<<

Maximum likelihood estimation from two possibly mismatched data sets Olivier Besson

To cite this version:

Olivier Besson. Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. Signal Processing, Elsevier, 2020, 167, pp.107285-107294. ￿10.1016/j.sigpro.2019.107285￿. ￿hal-02572461￿

HAL Id: hal-02572461 https://hal.archives-ouvertes.fr/hal-02572461 Submitted on 13 May 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Open Archive Toulouse Archive Ouverte (OATAO )

OATAO is an open access repository that collects the work of some Toulouse researchers and makes it freely available over the web where possible.

This is an author's version published in: https://oatao.univ-toulouse.fr/25984

Official URL : https://doi.org/10.1016/j.sigpro.2019.107285

To cite this version :

Besson, Olivier Maximum likelihood covariance matrix estimation from two possibly mismatched data sets. (2020) Signal Processing, 167. 107285-107294. ISSN 0165-1684

Any correspondence concerning this service should be sent to the repository administrator: [email protected] Maximum likelihood covariance matrix estimation from two possibly mismatched data sets

Olivier Besson

ISAE-SUPAERO, 10 Avenue Edouard Belin, Toulouse 31055, France

a b s t r a c t

We consider estimating the covariance matrix from two data sets, one whose covariance matrix R 1 is the sought one and another set of samples whose covariance matrix R 2 slightly differs from the sought one, due e.g. to different measurement configurations. We assume however that the two matrices are rather / / 1 2 −1 1 2| close, which we formulate by assuming that R1 R2 R1 R1 follows a around the . It turns out that this assumption results in two data sets with different marginal distri- Keywords: butions, hence the problem becomes that of covariance matrix estimation from two data sets which are Covariance matrix estimation distribution-mismatched. The maximum likelihood estimator (MLE) is derived and is shown to depend Maximum likelihood on the values of the number of samples in each set. We show that it involves whitening of one data set Mismatch by the other one, shrinkage of eigenvalues and colorization, at least when one data set contains more samples than the size p of the observation space. When both data sets have less than p samples but the total number is larger than p , the MLE again entails eigenvalues shrinkage but this time after a projection operation. Simulation results compare the new estimator to state of the art techniques.

1. Problem statement derived minimax estimators in two important classes, namely es- timators of the form Rˆ = GDG T where D is a  and Analysis or processing of multichannel data most often relies on G is the Cholesky factor of S , or of the form Rˆ = U diag ϕ(λ) U T the covariance matrix, which is a fundamental tool e.g., for princi- where U diag( λ) U T is the eigenvalue decomposition of S and ϕ( λ) pal component analysis, spectral analysis, adaptive filtering, detec- is a non-linear function of λ. This seminal work of Stein gave rise tion, direction of arrival estimation among others [1–3] . In practical to a great number of studies, see for instance [7–13] and refer- applications, the p × p covariance matrix R needs to be estimated ences therein. A second class of robust estimates is based on lin- from a finite number n of samples. When the latter are indepen- ear shrinkage of the SCM to a target matrix (an approach which dent and Gaussian distributed, the maximum likelihood estimator can be interpreted as an empirical Bayes technique), i.e., esti- −1 × = T of R is n S where X is the p n data matrix and S XX is the mates of the form Rˆ = αRt + βS where Rt = I is the most widely sample covariance matrix (SCM) [1] . However, in low sample sup- spread choice, see e.g., [14–20] . Note that these techniques ap- port or when deviation from the Gaussian assumption is at hand, plied with Rt = I achieve an affine transformation of the eigen- the SCM tends to behave poorly. In particular it was observed that values of S , while retaining the eigenvectors, and therefore bear the sample covariance matrix is usually less well-conditioned than resemblance with Stein’s method, although the selection of α, β the true covariance matrix, and therefore considerable effort has may not be driven by the same principle. Robustness to a pos- been dedicated to regularizing it with a view to improve its per- sibly non Gaussian distribution has also been a topic of consid- formance. erable interest and many papers have focused on robust estima- One of the most important approach in this respect is due to tion for elliptically distributed data, see e.g., [21–30] and references Stein [4–6] who, instead of maximizing the , therein. advocated to minimize a meaningful within a given Most of the above cited works deal with estimation of a co- class of estimators. Stein hence introduced the concept of admissi- matrix from a single data set. In this paper, we consider a ble estimation and minimax estimators under the so-called Stein’s situation where two data sets X1 and X2 are available, with respec- loss. He showed that the SCM-based estimator is not minimax and tive covariance matrices R1 and R2 . This situation typically arises in radar applications when one wishes to detect a target buried in clutter with unknown [31,32] . In order to infer the lat- E-mail address: [email protected] ter, training samples are generally used, which hopefully share the https://doi.org/10.1016/j.sigpro.2019.107285 same statistics as the clutter in the cell under test (CUT). However, X2 ) would be maximized under the constraint that the distance it has been evidenced that clutter is most often heterogeneous between W and I is smaller than some value. Alternatively, and [31] , with a discrepancy compared to the CUT that may grow with this is what we elect here, one can resort to an empirical Bayes the distance to the CUT [33] . Therefore, one is led to use some approach where the W follows some prior distri- clustering that separates training samples, either based on their bution rather concentrated around I . For mathematical tractability, proximity to the CUT or by of some statistical criterion, we choose a for W and we assume that W fol- such as the power selected training [34] . The samples so selected lows a Wishart distribution with ν degrees of freedom and param-   are deemed to be representative of the clutter in the CUT while −1 d −1 eter matrix μ I , i.e., W = W p ν, μ I . Of course, this is a rather others are less reliable, which corresponds to the situation consid- strong assumption whose validity would be difficult to check, e.g., ered herein. A second example is in the field of synthetic aper- on real data. However, it is in accordance with the mere knowl- ture radar in the case where a scene is imaged on two consecutive edge we have about the relation between R1 and R2 , and it allows days, with possible changes in between [35] . Finally, in hyperspec- for tractable derivations. tral imagery, the problem of target or anomaly detection leads to Using the fact that X1 |R1 and X2 |R2 are independent and Gaus- a very similar framework. Indeed, the background in a pixel under sian distributed with respective covariance matrices R1 and R2 , test has to be estimated from the local pixels around and pixels lo- = −1 T , and since R2 G1 W G1 we thus assume the following stochastic cated further apart [36]. In the present paper, we assume that R2 model:   is close to R1 , the covariance matrix we wish to estimate. Since R2 −n / 2 −p(n + n ) / 2 −n 1 / 2  −1  2 p(X , X | R , W ) = (2 π ) 1 2 | R | W R differs from but is close to R1 we investigate using both X1 and 1 2 1 1 1   X to estimate R . The reason for using also X is that despite its 2 1 2 1 T −1 1 T −T −1 ×etr − X R X 1 − X G WG X 2 (1a) covariance matrix is not R1 , it is close to. Additionally, one might 2 1 1 2 2 1 1 face situations where the number of samples in X1 is very small.   ν p/ 2 This paper constitutes a first approach to this specific problem and μ (ν− − ) / 1 p(W ) = | W | p 1 2etr − μW (1b) we focus herein on the most natural approach, namely maximum ν p/ 2 2 p (ν/ 2) 2 likelihood estimation. The objective is to figure out the pros and  cons of the latter and the conditions under which it is an accu- E −1 = (ν − − ) −1 μ E { } =  Note that W p 1 I so that R2 rate estimator. The paper is organized as follows. In section 2 we −1 T −1 E G 1 W G = (ν − p − 1) μR 1 : therefore, for E {R 2 } to be equal formulate the statistical assumptions: more precisely, we assume 1 to R , one must select μ = ν − p − 1 . Observe also that W comes 1 / 2 −1 1 / 2 1 that R R R | R is a random matrix with a Wishart distribu- − 1 2 1 1 closer to I as ν grows large. Indeed, E {W } = ν(ν − p − 1) 1I and tion around the identity matrix, and we derive the joint distri- E ( W − E { W } )2 = pν(ν − p − 1) 2I which goes to zero as ν → ∞ bution of (X1 , X2 ). Section 3 is devoted to the derivation of the [40] . maximum likelihood estimator of R1 from (X1 , X2 ), taking into ac- The of (X1 , X2 ) is obtained by integrating count the possible configurations regarding the number of samples (1) with respect to W , which results in in each data set. Numerical simulations illustrate the performance of the MLE and compare it with existing alternatives in section 4 . p(X 1 , X 2 | R 1 ) = p(X 1 , X 2 | R 1 , W ) p(W ) dW Conclusions and possible extensions of the present work are drawn W > 0 − ( + ) / ν /   in section 5 . ( π ) p n1 n2 2μ p 2 2 −(n + n ) / 2 1 T −1 = | R | 1 2 etr − X R X ν p/ 2 1 1 1 1 2 p (ν/ 2) 2   2. Data model (ν+ − − ) / 1 − − × | | n2 p 1 2 − μ + 1 T T W etr W I G1 X2 X2 G1 dW W > 0 2 − ( + ) / ν / Let us assume that we have two sets of measurements ( π ) p n1 n2 2μ p 2 2 (ν+ n ) p/ 2 × × = 2 2  ((ν + n ) / 2) X1 (p n1 ) and X2 (p n2 ) which are distributed according to ν p/ 2 p 2 2 p (ν/ 2) =d N , , =d N , , N , ,  X1 ( 0 R1 I) and X2 ( 0 R2 I) where ( 0 ) de-     −( + ) / −(ν+ n ) / 2 1 n1 n2 2 −1 T −T  2 T −1 notes the matrix-variate normal distribution whose density is × | R 1 | μI + G X 2 X G etr − X R X 1 − / − / − / 1 2 1 1 1 ( π ) pn 2| | n 2| | p 2 { − 1 T −1 −1 } 2 2 etr 2 X X with |.| the determi-  

− / −n / 2 1 − nant and etr{.} the exponential of the trace of a matrix. Note = ( π ) pn1 2| | 1 − T 1 2 R1 etr X1 R X1 that we consider real-valued data here whereas in radar appli- 2 1 − / π pn2 2 ((ν + ) / )  −(ν+ ) / cations it is customary to consider complex-valued signals. In p n2 2 − / − n2 2 × | μ | n2 2 + T μ 1  Appendix A we show how the results below can be readily ex- R1 I X2 [ R1 ] X2 p (ν/ 2) tended to the complex case. Our goal in this paper is to esti- (2) = mate R1 , using both X1 and X2 even if R1 R2 . However we as- sume that the two matrices are close to each other. In order to In order to obtain the third equality, we made use of the fact that,

d define a model that can reflect the proximity between R1 and if S = W p ( ν, ),

R2 , we note that the natural distance between them is given by    2 ( , ) = p 2 λ ( T −1 ) (ν−p−1) / 2 1 −1 d R1 R2 = log k G R G1 [37,38] where G1 is a square- p(S ) dS = 1 ⇒ | S | etr − S dS k 1 1 2 T T −1 S > 0 S > 0 2 root of R , i.e., R = G G and λ (G R G ) stands for the k th 1 1 1 1 k 1 2 1 ν/ − ν p/ 2 2 T 1 = 2 p (ν/ 2) | | (3) eigenvalue of G1 R2 G1 . This matrix is pivotal in adaptive detec- tion problems also. More precisely, in the case of a covariance mis- Note that p(X1 , X2 |R1 ) in (2) can be factored as match between the training samples and the data under test, it is ( , | ) = ( , ) × ( , ) p X1 X2 R1 f1 X1 R1 f2 X2 R1 which shows that X1 shown in [39] that the performance of the well-known adaptive ( , | ) = and X2 are marginally independent and that p X1 X 2 R1 − matched filter depends essentially on this matrix. Therefore, it be- ( | ) ( | ) ( | ) ∝ − 1 T 1 p X1 R1 p X2 R1 with p X1 R1 etr X R X1 and 2 1 1 comes natural to encapsulate the difference between R1 and R2  −(ν+ ) /  T −1  n2 2 = T −1 p(X | R ) ∝ I + X μR X . Due to the model adopted through the matrix W G R G1 and its proximity to the iden- 2 1 2 [ 1] 2 1 2 = T −1 , tity matrix. There are of course different ways to translate this for the random matrix W G1 R2 G1 X2 follows a matrix variate = constraint in the model. For instance a frequentist approach may Student distribution [41]. Therefore, the fact that R2 R1 results be advocated where the joint probability density function of (X1 , here in to two data sets with different distributions: one set − − − − X is Gaussian distributed with covariance matrix R while the ⇒ −( + ) 1 + (ν − ) μ 1 + 1 + μ 1 = 1 1 n1 n2 R1 S2 R1 n1 R1 S1 S2 R1 S1 0  uncertainty in R2 leads to a Student distribution for X2 . This is a ν − −1 n1 1 −1 rather original situation where one has to carry covariance matrix ⇒ R 1 S R 1 − I + S 1 S R 1 2 μ(n + n ) n + n 2 estimation from two data sets which are mismatched in their dis- 1 2 1 2

1 tributions. This peculiarity will result in new schemes compared − S = 0 (6) μ( + ) 1 to the conventional case of a single set with given distribution, as n1 n2 − − − detailed now. = T ˜ = −1 T ˜ = 1 T Let S2 L2 L2 and let us define R12 L2 R1 L2 and S1 L2 S1 L2 .

Then, pre-multiplying the previous equation by L −1 and post- 2 3. Maximum likelihood estimation −T , multiplying it by L2 we obtain 

In this section we address estimation of R1 from (X1 , X2 ) and ν − n 1 1 ˜ 2 − 1 + ˜ ˜ − ˜ = we focus on the most natural estimator, i.e., the maximum likeli- R12 I S1 R12 S1 0 (7) μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) hood estimator. From (2) , the log-likelihood function is, up to an ˜ ξ additive and constant term Let w be an eigenvector of R12 associated with eigenvalue .

+ ν +    Then, n1 n2 n2  −1 −1  1 −1  f (R 1 ) = − log |R 1 | − log I + μ R S 2 − Tr R S 1 2 2 1 2 1 ν − n 1 1 ξ 2 − ξ 1 + ˜ − ˜ = ν − ν +    w I S1 w S1 w 0 n1 n2  −1  1 −1 μ(n + n ) n + n μ(n + n ) = log |R 1 | − log R 1 + μ S 2 − Tr R S 1 (4) 1 2 1 2 1 2 2 2 2 1    ξ ν − 1 n1 = T = T ⇒ + ˜ = ξ ξ − where S1 X1 X1 and S2 X2 X2 . Differentiating the previous S1w w (8) μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) equation and using the fact that d |R | = |R |Tr R −1 dR and dR −1 = −1 −1 −R (dR ) R , we obtain the following equation that the ML solu- which implies that w is also an eigenvector of S˜ 1 . Either it is as- tion should satisfy sociated with a zero eigenvalue (there are p − n 1 of them) and, in ν− n1  − this case, ξ = , or it is associated with a strictly positive −1 −1 1 −1 −1 μ(n + n ) (ν − n ) R − (ν + n ) R + μ S + R S R = 0 (5) 1 2 1 1 2 1 2 1 1 1 eigenvalue λ and ξ satisfies the second-order polynomial equation In order to solve (5) , we must investigate various configurations for   ( n , n ) as the solution will depend on them. Before going to the ν − n 1 λ λ 1 2 ξ 2 − ξ + − = 0 (9) technical details of each case, we give an overview of the results μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) obtained. The above polynomial has obviously two real-valued roots, one

3.1. Summary of results being negative, the other being positive, and thus the latter is −1 T the eigenvalue of R˜ . To summarize, if we let L X = UV =  12 2 1 n −1 1 σ u v T be the singular value decomposition of L X , we As is illustrated below, the expression of the maximum like- k =1 k k k 2 1 have lihood estimator depends on the respective values of n1 and n2 . n p In the sequel three cases will be distinguished: a first situation 1 ν −  T n1 T < ≥ ˜ = ξ + where n1 p and n2 p, a second one which is the mirror situa- R12 k uk uk uk uk μ(n 1 + n 2 ) ≥ < k =1 k = n +1 tion, namely n1 p and n2 p, and finally a third more challenging 1  case where n < p, n < p and n + n ≥ p. n1 1 2 1 2 ν − n ν − n = ξ − 1 T + 1 In the first [respectively second] case, the ML solution is given k uk uk I (10) μ(n 1 + n 2 ) μ(n 1 + n 2 ) by (11) [resp. (21) ]: it will be shown that the estimation process k =1 entails whitening of X1 [resp. X2 ] by the inverse of the square- ξ λ σ 2 where k is the positive root of (9) with substituted for k . The root of the sample covariance matrix of X [resp. X ], followed by 2 1 MLE of R1 is thus shrinkage of eigenvalues and finally colorization by the square-root   n1 ν − n ν − n of the sample covariance matrix of X2 [resp. X1 ]. The technique of = ξ − 1 T T + 1 R1 k L2 uk uk L2 S2 (11) eigenvalue shrinkage is rather well known but usually applied to μ(n 1 + n 2 ) μ(n 1 + n 2 ) k =1 the SCM of a single set: herein, due to the presence of two data

It is instructive to study the form of this solution. The original data sets, this technique is applied to one data set after it has been X is first adaptively whitened by L −1 and its sample covariance whitened by the second one. Interestingly enough, the ML solu- 1 2 tion can also be written as (14) [resp. (22) ], that is as a weighted matrix is computed. The eigenvectors of the latter are retained and sum of the SCM of each data set, where the weighting matrix is the eigenvalues are modified. Then, data is re-colored by L2 . Note diagonal for one set of samples, and non diagonal for the other that the technique of regularizing eigenvalues while keeping eigen- set. vectors is classical in robust covariance matrix estimation. How-

< < + ≥ , ever, this technique usually applies to one set of samples. Here it Finally, when n2 p, n1 p and n1 n2 p the procedure includes a partitioning between the subspace spanned by the applies to one set of samples after it has been “whitened” by the

other set. Indeed a whitening-colorization operation is performed columns of X2 and its orthogonal complement. In the former, shrinkage of eigenvalues is used while, in the latter, projection of pre and post eigenvalues modification. Another important obser- λ → ξ vation is that the transformation preserves the order of the the SCM of X1 is retained. eigenvalues, an important issue in Stein’s estimation using eigen- value decomposition [42–44] . This can be seen by differentiating 3.2. Case n 1 < p and n 2 ≥ p λ (9) with respect to , which gives < ≥ We consider first the case where n1 p and n2 p, i.e., n1 is ∂ξ ν − n λ ξ 1 ξ − 1 − = + not large enough for S to be positive definite and one needs to 2 (12) 1 ∂λ μ(n + n ) n + n n + n μ(n + n ) = 1 2 1 2 1 2 1 2 use X2 in order to estimate R1 , even though R2 R1 . Eq. (5) can be rewrittenas    Since the bracketed term on the left-hand side of the previous −1 −1 −1 −1 −1 equation is positive, it follows that ∂ ξ / ∂ λ > 0 and therefore the (ν − n ) R R +μ S −(ν + n ) I + R S R R +μ S = 0 1 1 1 2 2 1 1 1 1 2 transformation preserves ordering of the eigenvalues. This property − − − − − − ⇒ −( + ) + (ν − ) μ 1 1 + 1 +μ 1 1 1 = n1 n2 I n1 R1 S2 R1 S1 R1 S1 R1 S2 0 will hold true in the other cases developed below.

A comment is also in order regarding the behavior of the MLE As before, it can be seen that R˜ 11 and S˜ 2 share the same eigen- ν when grow large, i.e., when W comes closer to I. Indeed, with vectors. The p − n 2 eigenvectors of S˜ 2 associated with zero eigen- μ = ν − − , p 1 one has value will correspond to a constant eigenvalue for R˜ 11 equal to −1 + λ (n 1 + n 2 ) . A strictly positive eigenvalue ζ of R˜ 11 is related to its 1 k 1 lim ξ = ⇒ lim R˜ = S˜ + I λ ˜ ν→∞ k + ν→∞ 12 + 1 counterpart of S2 by n 1 n2 n1 n2   λ(ν − ) λ 1 −1 −T T 2 n1 1 ⇒ lim R = L L S L + I L ζ − ζ + − = 0 (19) ν→∞ 1 + 2 2 1 2 2 n1 n2 μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) 1  = + −1 T n2 T [S 1 S2 ] (13) Now, if we let L X = YZ = θ y z be the singular value n + n 1 2 k =1 k k k 1 2 −1 decomposition of L X , we have 1 2 which shows that, as W comes closer to I, i.e., as R2 comes closer to R , the MLE is simply the sample covariance matrix of the n2 p 1 1 ˜ = ζ T + T whole data, as may be expected. R11 k yk y yk y k n + n k = 1 2 = + Finally, another interpretation of the MLE can be obtained by k 1 k n2 1 n   rewriting the MLE in an other form. Noting that the space of 2 1 1 −1 T L X coincides with the range space of u , ... , u , it follows that = ζ − y y + I (20) 2 1 1 n1 k + k k + n1 n2 n1 n2 = −1 η η k =1 uk L2 X1 k for some vector k . Therefore, (11) can be rewritten as ζ λ θ 2   where k is the positive root of (19) with substituted for k . The n   1 ν − ν − MLE of R1 becomes n1 T T n1 R 1 = X 1 ξ − η η X + S 2   k μ( + ) k k 1 μ( + ) n2 n1 n2 n1 n2 1 1 k =1 = ζ − T T + R1 k L1 yk y L1 S1 (21) n + n k n + n ν − n = 1 2 1 2 =  T + 1 T k 1 X1 1 X1 X2 X2 (14) μ(n 1 + n 2 ) − 1 , . . . , , Again, since the range space of L1 X2 is spanned by y1 yn2

Consequently, the MLE is a weighted version of the sample covari- = −1 χ one has yk L1 X2 k and hence ance matrices of each data set. In fact, it can be shown (we omit     the details) that if a solution to (5) is sought which is of the form n2 1 1  = ζ − χ χT T + T (14), then 1 is solution to the equation R1 X2 k k k X2 X1 X1 n + n n + n  = 1 2 1 2 ν −  − k 1 2 n1 T −1 1 1  + X S X − I  1 1 2 1 1 1 μ(n 1 + n 2 ) n 1 + n 2 =  T + T X2 2 X2 X1 X1 (22) ν +  − n + n n2 T −1 1 1 2 − X S X = 0 (15) 2 1 2 1 μ(n 1 + n 2 ) Note that (22) differs from (14) in that the weighting matrix ap- T −1 plied between X and X is now diagonal while that applied be- It ensues that  and X T S X share the same eigenvectors, which 1 1 1 1 2 1 T tween X and X is no longer diagonal. Furthermore, if one looks −1 2 2 are indeed the right singular vectors vk of L X1 . Moreover, the 2 for a solution of the form (22) then  is the solution to γ  2 eigenvalues k of 1 satisfy     T −1 −1 −2 −2 (X S X ) (ν − n ) (ν − ) σ (ν + ) σ 2 2 1 2 1 2 n1 k 1 n2 k  + − I 2 γ + γ − − = 0 (16) 2 + μ( + ) k k 2 n1 n2 n1 n2 μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) ν + n − − − 2 ( T 1 ) 1 = To summarize, the MLE of R can either be written as in X2 S X2 0 (23) 1 μ(n + n ) 2 1 ξ λ 1 2 (11) where the eigenvalues k are related to the eigenvalues k of − − T −1 1 T   and X S X share the same eigenvectors (actually z ) and the L2 S1 L2 by (9), or as in (14) where 1 is given by (15). 2 2 1 2 k γ  eigenvalue k of 2 is obtained as the positive solution to < ≥   3.3. Case n2 p and n1 p − θ −2 θ 2(ν + ) (ν − n ) n2 γ 2 + γ k − 1 − k = k k 0 (24) < ≥ + μ( + ) μ( + ) 2 We now consider a situation where n2 p and n1 p under n1 n2 n1 n2 n1 n2 which one has a sufficient number of “good” samples X1 for S1 to Remark 1. When n ≥ p and n ≥ p , the previous techniques can 1 2 be full-rank. Yet, it might be of interest to use X2 even though its still be used, with slight variations. In this case, S˜ and S˜ are now = 1 2 covariance matrix R2 R1 . The derivation of the MLE follows along full-rank, and therefore the MLE of R is given by (11) but with 1 the same lines as in the previous case, except that now S2 is rank- the first sum extended to p eigenvectors ( S˜ 1 has now p non-zero deficient and S is full-rank. Starting from the ML Eq. (5) , on can 1 eigenvalues), and the second term vanishes. The ML solution is also write     given by (21) with the first term extended to p eigenvectors and − − − − (ν − ) −1 + μ 1 − (ν + ) + 1 1 + μ 1 = n1 R1 R1 S2 n2 I R1 S1 R1 R1 S2 0 the second term vanishing. − − − − − ⇒ −( + ) + (ν − ) μ−1 1 + 1 + μ 1 1 1 = n1 n2 I n1 R1 S2 R1 S1 R1 S1 R1 S2 0 < < + ≥ − − − − 3.4. Case n1 p, n2 p and n1 n2 p ⇒ −(n + n ) R S 1R + (ν − n ) μ 1R S 1S + R + μ 1S = 0 1 2 11 1 1 1 1 2  1 2 (ν − n ) − 1 1 ⇒ −1 − 1 1 + − = We now consider the more challenging case where neither of R1 S1 R1 R1 S1 S2 I S2 0 μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) the two data sets contains enough samples for their respective (17) sample covariance matrices to be full rank, and thus it becomes

− − − mandatory to combine both sets. This situation is a bit trickier and = T ˜ = −1 T ˜ = 1 T Let S1 L1 L and let us define R11 L R1 L and S2 L S2 L . 1 1 1 1 1 requires some carefulness. Going back to (5), the MLE of R1 should

Then, taking the transpose of the previous equation, pre- − satisfy multiplying by L −1 and post-multiplying by L T , we obtain 1 1  −   −1 −1 1 −1 −1 (ν − ) − (ν + ) + μ + = (ν − ) n1 R1 n2 R1 S2 R1 S1R1 0 2 n1 1 1 R˜ − S˜ + I R˜ − S˜ = 0 (18) − − − 11 μ( + ) 2 + 11 μ( + ) 2 ⇒ (ν − ) 1 + 1 1 n1 n2 n1 n2 n1 n2 n1 R1 R1 S1 R1    − − − − − − − − − − −1 − (n + n )  D 2 − (ν − n ) μ 1I + FD 2 − μ 1F 1 = 0 − (ν + ) 1 − μ 1 1 + μ 1 T 1 T 1 = 1 2 a.b a 1 a a.b n2 R1 R1 X2 I X2 R1 X2 X2 R1 0  ν − −2 n1 1 −2 1   ⇒  . D  . − I + FD  . − F = 0 − − − −1 a b a a b μ( + ) + a a b μ( + ) ⇒ ( + ) = (ν + ) μ 1 + μ 1 T 1 T + T n1 n2 n1 n2 n1 n2 n1 n2 R1 n2 X2 I X2 R1 X2 X2 X1 X1  ν − n 1 1 ⇒ ˜ 2 − 1 + ˜ ˜ − ˜ = (25) a.b I F a.b F 0 (34) μ(n 1 + n 2 ) n 1 + n 2 μ(n 1 + n 2 ) − − − ˜ = −1 1 ˜ = 1 1 Before pursuing, it is worthy looking at the previous equation to where a.b Da a.b Da and F Da FDa . Similarly to what was ˜ ˜ get some insight. We observe that the projection of R1 onto the done before, a.b and F share the same eigenvectors. When the subspace orthogonal to X2 will be equal to the projection of S1 eigenvalue λ of F˜ is zero (there are actually p − n 1 of them [45] ) ν− on this same subspace. This suggests to use a decomposition that the corresponding eigenvalue φ of ˜ is n1 . For each of the a.b μ(n + n ) R ( ) 1 2 splits data in X2 and its orthogonal complement. To do so,  r = n 1 + n 2 − p non-zero λ, the corresponding φ is the unique pos-

T Da T let us consider the SVD of X as X = CDE = C C E = itive root of 2 2 a b 0   ν − n λ λ T , × × × φ2 − 1 + φ − = Ca Da E where C is p p, Ca is p n2 and Da is the n2 n2 di- 0 (35) μ(n + n ) n + n μ(n + n ) agonal matrix of singular values. Let us also operate a change of 1 2 1 2 1 2 coordinates and define ˜ , ˜     Therefore, if u˜ k are the eigenvectors of F a.b is given by T T r p C R C C R C    ν −   = T = a 1 a a 1 b = aa ab T n1 T C R1 C T T (26) ˜ . = φ u˜ u˜ + u˜ u˜ C R 1 C a C R 1 C   a b k k k k k b b b ba bb μ(n 1 + n 2 ) k =1 k = r+1 r  With these definitions, it is straightforward to show that ν − n ν − n − − 1 T 1 T −1 =  1 T  =  −   1 = φ − u˜ u˜ + I (36) X R X2 EDa Da E where . aa and k k k 2 1 a.b a b ab bb ba μ(n 1 + n 2 ) μ(n 1 + n 2 ) thus k =1   − − −1 − − −1 Once ˜ . is computed,  . = D a ˜ . D a and aa can be obtained + μ 1 T 1 T = + μ 1  1 T a b a b a b X2 I X2 R X2 X2 Ca Da I Da . Da Da Ca  T 1 a b from (32). Finally, the MLE of R1 is given by C C . − − − −1 = 2 + μ 1 1 T We now present an alternative way to compute the solution. Ca D . C (27) a a b a ( + ) = From (25), it appears that R1 can be written as n1 n2 R1  − − − −1 1 T X X T + X  X T where  = (ν + n ) μ 1 I + μ 1X T R X . Let Therefore, pre-multiplying (25) by C and post-multiplying it by C, 1 1 2 2 2 2 2 2 1 2 = T = T we obtain X X 1 X 2 and let X QR be the QR decomposition of X     − with Q a (n + n ) × p semi-, i.e., Q T Q = I . Let us −2 −1 −1 1 1 2   p aa  − + μ  ( + ) ab = (ν + ) μ 1 Da a.b 0 n1 n2 n2 Q 1   partition Q as Q = so that X T = Q R and X T = Q R . Then, one ba bb 0 0 1 1 2 2   Q2 T T C S C C S C has + a 1 a a 1 b T T (28) C S C C S C T T b 1 a b 1 b (n + n )R = X X + X  X 1 2 1 1 1 2 2 2

= T T + T  which immediately implies that R Q1 Q1 Q2 2 Q2 R

T and therefore (n + n )  = C S C 1 2 ba b 1 a − −1 T −1 T T 1 T T (n + n ) X R X = Q Q Q + Q  Q Q (n + n )  = C S C (29) 1 2 2 1 2 2 1 1 2 2 2 2 1 2 bb b 1 b −1 = + T (  − ) T Q2 I Q2 2 I Q2 Q2 This corroborates the comments we made after Eq. (25) since one  − −1 has = − T (  − ) 1 + T T Q2 I Q2 2 I Q2 Q2 Q2 Q2 T T T ⊥ ⊥ (n + n ) C  C = (n + n ) C C R C C = P R P −1 1 2 b bb b 1 2 b b 1 b b X2 1 X2 T T −1 T T = − (  − ) + ⊥ ⊥ Q2Q2 Q2Q2 2 I Q2Q2 Q2Q2 = T T = Cb Cb S1 Cb Cb PX S1 PX (30) − 2 2  − 1 T 1 = Q Q +  − I (37)   2 2 2 It now remains to find aa or equivalently a.b . Towards this end, note that  − T 1 Consequently, if we define B 2 = Q 2 Q − I − 2 −1 −2 −1 −1 1 T (n + n )  = (ν + n ) μ D + μ  + C S C (31) − − − − 1 2 aa 2 a a.b a 1 a  1 = (ν + ) 1μ + μ 1 T 1 2 n2 I X2 R1 X2 −1 −1 T −1 However, = (ν + n ) μI + (ν + n ) X R X 2 2 2 1 2 − − − − ( + )  = ( + )  +   1 = (ν + ) 1μ + (ν + ) 1( + ) (  + ) 1 n1 n2 aa n1 n2 a.b ab ba n2 I n2 n1 n2 2 B2 (38)  bb    −1  + = ( + )  + T T T Pre-multiplying the previous equation by ( 2 B2 ) and post- n1 n2 a.b Ca S1 Cb Cb S1 Cb Cb S1 Ca multiplying by  , we obtain the following second-order polyno- (32) 2 mial equation:  ν − ν + which leads to 2 n1 n2  + B − I  − B = 0 (39) 2 2 μ 2 μ 2 − − − − −1 ( + )  = (ν + ) μ 1 2 + μ 1 1 + T n1 n2 a.b n2 Da . C S1 C  λ a b a.b It follows that 2 and B2 share the same eigenvectors. If is a + − (33) non-zero eigenvalue of B2 (there are n1 n2 p of them), then the γ  corresponding eigenvalue of 2 is the unique positive root to the For the sake of notational convenience, let us denote following polynomial equation T  F = C S 1 C . Post-multiplying the previous equation by a.b ν − n ν + n − − γ 2 + λ − 1 γ − 2 λ = D 2 + μ−1  1 results in 0 (40) a a.b μ μ

Fig. 2. Average distance between Rˆ 1 and R 1 in case 2.

Fig. 1. Average distance between Rˆ 1 and R 1 in case 1. λ = γ = (ν − ) μ−1  If 0 then n2 . Finally, the solution 2 is given by

r n  ν − 2 T n1 T  = γ b b + γ b b 2 k k k μ k k k k =1 k = r+1 r   ν − ν − n1 T n1 = γ − b b + I (41) k μ k k μ k =1

where bk are the eigenvectors of B2 . Note that − = λ ⇒ ( T ) 1 − = λ B2 b b Q2 Q2 b b b − ⇒ ( T ) 1 = ( + λ) Q2 Q2 b 1 b − ⇒ ( T ) = ( + λ) 1 Q2 Q2 b 1 b T and hence b is an eigenvector of Q2 Q2 associated with eigenvalue ( + λ) −1 , T 1 or equivalently a right singular vector of Q2 . Observe T = T = T T = T , also that, since X2 Q2 R and XX R Q QR R R one has − − − T = T 1 T = T ( T ) 1 Q2 Q2 X2 R R X2 X2 R R X2 − − = T ( T ) 1 = T ( T + T ) 1 X2 XX X2 X2 X1 X1 X2 X2 X2 = T + T = T , T −1 Hence, if we let S X1 X1 X2 X2 LL then Q2 and L X2 share the same right singular vectors.

4. Numerical simulations

In this section, we evaluate numerically the performance of the MLE presented above through Monte-Carlo simulations. We consider a scenario where the size of the observation space is p = 128 . Three cases will be considered for the covariance matrix

R1 , which correspond to different kind of processes. In the first | k − | case the ( k ,  ) element is R 1 (k,  ) = P ρ + δ(k,  ) with ρ = 0 . 7 . − . ( πσ | − | ) 2 0 5 2 f k The second case assumes that R 1 (k,  ) = Pe + δ(k,  ) σ = . ( ,  ) = (| −  | ) + δ( ,  ) with f 0 02. In the third case, R1 k rAR k k (| −  | ) where rAR k corresponds to the correlation of an autoregres- π ± π sive process whose poles are located at 0.95 e ± i 2 0.05, 0.9 e i2 0.15, ± i 2 π0.18 = ( ) = 0.9e . Finally P 100 and rAR 0 100. The corresponding processes are rather lowpass in case 1 and 2, while case 3 con- cerns processes with sharp peaks in their spectrum. In each simu- lation X1 is generated from a Gaussian distribution with covariance matrix R1 . Then W is generated from a Wishart distribution with ν = p + 2 degrees of freedom and parameter matrix (ν − p − 1) −1 I = −1 T and R2 is computed as R2 G1 W G1 . Then X2 is generated from a Gaussian distribution with covariance matrix R2 . The MLE is compared with four competitors. The first is the −1 sample covariance matrix based on all samples, i.e., (n 1 + n 2 ) S = T + T T , where S X1 X1 X2 X2 . The second is of the form GSCM DGSCM where GSCM is the Cholesky factor of S, and D is a diagonal ma- trix which is chosen to minimize Stein’s loss and is given by = / ( + + − + ) Dk,k 1 n1 n2 p 2k 1 . The third is of the same form but is meant at minimizing the natural distance between R1 and = its estimate:  as shown in [13], it amounts to choosing Dk,k −E χ 2 exp log + − + . Finally, we consider the class of or- n1 n2 i 1   thogonally invariant estimators of the form U diag ϕ(λ) U T   SCM SCM where S = U diag λ U T is the eigenvalue decomposition of S SCM SCM and ϕ(λ) = ϕ (λ) . . . ϕ p (λ) . Stein showed that the choice 1  ϕ = λ / ( + − + + λ (λ − λ ) −1 ) k k n1 n2 p 1 2 k j = k k j is the best with respect to Stein’s loss. However this choice has two drawbacks: ϕ < it can result in some k 0 and it does not preserve the order λ of the eigenvalues k , which is a problem [42]. In order to over- come these problems, Stein proposed an isotonizing scheme that Fig. 3. Average distance between Rˆ 1 and R 1 in case 3. ϕ > guarantees k 0 and preserves order, see [46] for details of this scheme. We consider this improved estimator as the fourth alter- native. The figure of merit for all estimators will be the natural

and X2 . The MLE was shown to perform quite well, as compared to state of the art algorithms, at least when the number of sam-

ples in X1 is small and the total number of samples n is large enough. However, as in a classical framework with a single data set, there is room from improvement of the MLE, especially in low sample support. Therefore, future work should be devoted to im- proving the MLE in this situation. For instance, one could study how the MLE could be regularized or could investigate whether a Stein-like approach is possible for this two data sets framework.

Alternatively, a frequentist approach where joint estimation of R1 and W is performed under some constraints constitutes a worthy path of investigation.

Declaration of Competing Interest

None.

Appendix A. Extension to complex-valued data

In this appendix, we briefly show that the derivations con- Fig. 4. Average distance between Rˆ 1 and R 1 in case 1 versus ν. n 1 + n 2 = 2 p. cerning the maximum likelihood estimator can be extended in a straightforward manner to the complex case. Let us assume here

| =d CN , , | =d CN , , distance between the true and the estimated covariance matrices that X1 R1 ( 0 R1 I) and X2 R2 ( 0 R2 I) are complex- 2 p 2 −1 d (R , Rˆ ) = log λ (R Rˆ ) . 1 1 k =1 k 1 1 valued data and distributed according to a circularly symmetric The simulation results are shown in Figs. 1 - 4 where we con- = H complex-valued matrix-variate normal distribution. Let R1 G1 G1 = + , H −1 H sider different values for the total number of samples n n1 n2 -where stands for the Hermitian transpose- and R = G W G   2 1 1 namely n = p, n = 3 p/ 2 and n = 2 p. The main conclusions regard- d − where W = CW ν, μ 1I follows a complex Wishart distribution. ing these simulations are the following: p The statistical (complex-valued) model is thus   • the MLE is shown to outperform its competitors when n is −n 1 −p(n + n ) −n 1  −1  2 p(X , X | R , W ) = π 1 2 | R | W R small and n is large enough, typically it has the best perfor- 1 2 1  1 1 − − − mance for n = 3 p/ 2 and n = 2 p. One can observe that the im- × − H 1 − H H 1 etr X1 R1 X1 X2 G1 WG1 X2 (A.1a) provement achieved by the MLE is more important when n = 2 p and n is small, i.e., when one has very few samples drawn 1 μν p ν−p from R1 and a large majority of samples drawn from R2 . p(W ) = | W | etr{ −μW } (A.1b)

• in contrast, when n = p the other methods can perform bet- ˜ p (ν)

ter than the MLE, especially when n1 is above a threshold, i.e.,  E −1 = (ν − ) −1 μ when the number of “good” samples is large enough. Note that, in the complex case, W p I [40] so − − • E { } = E 1 H = (ν − ) 1μ E { } among the Stein-like methods, that based on eigenvalue de- that R2 G1 W G1 p R1 . Therefore, for R2 μ = ν − composition (with isotonizing) is the best, but the method to be equal to R1 , one must have p in the complex case, based on Cholesky factorization and minimization of the instead of μ = ν − p − 1 in the real case.

geodesic distance comes very close. The marginal distribution of (X1 , X2 ) is now

In a final simulation, we evaluate the influence of ν: recall that, p(X 1 , X 2 | R 1 ) = p(X 1 , X 2 | R 1 , W ) p(W ) dW ν as increases, W is closer to I and thus R2 is closer to R1 , which W > 0 − ( + ) ν means that X should be nearly as informative as X . In Fig. 4 we π p n1 n2 μ p  2 1 −(n + n ) H −1 = | R | 1 2 etr −X R X display the average distance as a function of ν in case 1 with 1 1 1 1 ˜ p (ν) + = ν n1 n2 2p. It is observed that, as increases, the performance  ν+ − − − of all estimators improve. The proposed MLE is no longer the most × | | n2 p − μ + 1 H H W etr W I G1 X2 X2 G1 dW accurate above a threshold, where it is dominated by the Stein’s W > 0 − ( + ) ν π p n1 n2 μ p˜ (ν) estimator based on the eigenvalues of the whole sample covari- p −(n + n ) = | R | 1 2 ance matrix. However, the proposed MLE still performs better than 1 ˜ p (ν) all other estimators.   −(ν+ ) H −1  −1 H −H  n2 etr −X R X μI + G X X G 1 1 1  1 2 2 1 − −n − 5. Conclusions = π pn1 | | 1 − H 1 R1 etr X1 R1 X1 −pn   π 2 ˜ (ν) −(ν+ n ) In this paper, we considered the problem of estimating a covari- p −n 2  H −1  2 × | μR 1 | I + X [μ R 1 ] X 2 ance matrix R from two data sets, one set X whose covariance 2 1 1 ˜ p (ν) matrix is actually R1 and another set X2 whose covariance matrix (A.2)

R2 is different but close to R1 . Since the distance between R1 and − = T 1 , and we recover the fact that X | R is Gaussian distributed and that R2 depends on the eigenvalues of W G1 R2 G1 we embedded the 1 1 latter in a and assumed that it followed a Wishart X2 |R1 is Student distributed. From (A.2), the log-likelihood function distribution around the identity matrix. We showed that the prob- is, up to an additive and constant term    lem is that of estimating R1 from two data sets with different dis- ˜  −1 −1  −1 f (R 1 ) = −(n 1 + n 2 ) log |R 1 | − (ν + n 2 ) log I + μ R S 2 − Tr R S 1 tributions. The maximum likelihood estimator was derived and its  1  1 = (ν − ) | | − (ν + )  + μ−1  − −1 n1 log R1 n2 log R1 S2 Tr R S1 (A.3) expression was shown to depend on the number of samples in X1 1

= H = H [22] F. Pascal , P. Forster , J.-P. Ovarlez , P. Larzabal , Performance where S1 X1 X1 and S2 X2 X2 . Differentiating the previous matrix estimates in impulsive noise, IEEE Trans. Signal Process. 56 (61) (2008) equation, it follows that the maximum likelihood estimator of R 1 2206–2217 . should satisfy [23] Y. Chen , A. Wiesel , A.O. Hero , Robust shrinkage estimation of high-dimensional   covariance matrices, IEEE Trans. Signal Process. 59 (9) (2011) 4097–4107 . − − −1 − − (ν − ) 1 − (ν + ) + μ 1 + 1 1 = n1 R1 n2 R1 S2 R1 S1 R1 0 (A.4) [24] E. Ollila, D. Tyler, V. Koivunen, H. Poor, Complex elliptically symmetric distri- butions: survey, new results and applications, IEEE Trans. Signal Process. 60 which is exactly (5) , the equation in the real case. From there, all (11) (2012) 5597–5625 . previous derivations follow simply by replacing the transpose by [25] A. Wiesel , Unified framework to regularized covariance estimation in scaled Gaussian models, IEEE Trans.Signal Process. 60 (1) (2012) 29–38 . the Hermtian transpose. [26] M. Mahot , F. Pascal , P. Forster , J.-P. Ovarlez , Asymptotic properties of robust complex covariance matrix estimates, IEEE Trans. Signal Process. 61 (13) (2013)

References 3348–3356 . [27] Y.I. Abramovich , O. Besson , Regularized covariance matrix estimation in com- [1] R.J. Muirhead , Aspects of Multivariate , John Wiley & Sons, plex elliptically symmetric distributions using the expected likelihood ap- Hoboken, NJ, 1982 . proach - part 1: the oversampled case, IEEE Trans. Signal Process. 61 (23) [2] L.L. Scharf , Statistical Signal Processing: Detection, Estimation and (2013) 5807–5818 . Analysis, Addison Wesley, Reading, MA, 1991 . [28] O. Besson , Y.I. Abramovich , Regularized covariance matrix estimation in com- [3] M.S. Srivastava , Methods of , John Wiley & Sons., New plex elliptically symmetric distributions using the expected likelihood ap- York, 2002 . proach - part 2: the under-sampled case, IEEE Trans.Signal Process. 61 (23) [4] C. Stein , Inadmissibility of the usual estimator for the of a multivariate (2013) 5819–5829 . distribution, in: Proceedings 3rd Berkeley Symposium on Mathematical Statis- [29] F. Pascal , Y. Chitour , Y. Quek , Generalized robust shrinkage estimator and its tics and Probability, 1956, pp. 197–206 . application to STAP detection problem, IEEE Trans. Signal Process. 62 (21) [5] C. Stein , Lectures on the theory of estimation of many parameters, J. Math. Sci. (2014) 5640–5651 . 34 (1986) 1373–1403 . [30] E. Ollila , E. Raninen , Optimal high-dimensional shrinkage covariance estima- [6] W. James, C. Stein, Estimation with Quadratic Loss, Springer Series in Statistics tion for elliptical distributions, IEEE Trans. Signal Process. 67 (10) (2019) (Perspectives in Statistics), Springer, pp. 443–460. 2707–2719 . [7] D.K. Dey , C. Srinivasan , Estimation of a covariance matrix under stein’s loss, [31] W.L. Melvin , Space-time adaptive radar performance in heterogeneous clutter, Ann. Stat. 13 (4) (1985) 1581–1591 . IEEE Trans. Aerospace Electron.Syst. 36 (2) (20 0 0) 621–633 . [8] D.K. Dey , C. Srinivasan , Trimmed minimax estimator of a covariance matrix, [32] , Principles of Modern Radar: Advanced Principles, W.L. Melvin, J.A. Scheer Ann. Inst. Stat. Math. 38 (1986) 101–108 . (Eds.), 2, Institution Engineering Technology, 2012 . [9] F. Perron , Minimax estimators of a covariance matrix, J. Multivar. Anal. 43 (1) [33] R. Nitzberg , An effect of range-heterogeneous clutter on adaptive doppler fil- (1992) 16–28 . ters, IEEE Trans.Aerospace Electron.Syst. 26 (3) (1990) 475–480 . [10] T. Ma , L. Jia , Y. Su , A new estimator of covariance matrix, J. Stat. Plan. Inference [34] D.J. Rabideau , A.O. Steinhardt , Improved adaptive clutter cancellation through 142 (2) (2012) 529–536 . data-adaptive training, IEEE Trans.Aerospace Electron.Syst. 35 (3) (1999) [11] H. Tsukuma , Estimation of a high-dimensional covariance matrix with the 879–891 . stein loss, J. Multivar. Anal. 148 (2016) 1–17 . [35] L.M. Novak , Change detection for multi-polarization multi-pass SAR, in: Pro- [12] H. Tsukuma , Minimax estimation of a normal covariance matrix with the par- ceedings SPIE 5808, Algorithms for Synthetic Aperture Radar Imagery XII, tial Iwasawa decomposition, J. Multivar. Anal. 145 (2016) 190–207 . 2005, pp. 234–246 . [13] M.-T. Tsai , On the maximum likelihood estimation of a covariance matrix, [36] N.M. Nasrabadi , Hyperspectral target detection : an overview of current and Math. Method. Stat. 27 (2018) 71–82 . future challenges, IEEE Signal Process. Mag. 31 (1) (2014) 34–44 . [14] L.R. Haff, Empirical Bayes estimation of the multivariate normal covariance [37] R. Bhatia , Positive Definite Matrices, Princeton University Press, 2007 . matrix, Ann. Stat. 8 (3) (1980) 586–597 . [38] S.T. Smith , Covariance, subspace and intrinsic Cramér-Rao bounds, IEEE Trans. [15] O. Ledoit , M. Wolf , A well-conditioned estimator for large-dimensional covari- Signal Process. 53 (5) (2005) 1610–1630 . ance matrices, J. Multivar. Anal. 88 (2) (2004) 365–411 . [39] R.S. Raghavan , False alarm analysis of the AMF algorithm for mismatched [16] P. Stoica , J. Li , X. Zhu , J.R. Guerci , On using a priori knowledge in space– training, IEEE Trans. Signal Process. 67 (1) (2019) 83–96 . time adaptive processing, IEEE Trans.Signal Process. 56 (6) (2008) 2598– [40] J.A. Tague , C.I. Caldwell , Expectations of useful complex Wishart forms, Multi- 2602 . dimensional Systems and Signal Processing 5 (1994) 263–279 . [17] Y. Chen , A. Wiesel , Y.C. Eldar , A.O. Hero ,Shrinkage algorithms for MMSE [41] A.K. Gupta , D.K. Nagar , Matrix Variate Distributions, Chapman & Hall/CRC, Boca covariance estimation, IEEE Trans.Signal Process. 58 (10) (2010) 5016– Raton, FL, 20 0 0 . 5029 . [42] Y. Sheena, A. Takemura , Inadmissibility of non-order preserving orthogonally [18] T. Fisher , X. Sun , Improved stein-type shrinkage estimators for the high-dimen- invariant estimators of the covariance matrix in the case of Stein’s loss, J. Mul- sional multivariate normal covariance matrix, Comput. Stat. Data Anal. 55 (5) tivar. Anal. 41 (1992) 117–131 . (2011) 1909–1918 . [43] B. Rajaratnam , D. Vincenzi , A theoretical study of Stein’s covariance estimator, [19] A. Coluccia , Regularized covariance matrix estimation via empirical Bayes, IEEE Biometrika 103 (2016) 653–666 . Signal Process. Lett. 22 (11) (2015) 2127–2131 . [44] B. Naul , B. Rajaratnam , D. Vincenzi , The role of isotonizing algorithm in Stein’s [20] Y. Ikeda , T. Kubokawa , M.S. Srivastava , Comparison of linear shrinkage esti- covariance matrix estimator, Comput. Stat. 31 (4) (2016) 1453–1476 . mators of a large covariance matrix in normal and non-normal distributions, [45] L. Guttman , General theory and methods for matric factoring, Psychometrica 9 Comput. Stat. Data Anal. 95 (2016) 95–108 . (1) (1944) 1–16 . [21] T. Kubokawa , M.S. Srivastava , Robust improvement in estimation of a covari- [46] S. Lin , M. Perlman , A Monte Carlo comparison of four estimators of a covari- ance matrix in an elliptically contoured distribution, Ann. Stat. 27 (2) (1999) ance matrix, in: P.R. Krishnaiah (Ed.), Multivariate Analysis VI, North Holland, 600–609 . Amsterdam, 1985, pp. 411–429 .