arXiv:1605.07646v4 [stat.ME] 8 May 2020 iao codn oteNwo-aho ehd[ likeliho method maximum Newton-Raphson a the obtain to log-likelihood—to according the timator maxim of matrix by the requires Hessian obtained one negative usually principle, In is function. estimation likelihood The parameter(s). )variance 1. aaeesadejy iprfrua hrfr ti ieyused widely is it Therefore [ formula. cations simper a enjoys and parameters swdl sdi nomtis[ informetrics in used widely is ttsis[ the as to the simply referred or usually is matrix information observed opttoa rcdr sqienee o plctoso ag s large of applications for Furthe needed st magnitude. quite corresponding of the is orders and procedure several sets computational by data increased of biological suddenly size f max in have the algorithm technologies log-likelihood that scoring high-throughput mean a Fisher the gineering of in particular, th use In bottlenecks of the elements of sets. prohibits evaluating which one Still, procedure, as tion remains log-likelihood. the matrix of information matrix Hessian the SC(o1710,1514,1614,1615,6162003 11671051, 11671049, 11571047, (RDF-2017-0 (No.11771002, XJTLU NSFC of Fund Sp Development Program Research Key KSF-P-02), (BK20171237), Program Research Basic nology estimation parameter model,variance mixed linear method, 2010 h ihrsoigagrtmi ucs nsmlfigteapproximatio the simplifying in success a is algorithm scoring Fisher The ∗ LCP(6142A05180 of Foundation by supported is research This e od n phrases. and words Key Introduction. orsodn uhr hnxnZhu. Shengxin author: Corresponding IS VRG NOMTO ARXSPLITTING MATRIX INFORMATION AVERAGE AIMS: ahmtc ujc Classification. Subject Mathematics Abstract. ua hl h te atwihivle o fcomputatio of lot negligible. a is involves thus computatio which and and part matrix simple zero other a the enjoys while part informatio mula, essential Fisher The the and parts. pr information two we observed the parameters, of component erage variance on dependent linearly 3 .Tersligagrtm scle the called is algorithms resulting The ]. 7 p.30]. , eateto ahmtc,X’nJatn-iepo Univ Jiaotong-Liverpool Xi’an Mathematics, of Department nomto matrix information aoaoyfrItlietCmuigadFnnilTechno Financial and Computing Intelligent for Laboratory nttt fApidPyisadCmuainlMathematic Computational and Physics Applied of Institute o iermxdmdl ihc-ainemtie hc are which matrices co-variance with models mixed linear For aysaitclmtosrqiea siaino nnw (co- unknown of estimation an require methods statistical Many ogin uadXnpn Liu Xingping and Gu Tongxiang bevdifrainmti,Fse nomto arx N matrix, information Fisher matrix, information Observed aoaoyo opttoa Physics Computational of Laboratory uhu 113 P.R.China 215123, Suzhou, ejn 008 P.R.China 100088, Beijing 13 tkesteesnilifrainaotunknown about information essential the keeps It . n o ssadr rcdr ncomputational in procedure standard is now and ] hnxnZhu Shengxin rmr:6J2 21;eodr:65C60,65F99. 62P10;Secondary: 65J12, Primary: 1 ∗ bevdifrainmatrix information observed vrg Information. Average , ihrsoigalgorithm Fisher-scoring n 11871339). and , 11 .Teepce au fthe of value expected The ]. ca udo JL (KSF-E-21, XJTLU of Fund ecial -3 n atal upre by supported partially and 2-23) 0) ins cec n Tech- and Science Jiangsu 501), ihrifrainmatrix information Fisher a esltinto split be can n v htteav- the that ove a redyfor- friendly nal si random a is ns ersity ipicto of simplification r logy s tsia models atistical cec n en- and science nmn appli- many in aestatistical cale rlredata large or zn log- a izing not Fisher e imiza- des- od which —the ewton of n 2 SHENGXIN ZHU models, such as genome-wide association studies, which involves many thousands parameters to be estimated [22]. The aim of this short note is to provide a concise mathematical result: the average of the observed information and the can be split into two parts; the essential part enjoys a simple formula and is easy to compute, while the other part which involves a lot of computations is a random zero matrix and thus is negligible. Such a spitting and approximation provides significant reduction in computations. What we should mention is that the average information idea has been proposed in [12] for (co)variance matrices which linearly depend on variance parameters. It results in an efficient breeding algorithm in [6], and followed by [14]. However, previous results assume that the variance-variance matrix should be linearly dependent on the underlying variance parameters. Here we prove that similar results still be obtained even the variance-variance matrix is not linearly dependent on the underlying variance parameters.

2. Preliminary. Consider the following widely-used linear mixed models [1, 2, 5, 28, 21] y = Xτ + Zu + ǫ, (1) where y ∈ Rn×1 is the observation, τ ∈ Rp×1 is the vector of fixed effects, X ∈ Rn×p is the design matrix which corresponds to the fixed effects, u ∈ Rb×1 is the vector of unobserved random effects, Z ∈ Rn×b is the design matrix which corresponds to the random effects. ǫ ∈ Rn×1 is the vector of residual errors. The random effects, u, and the residual errors, ǫ, follow multivariate normal distributions such that E(u) = 0, E(ǫ) = 0, u ∼ N(0, σ2G), ǫ ∼ N(0, σ2R) and u G 0 var = σ2 , (2)  ǫ   0 R  where G ∈ Rb×b, R ∈ Rn×n. Typically G and R are parameterized matrices with unknown parameters to be estimated. Precisely, suppose that G = G(γ), R = R(φ), and denote κ = (γ, φ), θ = (σ2,κ). Estimating our main concern, the variance parameters θ, requires a conceptually simple nonlinear iterative procedure: one has to maximize a residual log- of the form [18, p.252] 1 yT Py ℓ = const − {(n − ν) log σ2 + log det(H) + log det(XT H−1X)+ }, (3) R 2 σ2 where H = R(φ)+ ZG(γ)ZT , ν = rank(X) and P = H−1 − H−1X(XT H−1X)−1XT H−1.

Here we suppose X is full rank, say, ν = p. The first derivatives of ℓR is referred to as the scores for the variance parameters θ := (σ2,κ)T [18, p.252]: ∂ℓ 1 n − ν yT Py s(σ2)= R = − − , (4) ∂σ2 2  σ2 σ4 

∂ℓR 1 ∂H 1 T ∂H s(κi)= = − tr(P ) − 2 y P Py . (5) ∂κi 2  ∂κi σ ∂κi  where κ = (γT , φT )T . We shall denote

2 T S(θ) = (s(σ ),s(κ1),...,s(κm)) . AIMS: AVERAGE INFORMATION MATRIX SPLITTING 3

Algorithm 1 Newton-Raphson method to solve S(θ) = 0.

1: Give an initial guess of θ0 2: for k =0, 1, 2, ··· until convergence do 3: Solve IO(θk)δk = S(θk), 4: θk+1 = θk + δk 5: end for

The negative Hessian of the log-likelihood function is referred to as the observed information matrix. We will denote the matrix as IO. 2 2 2 ∂ ℓR ∂ ℓR ∂ ℓR 2 2 2 2 1 ··· ∂σ2 ∂σ ∂σ2 ∂κ ∂σ 2∂κm ∂ ℓR ∂ ℓR ∂ ℓR  2 ···  ∂κ1∂σ ∂κ1∂κ1 ∂κ1∂κm IO = − . . . . . (6)  . . .. .   2 2 2   ∂ ℓR ∂ ℓR ∂ ℓR   2 ···   ∂κm∂σ ∂κm∂κ1 ∂κm∂κm  Given an initial guess of the variance parameter θ0, a standard approach to maximize ℓR or find the root of the score equation S(θ) = 0 is the Newton-Raphson method (Algorithm 1), which requires elements of the observed information matrix. In particular, tr(P H¨ ) − tr(P H˙ P H˙ ) 2yT P H˙ P H˙ Py − yT P H¨ Py I (κ ,κ )= ij i j + i j ij , (7) O i j 2 2σ2 2 where H˙ = ∂H and H¨ = ∂ H . Each element I (κ ,κ ) involves two compu- i ∂κi ij ∂κiκj O i j tationally intensive trace terms, which prohibits the practical use of the (exact) Newton-Raphson method for large data sets. In practice, the Fisher information matrix, I = E(IO), is preferred. The ele- ments of the Fisher information matrix have simper forms than these of the observed information matrix for example tr P H˙ P H˙ I(κ ,κ )= i j . (8) i j 2 The corresponding algorithm is referred to as the Fisher scoring algorithm [13]. Still the element I(κi,κj ) of the Fisher information matrix involves the computationally expensive trace terms. When the variance-variance matrix H is linearly dependent on the variance pa- rameter, say H¨ij = 0, researchers noticed that the average information matrix I(κ ,κ )+ I (κ ,κ ) yT PH PH Py i j O i j = i j (9) 2 2σ2 enjoys a simpler and more computational friendly formula [6][12]. yT P H˙ P H˙ Py I = i j . (10) A 2σ2

For more general covariance matrices when H¨ij 6= 0, we still have the following nice property by the classical matrix splitting [20, p.94]. The average information matrix can be split into two parts as follows [25]: I(κ ,κ )+ I (κ ,κ ) yT P H˙ P H˙ Py tr(P H¨ ) − yT P H¨ Py/σ2 i j O i j = i j + ij ij . (11) 2 2σ2 4 I IA(κi,κj ) Z (κi,κj ) | {z } | {z } 4 SHENGXIN ZHU

Such a splitting enjoys a nice property which is presented as our main result.

3. Main result.

Theorem 3.1. Let IO and I be the observed information matrix and the Fisher information matrix for the residual log-likelihood of linear mixed model respectively, then the average of the observed information matrix and the Fisher information IO+I matrix can be split as 2 = IA +IZ , such that the expectation of IA is the Fisher information matrix and E(IZ )=0. Such a splitting aims to remove computationally expensive and negligible terms so that a Newton-like method is applicable for large data which involves thousands of fixed and random effects. It keeps the essential information in the observed infor- mation matrix. In this sense, IA is a good approximation which is data-dependent (on y) to the data-independent Fisher information. An Quasi-Newton iterative procedure is obtained by replacing IO with IA in Algorithm 1.

Proof of the main result.

3.1. Basic Lemma. Lemma 3.2. Let y ∼ N(Xτ,σ2H), be a and H is a symmetric positive definite matrix, then P = H−1 − H−1X(XT H−1X)−1XT H−1 is a weighted projection matrix such that 1. PX =0; 2. P HP = P ; 3. tr(PH)= n − ν, where rank(X)= ν; 4. P E(yyT )= σ2PH.

Proof. The first two terms can be verified by direct computation. Since H is a positive definite matrix, there exists H1/2 such that tr(PH) = tr(H1/2PH1/2) = tr(I − Xˆ(Xˆ T Xˆ)−1Xˆ)= n − rank(Xˆ)= n − ν. where Xˆ = H−1/2X. The 4th item follows because P E(yyT )= P (var(y)+ Xτ(Xτ)T )= σ2PH + P Xτ(Xτ)T = σ2PH.

Lemma 3.3. Let H be a parametric matrix of κ, and X be an constant matrix, then the partial derivative of the projection matrix P = H−1 − H−1X(XT H−1X)−1XT H−1 with respect to κ is P˙ = −P H˙ P, where P˙ = ∂P and H˙ = ∂H . i i i i ∂κi i ∂κi Proof. See Lemma B.3 [8]. Using the derivatives of the inverse of a matrix

−1 ∂A − ∂A − = −A 1 A 1. ∂κi ∂κi AIMS: AVERAGE INFORMATION MATRIX SPLITTING 5 we have

∂ −1 −1 T −1 −1 T −1 P˙i = (H − H X(X H X) X H ) ∂κi −1 −1 −1 −1 T −1 −1 T −1 = − H H˙ iH + H H˙ iH X(X H X) X H −1 T −1 −1 T −1 −1 T −1 −1 T −1 − H X(X H X) X H H˙ iH X(X H X) X H −1 T −1 −1 T −1 −1 + H X(X H X) X H H˙ iH −1 −1 T −1 −1 T −1 = − H H˙ iP + H X(X H X) X H H˙ iP = −P H˙ iP.

3.2. Formulae of the observed information matrix.

Lemma 3.4. The element of the observed information matrix for the residual log- likelihood (3) is given by

yT Py n − ν I (σ2, σ2)= − , (12) O σ6 2σ4 1 I (σ2,κ )= yT P H˙ P y, (13) O i 2σ4 i 1 ˙ ˙ ˙ 1 T ˙ ˙ T ¨ IO(κi,κj )= tr(P Hij ) − tr(P HiP Hj ) + 2 2y P HiP Hj Py − y P Hij Py . 2 n o 2σ n o (14)

2 where H˙ = ∂H , H¨ = ∂ H . i ∂κi ij ∂κi∂κj Proof. See Result 4 in [8]. The result in (12) is standard according to the definition. The result in (13) follows from the result in Lemma 3.3 if one uses the score in (4). The first term in (14) follows because

∂ tr(P H˙ i) = tr(P H¨ij ) + tr(P˙j H˙ i) = tr(P H¨ij ) − tr(P H˙ j P H˙ i) (P˙j = −P H˙ jP ). ∂κj The second term in (14) follows because of using the result in Lemma 3.3, we have

∂(P H˙ iP ) − = P H˙ j P H˙ iP − P H¨ij P + P H˙ iP H˙ j P. (15) ∂κj

Further note that H˙ i, H˙ j and P are symmetric. The second term in (14) follows because of T T y P H˙ iP H˙ j Py = y P H˙ j P H˙ iP y.

3.3. Formulae of the Fisher information matrix. The Fisher information ma- trix, I, is the expected value of the observed information matrix, I = E(IO). The Fisher information matrix enjoys a simpler formula than the observed information matrix and provides the essential information provided by the data, and thus it is a natural approximation to the negative Jacobian matrix. 6 SHENGXIN ZHU

Lemma 3.5. The elements of the Fisher information matrix for the residual log- likelihood function in (3) are given by tr(PH) n − ν I(σ2, σ2)= E(I (σ2, σ2)) = = , (16) O 2σ4 2σ4 1 I(σ2,κ )= E(I (σ2,κ )) = tr(P H˙ ), (17) i O i 2σ2 i 1 I(κ ,κ )= E(I (κ ,κ )) = tr(P H˙ P H˙ ). (18) i j O i j 2 i j Proof. The formulas can be found in [18]. Here we supply an alternative proof. First note that PX = 0, and according to Lemma 3.2 P E(yyT )= P (σ2H + Xτ(Xτ)T )= σ2PH. (19) Then E(yT Py)= E(tr(PyyT )) = tr(P E(yyT )) = σ2 tr(PH) = (n − ν)σ2. (20) Therefore E(yT Py) n − ν n − ν E(I (σ2, σ2)) = − = . (21) O σ6 2σ4 2σ4 Second, we notice that P HP = P . Applying the procedure in (20), we have T T 2 E(y P H˙ iPy) = tr(P H˙ iP E(yy )) = σ tr(P H˙ iPH) 2 2 = σ tr(P HP H˙ i)= σ tr(P H˙ i), (22) T 2 E(y P H˙ iP H˙ j Py)= σ tr(P H˙ iP H˙ j PH) 2 2 = σ tr(P HP H˙ iP H˙ j )= σ tr(P H˙ iP H˙ j ), (23) T 2 2 E(y P H¨ij Py)= σ tr(P H¨ij PH)= σ tr(P H¨ij ). (24) Substitute (22) into (13), we obtain (17). Substitute (23) and (24) to (14), we obtain (18).

Using the Fishing information matrix as an approximation to the negative Jaco- bian results in the widely-used Fisher-scoring algorithm [13].

3.4. Proof of the main result.

Proof. Let 1 I (σ2, σ2)= yT Py; (25) A 2σ6 1 I (σ2,κ )= yT P H˙ Py; (26) A i 2σ4 i 1 I (κ ,κ )= yT P H˙ P H˙ Py; (27) A i j 2σ2 i j then we have 2 2 IZ (σ , σ )=0, (28) tr(P H˙ ) yT P H˙ Py I (σ2,κ )= i − i , (29) Z i 4σ2 4σ4 tr(P H¨ ) − yT P H¨ Py/σ2, I (κ ,κ )= ij ij (30) Z i j 4 AIMS: AVERAGE INFORMATION MATRIX SPLITTING 7

Apply the result in (20), we have (n − ν) E(I (σ2, σ )) = = I(σ2, σ2). (31) A 2 2σ4 Apply the result in (22), we have tr(P H˙ ) E(I (σ2,κ )) = i and E(I (σ2,κ ))=0. (32) A i 2σ2 Z i Apply the result in (23), we have tr(P H˙ P H˙ ) E(I (κ ,κ )) = i j = I(κ ,κ ) (33) A i j 2 i j and E(IZ (κi,κj )) = 0.

4. Discussion. The average information splitting is one of the key techniques to reduce computation in the maximum likelihood methods [22], other techniques like sparse inversion (see the state-of-art of the sparse inversion algorithm [27]) should also be implemented to evaluate the score of the log-likelihood. More details can be found in the review report [26]. Since Fisher information matrix is preferred not only in finding the variance of an estimator and in Bayesian inference [15], but also in analyzing the asymptotic behavior of maximum likelihood estimates [16, 23, 24]. Besides the traditional application fields like genetical theory of natural selection and breeding [4], many other fields including theoretical physics and information geometry also use the Fisher information matrix theory [9][10][17][19]. Therefore the average information matrix splitting techniques also provides promise in these directions.

Acknowledgments. We would like to thank the anonymous reviewers for some constructive feedback.

REFERENCES

[1] Z. Chen, S. Zhu, Q. Niu and X. Lu, Censorious young: Knowledge dis- covery from high-throughput movie rating data with lme4, in 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA), 2019, 32–36, URL https://ieeexplore.ieee.org/document/8713193. [2] Z. Chen, S. Zhu, Q. Niu and T. Zuo, Knowledge discovery and recommen- dation with linear mixed model, IEEE Access, 8 (2020), 38304–38317, URL https://ieeexplore.ieee.org/document/8993770. [3] B. Efron and D. V. Hinkley, Assessing the accuracy of the maximum likelihood estima- tor: Observed versus expected Fisher information, Biometrika, 65 (1978), 457–483, URL https://doi.org/10.1093/biomet/65.3.457. [4] R. Fisher, The Genetical Theory of Natural Selection, Clarendon Press, Oxford, 1930. [5] B. Gao, G. Zhan, H. Wang, Y. Wang and S. Zhu, Learning with linear mixed model for group recommendation systems, in Proceedings of the 2019 11th International Conference on Machine Learning and Computing, ICMLC 19, Association for Computing Machinery, New York, NY, USA, 2019, 8185, URL https://doi.org/10.1145/3318299.3318342. [6] A. R. Gilmour, R. Thompson and B. R. Cullis, Average information reml: An efficient al- gorithm for variance parameter estimation in linear mixed models, Biometrics, 51 (1995), 1440–1450, URL http://www.jstor.org/stable/2533274. [7] G. Givens and J. Hoeting, Computational Statistics, 2nd edition, Wiley Series in Computation Statistics, John Wiley & Sons, Inc., Wiley New Jersey, 2005. [8] F. N. Gumedze and T. T. Dunne, Parameter estimation and inference in the linear mixed model, Linear Algebra Appl., 435 (2011), 1920–1944, URL https://www.sciencedirect.com/science/article/pii/S002437951100320X. 8 SHENGXIN ZHU

[9] A. Heavens, Generalised Fisher matrices, Entropy, 18 (2016), 236, URL https://www.mdpi.com/1099-4300/18/6/236/htm. [10] W. Janke, D. Johnston and R. Kenna, Information geometry and phase transitions, Physica A: Statistical Mechanics and its Applications, 336 (2004), 181 – 186, URL http://www.sciencedirect.com/science/article/pii/S0378437104000469. [11] R. I. Jennrich and P. F. Sampson, Newton-raphson and related algorithms for maxi- mum likelihood variance component estimation, Technometrics, 18 (1976), 11–17, URL https://www.tandfonline.com/doi/abs/10.1080/00401706.1976.10489395. [12] D. Johnson and R. Thompson, Restricted maximum likelihood estimation of vari- ance components for univariate animal models using sparse matrix techniques and average information, Journal of Dairy Science, 78 (1995), 449 – 456, URL http://www.sciencedirect.com/science/article/pii/S0022030295766541. [13] N. T. LONGFORD, A fast scoring algorithm for maximum likelihood estimation in un- balanced mixed models with nested random effects, Biometrika, 74 (1987), 817–827, URL https://doi.org/10.1093/biomet/74.4.817. [14] K. Meyer, An average information restricted maximum likelihood algorithm for estimat- ing reduced rank genetic covariance matrices or covariance functions for animal mod- els with equal design matrices, Genetics Selection Evolution, 29 (1997), 97, URL https://gsejournal.biomedcentral.com/track/pdf/10.1186/1297-9686-29-2-97. [15] J. I. Myung and D. J. Navarro, Information Matrix, chapter 1, American Cancer Society, 2005, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/0470013192.bsa302. [16] J. W. Pratt et al., Fy edgeworth and ra fisher on the efficiency of maximum likelihood esti- mation, The Annals of Statistics, 4 (1976), 501–514. [17] M. Prokopenko, J. T. Lizier, O. Obst and X. R. Wang, Relating fisher in- formation to order parameters, Phys. Rev. E, 84 (2011), 041116, URL https://link.aps.org/doi/10.1103/PhysRevE.84.041116. [18] S. R. Searle, G. Casella and C. E. McCulloch, Variance components, Wiley Series in Proba- bility and Statistics, Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2006, Reprint of the 1992 original, Wiley-Interscience Paperback Series. [19] M. Vallisneri, A user manual for the Fisher informaiton matrix., California Institute of Tech- nology, Jet Propulsion Laboratory, 2007. [20] R. S. Varga, Matrix iterative analysis, vol. 27 of Springer Series in Computational Mathe- matics, expanded edition, Springer-Verlag, Berlin, 2000, [21] Y. Wang, T. Wu, F. Ma and S. Zhu, Personalized recommender systems with multiple source data, in Computing Conference 2020 (ed. xxx), vol. x of Advance in Intelligent Systems and Computing, Springer International Publishing, Cham, 2020, 1–17. [22] S. Welham, S. Zhu and A. J. Wathen, Big data, fast models: faster calculation of models from high-throughput biological data sets., Knowledge Transfer Report IP12-0009, Smith Institute and The Universtiy of Oxford, Oxford, 2013. [23] R. Zamir, A necessary and sufficient condition for equality in the matrix fisher information inequality., Technical report, Tel Aviv University, 1997. [24] R. Zamir, A proof of the fisher information inequality via a data processing ar- gument, IEEE Transactions on Information Theory, 44 (1998), 1246–1250, URL https://ieeexplore.ieee.org/document/669301. [25] S. Zhu, T. Gu and X. Liu, Information matrix splitting, arXiv preprint arXiv:1605.07646. [26] S. Zhu and A. J. Wathen, Essential formulae for restricted maximum likelihood and its deriva- tives associated with the linear mixed models, arXiv preprint arXiv:1805.05188. [27] S. Zhu and A. J. Wathen, Sparse inversion for derivative of log determinant, arXiv preprint arXiv:1911.00685. [28] T. Zuo, S. Zhu and J. Lu, A hybrid recommender system combing singular valude decompo- sition and linear mixed model, in Computing Conference 2020 (ed. xxx), vol. x of Advance in Intelligent Systems and Computing, Springer International Publishing, Cham, 2020, xxx–xxx. Received xxxx 20xx; revised xxxx 20xx. E-mail address: [email protected]