Aims: Average Information Matrix Splitting 3
Total Page:16
File Type:pdf, Size:1020Kb
AIMS: AVERAGE INFORMATION MATRIX SPLITTING Shengxin Zhu∗ Laboratory for Intelligent Computing and Financial Technology Department of Mathematics, Xi’an Jiaotong-Liverpool University Suzhou, 215123, P.R.China Tongxiang Gu and Xingping Liu Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics Beijing 100088, P.R.China Abstract. For linear mixed models with co-variance matrices which are not linearly dependent on variance component parameters, we prove that the av- erage of the observed information and the Fisher information can be split into two parts. The essential part enjoys a simple and computational friendly for- mula, while the other part which involves a lot of computations is a random zero matrix and thus is negligible. 1. Introduction. Many statistical methods require an estimation of unknown (co- )variance parameter(s). The estimation is usually obtained by maximizing a log- likelihood function. In principle, one requires the observed information matrix—the negative Hessian matrix of the log-likelihood—to obtain a maximum likelihood es- timator according to the Newton-Raphson method [11]. The expected value of the observed information matrix is usually referred to as the Fisher information matrix or simply the information matrix. It keeps the essential information about unknown parameters and enjoys a simper formula. Therefore it is widely used in many appli- cations [3]. The resulting algorithms is called the Fisher-scoring algorithm which is widely used in informetrics [13] and now is standard procedure in computational statistics [7, p.30]. The Fisher scoring algorithm is a success in simplifying the approximation of the Hessian matrix of the log-likelihood. Still, evaluating elements of the Fisher arXiv:1605.07646v4 [stat.ME] 8 May 2020 information matrix remains as one of bottlenecks in a log-likelihood maximiza- tion procedure, which prohibits the use of Fisher scoring algorithm for large data sets. In particular, the high-throughput technologies in biological science and en- gineering mean that the size of data sets and the corresponding statistical models have suddenly increased by several orders of magnitude. Further simplification of computational procedure is quite needed for applications of large scale statistical 2010 Mathematics Subject Classification. Primary: 65J12, 62P10;Secondary: 65C60,65F99. Key words and phrases. Observed information matrix, Fisher information matrix, Newton method, linear mixed model,variance parameter estimation, Average Information. This research is supported by Foundation of LCP(6142A05180501), Jiangsu Science and Tech- nology Basic Research Program (BK20171237), Key Program Special Fund of XJTLU (KSF-E-21, KSF-P-02), Research Development Fund of XJTLU (RDF-2017-02-23) and partially supported by NSFC (No.11771002, 11571047, 11671049, 11671051, 6162003, and 11871339). ∗ Corresponding author: Shengxin Zhu. 1 2 SHENGXIN ZHU models, such as genome-wide association studies, which involves many thousands parameters to be estimated [22]. The aim of this short note is to provide a concise mathematical result: the average of the observed information and the Fisher information can be split into two parts; the essential part enjoys a simple formula and is easy to compute, while the other part which involves a lot of computations is a random zero matrix and thus is negligible. Such a spitting and approximation provides significant reduction in computations. What we should mention is that the average information idea has been proposed in [12] for (co)variance matrices which linearly depend on variance parameters. It results in an efficient breeding algorithm in [6], and followed by [14]. However, previous results assume that the variance-variance matrix should be linearly dependent on the underlying variance parameters. Here we prove that similar results still be obtained even the variance-variance matrix is not linearly dependent on the underlying variance parameters. 2. Preliminary. Consider the following widely-used linear mixed models [1, 2, 5, 28, 21] y = Xτ + Zu + ǫ, (1) where y ∈ Rn×1 is the observation, τ ∈ Rp×1 is the vector of fixed effects, X ∈ Rn×p is the design matrix which corresponds to the fixed effects, u ∈ Rb×1 is the vector of unobserved random effects, Z ∈ Rn×b is the design matrix which corresponds to the random effects. ǫ ∈ Rn×1 is the vector of residual errors. The random effects, u, and the residual errors, ǫ, follow multivariate normal distributions such that E(u) = 0, E(ǫ) = 0, u ∼ N(0, σ2G), ǫ ∼ N(0, σ2R) and u G 0 var = σ2 , (2) ǫ 0 R where G ∈ Rb×b, R ∈ Rn×n. Typically G and R are parameterized matrices with unknown parameters to be estimated. Precisely, suppose that G = G(γ), R = R(φ), and denote κ = (γ, φ), θ = (σ2,κ). Estimating our main concern, the variance parameters θ, requires a conceptually simple nonlinear iterative procedure: one has to maximize a residual log-likelihood function of the form [18, p.252] 1 yT Py ℓ = const − {(n − ν) log σ2 + log det(H) + log det(XT H−1X)+ }, (3) R 2 σ2 where H = R(φ)+ ZG(γ)ZT , ν = rank(X) and P = H−1 − H−1X(XT H−1X)−1XT H−1. Here we suppose X is full rank, say, ν = p. The first derivatives of ℓR is referred to as the scores for the variance parameters θ := (σ2,κ)T [18, p.252]: ∂ℓ 1 n − ν yT Py s(σ2)= R = − − , (4) ∂σ2 2 σ2 σ4 ∂ℓR 1 ∂H 1 T ∂H s(κi)= = − tr(P ) − 2 y P Py . (5) ∂κi 2 ∂κi σ ∂κi where κ = (γT , φT )T . We shall denote 2 T S(θ) = (s(σ ),s(κ1),...,s(κm)) . AIMS: AVERAGE INFORMATION MATRIX SPLITTING 3 Algorithm 1 Newton-Raphson method to solve S(θ) = 0. 1: Give an initial guess of θ0 2: for k =0, 1, 2, ··· until convergence do 3: Solve IO(θk)δk = S(θk), 4: θk+1 = θk + δk 5: end for The negative Hessian of the log-likelihood function is referred to as the observed information matrix. We will denote the matrix as IO. 2 2 2 ∂ ℓR ∂ ℓR ∂ ℓR 2 2 2 2 1 ··· ∂σ2 ∂σ ∂σ2 ∂κ ∂σ 2∂κm ∂ ℓR ∂ ℓR ∂ ℓR 2 ··· ∂κ1∂σ ∂κ1∂κ1 ∂κ1∂κm IO = − . (6) . .. 2 2 2 ∂ ℓR ∂ ℓR ∂ ℓR 2 ··· ∂κm∂σ ∂κm∂κ1 ∂κm∂κm Given an initial guess of the variance parameter θ0, a standard approach to maximize ℓR or find the root of the score equation S(θ) = 0 is the Newton-Raphson method (Algorithm 1), which requires elements of the observed information matrix. In particular, tr(P H¨ ) − tr(P H˙ P H˙ ) 2yT P H˙ P H˙ Py − yT P H¨ Py I (κ ,κ )= ij i j + i j ij , (7) O i j 2 2σ2 2 where H˙ = ∂H and H¨ = ∂ H . Each element I (κ ,κ ) involves two compu- i ∂κi ij ∂κiκj O i j tationally intensive trace terms, which prohibits the practical use of the (exact) Newton-Raphson method for large data sets. In practice, the Fisher information matrix, I = E(IO), is preferred. The ele- ments of the Fisher information matrix have simper forms than these of the observed information matrix for example tr P H˙ P H˙ I(κ ,κ )= i j . (8) i j 2 The corresponding algorithm is referred to as the Fisher scoring algorithm [13]. Still the element I(κi,κj ) of the Fisher information matrix involves the computationally expensive trace terms. When the variance-variance matrix H is linearly dependent on the variance pa- rameter, say H¨ij = 0, researchers noticed that the average information matrix I(κ ,κ )+ I (κ ,κ ) yT PH PH Py i j O i j = i j (9) 2 2σ2 enjoys a simpler and more computational friendly formula [6][12]. yT P H˙ P H˙ Py I = i j . (10) A 2σ2 For more general covariance matrices when H¨ij 6= 0, we still have the following nice property by the classical matrix splitting [20, p.94]. The average information matrix can be split into two parts as follows [25]: I(κ ,κ )+ I (κ ,κ ) yT P H˙ P H˙ Py tr(P H¨ ) − yT P H¨ Py/σ2 i j O i j = i j + ij ij . (11) 2 2σ2 4 I IA(κi,κj ) Z (κi,κj ) | {z } | {z } 4 SHENGXIN ZHU Such a splitting enjoys a nice property which is presented as our main result. 3. Main result. Theorem 3.1. Let IO and I be the observed information matrix and the Fisher information matrix for the residual log-likelihood of linear mixed model respectively, then the average of the observed information matrix and the Fisher information IO+I matrix can be split as 2 = IA +IZ , such that the expectation of IA is the Fisher information matrix and E(IZ )=0. Such a splitting aims to remove computationally expensive and negligible terms so that a Newton-like method is applicable for large data which involves thousands of fixed and random effects. It keeps the essential information in the observed infor- mation matrix. In this sense, IA is a good approximation which is data-dependent (on y) to the data-independent Fisher information. An Quasi-Newton iterative procedure is obtained by replacing IO with IA in Algorithm 1. Proof of the main result. 3.1. Basic Lemma. Lemma 3.2. Let y ∼ N(Xτ,σ2H), be a random variable and H is a symmetric positive definite matrix, then P = H−1 − H−1X(XT H−1X)−1XT H−1 is a weighted projection matrix such that 1. PX =0; 2. P HP = P ; 3. tr(PH)= n − ν, where rank(X)= ν; 4. P E(yyT )= σ2PH. Proof. The first two terms can be verified by direct computation.