Appendix: the Multivariate Normal Distribution
Total Page:16
File Type:pdf, Size:1020Kb
Appendix: The Multivariate Normal Distribution In dealing with multivariate distributions such as the multivariate normal, it is convenient to extend the expectation and variance operators to random t vectors. The expectation of a random vector X = (X1,...,Xn) is defined componentwise by E[X1] . E(X) = . . E[X ] n Linearity carries over from the scalar case in the sense that E(X + Y ) = E(X) + E(Y ) E(MX) = M E(X) for a compatible random vector Y and a compatible matrix M. The same componentwise conventions hold for the expectation of a random matrix and the variances and covariances of a random vector. Thus, we can express the variance-covariance matrix of a random vector X as Var(X) = E [X E(X)][X E(X)]t = E(XXt) E(X) E(X)t. { − − } − These notational choices produce many other compact formulas. For in- stance, the random quadratic form XtMX has expectation E(XtMX) = tr[M Var(X)] + E(X)tM E(X). (A.1) To verify this assertion, observe that t E(X MX) = E Ximij Xj Xi Xj = mij E(XiXj ) Xi Xj = mij[Cov(Xi,Xj ) + E(Xi) E(Xj )] Xi Xj = tr[M Var(X)] + E(X)tM E(X). Among the many possible definitions of the multivariate normal distri- bution, we adopt the one most widely used in stochastic simulation. Our 581 582 Appendix: The Multivariate Normal Distribution point of departure will be random vectors with independent standard nor- mal components. If such a random vector X has n components, then its density is n n/2 1 x2/2 1 xtx/2 e− j = e− . √2π 2π jY=1 Because the standard normal distribution has mean 0, variance 1, and s2 /2 characteristic function e− , it follows that X has mean vector 0, variance matrix I, and characteristic function n istX s2/2 sts/2 E(e ) = e− j = e− . j=1 Y We now define any affine transformation Y = AX + µ of X to be mul- tivariate normal [1, 2]. This definition has several practical consequences. First, it is clear that E(Y ) = µ and Var(Y ) = A Var(X)At = AAt = Ω. Second, any affine transformation BY +ν = BAX+Bµ+ν of Y is also mul- tivariate normal. Third, any subvector of Y is multivariate normal. Fourth, the characteristic function of Y is istY istµ istAX istµ stAAts/2 istµ stΩs/2 E(e ) = e E(e ) = e − = e − . Fifth, the sum of two independent multivariate normal random vectors is multivariate normal. Indeed, if Z = BU + ν is suitably dimensioned and X is independent of U, then we can represent the sum X Y + Z = ( AB ) + µ + ν U in the required form. This enumeration omits two more subtle issues. One is whether Y pos- sesses a density. Observe that Y lives in an affine subspace of dimension equal to or less than the rank of A. Thus, if Y has m components, then n m must hold in order for Y to possess a density. A second issue is the existence≥ and nature of the conditional density of a set of components of Y given the remaining components. We can clarify both of these issues by making canonical choices of X and A based on the QR decomposition of a matrix. Assuming that n m, we can write ≥ R At = Q , 0 where Q is an n n orthogonal matrix and R = Lt is an m m upper- triangular matrix× with nonnegative diagonal entries. (If n = ×m, we omit the zero matrix in the QR decomposition.) It follows that AX = ( L 0t ) QtX = ( L 0t ) Z. Appendix: The Multivariate Normal Distribution 583 In view of the usual change-of-variables formula for probability densities and the facts that the orthogonal matrix Qt preserves inner products and has determinant 1, the random vector Z has n independent standard normal components± and serves as a substitute for X. Not only is this true, but we can dispense with the last n m components of Z because they are multiplied by the matrix 0t. Thus,− we can safely assume n = m and calculate the density of Y = LZ + µ when L is invertible. The change-of- variables formula then shows that Y has density n/2 t 1 t 1 1 1 (y µ) (L− ) L− (y µ)/2 f(y) = det L− e− − − 2π | | n/2 t 1 1 1/2 (y µ) Ω− (y µ)/2 = det Ω − e− − − , 2π | | where Ω = LLt is the variance matrix of Y . By definition LLt is the Cholesky decomposition of Ω. To address the issue of conditional densities, consider the compatibly t t t t t t t t t partitioned vectors Y = (Y1 ,Y2 ), X = (X1,X2), and µ = (µ1, µ2) and matrices L 0 Ω Ω L = 11 , Ω = 11 12 . L L Ω Ω 21 22 21 22 Now suppose that X is standard normal, that Y = LX + µ, and that L11 has full rank. For Y1 = y1 fixed, the equation y1 = L11X1 + µ1 shows that 1 X is fixed at the value x = L− (y µ ). Because no restrictions apply 1 1 11 1 − 1 to X2, we have 1 Y = L X + L L− (y µ ) + µ . 2 22 2 21 11 1 − 1 2 1 Thus, Y2 given Y1 is normal with mean L21L11− (y1 µ1) + µ2 and variance t − t L22L22. To express these in terms of the blocks of Ω = LL , observe that t Ω11 = L11L11 t Ω21 = L21L11 t t Ω22 = L21L21 + L22L22. 1 1 The first two of these equations imply that L21L11− = Ω21Ω11− . The last equation then gives t t L22L22 = Ω22 L21L21 − t 1 1 = Ω22 Ω21(L11)− L11− Ω12 − 1 = Ω Ω Ω− Ω . 22 − 21 11 12 These calculations do not require that Y2 possess a density. In summary, the conditional distribution of Y2 given Y1 is normal with mean and variance 1 E(Y2 Y1) = Ω21Ω11− (Y1 µ1) + µ2 | −1 Var(Y Y ) = Ω Ω Ω− Ω . (A.2) 2 | 1 22 − 21 11 12 584 Appendix: The Multivariate Normal Distribution A.1 References [1] Rao CR (1973) Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York [2] Severini TA (2005) Elements of Distribution Theory. Cambridge Uni- versity Press, Cambridge Index Acceptance function, 544 examples, 52–54 Active constraint, 169 Attenuation coefficient, 203 Adaptive acceptance-rejection, 440 Autocovariance, 405, 566 Adaptive barrier methods, 301– 305 Backtracking, 251, 287 linear programming, 303 Backward algorithm, Baum’s, 508 logarithmic, 301–303 Backward operator, 565 Adaptive quadrature, 369 Banded matrix, 108, 151 Admixture distribution, 443 Barker function, 544 Admixtures, see EM algorithm, Basis, 335 cluster analysis Haar’s, 413–415 AIDS data, 257 wavelets, 429 Allele frequency estimation, 229– Baum’s algorithms, 508–509 231 Bayesian EM algorithm, 228 Dirichlet prior, with, 532 Bernoulli functions, 338–340 Gibbs sampling, 545 Bernoulli number, 339 Hardy-Weinberg law, 229 Euler-Maclaurin formula, in, loglikelihood function, 239 364 Alternating projections, 310 Bernoulli polynomials, 338–340, 357 Analytic function, 400–402 Bernoulli random variables, vari- Antithetic simulation, 464–465 ance, 43 bootstrapping, 492 Bernoulli-Laplace model, 521 Apollonius’s problem, 183 Bessel function, 454 Arcsine distribution, 450 Bessel’s inequality, 335 Armijo rule, 287 Best subset regression, 543 Ascent algorithm, 251 Beta distribution Asymptotic expansions, 39–54 coupling, 576 incomplete gamma function, distribution function, see In- 45 complete beta function Laplace transform, 45 orthonormal polynomials, 344– Laplace’s method, 46–51 346 order statistic moments, 47 recurrence relation, 347 Poincar´e’s definition, 46 sampling, 436, 445, 452, 454 posterior expectations, 49 Bias reduction, 486–487 Stieltjes function, 52 Bilateral exponential distribution, Stirling’s formula, 49 382 Taylor expansions, 41–43 sampling, 437 Asymptotic functions, 40 Binomial coefficients, 1, 5 585 586 Index Binomial distribution Canonical correlations, 178–179 conjugate prior, 529 Capture-recapture, 531–532 coupling, 474 Cardinal B-spline, 426 distribution function, 19 Cauchy distribution, 382 maximum likelihood estima- convolution, 387 tion, 216 Fourier transform, 385 orthonormal polynomials, 357 sampling, 433, 455 right-tail probability, 461 Cauchy sequence, 334 sampling, 444, 453 Cauchy-Schwarz inequality, 79, 170 score and information, 256 inner product space, on, 334 Biorthogonality, 561 Censoring, 272 Bipartite graph, 506 Central difference formula, 375 Birthday problem, 48 Central limit theorem, 562 Bisection method, 55–58 Central moments, 484 Bivariate exponential, 466 Chapman-Kolmogorov relation, 512 Bivariate normal distribution Characteristic function, 379 distribution function, 24, 376 moments, in terms of, 388 missing data, with, 241 Chi-square distance, 558, 570 Chi-square distribution Block relaxation, 174–181 distribution function, 19 global convergence of, 286 noncentral, 23 local convergence, 282–283, 291 sampling, 445 Blood type data, 229, 255 Chi-square statistic, 483 Blood type genes, 229, 239 Chi-square test, see Multinomial Bolzano-Weierstrass theorem, 158 distribution Bootstrapping, 477–499 Cholesky decomposition, 99–101, antithetic simulation, 492 108, 583 balanced, 491–492 banded matrix, 108, 151, 153 bias reduction, 486–487 operation count, 108 confidence interval, 487–489 Circulant matrix, 407 bootstrap-t method, 487 Clique, 554 correspondence principle, 484 Cluster point, 116, 284 generalized linear models, 490 Coercive function, 158, 181, 182, importance resampling, 492– 283 495, 498–499 Coin tossing, waiting time, 409 linear regression, 490 Coloring, 201 nonparametric, 484 Compact operator, 567–570 parametric, 484 Complete inner product space, 334 Box-Muller method, 434 Complete orthonormal sequence, Bradley-Terry model, 196 335 Branching process, 399, 518 Compound Poisson distribution, continuous time, 514–515 15 extinction probabilities, 62– Concave function, 173 63, 65–67, 72 Condition number, 87–88, 90–91, Bregman distance, 302 138 Index 587 Confidence interval, 56–58 Convolution bootstrapping, 487–489 functions, of, 386–387 normal variance, 71 Fourier transform, 387 Conjugate prior sequences, of, 396, 402–405 binomial distribution, 529 Coordinate descent, 311–317 exponential distribution, 529 Coronary disease data, 187 geometric distribution, 529 Coupled random