<<

Appendix: The Multivariate

In dealing with multivariate distributions such as the multivariate normal, it is convenient to extend the expectation and operators to random t vectors. The expectation of a random vector X = (X1,...,Xn) is defined componentwise by

E[X1] . E(X) =  .  . E[X ]  n    Linearity carries over from the scalar case in the sense that

E(X + Y ) = E(X) + E(Y ) E(MX) = M E(X) for a compatible random vector Y and a compatible matrix M. The same componentwise conventions hold for the expectation of a and the and of a random vector. Thus, we can express the variance- matrix of a random vector X as

Var(X) = E [X E(X)][X E(X)]t = E(XXt) E(X)E(X)t. { − − } − These notational choices produce many other compact formulas. For in- stance, the random quadratic form XtMX has expectation

E(XtMX) = tr[M Var(X)] + E(X)tM E(X). (A.1)

To verify this assertion, observe that

t E(X MX) = E Ximij Xj  Xi Xj  = mij E(XiXj ) Xi Xj = mij[Cov(Xi,Xj ) + E(Xi)E(Xj )] Xi Xj = tr[M Var(X)] + E(X)tM E(X).

Among the many possible definitions of the multivariate normal distri- bution, we adopt the one most widely used in stochastic simulation. Our

581 582 Appendix: The Multivariate Normal Distribution point of departure will be random vectors with independent standard nor- mal components. If such a random vector X has n components, then its density is

n n/2 1 x2/2 1 xtx/2 e− j = e− . √2π 2π jY=1   Because the standard normal distribution has 0, variance 1, and s2 /2 characteristic function e− , it follows that X has mean vector 0, variance matrix I, and characteristic function n istX s2/2 sts/2 E(e ) = e− j = e− . j=1 Y We now define any affine transformation Y = AX + µ of X to be mul- tivariate normal [1, 2]. This definition has several practical consequences. First, it is clear that E(Y ) = µ and Var(Y ) = A Var(X)At = AAt = Ω. Second, any affine transformation BY +ν = BAX+Bµ+ν of Y is also mul- tivariate normal. Third, any subvector of Y is multivariate normal. Fourth, the characteristic function of Y is istY istµ istAX istµ stAAts/2 istµ stΩs/2 E(e ) = e E(e ) = e − = e − . Fifth, the sum of two independent multivariate normal random vectors is multivariate normal. Indeed, if Z = BU + ν is suitably dimensioned and X is independent of U, then we can represent the sum X Y + Z = ( AB ) + µ + ν U   in the required form. This enumeration omits two more subtle issues. One is whether Y pos- sesses a density. Observe that Y lives in an affine subspace of dimension equal to or less than the rank of A. Thus, if Y has m components, then n m must hold in order for Y to possess a density. A second issue is the existence≥ and nature of the conditional density of a set of components of Y given the remaining components. We can clarify both of these issues by making canonical choices of X and A based on the QR decomposition of a matrix. Assuming that n m, we can write ≥ R At = Q , 0   where Q is an n n orthogonal matrix and R = Lt is an m m upper- triangular matrix× with nonnegative diagonal entries. (If n = ×m, we omit the zero matrix in the QR decomposition.) It follows that AX = ( L 0t ) QtX = ( L 0t ) Z. Appendix: The Multivariate Normal Distribution 583

In view of the usual change-of-variables formula for probability densities and the facts that the orthogonal matrix Qt preserves inner products and has 1, the random vector Z has n independent standard normal components± and serves as a substitute for X. Not only is this true, but we can dispense with the last n m components of Z because they are multiplied by the matrix 0t. Thus,− we can safely assume n = m and calculate the density of Y = LZ + µ when L is invertible. The change-of- variables formula then shows that Y has density

n/2 t 1 t 1 1 1 (y µ) (L− ) L− (y µ)/2 f(y) = det L− e− − − 2π | |  n/2 t 1 1 1/2 (y µ) Ω− (y µ)/2 = det Ω − e− − − , 2π | | where Ω = LLt is the variance matrix of Y . By definition LLt is the Cholesky decomposition of Ω. To address the issue of conditional densities, consider the compatibly t t t t t t t t t partitioned vectors Y = (Y1 ,Y2 ), X = (X1,X2), and µ = (µ1, µ2) and matrices L 0 Ω Ω L = 11 , Ω = 11 12 . L L Ω Ω  21 22   21 22  Now suppose that X is standard normal, that Y = LX + µ, and that L11 has full rank. For Y1 = y1 fixed, the equation y1 = L11X1 + µ1 shows that 1 X is fixed at the value x = L− (y µ ). Because no restrictions apply 1 1 11 1 − 1 to X2, we have 1 Y = L X + L L− (y µ ) + µ . 2 22 2 21 11 1 − 1 2 1 Thus, Y2 given Y1 is normal with mean L21L11− (y1 µ1) + µ2 and variance t − t L22L22. To express these in terms of the blocks of Ω = LL , observe that t Ω11 = L11L11 t Ω21 = L21L11 t t Ω22 = L21L21 + L22L22. 1 1 The first two of these equations imply that L21L11− = Ω21Ω11− . The last equation then gives t t L22L22 = Ω22 L21L21 − t 1 1 = Ω22 Ω21(L11)− L11− Ω12 − 1 = Ω Ω Ω− Ω . 22 − 21 11 12 These calculations do not require that Y2 possess a density. In summary, the conditional distribution of Y2 given Y1 is normal with mean and variance 1 E(Y2 Y1) = Ω21Ω11− (Y1 µ1) + µ2 | −1 Var(Y Y ) = Ω Ω Ω− Ω . (A.2) 2 | 1 22 − 21 11 12 584 Appendix: The Multivariate Normal Distribution A.1 References

[1] Rao CR (1973) Linear and Its Applications, 2nd ed. Wiley, New York [2] Severini TA (2005) Elements of Distribution Theory. Cambridge Uni- versity Press, Cambridge Index

Acceptance function, 544 examples, 52–54 Active constraint, 169 Attenuation coefficient, 203 Adaptive acceptance-rejection, 440 Autocovariance, 405, 566 Adaptive barrier methods, 301– 305 Backtracking, 251, 287 linear programming, 303 Backward algorithm, Baum’s, 508 logarithmic, 301–303 Backward operator, 565 Adaptive quadrature, 369 Banded matrix, 108, 151 Admixture distribution, 443 Barker function, 544 Admixtures, see EM algorithm, Basis, 335 Haar’s, 413–415 AIDS , 257 , 429 Allele estimation, 229– Baum’s algorithms, 508–509 231 Bayesian EM algorithm, 228 Dirichlet prior, with, 532 Bernoulli functions, 338–340 Gibbs , 545 Bernoulli number, 339 Hardy-Weinberg law, 229 Euler-Maclaurin formula, in, loglikelihood function, 239 364 Alternating projections, 310 Bernoulli polynomials, 338–340, 357 Analytic function, 400–402 Bernoulli random variables, vari- Antithetic simulation, 464–465 ance, 43 bootstrapping, 492 Bernoulli-Laplace model, 521 Apollonius’s problem, 183 Bessel function, 454 , 450 Bessel’s inequality, 335 Armijo rule, 287 Best subset regression, 543 Ascent algorithm, 251 Asymptotic expansions, 39–54 coupling, 576 incomplete , distribution function, see In- 45 complete beta function Laplace transform, 45 orthonormal polynomials, 344– Laplace’s method, 46–51 346 order moments, 47 recurrence relation, 347 Poincar´e’s definition, 46 sampling, 436, 445, 452, 454 posterior expectations, 49 Bias reduction, 486–487 Stieltjes function, 52 Bilateral , Stirling’s formula, 49 382 Taylor expansions, 41–43 sampling, 437 Asymptotic functions, 40 Binomial coefficients, 1, 5

585 586 Index

Binomial distribution Canonical correlations, 178–179 , 529 Capture-recapture, 531–532 coupling, 474 Cardinal B-spline, 426 distribution function, 19 , 382 maximum likelihood estima- convolution, 387 tion, 216 Fourier transform, 385 orthonormal polynomials, 357 sampling, 433, 455 right-tail probability, 461 Cauchy sequence, 334 sampling, 444, 453 Cauchy-Schwarz inequality, 79, 170 score and information, 256 inner product space, on, 334 Biorthogonality, 561 , 272 Bipartite graph, 506 Central difference formula, 375 Birthday problem, 48 , 562 Bisection method, 55–58 Central moments, 484 Bivariate exponential, 466 Chapman-Kolmogorov relation, 512 Bivariate normal distribution Characteristic function, 379 distribution function, 24, 376 moments, in terms of, 388 , with, 241 Chi- distance, 558, 570 Chi-square distribution Block relaxation, 174–181 distribution function, 19 global convergence of, 286 noncentral, 23 local convergence, 282–283, 291 sampling, 445 Blood type data, 229, 255 Chi-square statistic, 483 Blood type genes, 229, 239 Chi-square test, see Multinomial Bolzano-Weierstrass theorem, 158 distribution Bootstrapping, 477–499 Cholesky decomposition, 99–101, antithetic simulation, 492 108, 583 balanced, 491–492 banded matrix, 108, 151, 153 bias reduction, 486–487 operation count, 108 confidence interval, 487–489 Circulant matrix, 407 bootstrap-t method, 487 Clique, 554 correspondence principle, 484 Cluster point, 116, 284 generalized linear models, 490 Coercive function, 158, 181, 182, importance , 492– 283 495, 498–499 Coin tossing, waiting time, 409 , 490 Coloring, 201 nonparametric, 484 Compact operator, 567–570 parametric, 484 Complete inner product space, 334 Box-Muller method, 434 Complete orthonormal sequence, Bradley-Terry model, 196 335 Branching process, 399, 518 Compound , continuous time, 514–515 15 extinction probabilities, 62– Concave function, 173 63, 65–67, 72 Condition number, 87–88, 90–91, Bregman distance, 302 138 Index 587

Confidence interval, 56–58 Convolution bootstrapping, 487–489 functions, of, 386–387 normal variance, 71 Fourier transform, 387 Conjugate prior sequences, of, 396, 402–405 , 529 Coordinate descent, 311–317 exponential distribution, 529 Coronary disease data, 187 , 529 Coupled random variables, 464, 474 , 529, independence sampler, 563 532 Coupling bound, 559 normal distribution, 529, 531 Courant-Fischer theorem, 119, 125 Poisson distribution, 529, 546 generalized, 120 Constrained optimization, 328 , 126 standard errors, estimating, asymptotic, 317–327 317–319 , 56–58 Cross-validation, 311 exact tests, 481–483 Cubic interpolation, 68–70 three-way, 179 Cubic splines, see Splines Continued fractions, 27–37 generating function, 15 convergence, 27–28, 36–37 Cyclic coordinate descent equivalence transformations, local convergence, 282–283 29–30, 36 saddle point, convergence to, evaluating, 28–29, 36 290 hypergeometric functions, 31– 33 Data augmentation, 532 incomplete gamma function, Daubechies’ wavelets, 416–429 33–36 Davidon’s formula, 260 Lentz’s method, 36 Death notice data, 240, 264 nonnegative coefficients, with, Dense set, 334 36 , 415–416, 426 Stieltjes function, 37 Detailed balance, 505 Wallis’s algorithm, 28, 36 Hastings-Metropolis algorithm, Contractive function, 60 528 matrix properties, 84 Determinant, computing, 98 Control variates, 465–466 Diagonalizable operator, 577 Convergence of optimization al- Diagonally dominant matrix, 127 gorithms, 277–295 Differential, 277 local, 279–283 Differentiation, numerical, 148, 375 Convex function, 173, 224 analytic function, of, 400–402 optimizing a sum, 215–217 Diffusion of gas, 507 sums of, optimizing, 294 Digamma function Convex programming, 301 recurrence relation, 271 Dykstra’s algorithm, 305, 310 for a geometric program, 303 conjugate prior, 231 Convex regression, 308 sampling, 447 Convex set, 172 score and information, 271 588 Index

Discriminant analysis, 316 allele frequency estimation, for, Distribution function 229 for specific type of distribu- ascent property, 224–227 tion, see name of specific Bayesian, 228 distribution bivariate normal parameters, transformed , 241 for, 21 cluster analysis, 231–233 Division by Newton’s method, 65 convergence to a saddle point, Double exponential distribution, 241 see Bilateral exponential E step, 224 distribution estimating multinomial param- Duodenal ulcer blood type data, eters, 243 229, 255 , 239 Dykstra’s algorithm, 305–310, 322 , 235–238 genetic linkage, 239 linear regression with right cen- Edgeworth expansion, 387–391 soring, 242 Edgeworth’s algorithm, 313–314 local convergence Ehrenfest’s model of diffusion, 507 sublinear rate, 291 Eigenvalues, 113–127 M step, 224 convex combination of matri- movie rating, 245 ces, 126 sublinear convergence rate, 291 Courant-Fischer theorem, 119, success probability, 242 125 transmission tomography, 233– interlacing property, 125 235 Jacobi’s method, 117 zero truncated data, 244 largest and smallest, 120–126 Entropy, 238 Markov chain convergence, 560– Epoch, 503 563 Equality constraint, see Constrained Markov chain transition ma- optimization, 163 trix, 522 Equilibrium distribution, see Markov symmetric perturbation, 119, chain 125 Ergodic conditions, 504, 522 Eigenvalues and eigenvectors, 563 Ergodic theorem, 504 Eigenvectors, 113–127 Euclidean norm, 77 Jacobi’s method, 118 Euclidean space, 334 Markov chain, 513 Euler’s constant, 365 Markov chain convergence, 560– Euler-Maclaurin formula, 363–366 563 Expected information, 254 Elliptical orbits, 71 admixture density, 267 Elliptically symmetric densities, 194– exponential families, 254, 256 195 , 269 Elsner-Koltracht-Neumann theo- positive definiteness, 107 rem, 310 power series family, 269 EM algorithm, 223–245 quasi-Newton initialization, 261 Index 589

, in, 268 Finite Fourier transform, 395–410 Exponential distribution computing, see Fast Fourier bilateral, see Bilateral expo- transform nential distribution definition, 395 conjugate prior, 529 inversion, 396 exponential integral, 44 transformed sequences, of, 397 Fourier transform, 382 Fisher’s z distribution order statistics, 448 distribution function, 21 random sums of, 23 sampling, 436, 452 of random sample, 390– Fisher’s , 483 391 Fisher-Yates distribution, 482–483 saddle point approximation, moments, 496 392 sampling, 483 sampling, 433 Fixed point, 60, 284 score and information, 256 Forward algorithm, Baum’s, 508 Exponential family, 227 Forward operator, 565 EM algorithm, 239 Four-color theorem, 547 expected information, 254–255, Fourier coefficients, 335, 337, 356 267 approximation, 398–402 saddle point approximation, Fourier series, 337–340 392 absolute value function, 356 score, 254 Bernoulli polynomials, 339 Exponential power distribution, 437 pointwise convergence, 337 Exponential tilting, 389, 462 Fourier transform, 379–393 Extinction, see Branching processes, bilateral exponential density, extinction probabilities 382 Cauchy density, 385 F distribution convolution, of, 387 distribution function, 20 Daubechies’ scaling function, sampling, 445, 452 427 Factor analysis, 235 definition, 379, 386 Factor loading matrix, 235 fast, see Fast Fourier trans- Family size form mean, 4 finite, see Finite Fourier trans- recessive genetic disease, with, form 10 function pairs, table of, 380 upper bound, with, 10 gamma density, 382 variance, 10 Hermite polynomials, 382 Farkas’s lemma, 184 inversion, 384–385 Fast Fourier transform, 397–398 mother , 418 Fast wavelet transform, 425 normal density, 381 Feasible point, 163 random sum, 392 Fej´er’s theorem, 337 uniform density, 381 Fermat’s principle, 159 Fractional linear transformation, Finite differencing, 402 60–61 590 Index

Frobenius norm, 79 branching process, 399 Function coin toss wait time, 409 coercive, 182 Hermite polynomials, 15 concave, 173 multiplication, 403 convex, 173 partitions of a set, 22 Lagrangian, 164 progeny distribution, 62 logposterior, 204 Genetic drift, 506 majorizing, 189 Geometric distribution, 434 objective, 162 conjugate prior, 529 potential, 204 Geometric mean, 184 Functional iteration, 58–67 Geometric programming, 303 acceleration, 72 Gerschgorin’s circle theorem, 121, Fundamental theorem of algebra, 127 158 Gibbs prior, 204 Gibbs sampling, 529–536 allele frequency estimation, 545 confidence intervals, 58 operator theory, 567–574 distribution function, see In- random effects model, 545 complete gamma function Gillespie’s algorithm, 518 Fourier transform, 382 Golden section search, 67–68 maximum likelihood estima- Goodness of fit test, see Multino- tion, 216, 273 mial distribution order statistics, 376 Gradient algorithms, 286–289 orthonormal polynomials, 343 Gram-Schmidt orthogonalization, sampling, 436, 443, 452, 454 101–103, 108, 449 Gamma function Graph bisection, 548 asymptotic behavior, 49 , 450 evaluating, 17 Gauss’s method for hypergeomet- H¨older’s inequality, 184 ric functions, 30 Haar’s wavelets, 413–415 Gauss-Jordan pivoting, 96 Hammersley-Clifford theorem, 553 Gauss-Newton algorithm, 252, 256– Hardy-Weinberg law, 229 257 Harmonic series, 365 singular matrix correction, 270 Hastings-Metropolis algorithm, 527– Gaussian distribution, see Normal 539 distribution aperiodicity, 544 Gaussian quadrature, 370–373, 375– Gibbs sampler, 529–536 376 independence sampler, 536 Gene expression, 315 convergence, 563–564 Generalized inverse matrix, 267 random walk sampler, 537 Generalized , 257–258 Hazard function, 272 quantal response model, 267 Hemoglobin, 514 Generalized linear models Hermite interpolation, 69 bootstrapping, 490 Hermite polynomials, 342, 359 Generating function Edgeworth expansions, in, 388 Index 591

evaluating, 15–16 hypergeometric function, as, Fourier transform, 382 31 recurrence relation, 347 identities, 24 roots, 376 Incomplete gamma function, 17 Hermitian matrix, 81 asymptotic expansion, 45 Hidden Markov chain, 507–510 connections to other distri- EM algorithm, 510 butions, 19, 21, 23–24 Hidden trials continued fraction expansion, binomial, 242 33–36 EM algorithm for, 243 gamma confidence intervals, multinomial, 243 58 Poisson or exponential, 243 Incremental heating, 547 Hilbert space, 333–336, 564–570 Independence sampler, 536 separable, 334 convergence, 563–564 estimator, 415–416 Inequality Horner’s method, 2–3, 9 H¨older’s, 184 Householder matrix, 83 Inequality constraint, 169, see Con- least , 103–104 strained optimization Huber’s function, 269 Infinitesimal transition matrix, 513, Hyperbolic trigonometric functions 523 generalization, 410 Infinitesimal transition probabil- Hypergeometric distribution ity, 512 Bernoulli-Laplace model, in, Information inequality, 225 521 Ingot data, 267 coupling, 474 Inner product, 333–334 sampling, 453 Markov chain, 521 Hypergeometric functions, 30–33 Integrable function, 379 identities, 35 Integration by parts, 43–46 Integration, numerical, 148 Ill-conditioned matrix, 86 Monte Carlo, see Monte Carlo Image compression, 135, 424–426 integration Importance ratio, 536, 563 quadrature, see Quadrature Importance sampling, 460–463 Intensity leaping, 517–520 bootstrap resampling, 492–495, Inverse , 21 498–499 Inverse chi-square distribution, 21 Markov chain Monte Carlo, Inverse power method, 121 546 Inverse secant condition, 262 sequential, 468–472 Ising model, 530 Inactive constraint, 169 Isolated point, 284 Inclusion-exclusion principle, 11 Isotone regression, 308, 322 Incomplete beta function, 18 Iterative proportional fitting, 179– connections to other distri- 181, 186 butions, 19–21, 24 continued fraction expansion, Jackknife residuals, 107 32 Jacobi matrix, 278 592 Index

Jacobi polynomials, 373 iterative solution, 84–86 Jacobi’s method for linear equa- Jacobi’s method, 85 tions, 85 Landweber’s method, 85 Jacobi’s method of computing eigen- Linear , 198–199 values, 113–118, 137 Linear programming, 303 Jensen’s inequality, 173, 224 Linear regression, 94–95 geometric proof, 225 bootstrapping, 490 bootstrapping residuals, 497 k- clustering, 177–178 Cholesky decomposition, 99 Kepler’s problem of celestial me- Gram-Schmidt, 101 chanics, 71 Householder reflections, 103 Kolmogorov’s circulation criterion, right censored data, for, 242 505, 514 sweep operator, 99, 107 Krawtchouk polynomials, 357 Link function, 257 Kuhn-Tucker condition, 170 Linkage equilibrium, genetic, 483 Lipschitz constant, 59 Lagrange multiplier rule, 165, 170 Location-scale family, 268 Lagrange’s interpolation formula, Log chi-square distribution, 21 152 Log-concave distributions, 435–442, Lagrangian 451–453 allele frequency estimation, 230 Logarithmic barrier method, 301– multinomial probabilities, 166 303 quadratic programming, 167 Logistic distribution, 268 stratified sampling, 464 sampling, 450 Lagrangian function, 164 Logistic regression, 216 Laguerre polynomials,343–344, 359 Loglinear model, 179 recurrence relation, 347 Laplace transform, 45 observed information, 186 asymptotic expansion, 46 Lognormal distribution Laplace’s method, 46–51 distribution function, 21 Large integer multiplication, 403 sampling, 445 Lasso, 310–317 Logposterior function, 204 Leapfrogging, 539 London Times death notice data, 240, 264 Least `p regression, 195, 210, 217 p = 1 case, 293 Lotka’s surname data, 63 Least absolute deviation regres- Luce’s model, 218 sion, 210, 217, 293, 313– 314 Maehly’s algorithm, 73 estimation, 308, 314 Maher’s sports model, 175–177 functions, Majorizing function, 189 252 Mangasarian-Fromovitz constraint Lentz’s algorithm for continued frac- qualification, 170 tions, 36 Markov chain, 503–524 Linear convergence, 65, 279 continuous time, 511–515 Linear equations branching process, 514 Index 593

equilibrium distribution, 513, multivariate normal distribu- 516 tion, 161 discrete time, 503–510 Poisson distribution, 252 aperiodicity, 504 Maxwell-Boltzmann distribution, equilibrium distribution, 85– 238 86, 504 Mean value theorem, 63, 278, 290 embedded, 522 Mean, arithmetic, 3 hemoglobin, model for, 514 geometric mean inequality, 184 hidden, 507–510 irreducibility, 504 bootstrapping, 497 reversibility, 505, 514 moments of, 474 Markov chain Monte Carlo, 527– variance of, 373 548, 551–564, 578 Meixner polynomials, 358 burn-in period, 539 Mellin transform, 391 Gibbs sampling, 529–536 Mersenne prime, 431 Hastings-Metropolis algorithm, Metropolis algorithm, see Hastings- 527–539 Metropolis algorithm importance sampling, 546 Michelot’s algorithm, 322 multiple chains, 547 Minimum simulated annealing, 542–544 positive definite quadratic func- starting point, 539 tion, 161 variance reduction, 540 Missing data Markov random field, 552–556 data augmentation, 532 Marsaglia’s polar method, 434 EM algorithm, 224, 510 Master equations, 518 Mixtures, see EM algorithm, clus- Matrix ter analysis factor loading, 235 MM algorithm, 189–218 , 266 `p regression, 195, 217 Matrix approximation, 132 acceleration, 262–264, 293 Matrix differential equation, 515 Bradley-Terry model, 196 Matrix exponential, 90, 515–516 convex objective function, 294 approximating, 523 descent property, 190 definition, 513 elliptically symmetric densi- determinant, 524 ties, 194 Matrix inversion for discriminant analysis, 317 Moore-Penrose inverse, 131 global convergence of, 283– Newton’s method, 250, 265 285 sweep operator, 97, 106, 168 linear logistic regression, 198– Matrix norm, see Norm, matrix 199 Maximum likelihood estimation linear regression, 193–194 Dirichlet distribution, 259–260 local convergence, 280, 292 exponential distribution, 253 majorization, 191–193 multinomial distribution, 253, movie rating, 245 305, 510–511 standard errors, 317–318 594 Index

transmission tomography, see Negative binomial distribution transmission tomography coupling, 576 zero-truncated data, 244 distribution function, 19, 24 MM gradient algorithm, 258–260 family size, in estimating, 4 convex programming, 302 maximum likelihood estima- Dirichlet distribution, estima- tion, 216 tion with, 259–260 orthonormal polynomials, 358 local convergence, 280, 292 sampling, 444, 453 , 310–317 Newton’s method, 63–67, 249–251 generating function least squares estimation, 252 power series and, 14, 15 local convergence, 281 relation to cumulant gener- matrix inversion, 250 ating function, 15 MM gradient algorithm, see Moments, 484 MM gradient algorithm asymptotic, 42, 52 optimization, 250–251 sums, of, 14 orthogonal polynomials, find- Monte Carlo integration, 459–475 ing roots of, 372 antithetic variates, 464–465 quadratic function, for, 266 root extraction, 73, 249–250 control variates, 465–466 transmission tomography, 258 importance sampling, 460–463 Neyman-Pearson lemma, 71 Rao-Blackwellization, 466–468 Noncentral chi-square distribution, stratified sampling, 463–464 23 Moore-Aronszajn theorem, 351 Nonlinear equations, 55–75 Moore-Penrose inverse, 131 bisection method, 55 Mouse survival data, 498 functional iteration, 58 Movie rating, 245 Newton’s method, 63 Multilogit model, 274 Nonlinear regression, 256 Multinomial distribution , 149, 153 asymptotic covariance, 327 Norm, 77–91 chi-square test alternative, 5– connection to svd, 132 6, 10 matrix conjugate prior, 529, 532 induced, 79, 523–524 maximum likelihood estima- properties, 78, 89 tion, 166 nuclear, 140 sampling, 446, 449 preserving transformations, 82– score and information, 256 84 Multivariate t-distribution, 209 total variation, 558 Multivariate normal distribution, vector 95 on inner product spaces, 334 maximum entropy property, properties, 77, 88–89 238 Normal distribution, 581–583 maximum likelihood for, 161 bivariate, see Bivariate nor- sampling, 446, 450 mal distribution sweep operator, 99 conjugate prior, 529, 531 Index 595

distribution function, 16–17, Laguerre, 343–344, 573 19 Meixner, 358, 573 asymptotic expansion, 45 Poisson-Charlier, 341 Fourier transform, 381 recurrence relations, 346–347 Gibbs sampling, 574 roots, 372–373 mixtures, 231 Orthogonal projection, 576 multivariate, see Multivariate Orthogonal vectors, 334 normal distribution, 581– Orthonormal vectors, 334–336 583 Overdispersion, 557 orthonormal polynomials, 342 saddle point approximation, Pad´eapproximation, 516 392 , 450 sampling, 434–435, 437, 455 Parseval-Plancherel theorem, 385– Normal equations, 94 386 NP-completeness, 542 Partition Nuclear norm, 140 integers, of, 9 sets, of, 1–2, 22 O-notation, see Order relations Pascal’s triangle, 1 Objective function, 162 Penalized estimation, 310–317 Observed information, 249 Periodogram, 406 Operator theory, 564–570 Permanent adjoint, 565 computing, 471 compact operator, 567 Pixel, 203 norm, 565 Plug-in estimator, 485 spectral decomposition, 567 Poisson distribution spectral radius, 567 AIDS deaths model, 258 Optimization theory, 157–174 birthday problem, 48 Order relations, 39–40 compound, 15 examples, 52 conjugate prior, 529, 546 Order statistics contingency table data, mod- distribution functions, 24 eling, 180 moments, 47–48 coupling, 559, 575 sampling, 448 distribution function, 19 Orthogonal matrix, 82 Edgeworth expansion, 393 sampling, 449 Gibbs sampling, 572 sequence, 90 maximum likelihood estima- Orthogonal operator, 576 tion, 216 Orthogonal polynomials, 340–347, mixtures, 240 572 orthonormal polynomials, 341 beta distribution, 344–346 sampling, 439, 444 Gaussian quadrature, in, 371– score and information, 256 373 Poisson process, 200 Hermite, 342, 574 , 216 Jacobi, 373 Poisson-binomial distribution, 4– Krawtchouk, 357 5 596 Index

conditional, 449 Quadratic programming, 167–169, Poisson-Charlier polynomials, 341 309 recurrence relation, 347 Quadrature, 363–376 Polar decomposition, 134, 141 adaptive, 369 Polar method of Marsaglia, 434 Gaussian, 370 Polynomial poorly behaved integrands, 369– evaluation, 2 370 interpolation, 152 Romberg’s algorithm, 366 multiplication, 403 trapezoidal rule, 366 Pool adjacent violators, 322 Quantal response model, 267 Positive definiteness Quantile, 485 monitoring, 98 computing, 56 partial ordering by, 123, 126 Quasi-Newton quasi-Newton algorithms, in, AIDS model, 261 260 Quasi-Newton algorithms, 260 Posterior expectation, 49–50 ill-conditioning, avoiding, 271 Posterior , 204 Quick sort, 7–9, 11 Potential function, 204 -case performance, 8 Power method, 86, 121, 516 worst-case performance, 11 Power series, 13–25 Random deviates, generating, 431– exponentiation, 14–16 456 powers, 13–14 acceptance-rejection method, Power series distribution, 22–23 435, 453–455 expected information, 269 log-concave distributions, 435– Powers of integers, sum of, 374 442, 452–453 Principal components, 135 Monte Carlo integration, in, Principal components analysis, 113 467 Probability plot, 433 adaptive acceptance-rejection, Progeny generating function, 62, 440 399 admixture, 443 Projection matrix, 110, 307 arc sine, 450 Projection operators, 305–310, 320– beta, 436, 445, 452, 454 322 bilateral exponential, 437 Projection theorem, 305 binomial, 444, 453 Proportional hazards model, 272 Cauchy, 433, 455 Proposition chi-square, 445 Liapunov, 284 Dirichlet, 447 Pseudo-random deviates, see Ran- discrete uniform, 434 dom deviates, generating exponential, 433 F, 445, 452 q quantile, 208 Fisher’s z, 436, 452 QR decomposition, 101–104, 582 gamma, 436, 443, 452, 454 Quadratic convergence, 65, 281 geometric, 434 Quadratic form, 183 Gumbel, 450 Index 597

hypergeometric, 453 beta distribution polynomi- inverse method, 432–434 als, 344 logistic, 450 binomial coefficients, 1 lognormal, 445 continued fractions, 28–29, 36 multinomial, 446, 449 to moments, 15 multivariate t, 446 digamma and trigamma func- multivariate normal, 446, 450 tions, 271 multivariate uniform, 447 expected family size, 4 negative binomial, 444, 453 exponentiation of power se- normal, 434–435, 437, 455 ries, 15 Box-Muller method, 434 gamma function, 17 Marsaglia’s polar method, Hermite polynomials, 16 434 hidden Markov chain, 508 order statistics, 448 incomplete beta function, 18 orthogonal matrix, 449 moments of sum, 14 Pareto, 450 moments to cumulants, 15 Poisson, 439, 444 orthonormal polynomials, 346– ratio method, 442–443, 455 347 partitions of a set, 2, 22 slash, 451 partitions of an integer, 9 Student’s t, 445 Pascal’s triangle, 1 truncated normal, 453 Poisson-binomial distribution, uniform, 431 4 von Mises, 453 polynomial evaluation, 2 Weibull, 450 powers of power series, 13 Random effects model, 545 random walk, 21 Random graph model, 196–198 sample mean and variance, 3 Random number generator, 431 unstable, 6–7, 11 Random sum, 392 Wd statistic, 5 Random thinning, 201 Reduced rank regression, 133 Random walk, 21 Reflection matrix, 82 graph, on, 506, 521 eigenvalues, 124 returns to origin, 462, 466, Regression 473 `p, see Least `p regression sampling, 537 linear, see Linear regression Rao-Blackwell theorem, 466 nonlinear, 256 Rayleigh quotient, 118–120, 281 nonparametric, 149, 153 generalized, 120 reduced rank, 133 gradient, 125 ridge, 134, 139 Reaction channel, 517 robust, 268–269 Recessive genetic disease, 10 total least squares, 136 Recurrence relations, 1–11 Reisz representation theorem, 336 average-case quick sort, 8 , see Random Bernoulli numbers, 339 deviates, generating Bernoulli polynomials, 338 Renewal equation, 403–405 598 Index

Reproducing kernel Hilbert spaces, applications, 133–137 347–356 basic properties, 130–133 Resampling, see Bootstrapping Sinkhorn’s algorithm, 174–175, 185 Residual sum of squares, 94 , 451 Reversible jump MCMC, 556–558 Slice sampler, 534 Reversion of sequence, 396 Swenden-Wang algorithm, 555 Ridge regression, 134, 139–141 Smoothing, 402, 407 Riemann sum, 278 Sorting, see Quick sort Riemann-Lebesgue lemma, 383 Spectral decomposition, 83, 567 Riffle shuffle, 560 Spectral density, 406 Robust regression, 193, 268–269 Spectral radius, 79, 565 Romberg’s algorithm, 366–368 properties, 90 Root extraction, 73 upper bound, 81 Rotation matrix, 82 Spline, 143–154 eigenvalues, 124 Bayesian interpretation, 154 Markov chain, 507, 521 definition, 143 differentiation and integration, Saddle point, 285 148–149 Saddle point approximation, 389– equally spaced points, on, 153 392 error bounds, 147 Scaling equation, 416 minimum curvature property, Score, 249 146 admixture density, 267 nonparametric regression, in, exponential families, 254, 256 149–151, 153 hidden Markov chain likeli- quadratic, 152 hood, 509 uniqueness, 144 robust regression, 268 vector space of, 153–154 Scoring, 254–256 Square-integrable functions (L2(µ)), AIDS model, 258 334 allele frequency estimation, 255 Squares of integers, sum of, 22, local convergence, 281 374 nonlinear regression, 256 Standard errors, see Covariance Secant condition, 260 matrix Segmental function, 410 Stationary point, 160, 284 Self-adjointness, 521 Steepest ascent, 162 Self-adjointness condition, 576 Steepest descent, 162 Self-avoiding random walk, 469 Step-halving, 251, 287 Separable Hilbert space, 334 Stern-Stolz theorem of continued Sequential sampling, 449 fraction convergence, 36 Sherman-Morrison formula, 94, 261 Stieltjes function, 37 Simpson’s rule, 374 asymptotic expansion, 52 Simulated annealing, 542–544 Stirling’s formula, 49 Sine transform, 409 Euler-Maclaurin formula, de- Singular value decomposition, 129– rived from, 365 142, 185 Gosper’s version, 274 Index 599

Stochastic integration, see Monte Total variation inequality, 575 Carlo integration Total variation norm, 558 Stochastic simulation, 517–520 Trace condition, 568 Stone-Weierstrass theorem, 337 Transition matrix, 86, 503 Stopping criteria, 70 eigenvalues, 522 Stratified sampling, 463–464 Gibbs sampler, 529 Stretching of sequence, 396 Transition rate, 512 Strong stationary time, 559 Translation of sequence, 396 Strong uniform time, 559 Transmission tomography, 202–205, Student’s t distribution 233–242 computing quantiles, 56 Trapezoidal rule, 366–369 distribution function, 21 error bound, 366 multivariate, 209, 446 Traveling salesman problem, 542 sampling, 445 Triangle inequality, 77 Sudoku puzzle, 548 Triangle of greatest area, 183 Supporting hyperplane property, Tridiagonal matrix, 9 174 Trigamma function Surname data, 63 recurrence relation, 271 , 272–273 Truncated normal distribution , 272 sampling, 453 Sweep operator, 95–99, 106–107 checking positive definiteness, Uniform distribution 98 discrete, 434 definition, 95–96 Fourier transform, 381 finding determinant, 98 moments, 14 inverse, 96 multivariate, 447 linear regression, 94, 99, 107 Uniform ergodicity, 562 matrix inversion, 97, 106, 168 Unitary transformation, 386 multivariate normal distribu- Upper triangular matrix, 108 tion, 99 eigenvectors, 124 operation count, 107 properties, 96–98 Variance Woodbury’s formula, 106 bootstrapping, 485 Swendsen-Wang algorithm, 555 computing, 3 Sylvester’s criterion, 125 conditional, formula for, 464 Variance reduction, see Monte Carlo t distribution, see Student’s t dis- integration tribution Vector norm, see Norm, vector Tangent curve, 163 Viterbi algorithm, 509 Tangent vector, 163 , 53 Taylor expansion, 41–43 sampling, 453 Temperature, 542 , 405–406 Wd statistic, 5–6, 10 spectral density, 406 Wallis’s algorithm for continued Total least squares, 136 fractions, 28, 36 600 Index

Wavelets, 413–429 completeness in L2( , ), 423 −∞ ∞ Daubechies’ scaling function, 416–429 existence, 428–429 Fourier transform, 427 differentiability, 423, 427 Haar’s, 413–415 Haar’s scaling function, 414 image compression, 424–426 mother, 414, 418 Fourier transform, 418 orthonormality, 421 periodization, 424 scaling equation, 416 , 272, 450 Weierstrass theorem, 158 , 447 Woodbury’s formula, 105–106 generalization, 109 Wright’s model of genetic drift, 506

Zero-truncated data, 244