<<

1 Appendix: Common distributions

This chapter provides details for common univariate and multivariate distributions, in- cluding de…nitions, moments, and simulation. Many distributions can be parameterized in di¤erent ways. Deyroye (1986) provides a complete treatments of random number genera- tion, although care must be taken to insure the parameterizations are consistent.

Uniform

A random variable X has a uniform distribution on the interval [ ; ], denoted  [ ; ] ; if the probability density function (pdf) is U 1 p (x ; ) = (1) j for x [ ; ] and 0 otherwise. The mean and of a uniform random variable 2 + ( )2 are E (x) = 2 and var (x) = 12 , respectively. The uniform distribution plays a foundational role in random number generation. In  particular, uniform random numbers are required for the inverse transform simulation method, accept-reject algorithms, and the Metropolis algorithm. Fast and accurate pre-programmed algorithms are available in most statistical software packages and programming languages.

Bernoulli

A random variable X 0; 1 has a Bernoulli distribution with parameter , denoted  2 f g X er () if the probability mass function is  B x 1 x Prob (X = x ) =  (1 ) . (2) j The mean and variance of a Bernoulli random variable are E(X) =  and var(X) = (1 ), respectively. To simulate X er (),   B 1. Draw U (0; 1)  U 2. Set X = 1 if U > .

1 Binomial

A random variable X 0; 1; :::; n has a binomial distribution with parameters n  2 f g and , denoted, X in (n; ), if the probability mass function is  B

n! x n x Prob (X = x n; ) =  (1 ) , (3) j x!(n x)! where n! = n (n 1)! = n (n 1) 2 1. The mean and variance of a binomial ran-     dom variable are E(X) = n and var(X) = n (1 ), respectively. The Binomial distribution arises as the distribution of a sum of n independent Bernoulli trials. The

binomial is closely related to a number of other distributions. If W1; :::; Wn are i.i.d. n er (p), then Wi in (n; p). As n increases and p 0 and np = , B i=1  B ! 1 ! X in (n; p) converges in distribution to a Poisson distribution with parameter .  B P To simulate X in (n; ),   B

1. Draw W1; :::; Wn independently Wi er ()  B 2. Set X = # (Xi = 1) .

Poisson

A random variable X N+ (the non-negative integers) has a Poisson distribution  2 with parameter , denoted X oi () ; if the probability mass function is  P x   Prob (X = x ) = e : (4) j x! The mean and variance of a Poisson random variable are E(X) =  and var(X) = ; respectively.

To simulate X oi (),   P

1. Draw Zi i 1 independently, Zi exp (1)  f g n  2. Set X = inf n 0 : Zi >  .  i=1 n X o Exponential

2 A random variable X R+ has an exponential distribution with parameter , de-  2 noted, X exp (), if the pdf is  1 x p (x ) = exp . (5) j     The mean and variance of an exponential random variable are E (X) =  and var (X) = 2, respectively.

The inverse transform method is easiest way to simulate exponential random vari-  x ables, since cumulative distribution function is F (x) = 1 e  : To simulate X  exp (),

1. Draw U [0; 1]  U 2. Set X =  ln (1 U) . Gamma

A random variable X R+ has a gamma distribution with parameters and ,  2 denoted X ( ; ), if the pdf is  G

1 p (x ; ) = x exp ( x) , (6) j ( ) where the gamma function is de…ned in Appendix 4. The mean and variance of 1 2 a gamma random variable are E(X) = and var(X) = , respectively. It is important to note that there are di¤erently parameterizations of the gamma distribution. For example, some authors (and MATLAB) parameterize the gamma density as 1 1 p (x ; ) = x exp ( x= ) . j ( ) Notice that if Y ( ; 1) and X = Y= , then X ( ; ). To see this, note that  G  G the inverse transform is Y = X and dY=dX = , which implies that

1 1 1 p (x ; ) = (x ) exp ( x)( ) = x exp ( x) , j ( ) ( ) which is the density of ( ; ) random variable. The exponential distribution is a G 1 special case of the gamma distribution when = 1: X (1;  ) implies that  G X exp ().  3 Gamma random variable simulation is standard, with built-in generators in most  software packages. These algorithms typically use accept/reject algorithms that are customized to the speci…c values of and . To simulate X ( ; ) when is  G integer-valued,

1. Draw X1; :::; X independently Xi exp(1)  2. Set X = Xi. i=1 X For non-integer , accept-reject methods provide fast and accurate algorithms for gamma simulation. To avoid confusion over parameterizations, the transformation method can be used. To simulate X ( ; ) ;  G 1. Draw Y ( ; 1)  G Y 2. Set X = .

Beta

A random variable X [0; 1] has a beta distribution with parameters and ,  2 denoted X ( ; ), if the pdf is  B 1 1 x (1 x) p (x ; ) = , (7) j B ( ; ) where ( ) ( ) B ( ; ) = ( + ) is the beta function. Since p (x ; ) dx = 1, j R 1 1 1 B ( ; ) = x (1 x) dx: Z0 The mean and variance of a beta random variable are

E (X) = and var (X) = , (8) + ( + )2 ( + + 1)

respectively. If = = 1, then X (0; 1).  U

4 If and are integers, to simulate X ( ; ),   B

1. Draw X1 ( ; 1) and X2 ( ; 1)  G  G X 2. Set X = 1 : X1 + X2 For the general case, fast algorithms involving accept-reject, composition, and trans- formation methods are available in standard software packages.

Chi-squared

A random variable X R+ has a Chi-squared distribution with parameter , denoted  2 X 2 if the pdf is  X 1  x 2 1 p (x ) =   x exp . (9) j 2 2 2 2   The mean and variance of X are E (X) =  and var (X) = 2, respectively. The 2-distribution is a special case of the gamma distribution: 2 =  ; 1 . X X2 G 2 2 Simulating chi-squared random variables typically uses the transformation  method.  For integer values of . the following two-step procedure simulates a 2 random X variable:

1. Draw Z1; :::; Z independently Zi (0; 1)   N 2 2.Set X = Zi . i=1 X When  is large, simulating using normal random variables is computationally costly and alternative more computationally e¢ cient algorithms use gamma random variable generation.

Inverse gamma

A random variable X R+ has an inverse gamma distribution, denoted by X  2  ( ; ), if the pdf is IG exp p (x ; ) = x . (10) ( ) x +1 j 

5 The mean and variance of the inverse gamma distribution are

2 E (X) = and var (X) = (11) 1 ( 1)2 ( 2) 1 for > 2. If Y ( ; ) ; then X = Y ( ; ). To see this, note that  G  IG 0 1 1 1 1 1 1 = y exp ( y) dy = exp 2 dx 0 ( ) ( ) x x x Z Z1       1 1 = exp dx. ( ) x +1 x Z0   The following two-steps simulate an ( ; )  IG 1: Draw Y ( ; 1)  G

2: Set X = . Y Again, as in the case of the gamma distribution, some authors use a di¤erent para- meterization for this distribution as, so it is important to be careful to make sure you are drawing using the correct parameters. In the case of prior distributions over scale parameters, 2, it is additional complicated because some authors such as Zellner (1971) parameterize models in terms of  instead of 2.

Normal

A random variable X R has a with parameters  and 2,  2 denoted X (; 2), if the pdf is  N 2 2 1 (x ) p x ;  = exp 2 : (12) j p22 2 !  The mean and variance are E (X) =  and var (X) = 2.

Given the importance of normal random variables, all software packages have func-  tions to draw normal random variables. The algorithms typically use transformation methods drawing uniform and exponential random variables or look-up tables.

Lognormal

6 2 A random variable X R+ has a lognormal distribution with parameters  and  ,  2 denoted by X (; 2) if the pdf is  LN 1 1 p x ; 2 =  x ; 2 = exp (ln x )2 : (13) j j xp22 22     + 1 2 The mean and variance of the normal distribution are E (X) = e 2 and similarly var (X) = exp (2 + 2) (exp (2) 1). It is related to a normal distribution via the transformation X = e+Z . Although …nite moments of the lognormal exist, the distribution does not admit a moment-generating function.

Simulating lognormal random variables via the transformation method is straightfor-  ward since X = e+" where " (0; 1) is (; 2).  N LN Truncated Normal A random variable X has a truncated normal distribution with parameters ; 2 and  truncation region ( ; b) if the pdf is  (x ; 2) p (x a < x < b) = j ; j  (b ; 2)  (a ; 2) j j where and it is clear that b  (x ; 2) dx =  (b ; 2). The mean of a truncated 1 j j normal distribution is R a b E(X a < X < b) =   ; j b a where x is the standard normal density evaluated at (x ) = and x is the stan- dard normal CDF evaluated at (x ) =. The inversion method can be used to simulate a truncated normal random variable.  A two-step algorithm provides a draw from a truncated standard normal, 1: U U [0; 1]  1 2: X =  [ (a) + U ( (b)  (a))] ; a 1=2 2 where  (a) = (2) exp ( x =2) dx. For a general truncated normal, X 2 1  (;  ) 1[a;b] TN R 1: Draw U U [0; 1]  1 a  b  a  2: X =  +   + U   ;            1 where  is the inverse of the error function.

7 Double exponential

A random variable X R has a double exponential (or Laplace) distribution with  2 parameters  and 2, denoted X (; ), if the pdf is  DE 1 1 p (x ; ) = exp x  : (14) j 2  j j   The mean and variance are E (x) =  and var (X) = 22.

The following two steps utilize the composition method to simulate a (; ) ran-  DE dom variable:

1: Draw  (2) and " (0; 1)  E  N 2: Set X =  + p":

Check exponential

A random variable X R has a check (or asymmetric) exponential distribution with  2 parameters  and 2, denoted X (; ; ), if the pdf is  CE 1 1 p (x ; ) = exp  (x ) : (15) j      1 where  (x) = x (2 1)x and  = 2(1 ). The double exponential is a j j 1 special case when  = 2 . The following two steps utilize the composition method to simulate a (; ) random  CE variable:

Step 1: Draw  ( ) and " (0; 1)  E  N Step 2: Set X = (2 1) + p": Student T

2 A random variable X R+ has a t-distribution with parameters ,, and  , denoted  2 2 X t (;  ), if the pdf is  +1 +1 (x )2 2 p x ; ; 2 = 2 1 + : (16) j p 2  2   2 !  8  When  = 0 and  = 1, the distribution is denoted merely as t. The mean and 2  variance of the t-distribution are E (X) =  and var (X) =   2 for  > 2. The is the special case where  = 1.

2 The following two steps utilize the composition method to simulate a t (;  ) random  variable,

2 2 1. Draw X1 ;  and X2  N  X X1  2. Set X = 1=2 . (X2=)

Inverse Gaussian (…x)

A random variable X R+ has an inverse Gaussian distribution with parameters   2 and , denoted X (; ), if the pdf is  IN (x )2 p (x ; ) = exp : (17) j 2x3 22x r ! The mean and variance of an inverse Gaussian random variable are E (X) =  and var (X) = 3= , respectively.

To simulate an inverse Gaussian (; ) random variable,  IN 1. U U(0; 1) and V 2   1 4 =Y 2: Set W = 2 1 + 1 + (4 =Y ) =  p   2 3. If U < set X = W . If U X = .  + W   + W W

Generalized inverse Gaussian

A random variable X R+ has an generalized inverse Gaussian distribution with  2 parameters a, b, and p, X (a; b; p), if the pdf is  GIG p p 1 a 2 x 1 b p (x a; b; p) = exp ax + ; (18) j b 2K pab 2 x   p      9 where Kp is the modi…ed Bessel function of the second kind. The mean and variance are known, but are complicated expressions of the Bessel functions. The gamma distribution is the special case with = b=2 and a = 0, the inverse gamma is the special case with a = 0.

Simulating GIG random variables is typically done using resampling methods.  Multinomial

A vector of random variables X = (X1; :::; Xk) has a multinomial distribution, de-  noted X Mult (n; p1; :::; pk), if  k n! xi p (X = x p1; :::; pk) = pi , (19) j x1! xk! i=1    Y k where i=1 xi = n. The multinomial distribution is a natural extension of the Bernoulli and binomial distributions. The Bernoulli distribution gives a single trial P resulting in success or failure. The binomial distribution is an extension that involves n independently repeated Bernoulli trials. The multinomial allows for multiple out- comes, instead of the two outcomes in the binomial distribution. There are still n total trials, but now the outcome of each trial is assigned into one of k categories,

and xi counts the number of outcomes in category i. The probability of category i is

pi. The mean, variance, and covariances of the multinomial distribution are given by

E (Xi) = npi; var (Xi) = npi (1 pi) , and cov (Xi;Xj) = npipj: (20) Multinomial distributions are often used in modeling …nite mixture distributions where the multinomial random variables represent the various mixture components.

Standard software packages provide multinomial samplers.  Dirichlet

A vector of random variables X = (X1; ::; Xk) has a Dirichlet distribution, denoted  k X ( 1; :::; k), if Xi = 1  D i=1 P k i=1 i k i 1 p (x 1; :::; k) = xi . (21) j ( 1P) ( k) i=1    Y 10 Dirichlet distribution is used as a prior for mixture probabilities in mixture models. The mean, variance, and covariances of the Multinomial distribution are given by

i E (Xi) = k i=1 i

i j=i j P 6 var (Xi) = 2 k k i=1 i P i=1 i + 1 P   Pi j  cov (Xi;Xj) = 2 . k k i=1 i i=1 i + 1 P  P  To simulate a Dirichlet X = (X1;:::;Xk) ( 1; :::; k) draw k independent gamma   D k variates Yi ( i; 1) and then set Xi = Yi= Yi.  i=1 Multivariate normal P

p A 1 p random vector X R+ has a multivariate normal distribution with parameters   2  and , denoted X (; ), if the pdf is  MVN

1 1=2 1 1 p (x ; ) = p  exp (x )  (x )0 ; (22) j 2 j j 2 (2)   where  is the determinant of the positive de…nite symmetric matrix . The mean j j and covariance matrix of a multivariate normal random variable are E (X) =  and cov (X) = , respectively.

Given the importance of normal random variables, all software packages have func-  tions to draw normal random variables.

Multivariate t

p A 1 p random vector X R+ has a multivariate t-distribution with parameters   2 ; ; and , denoted X t (; ), if the pdf is given by  MV +p +p 1 2 (x )  (x )0 p (x ; ; ) = 2 1 + : (23) j ()1=2   1=2  2 j j   The mean and covariance matrix of a multivariate t random variable are E (X) =  and cov (X) = = ( 2), respectively. 11 The following two steps provide a draw from a multivariate t-distribution:  2 Step 1. Simulate Y k (; ) and Z   N 1  X Z 2 Step 2. Set X =  + Y    Wishart

A random m m matrix  has a Wishart distribution,  m (v; V ), if the density    W function is given by

(v m 1) 2  1 1 p ( v; V ) = v exp tr V  , (24) vmj j v j 2 2 V 2 2 m 2   j j  for v > m, where  m m(m 1) v v k + 1 m =  4 (25) 2 k=1 2     is the multivariate gamma function.Y If v < m, then S does not have a density (although its distribution is de…ned). The Wishart distribution arises naturally in multivariate settings with normally distributed random variables as the distribution of quadratic forms of multivariate normal random variables.

The Wishart distribution can be viewed as a multivariate generalization of the 2  X distribution. From this, it is clear how to sample from a Wishart distribution:

1. Draw Xj (0;V ) for j = 1; :::; v v N

2. Set S = XjXj0 . j=1 X Inverted Wishart

A random m m matrix  has an inverted Wishart distribution, denoted     m (b; B) if the density function is IW (b+m+1) 2  1 1 p ( v; V ) = j j exp tr B . (26) bm b b j 2 2 2 2 B m 2   j j   12 1 1 1 This also implies that  has a Wishart distribution,  m (b; B ). The  W Jacobian of the transformation is

1 @ (m+1) =  : @ j j

To generate  m (b; B), follow the two step procedure:   IW 1 Step 1: Draw Xi 0;B  N b Step 2: Set  = i=1X iXi0: 

In cases where m is extremely large, there areP more e¢ cient algorithms for drawing inverted Wishart random variables that factor .

13 2 Likelihoods, priors, and posteriors

This appendix provides likelihoods and priors for the following types of observed data: Bernoulli, Poisson, normal, normal regression, and multivariate normal. For each speci…- cation, proper conjugate and Je¤reys’priors are given.

2.1 Bernoulli observations

Likelihood: If yt  er (),  [0; 1], then the likelihood is j  B 2 T T T T yt 1 yt yt T t=1 yt p (y ) = p (yt ) =  (1 ) =  t=1 (1 ) ; (27) j t=1 j t=1 P P Y Y T where t=1 yt is a su¢ cient statistic. Fisher’sinformation for Bernoulli observations is

2 P @ ln p (yt ) 1 I () = E j = ; @2  (1 )   where E denotes the expectation of yt conditional on .

Priors: A proper conjugate prior for Bernoulli observations is the beta distribution. If  (a; A), then  B a 1 A 1 (a + A) a 1 A 1  (1 ) p () =  (1 ) = ; (a) (A) B (a; A) where B (a; A) is the Beta function. Je¤reys’prior is

1 1 1 1 1 p () = I () 2  2 (1 ) 2 ; : /  B 2 2  

Posterior: By Bayes rule, the posterior distribution is

T T a+ yt 1 A+T yt 1 p ( y) p (y ) p ()  t=1 (1 ) t=1 j / j / P P (aT ;AT ) ;  B T T where aT = a + yt and AT = A + T yt: The moments of the Beta distribution t=1 t=1 are given in Appendix 2. P P 14 Marginal likelihood: The marginal likelihood is

1 1 aT 1 AT 1 p (y) = p (y ) p () d =  (1 ) d B (a; A) j B (a; A) Z Z B (a ;A ) = T T . B (a; A)

Predictive likelihood: The predictive distribution is

T +1 T +1 p y p y  p () d B (yt+1 + aT ; 1 yt+1 + _AT ) y yT = = j = . P T +1 p (yT ) p (yT ) p () d B (a ;A ) j  R  T T  j R

2.2 Multinomial observations (FIX)

Likelihood: Multinomial observations consists of data from k di¤erent categories, and th yi counts the number of observations in the i category. If yt  Mult (T; 1; : : : ; k), j  i [0; 1], then the likelihood of T trials is 2 k k y1 yk p(y1; : : : ; yk 1; : : : ; k) = 1 : : : k where yi = T and i = 1 j i=1 i=1 X X Prior: If we assume a Dirchlet prior distribution,  Dir( ) with density 

( i) 1 1 k 1 p(1; : : : ; k ) =  : : :  j ( ) 1 k iP i Posterior: By Bayes rule, the posterior is DirchletQ with parameter + y :

y1 yk 1 1 k 1 p(1; : : : ; k y1; : : : ; yk)  : : :   : : :  j / 1 k 1 k 1+y1 1 k+yk 1  : : :  / 1 k Dir( + y). 

15 2.3 Poisson observations

Likelihood: If yt  oi (), then j  P T  y T T y e  t e  t=1 t p (y ) = = ; y ! T P j t=1 t t=1 yt! Y T Q where t=1 yt is a su¢ cient statistic for . Fisher’s information for Poisson observations is I () =  1. P Prior: A conjugate prior for  is a gamma distribution. If  (a; A), then the prior pdf  G is a A a 1 p () =  exp ( A) : (a) Je¤reys’prior is 1 1 1 2 p () I () 2 = ; /    which can be viewed as a special case of the gamma prior with A = 0 and a = 1=2. Posterior distribution: The posterior distribution is

T (A+T ) a+ yt 1 p ( y) p (y ) p () e  t=1 j / j / P (aT ;AT )  G T where aT = a + t=1 yt and AT = A + T . Marginal Likelihood: The marginal likelihood is given by P

1 a T (aT )(A) p (y) = p (y ) p () d = yt! : j t=1 (a)(A )aT Z T Q  Predictive distribution: The predictive distribution is given by

T T Prob yt+1 y = p (yT +1 ) p  y d j j j Z a  e yT +1 (A ) T  T aT 1 =  exp ( AT ) d y ! (a ) Z T +1 T 1 (A )aT (a + y ) = T T T +1 aT +yT +1 yT +1! (AT + 1) (aT )

16 2.4 Normal observations with known variance: independent prior

2 2 Likelihood: If yt ;  (;  ), the likelihood is j  N T 2 2 p y ;  = p yt ;  j t=1 j T T Y 2  1  1 2 = exp (yt ) 22 22   t=1 ! T XT 1 2 1 = exp (y y)2 + T (y )2 ; 22 22 t t=1 !!   X 1 T where y = T t=1 yt. Assuming the variance is known, the likelihood of  is

P 2 2 T ( y) p y ;  exp 2 . j / 2 !  and the other terms are absorbed into the constant of integration. Fisher’sinformation for normal observations with 2 known is 2 2 @ ln p (yt ;  ) 1 I () = E j = : @2 2  

Prior: Assuming 2 is known, a proper conjugate prior distribution for  that is independent of 2 is  (a; A). Je¤reys prior for normal observations (with known variance) is, as  N a function of , a constant, p () 1; which is improper. However, the resulting posterior / is proper and can be viewed as a limiting of the normal conjugate prior with a = 0 and A . ! 1 Posterior distribution: With the normal conjugate prior,  (a; A), the posterior is  N given by Bayes rule as

2 2 2 2 1 ( y) ( a) p  y;  p y ;  p () exp 2 + . j / j / 2 "  =T A #!   To simplify the posterior, complete the square for the term inside the brackets using the results in Appendix 5,

2 2 2 2 ( y) ( a) ( aT ) (y a) 2 + = + 2 ;  =T A AT  =T + A

17 where aT y a 1 1 1 = 2 + and = 2 + . (28) AT  =T A AT  =T A Thus, the posterior is

2 2 ( aT ) p  y;  exp (aT ;AT ) . j / 2AT !  N  Marginal likelihood: The marginal likelihood can be computed easily using su¢ cient statis- tics, since y is a su¢ cient statistic for  if 2 is known. In this case, p (y ; 2) = p (y ; 2) : j j This implies that p (y 2) = p (y 2) = p (y ; 2) p () d. Since y (; 2=T ) j j j  N and  (a; A), the marginalization can be done via substitution, which implies that  N R p (y 2) (a; A + 2=T ) and j  N 1 2 1 2 1 (y a) p y 2 = exp . j 2 (A + 2=T ) 2 A + 2=T   " #!  Predictive distribution: The predictive distribution is

2 2 2 p yT +1  ; y = p yT +1 ;  p   ; y d. j j j Z    This can also be easily computed using the integration by substitution trick: since yT +1 = 2  + "T +1 and p (  ; y) (aT ;AT ), then  = aT + pAT T +1 for some standard normal j  N T +1. Thus,

yT +1 = aT + AT T +1 + "T +1

2 2 and p (yT +1  ; y) (aT ;AT +  ). p j  N

18 2.5 Normal observations with known variance: dependent prior

2 2 Likelihood: As in the previous case, if yt ;  (;  ), the likelihood as a function of j  N , is

T T 2 2 1 1 2 2 p y ;  = exp (yt y) + T (y ) j 22 22   t=1 !!  X T ( y)2 exp 2 ; / 2 ! Prior: Again, assume that 2 is known, but now consider a dependent conjugate prior distribution for  conditional on 2 is  (a; 2A) (notice the subtle di¤erences between  N this and the prior in the previous case). Posterior: The posterior distribution is

2 2 2 2 2 1 ( y) ( a) p  y;  p y ;  p   exp 2 1 + : j / j j / 2 " T A #!    Completing the square, 2 2 2 2 ( y) ( a) ( aT ) (y a) 1 + = + 1 T A AT T + A where aT a y 1 1 1 = + 1 and = 1 + , (29) AT A T AT T A implies the posterior distribution is

2 2 ( aT ) 2 p  y;  exp 2 aT ;AT  : j / 2 AT !  N   Notice the slight di¤erences between this example and the previous one, in terms of the hyperparameters and the form of the posterior distribution. Marginal distribution: The marginal likelihood can be computed in the same manner as the previous section since y is a su¢ cient statistic for  if 2 is known. For this prior, 2 2 2 2 1 y (;  =T ) and  (a; A ), which implies that p (y  ) (a;  (A + T )) or  N  N j  N 1 2 1 2 1 (y a) p y 2 = exp . j 22 (A + T 1) 22 A + T 1   " #!  2 2 Predictive distribution:Using the arguments in the previous section p (yT +1  ; y) (aT ; (AT + 1)  ). j  N 19 2.6 Normal variance with known mean: conjugate prior

2 2 Likelihood: If yt ;  (;  ) and assuming that  is known, the likelihood, as a j  N function of 2, is T 2 T 2 1 (yt ) p y ; 2 exp t=1 . j / 2 22   P !  Fisher’sinformation for normal observations (with known mean) is

2 2 2 @ ln p (yt ;  ) 1 I  = E2 j = : @ (2)2 24    Prior distribution: A conjugate prior distribution for 2 is the inverse gamma distribution. 2 b B If  2 ; 2 , then  IG b B 2 exp B  p 2 = 2 22 . b b +1 2 2 2 ()   2 2 1 Je¤reys’prior is p ( ) ( ) , which is improper.  The resulting posterior is proper and / can be viewed as a limiting of the inverse gamma conjugate prior with b = B = 0. A ‡at or constant prior for 2 also leads to a proper posterior. Posterior distribution: By Bayes rule,

p 2 ; y p y ; 2 p 2 j / j b+T 2 +1 T 2  1   B + (yt ) exp t=1 / 2 22   P ! b B T ; T ;  IG 2 2   T 2 where bT = b + T and BT = B + (yt ) : The parameterization of the inverse t=1 gamma, 2 b ; B , is used instead of 2 (b; B) because the hyperparameters do  IG 2 2 P  IG not have any 1/2 terms. This is chosen for notational simplicity. It is also common in the literature to assume p () b ; B , which only changes the …rst term in the expression  IG 2 2 for bT . 

20 Marginal likelihood: The marginal likelihood is

p (y ) = p y ; 2 p 2 d2 j j Z b T T B 2   2 B 2 1 1 2 exp 22 2 = exp (yt ) d T 2 2 b +1 b 2  2 2 2 (2 ) t=1 ! ()  2 Z   X b bT B 2 2 +1 2 1 BT 2 = T 2 exp 2 d . b (2) 2  2 2  Z     Using the results in Appendix 4,

b b T +1 T 1 2 B B 2 b exp T d2 = T T : 2 22 2 2 Z         Thus, the marginal likelihood is

p (y ) = p y ; 2 p 2 d2 j j Z b bT bT  2 2 2 B BT = T . b 2 2 2 2 (2)     Predictive distribution: The predictive  distribution is

2 2 p (yT +1 ; y) = p yT +1 ;  p  ; y d j j j Z bT +1  2  2 +1 1 (yT +1 ) 1 BT exp exp d2 / 2 2 2 22 Z !     bT +1 2 2 +1 1 BT + (yT +1 ) 1 exp d2 / 2 2 2 Z !   bT +1 2 2 bT +1 2 2 (yT +1 ) BT + (yT +1 ) 1 + / / " BT #  bT +1 2 2 (yT +1 ) 1 + ; / 2 BT 3 bT bT 4   5 BT which is a Student t distribution, p (yT +1 ; y) tb ; . j  T bT   21 2.7 Unknown mean and variance: dependent conjugate priors

2 2 2 Likelihood: If yt ;  (;  ) and assuming that both  and  are unknown, then j  N the likelihood as a function of  and 2 is

T 2 T 2 2 1 (yt y) + T (y ) p y ; 2 = exp t=1 . j 22 22   P !  Fishers’information is

2 2 2 2 @ ln(p(yt ; )) @ ln(p(yt ; )) j j 1 2 @2 @@2 2 0 I ;  = E;2 2 2 2 2 = : @ ln(p(yt ; )) @ ln(p(yt ; )) 1 2 j j 3 0 2 2 @@2 2 2 " 2( ) #  @( ) 4 5 Prior distribution: A conjugate prior for (; 2) is p (; 2) = p ( 2) p (2) ; where j b B p  2 a; A2 and p 2 ; : (30) j  N  IG 2 2      This distribution is often expressed as (a; A; b; B). Notice that the prior distribution NIG for  and 2 are dependent. The marginal prior distribution for  is not normal but is a Student t distribution:

p () = p  2 p 2 d2 j Z b  1  2 B b +1 1 2 1 ( a) 2 B 1 2 = exp 2 exp d2 22 2 2A b 22 2 Z   ! 2     b+1 +1 2 1 2 1 ( a) B exp + d2: / 2 2 2A 2 Z   " #! Using the integration results in appendix D,

b+1 b+1 B ( a)2 2 ( a)2 2 p () + B + / " 2 2A # / " A # b+1 ( a)2 2 1 + AB , / " b b # AB which is a Student t distribution, tb a; b . Fisher’sinformation implies that Je¤reys prior is  3=2 1 1 p ; 2 det I ; 2 2 . / / 2     22 This prior is improper, but leads to a proper posterior distribution that can be viewed as a limiting case of 2 2 bT BT p ;  y aT ;AT  ; ; j  N IG 2 2   where a = 0, A , B = b = 0.  ! 1 Posterior distribution: Bayes rule and a few lines of algebra implies that

p ; 2 y p y ; 2 p  2 p 2 j / j j T +b+1 +1 2 2 T 1  2  1 ( y) ( a) exp + + (y y)2 + B : 2 22 T 1 A t / " t=1 #!   X Completing the square in  implies that

2 2 2 2 ( y) ( a) ( aT ) (y a) 1 + = + ; T A AT 1=T + A where aT a y 1 1 1 = + 1 and = + 1 . (31) AT A T AT A T Inserting this into the likelihood,

b+T +1 2 +1 2 2 1 1 ( aT ) p ;  y exp + BT ; j / 2 22 A   " T #!  where (y a)2 T B = B + + (y y)2 . (32) T 1=T + A t t=1 X Given the prior structure, the posterior to be conjugate if can be expressed as

p ; 2 y p  2; y p 2 y . (33) j / j j A few lines of algebra shows that   

1 T +b 2 2 2 +1 1 1 ( aT ) 1 BT p ; 2 y exp exp , j / 2 2 2A 2 22   T !      which implies that

2 2 2 bT BT p   ; y aT ;  AT and p( y) ; (34) j  N j  IG 2 2     23 2 where bT = b + T and BT is de…ned above. Thus, p (;  y) (aT ;AT ; bT ;BT ). j  N IG Marginal posterior distributions: The marginal parameter distributions, p (2 y) and p ( y), j j are both known analytically. For the …rst,

p 2 y = p ; 2 y d j j Z bT +1  2 +1 2 1 BT 1 ( aT ) exp exp d / 2 22 2 2A     Z T ! b T +1 1 2 B b B exp T T ; T . / 2 22  IG 2 2       Using the integration results in Appendix 4, the marginal p ( y) is j p ( y) = p ; 2 y d2 = p  2; y p 2 y d2 j j j j Z bT +1 Z 2 +1   1 1 ( aT ) BT exp + d2 / 2 2 2A 2 Z     T  bT +1 2 2 ( aT ) 1 + ; / 2 AT BT 3 bT bT 4  5 AT BT which is the kernel of a t-distribution, thus  y tb aT ; . j  T bT Marginal likelihood: The marginal likelihood is  

p (y) = p y ; 2 p  2 p 2 dd2: j j Z T  b 1 To compute this integral, de…ne K = (2) 2 , K = (B=2) 2 = (b=2) ; and K = (2A) 2 . The marginal likelihood is

b+T +1 +1 1 2 S + B p (y) = KKK exp 2 22 Z     1 ( y)2 ( a)2 exp + dd2:  22 T 1 A Z " #! Completing the square,

2 2 2 2 ( y) ( a) ( aT ) (y a) 1 + = + 1 , T A AT T + A

24 where aT y a 1 1 1 = 1 + and = 1 + . AT T A AT T A The integral terms can be expressed as

b+T +1 2 +1 2 2 1 1 (y a) 1 ( aT ) exp S + B + exp d d2: 2 22 T 1 + A 22 A Z   " #! Z T ! ! The second integral is

2 1 ( aT ) 2 exp 2 d = 2AT  : 2  AT ! Z p Substituting back into the main expression, the second integal can be computed to yield

b+T (y a)2 2 +1 1 S + B + 1  T +A 2 p (y) = 2AT KKK 2 exp 2 d  2 ! q Z    b+T T 1 b 2 2 2 2 b+T 2 1 AT B (y a) = 2 S + B + . 2 A 2 b T 1 + A       2  " # This is a multivariate t-distribution.  Predictive distribution: The predictive distribution is given by

T 2 2 T 2 p yT +1 y = p yT +1 ;  p ;  y dd . j j j Z    To simplify, …rst compute the integral against  by substituting from the posterior. Since 2 2 T p (  ; y) (aT ;  AT ), conditional on y , we have that j  N

 = aT +  AT "T p where "T is an independent normal random variable.e Substituting in yT +1 =  + "T +1,

e yT +1 = aT +  AT "T + "T +1 = aT + T +1 p e

25 where T +1 (0;AT + 1). Thus,  N 1 bT 2 2 2 +1 T 1 1 (yT +1 aT ) 1 BT 2 p yT +1 y exp exp d j / 2 22 A + 1 2 22 Z   T !      BT +1 2 +1 2 1 1 (yT +1 aT ) 2 exp BT + d / 2 22 A + 1 Z   " T #! BT +1 2 2 (yT +1 aT ) 1 + / " BT (AT + 1) #

BT (AT + 1) tb aT ; :  T b  T  2.8 Linear Regression

2 2 Likelihood: Consider a regression model speci…cation, yt ; xt;  (xt ;  ), where Xt j  N is a 1 p vector of observed covariates, is a p 1 vector of regression coe¢ cients and  2  "t (0;  ). The model can be expressed as y = x + ", where y is a T 1 vector of  N  dependent variables, x is a T p matrix of regressor variables and " is a T 1 vector of   errors. The likelihood function is

T=2 T 2 1 (yt xt ) p y ; 2; x = exp t=1 (35) j 22 22   P !  T 1 2 (y x )0 (y x ) exp : (36) / 2 22     The exponential term can be expressed in a more useful form. De…ning the OLS estimator 1 as = (x0x) x0y and the residual sum of squares by S = (y x ^)0(y x ^), then ^ ^ b (y x )0(y x ) = (y x x )0(y x x )     = ( ^)0(x0x)( ^) + S, since y x ^ 0 x ^ = 0. Thus, the likelihood is     T=2 2 1 1 S p(y ;  ) = exp ( ^)0(x0x)( ^) ; j 22 22 22     and ;^ S are su¢ cient for ( ; 2).   26 Information matrix:

2 Prior: A proper conjugate prior is given by p ( ;  ) p (a; A; b; B) ; which implies 2 2 2 b B  N IG that p (  ) p (a;  A) and p ( ) ; , where the multivarate normal density j  N  IG 2 2 is p  2 2 1 1 1 1 p  = A 2 exp ( a)0 (A) ( a) j 22 j j 22     since 2A = (2)p A . For the variance parameter, j j j j b B 2 exp B p 2 = 2 22 b b +1 2 2 2 ()   By analogy to the location-scale case, the marginal  prior for is a multivariate t-distribution,

tp;b (a; BA). This can shown by direct integration:  p ( ) = p 2 p 2 d2 j Z b B b+p  2 b+p 2 2 1 ( 2 ) = B + ( a)0 A ( a) p 1 b 2 2  V 2 j j  b+p  1 ( 2 ) 1 + ( a)0(BA) ( a) : / Je¤reys did not formally consider multiple parameter cases, but his rule can be used to derive a prior for this model. The Je¤reys prior is given by p ( ; 2) 1=p+2. / Posterior: The posterior distribution is

b+T +1 p 1 2 1 2 1 p ; 2 y exp Q ( ) ; (37) j / 2 2 22        where the quadratic form Q( ) is given by  1 Q ( ) = ( ^)0(x0x)( ^) + ( a)0 A ( a) + B + S. (38) Using the matrix-complete the square formula, we have that

1 ^ 0 1 1 Q ( ) = ( aT )0A ( aT ) + a (x0x) AT A ( a) + B + S. (39) T   h i where 1 1 ^ 1 AT = x0x + A and aT = AT x0x + A a :   27 ^ 1 Substituting in implies that aT = AT [x0y + A a]. Thus,

1 Q ( ) = ( aT )0 A ( aT ) + BT (40) T where 1 1 BT = B + y0y + a0A a a0 A aT : (41) T T For future reference, a bit of matrix algebra implies that BT can be expressed alternatively as 1 0 1 BT = B + S + a A + (x0x) a (42)       or b b 1 BT = B + (y xa)0 (I + xAx0) (y xa) . (43) which is useful since it depends explicitly on S, the sum of squared residuals. Thus, the posterior is

p b+T 2 2 +1 2 1 1 1 1 1 BT p ;  y exp ( aT )0 A ( aT ) exp j / 2 22 T 2 2 2          (44) p 2; x; y p 2 x; y ; (45)  j j   where 2 2 2 bT BT p  ; x; y k aT ;  AT and p  x; y ; (46) j  N j  IG 2 2      and bT = b + T . Marginal posteriors: The marginal posterior for the variance is immediate, p (2 x; y) bT ; BT . For j  IG 2 2 the regression parameters,  p ( y) = p ; 2 y d2 j j Z p bT 2  2 +1 1 1 1 1 1 BT 2 exp ( aT )0 A ( aT ) exp d / 2 22 T 2 2 2 Z         bT +p 2 +1   1 1 1 2 exp BT + ( aT )0 A ( aT ) d : / 2 22 T Z      

28 This is a univariate integral in 2:

bT +p 2 +1 1 1 1 2 exp BT + ( aT )0 A ( aT ) d 2 22 T Z     bT +p 1  2  BT + ( aT )0 A ( aT ) T , / 2  

bT +p 1 2 p ( y) BT + ( aT )0 A ( aT ) j / T bT +p 1 2  AT BT  ( aT )0 b+T ( aT ) BT + ; / " (b + T) #

which is the kernal of a multivariate t-distribution. Thus, p ( y) tbT (aT ;AT BT =bT ). j  MV 1 The moments of the posterior are known. E ( ) = aT . Since x0x (x0x) x0y = x0x , the posterior mean can be expressed as a shrinkage estimator of the form: b E ( x; y) = W ^ + (I W ) a; j 1 1 where W = (A + x0x) x0x is the matrix of shrinkage weights. Marginal likelihood: The marginal likelihood is given by

p (y) = p y 2;  p 2;  dd2 = p y 2;  p  2 d p 2 d2. j j j ZZ Z Z       The inner integral involving  can be done by brute force or using the integration by substitution trick, with the latter being far simpler. Express the model as y = x +", where 2 2 2 2 1 " (a;  I). Since p (  ) (a; A ), this implies that = a +  A 2 , for  MVN j  MVN  (0;I). Thus  MVN 2 1 y = x + " = xa +  xA 2  + ",

2 2 which implies that p (y  ) (xa;  (I + xAx0)). The determinant of (I + xAx0) j  MVN j j is needed to express the multivariate density of y. To do this, note the matrix identidy 1 1 1 D + EFG = D F F + GD E , which implies that I + xAx0 = A A + x0x = j j j j j j j j j j j j j j A = AT . Thus, j j j j 1 T 2 2 2 AT 1 1 1 0 p y  = j j1 2 exp 2 (y xa) [I + xAx0] (y xa) : j 2 2 2 A      j j 29 1 Using equations 42 and 43, BT = B +(y xa)0 [I + xAx0] (y xa). Given this, the outer integral is

1 b B+T 2 B 2 2 +1 2 2 2 AT 2 1 1 2 p (y) = p y  p  d = j1 j T b 2 exp 2 BT d j 2 2  2 Z A (2) 2 Z       j j The marginal likelihood is then 

p (y) = p y 2 p 2 d2 j Z b T B 2   2 B 2 1 1 exp 22 2 = exp BT d T 2 2 b +1 (2) 2 b  2 2 2  2 Z     ( )  b b+T B 2 2 +1 2  1 1 2 = T 2 exp 2 BT d . (2) 2 b  2  2 Z     and since  b T +1 2 b 1 1 2 bT T exp [BT ] d = [BT ] 2 ; 2 22 2 Z      

b bT B 2 2 +1 2 1 1 2 p (y) = T 2 exp 2 [BT ] d (2) 2 b  2  2 Z     b B 2 bT  bT 2 2 2 = T [BT ] . 2 b (2) 2   2.9 Multivariate Normal observations (through here)

1 m 1 m Likelihood: If yt ;  m (; ), where yt = y ::: y and  =  :::  j  N t t are both are 1 m vectors and  is an m mpositive de…nite covariance matrix. It is   convenient to stack the system as y = X + " where

y1 1 "1 . . . y = 0 . 1 ;X = 0 . 1 ; and " = 0 . 1 y 1 " B T C B C B T C @ A @ A @ A

30 where y and " are T m matrices and X is a T 1 vector. The likelihood function is given   by

T T 1=2 1=2 1 1 p (y ; ) = p (yt ; ) = (2)  exp (yt )  (yt )0 j j j j 2 t=1 t=1   Y Y T T T 1 1 = (2) 2  2 exp (y )  (y )0 2 t t j j t=1 ! X T T 1 1 = (2) 2  2 exp tr (y X)0 (y X)  : j j 2     This uses the fact that

T T 1 1 1 tr y0y = (yt )  (yt )0 = tr (yt )0 (yt )  t=1 t=1 !  X X Since

(y X)0 (y X) = (y Xy)0 (y Xy) + ( y) X0X ( y) = S + ( y) X0X ( y) where 1 T S = (y Xy)0 (y Xy) = (y y)(y y)0 T t t t=1 X is the observed variance-covariance matrix and

T 1 1 y = (X0X) X0y = T yt. t=1 X Then T=2 1 1 1 p (y ; )  exp tr S + ( y)0 X0X ( y)  : j / j j 2   The su¢ cient statistics are S and y, sinceX does not depend on the data. Information matrix/tensor Posterior distribution: Marginal posterior distributions: Marginal likelihood:

31 2.10 Multivariate regression (need)

3 Direct sampling: generating i.i.d. random variables from distributions

The Gibbs sampler and MH algorithms require simulating i.i.d. random variables from “recognizable” distributions. Appendix 1 provides a list of common “recognizable” dis- tributions, along with methods for generating random variables from these distributions. This section brie‡y reviews the standard methods and approaches for simulating random variables from recognizable distributions. Most of these algorithms …rst generate random variables from a relatively simple “build- ing block”distribution, such as a uniform or normal distribution, and then transform these draws to obtain a sample from another distribution. This section describes a number of these approaches that are commonly encountered in practice. Most “random-number”gen- erators actually use deterministic methods, along with transformations. In this regard, it is important to remember a famous quotation attributed to Von Neumann’s: “Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin.”

Inverse CDF method The inverse distribution method uses samples of uniform random variables to generate draws from random variables with a continuous distribution function, F . Since F (x) is uniformly distributed on [0; 1], draw a uniform random variable and invert the CDF to get a draw from F . Thus, to sample from F ,

Step 1: Draw U U [0; 1]  1 Step 2: Set X = F (U) ;

1 where F (U) = inf x : F (x) = U . f g 1 This inversion method provides i.i.d. draws from F provided that F (U) can be exactly calculated. For example, the CDF of an exponential random variable with parameter  is 1 F (x) = 1 exp ( x), which can easily be inverted. When F cannot be analytically calculated, approximate inversions can be used. For example, suppose that the density is a known analytical function. Then, F (x) can be computed to an arbitrary degree of accuracy on a grid and inversions can be approximately calculated, generating an approximate draw

32 from F . With all approximations, there is a natural trade-o¤ between computational speed and accuracy. One example where e¢ cient approximations are possible are inversions involving normal distributions, which is useful for generating truncated normal random variables. Outside of these limited cases, the inverse transform method does not provide a computationally attractive approach for drawing random variables from a given distribution function. In particular, it does not work well in multiple dimensions.

Functional Transformations The second main method uses functional transforma- tions to express the distribution of a random variable that is a known function of another random variable. Suppose that X F , admitting a density f, and that y = h (x) is  1 an increasing continuous function. Thus, we can de…ne x = h (y) as the inverse of the function h. The distribution of y is given by

1 h (y) 1 F (a) = Prob (Y y) = f (x) dx = F X h (y) :   Z1  Di¤erentiating with respect to y gives the density via Leibnitz’srule:

1 d 1 f (y) = f h (y) h (y) , Y dy

  where we make explicit that the density is over the random variable Y . This result is used 1 y  widely. For example, if X (0; 1), then Y =  + X. Since x = h (y) =  , the x N distribution function is F  and density

 2 1 1 y  fY (y) = exp : p2 2    ! Transformations are widely used to simulate both univariate and multivariate random vari- 2  2 ables. As examples, if Y () and  is an integer, then Y = X where each Xi  X i=1 i is independent standard normal. Exponential random variables can be used to simulate P 2, gamma, beta, and Poisson random variables. The famous Box-Muller algorithm simu- X lates normals from uniform and exponential random variables. In the multivariate setting, Wishart (and inverse Wishart) random variables can be simulated via sums of squared vectors of standard normal random variables.

33 Mixture distributions In the multidimensional case, a special case of the transfor- mation generate continuous mixture distributions. The density of a continuous mixture, distribution is given by p (x) = p (x ) p () d; j Z where p (x ) is viewed as density conditional on the parameter . One example of this is j the class of scale mixtures of normal distributions, where 1 x2 p (x ) = exp j p2 2   and,  is the conditional variance of X. It is often simpler just to write

X = p"; where " (0; 1). The distribution of p determines the marginal distribution of X.  N Here are a number of examples of scale mixture distributions.

t-distribution. The t-distribution arises in many problems in-  volving inverse gamma priors and conditionally normally distributed likelihoods. If b B p (yt ) (0; ) and p () ; , then the marginal distribution of yt is j  N  IG 2 2 tb (0; B=b). The proof is direct by analytically  computing the marginal distribution,

p (yt) = p (yt ) p () d: j Z Using the the integration results in Appendix 4:

1 b 2 b=2 +1 1 1 1 2 1 y (B=2) 1 2 B p (y) = exp exp d p2  2  b  2 Z0     2     b+1 ( )+1 2 1 1 2 1 y + B  exp dx; / x x 2 Z0     which takes the general form of equation 49. Computing the integral,

b+1 2 ( ) ( b+1 ) y 2 p (y) B + y2 2 1 + ; / / b (B=b)     2 2 thus yt tb (0; B=b). More generally, if  is known, then yt ; ;  (;  ) and b B 2 j  N  ; , then p (yt ;  ) tb (; B=b).  IG 2 2 j   34 Double-exponential distribution. The double exponential distribution is also a  scale mixture distribution: If p (yt ) (0; ) and  exp (2), then the marginal j  N  distribution of yt is (0; 1).. The proof is again by direct integration using the DE integration results in the appendix:

2 1 1 1 y 1 exp  d = exp ( y ) ; p2 2  2 j j Z0    by substituting a = 1 and b = y via integral 50. More generally, if  and 2 are 2 2 2 2 known, then yt ; ;  (;  ) and  exp (2), then p (yt ;  ) (;  ) j  N  1 j  DE substituting b = (y ) = and multiplying both sides by  . Asymmetric Laplacean. The asymmetric Laplacean distribution is a scale mixture  of normal distributions: if 1 p (yt ) ((1 2) ; ) and  exp ; j  N  2 (1 )  

then yt (; 0; 1). The proof uses integral 51 in the appendix:  CE 1 1 1 1 exp (y + (2 1) )2 2 (1 )  d = exp ( y (2 1) y) : p2 2 2 j j Z0   2 2 2 More generally, if  and  are known, yt ; ;  ( + (1 2) ;  ) and 1 2 j 2 N  exp 21  implies that p (yt ;  ) (; ;  ).  j  CE Exponential  power family. The exponential power family (Box and Tiao, 1973)  family p(x ; ) is given by j

1 y p(y ; ) = (2) c( ) exp (47) j    1 1 where c( ) = (1 + ) . Following West (1987) we have the following scale mixtures of normals representation

2 1 1 y p(y  = 1; ) = c( ) exp ( ) = exp p( )d (48) j j j p2 2 j Z0   where 3=2 + 1 p( )  St  j / 2 + and Sta is the (analytically intractable) density of a positive stable distribution.

35 Factorization Method Another method that is useful in some multivariate settings is known as factorization. The rules of probability imply that a joint density, p (x1; :::; xn), can always be factored as

p (x1; :::; xn) = p (xn x1; :::; xn 1) p (x1; :::; xn 1) j = p (x1) p (x2 x1) p (xn x1; :::; xn 1) . j    j

In this case, simulating X1 p (x1), X2 p (x2 X1), ... Xn p (xn X1; :::; Xn 1) generates   j  j a draw from the joint distribution. This procedure is common in Bayesian statistics, and the distribution of X1 is a marginal distribution and the other distributions are condi- tional distributions. This is used repeatedly to express a joint posterior in terms of lower dimensional conditional posteriors.

Rejecting sampling The …nal general method discussed here is the accept-reject method, developed by von Neumann. Suppose that the goal is to generate a sample from f (x), where it is assumed that f is bounded, that is, f (x) cg (x) for some c. The  accept-reject algorithm is a two-step procedure

Step 1: Draw U [0; 1] and X g  U  f (X) Step 2: Accept Y = X if U , otherwise return to Step 1.  cg (X)

Rejection sampling simulates repeatedly until a draw that satis…es U f(X) is found. By  cg(X) direct calculation, it is clear that Y has density f:

f(X) f (X) Prob X y; U cg(X) Prob(Y y) = Prob X y U =     j  cg (X) Prob U f(X)     cg(X) y f(x)=cg(X)   du g (x) dx 1 y y 0 c f (x) dx = 1 = 1 = f (x) dx. f(x)=cg(X) 1 R1R  1 f (x) dx 0 du g (x) dx c R Z1 1 1   R Rejection samplingR requiresR (a) a bounding or dominating density g, (b) an ability to evaluate the ratio f=g; (c) an ability to simulate i.i.d. draws from g, and (d) the bounding constant c. Rejection sampling does not require that the normalization constant f (x) dx f=c be known, since the algorithm only requires knowledge of . For continuousR densities on a bounded support, it is to satisfy (a) and (c) (a uniform density works), but for

36 continuous densities on unbounded support it can be more di¢ cult since we need to …nd a f(x) density with heavier tails and higher peaks. Setting c = sup g(x) maximizes the acceptance x probability. In practice, …nding the constant is di¢ cult because f generally depends on a multi-dimensional parameter vector, f (x f ), and thus the bounding is over x and f . j Rejection sampling is often used to generate random variables from various recognizable distributions, such as the gamma or beta distribution. In these cases, the structure of the densities is well known and the bounding density can be tailored to generate fast and e¢ cient rejection sampling algorithms. In many of these cases, the densities are log-concave (e.g., normal, double exponential, gamma, and beta). A density f is log-concave if ln (f (x)) dlnf(x) f (x) is concave. For di¤erentiable densities, this is equivalent to assuming that dx = f0(x) d2lnf(x) is non-increasing in x and dx2 < 0. Under these conditions, it is possible to develop “black-box”generation methods that perform well in many settings. Another modi…cation that “adapts”the dominating densities works well for log-concave densities. The basis of the rejection sampling is the simple fact that since

f(x) f(x) f (x) = du = U (du) ; Z0 Z0 where U is the uniform distribution function, the density f is a marginal distribution from the joint distribution (X;U) ((x; u) : 0 u f (x)) :  U   More generally, if X Rd is a random vector with density f and U is an independent 2 random variable distributed [0; 1], then (X; cUf (X)) is uniformly distributed on the set U A = (x; u): x Rd; 0 u cf (x) . Conversely, if (X;U) is uniformly distributed on A, 2   then the density of X is f (x).

4 Integration results

There are a number of useful integration results that are repeatedly used. The …rst is

(x )2 p22 = exp dx, 22 Z ! which de…nes the univariate normal distribution. The gamma function, de…ned as

1 1 y ( ) = y e dy, Z0 37 is important for a number of distributions, most notably the gamma and inverse gamma. Dividing both sides of the the gamma function expression by ( ),

1 y 1 y e 1 = dy, ( ) Z0 1 y which implies that y e = ( ) is a proper density. Changing variables to x = y, gives the standard form of the gamma distribution. Integration by parts implies that ( + 1) = ( ), which in turn implies that (1) = 1, (n) = (n 1)!,

1 1 2 ( ) + = 2 p (2 ) , 2   and (1=2) = p. A number of integrals are useful for analytic characterization of distributions, either priors or posteriors, that are scale mixtures of normals. As discussed in the previous appendix, scale mixtures involve integrals that are a product of two distributions. The following integral identities are useful for analyzing these speci…cations. For any given p; a; b > 0,

1 p 1 b 1 p p x exp ax dx = a b (49) b b Z0 p+1    1 1 b 1 p p exp ax dx = a b : x b b Z0      These integrals are useful when combining gamma or inverse gamma distributions prior distributions with normal likelihood functions. Second, for any a and b,

2 1 a 1 b exp a2x + dx = exp ( ab ) , (50) p2x 2 x j j Z0    which are useful for double exponential. A related integral, which is useful for the check exponential distribution is

2 1 a 1 b exp + 2(2 1)b + a2x dx = exp ( ab (2 1)b) . (51) p2x 2 x j j Z0   

38 5 Useful algebraic formulae

Completing the square (have fun showing the algebra for the second part).

2 2 2 2 (x 1) (x 2) (x 3) (1 2) + = + 1 2 3 1 + 2 where

1 1 1  = + 3    1 2     =  1 + 2 . 3 3    1 2  1 1 Check: assume that 2 = . Then 3 = = 1 and 3 = 1. 1 1 Alternatively, the following is a useful representation when analyzing shrinkage and shrinkage factors. De…ning  w = 1 ; 1 + 2 a few lines of algebra implies that the complete the square can be written as

2 2 2 2 (x 1) (x 2) (1 2) (x (1 w)1 w2) + = 1 + : 1 2 1(1 w) 2(1 w) Vector/matrix completion of the square

1 1 1 1 (x 1)  (x 1)0+(x 2)  (x 2)0 = (x 3)  (x 3)0+(1 2)  (1 2) 1 2 3 4 where

1 1 1 3 = 1 + 2 1 1 3 =  3 1 1 + 2 2 1 1 1 1 1 4 = 1 1 + 2 2 :  This is used for multivariate regressions:

T T 1 1 1 tr y0y = (yt )  (yt )0 = tr (yt )0 (yt )  t=1 t=1 !  X X

39 The fundamental identity of likelihood times prior equals posterior times marginal uses the completing the square and the convolution of two normals: we have

(1; 1) (2; 2) = (; ) c(1; 1; 2; 2) N N N  where

1 1  =  1 1 + 2 2 1 1 1  = 1 + 2   amd the marginal is given by

k k 1 1 c(1; 1; 2; 2) = (2) 2 1 + 2 2 exp (2 1)0 (1 + 2) (2 1) j j 2  

40