CHAPTER 1: DISTRIBUTION THEORY Basic Concepts • (R.V.)  discrete random variable: probability mass function  continuous random variable: probability density function

• Random Vector  multivariate version of random variables  joint probability function, joint density function • Formal definitions based on measure theory  R.V. is a measurable function on a probability measure space, which consists of sets in a σ-field equipped with a probability measure;  probability mass: the Radon-Nykodym derivative of a random variable-induced measure w.r.t. to a counting measure;  density function: the Radon-Nykodym derivative of a random variable-induced measure w.r.t. the Lebesgue measure. • Descriptive quantities of a distribution  cumulative distribution function

FX(x) ≡ P(X ≤ x)

 moments (mean): E[Xk], k = 1, 2,... (mean if k = 1) ; centralized moments: E[(X − µ)k] (variance if k = 2)  quantile function

−1 F (p) ≡ inf {FX(x) > p} X x

 the skewness: E[(X − µ)3]/Var(X)3/2  the kurtosis: E[(X − µ)4]/Var(X)2 • Characteristic function (c.f.)  definition φX(t) = E[exp{itX}]  c.f. uniquely determines the distribution

1 Z T e−ita − e−itb FX(b) − FX(a) = lim φX(t)dt T→∞ 2π −T it Z ∞ 1 −itx fX(x) = e φX(t)dt for continuous r.v. 2π −∞  moment generating function (if it exists)

mX(t) = E[exp{tX}] • Descriptive quantities of two R.V.s (X, Y)  covariance and correlation: Cov(X, Y), corr(X, Y)  conditional density or probability (f ’s become probability mass functions if R.V.s are discrete):

fX|Y(x|y) = fX,Y(x, y)/fY(y) Z E[X|Y] = xfX|Y(x|y)dx

 independence

P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y)

fX,Y(x, y) = fX(x)fY(y) • Double expectation formulae

E[X] = E[E[X|Y]] Var(X) = E[Var(X|Y)] + Var(E[X|Y])

• This is useful for computing the moments of X when X given Y has a simple distribution! 2 In-class exercise Compute E[X] and Var(X) if Y ∼ Bernoulli(p) and X given Y = y follows N(µy, σy ). Discrete Random Variables •  Binomial r.v. is the total number of successes in n independent Bernoulli trials.  Binomial(n, p):

n P(S = k) = pk(1 − p)n−k, k = 0, ..., n n k

E[Sn] = np, Var(Sn) = np(1 − p) it n φX(t) = (1 − p + pe )  Summation of two independent Binomial r.v.’s

Binomial(n1, p) + Binomial(n2, p) ∼ Binomial(n1 + n2, p)

In-class exercise Prove the last statement. • Negative Binomial distribution  It describes the distribution of the number of independent Bernoulli trials until m successes.  NB(m, p):

 k − 1  P(W = k) = pm(1 − p)k−m, k = m, m + 1, ... m m − 1

2 E[Wm] = m/p, Var(Wm) = m/p − m/p  Summation of two independent r.v.’s:

NB(m1, p) + NB(m2, p) ∼ NB(m1 + m2, p)

In-class exercise Prove the last statement. • Hypergeometric distribution  It describes the distribution of the number of black balls if we randomly draw n balls from an urn with M blacks and (N − M) whites (urn model).  HG(N,M,n):

MN−M ( = ) = k n−k , = , , .., P Sn k N k 0 1 n n nMN(M + N − n) E[S ] = Mn/(M + N), Var(S ) = n n (M + N)2(M + N − 1) •  It is used to describe the distribution of the number of incidences in an infinite population.  Poisson(λ):

P(X = k) = λke−λ/k!, k = 0, 1, 2, ...

it E[X] = Var(X) = λ, φX(t) = exp{−λ(1 − e )}  Summation of two independent r.v.’s:

Poisson(λ1) + Poisson(λ2) ∼ Poisson(λ1 + λ2) • Poisson distribution related to Binomial distribution  Assume

X1 ∼ Poisson(λ1), X2 ∼ Poisson(λ2), X1 ⊥⊥ X2.

Then λ1 X1|X1 + X2 = n ∼ Binomial(n, ). λ1 + λ2

 If Xn1, ..., Xnn are i.i.d Bernoulli(pn) and npn → λ, then

n! λk P(S = k) = pk (1 − p )n−k → e−λ. n k!(n − k)! n n k!

In-class exercise Prove these statements. • Multinomial Distribution

 Assume Y1, ..., Yn are independent k-category random Pk varaibles such that P(Yi = category l) = pl, l=1 pl = 1.  Nl, 1 ≤ l ≤ k counts the number of times that {Y1, ..., Yn} taking category value l.  Then P(N = n , ..., N = n ) = n pn1 ··· pnk 1 1 k k n1,...,nk 1 k

n1 + ... + nk = n

 the covariance matrix for (N1, ..., Nk)   p1(1 − p1) ... −p1pk  . .. .  n  . . .  −p1pk ... pk(1 − pk)     p1   .   = n diag(p1, ..., pk) −  .  p1 ... pk  pk Continuous Random Variables • Uniform distribution  Uniform(a, b): fX(x) = I[a,b](x)/(b − a),

where IA(x) is an indicator function taking value 1 if x ∈ A and 0, otherwise.  E[X] = (a + b)/2, Var(X) = (b − a)2/12  The uniform distribution in [0, 1] is the most fundamental distribution each pseudo random number generator tries to generate numbers from. •  It describes the distribution behavior of averaged in practice.  N(µ, σ2) 1 (x − µ)2 fX(x) = √ exp{− } 2πσ2 2σ2 E[X] = µ, Var(X) = σ2 2 2 φX(t) = exp{itµ − σ t /2} •  It is useful to describe the distribution of a positive random variable with long right tail.  Gamma(θ, β)

1 x f (x) = xθ−1 exp{− }, x > 0 X βθΓ(θ) β

E[X] = θβ, Var(X) = θβ2  θ is called the shape parameter and β is the scale parameter.  θ = 1 reduces to the so-called , Exp(β), and θ = n/2, β = 2 is equivalent to a chi-squared distribution with n degrees of freedom.

Caution In some references, notation Gamma(θ, β) is used to refer to the distribution Gamma(θ, 1/β) as defined here! • Cauchy distribution  Cauchy(a, b)

1 f (x) = X bπ {1 + (x − a)2/b2}

 E[|X|] = ∞, φX(t) = exp{iat − |bt|}  This distribution is often used as a counter example whose mean does not exist. Algebra of Random Variables • Assumption  We assume that X and Y are independent with distribution functions FX and FY, respectively.  We are interested in the distribution or density of

X + Y, XY, X/Y.

 For XY and X/Y, we assume Y to be positive for convenience, although this is not necessary. • Summation of X and Y  Derivation:

FX+Y(z) = E[I(X + Y ≤ z)] = EY[EX[I(X ≤ z − Y)|Y]] Z = EY[FX(z − Y)] = FX(z − y)dFY(y)

 Convolution formula (density refers to the situation when X and Y are continuous): Z FX+Y(z) = FY(z − x)dFX(x) ≡ FX ∗ FY(z)

Z Z fX ∗ fY(z) ≡ fX(z − y)fY(y)dy = fY(z − x)fX(x)dx • Product and quotient of X and Y (Y > 0) R  FXY(z) = E[E[I(XY ≤ z)|Y]] = FX(z/y)dFY(y) Z fXY(z) = fX(z/y)/yfY(y)dy

R  FX/Y(z) = E[E[I(X/Y ≤ z)|Y]] = FX(yz)dFY(y) Z fX/Y(z) = fX(yz)yfY(y)dy

In-class exercise How to obtain the distributions if Y may not be positive? • Application of algebra formulae 2 2 2 2  N(µ1, σ1) + N(µ2, σ2) ∼ N(µ1 + µ2, σ1 + σ2)  Gamma(r1, θ) + Gamma(r2, θ) ∼ Gamma(r1 + r2, θ)

 Poisson(λ1) + Poisson(λ2) ∼ Poisson(λ1 + λ2)

 Negative Binomial(m1, p) + Negative Binomial(m2, p)

∼ Negative Binomial(m1 + m2, p)

In-class exercise Verify the summation for Poisson r.v.’s. • Deriving the summation of R.V.s using characteristic functions  Key property If X and Y are independent, then

φX+Y(t) = φX(t)φY(t)

 For example, X and Y are normal with 2 2 2 2 φX(t) = exp{iµ1t − σ1t /2}, φY(t) = exp{iµ2t − σ2t /2}

2 2 2 ⇒φX+Y(t) = exp{i(µ1 + µ2)t − (σ1 + σ2)t /2} • More distributions generated from simple random variables 2 2  Assume X ∼ N(0, 1), Y ∼ χm and Z ∼ χn are independent. Then

X p ∼ Student’s t(m), Y/m Y/m ∼ Snedecor’s F , , Z/n m n Y ∼ Beta(m/2, n/2). Y + Z • Densities of t−, F− and Beta− distributions  Γ((m + 1)/2) 1 f ( )(x) = √ I(−∞,∞)(x) t m πmΓ(m/2) (1 + x2/m)(m+1)/2 

Γ(m + n)/2 (m/n)m/2xm/2−1 fF (x) = I( ,∞)(x) m,n Γ(m/2)Γ(n/2) (1 + mx/n)(m+n)/2 0  Γ(a + b) f (x) = xa−1(1 − x)b−1I(x ∈ (0, 1)) Beta(a,b) Γ(a)Γ(b)  It is always recommended that an indicator function is included in the density expression to indicate the support of r.v. • Exponential vs Beta- distributions

 Assume Y1, ..., Yn+1 are i.i.d Exp(θ).  Then Z = Y1+...+Yi ∼ Beta(i, n − i + 1). i Y1+...+Yn+1  In fact, (Z1,..., Zn) has the same joint distribution as that of the order statistics (ξn:1, ..., ξn:n) of n independent Uniform(0,1) random variables. Transformation of Random Vector Theorem 1.3 Suppose that X is k-dimension random vector with density function fX(x1, ..., xk). Let g be a one-to-one and continuously differentiable map from Rk to Rk. Then Y = g(X) is a random vector with density function

−1 fX(g (y1, ..., yk))|Jg−1 (y1, ..., yk)|,

−1 −1 where g is the inverse of g and Jg−1 is the Jacobian of g . In-class exercise Prove it by transformation in multivariate integration. • An application  let R2 ∼ Exp{2}, R > 0 and Θ ∼ Uniform(0, 2π) be independent;  X = R cos Θ and Y = R sin Θ are two independent standard normal random variables;  it can be applied to simulate normally distributed data.

In-class exercise Prove this statement.

In-class remark It is fundamental to have a mechanism to generate data from U(0, 1) (pseudo random

−1( ) number generator). To generate data from any distribution FX, generate data from FX U where U ∼ U(0, 1). Alternative algorithms include acceptance-rejection method, MCMC and etc. Multivariate Normal Distribution • Definition 0 Y = (Y1, ..., Yn) is said to have a multivariate normal 0 distribution with mean vector µ = (µ1, ..., µn) and non-degenerate covariance matrix Σn×n if its joint density is

1 1 0 −1 fY(y1, ..., yn) = exp{− (y − µ) Σ (y − µ)} (2π)n/2|Σ|1/2 2 • Characteristic function

φY(t) 0 = E[eit Y] Z − n − 1 0 1 0 − = (2π) 2 |Σ| 2 exp{it y − (y − µ) Σ 1(y − µ)}dy 2 Z 0 −1 0 −1 √ − n − 1 y Σ y − 0 µ Σ µ = ( 2π) 2 |Σ| 2 exp{− + (it + Σ 1µ) y − }dy 2 2 0 −1 Z  exp{−µ Σ µ/2} 1 0 − = √ exp − (y − Σit − µ) Σ 1(y − Σit − µ) ( 2π)n/2|Σ|1/2 2 ) (Σit + µ)0Σ−1(Σit + µ) + dy 2 ( 0 −1 ) µ Σ µ 1 0 − = exp − + (Σit + µ) Σ 1(Σit + µ) 2 2

0 1 0 = exp{it µ − t Σt}. 2 • Some remarks  multivariate normal distribution is uniquely determined by µ and Σ.  for standard multivariate normal distribution,

0 φX(t) = exp{−t t/2}.

 the moment generating function for Y is

exp{t0µ + t0Σt/2}.

In-class exercise Derive the moments of the standard normal distribution. • Linear transformation of normal r.v. Theorem 1.4 If Y = An×kXk×1 where X ∼ N(0, I) (standard multivariate normal distribution), then Y’s characteristic function is given by

 0 0 k φY(t) = exp −t Σt/2 , t = (t1, ..., tn) ∈ R , where Σ = AA0 so rank(Σ) = rank(A), i.e., Y ∼ N(0, Σ). 0 Conversely, if φY(t) = exp{−t Σt/2} with Σn×n ≥ 0 of rank k, i.e., Y ∼ N(0, Σ), then there exits an A matrix with rank(A) = k such that Y = An×kXk×1 with X ∼ N(0, Ik×k). Proof of Theorem 1.4 For the first part, since Y = AX and X ∼ N(0, Ik×k), we obtain

0 φY(t) = E[exp{it (AX)}] 0 0 = E[exp{i(A t) X}] 0 0 0 = exp{−(A t) (A t)/2} 0 0 = exp{−t AA t/2}.

Thus, Y ∼ N(0, Σ) with Σ = AA0. Clearly, rank(Σ) = rank(AA0) = rank(A). 0 For the second part, suppose φY(t) = exp{−t Σt/2}. There exists an orthogonal matrix, O, such that

0 0 Σ = O DO, D = diag((d1, ..., dk, 0, ..., 0) ), d1, ..., dk > 0.

Define Z = OY then 0 0 0 φZ(t) = E[exp{it (OY)}] = E[exp{i(O t) Y}]

0 0 0 (O t) Σ(O t) 2 2 = exp{− } = exp{−d1t /2 − ... − dktk /2}. 2 1

Thus, Z1, ..., Zk are independent N(0, d1), ..., N(0, dk) but Zk+1, ..., Zn are zeros. p 0 Let Xi = Zi/ di, i = 1, ..., k and write O = (Bn×k, Cn×(n−k)). Clearly,

0 p q Y = O Z = Bdiag{ d1, ..., dk}X

p p so the result holds by letting A = Bdiag{ d1, ..., dk}.

In-class remark Note that this proof also shows the way to generate data from N(0, Σ). • Conditional normal distributions 0 Theorem 1.5 Suppose that Y = (Y1, ..., Yk, Yk+1, ..., Yn) has a 0 0 multivariate normal distribution with mean µ = (µ(1) , µ(2) )0 and a non-degenerate covariance matrix

Σ Σ  Σ = 11 12 . Σ21 Σ22

(1) 0 (2) 0 Let Y = (Y1, ..., Yk) and Y = (Yk+1, ..., Yn) . Then (1) (1) (i) Y ∼ Nk(µ , Σ11). (1) (2) (ii) Y and Y are independent iff Σ12 = Σ21 = 0. (iii) For any matrix Am×n, AY has a multivariate normal distribution with mean Aµ and covariance AΣA0. (iv) The conditional distribution of Y(1) given Y(2) is

(1) (2) (1) −1 (2) (2) −1 Y |Y ∼ Nk(µ + Σ12Σ22 (Y − µ ), Σ11 − Σ12Σ22 Σ21). Proof of Theorem 1.5 0 0 We prove (iii) first. Since φY(t) = exp{it µ − t Σt/2},

it0AY φAY(t) = E[e ] ( 0 )0 = E[ei A t Y] 0 0 0 0 0 = exp{i(A t) µ − (A t) Σ(A t)/2} 0 0 0 = exp{it (Aµ) − t (AΣA )t/2}.

Thus, AY ∼ N(Aµ, AΣA0).  (i) follows from (iii) by letting A = Ik×k 0k×(n−k) . To prove (ii). Note h (1)0 (1) (2)0 (2) φY(t) = exp it µ + it µ

 1 n (1)0 (1) (1)0 (2) (2)0 (2)o − t Σ11t + 2t Σ12t + t Σ22t . 2

Thus, Y(1) and Y(2) are independent iff the characteristics function is a product of separate functions for t(1) and

(2) t iff Σ12 = 0. To prove (iv), we consider (1) = (1) − µ(1) − Σ Σ−1( (2) − µ(2)). Z Y 12 22 Y

(1) From (iii), Z ∼ N(0, ΣZ) with

Σ = ( (1), (1)) − Σ Σ−1 ( (2), (1)) + Σ Σ−1 ( (2), (2))Σ−1Σ Z Cov Y Y 2 12 22 Cov Y Y 12 22 Cov Y Y 22 21

= Σ − Σ Σ−1Σ . 11 12 22 21 On the other hand,

( (1), (2)) = ( (1), (2)) − Σ Σ−1 ( (2), (2)) = Cov Z Y Cov Y Y 12 22 Cov Y Y 0 so Z(1) is independent of Y(2) by (ii). Therefore, the conditional distribution Z(1) given Y(2) is the same as the unconditional distribution of Z(1), which is N(0, ΣZ). I.e., (1)| (2) ∼ (1) ∼ ( , Σ − Σ Σ−1Σ ). Z Y Z N 0 11 12 22 21

The result holds.

In-class remark The constructed Z has a geometric interpretation as (Y(1) − µ(1)) subtracting its project on the − / Σ 1 2( (2) − µ(2)) normalized variable 22 Y . • An example of measurement error model 2  X and U are independent with X ∼ N(0, σx) and 2 U ∼ N(0, σu)  Additive measurement error model assumes

Y = X + U.

 Note that the covariance of (X, Y) is

 2 2  σx σx 2 2 2 σx σx + σu

so from (iv),

σ2 σ4 | ∼ ( x , σ2 − x ) ≡ (λ , σ2( − λ)). X Y N 2 2 Y x 2 2 N Y x 1 σx + σu σx + σu

2 2  Here, λ = σx/σy is called (reliability coefficient). • Quadratic form of normal random variables  Summation of independent normal r.v.s:

2 2 2 N(0, 1) + ... + N(0, 1) ∼ χn ≡ Gamma(n/2, 2).

 If Y ∼ N(0, Σn×n) with Σ > 0, then

0 −1 2 Y Σ Y ∼ χn.

Proof: Y can be expressed as AX where X ∼ N(0, I) and A = Σ−1/2. Thus,

0 −1 0 0 0 −1 0 2 Y Σ Y = X A (AA ) AX = X X ∼ χn.

In-class exercise What is the distribution of Y0BY for an n × n symmetric matrix B? • Noncentral Chi-square distribution  Assume X ∼ N(µ, In×n).  Define 0 2 Y = X X ∼ χn(δ) and δ = µ0µ.  Y has a noncentral chi-square distribution and its density function is a Poisson-probability weighted summation of gamma densities:

∞ X exp{−δ/2}(δ/2)k f (y) = Gamma(y;(2k + n)/2, 2). Y k! k=0

In-class remark Equivalently, we generate S from Poisson(δ/2) and given S = k, we generate

Y ∼ Gamma((2k + n)/2, 2). • More about normal distributions p 2  noncentral t-distribution: N(δ, 1)/ χn/n 2 2 noncentral F-distribution: χn(δ)/n/(χm/m)  One trick to evaluate E[X0AX] when X ∼ N(µ, Σ):

E[X0AX] = E[tr(X0AX)] = E[tr(AXX0)] = tr(AE[XX0]) = tr(A(µµ0 + Σ)) = µ0Aµ + tr(AΣ) Distribution Family • Location-scale family

 aX + b: fX((x − b)/a)/a, a > 0, b ∈ R with mean aE[X] + b and variance a2Var(X)  Examples include normal distribution, uniform distribution, gamma distributions etc. • Exponential family  Exponential family includes many commonly used distributions such as binomial, poisson distributions for discrete variables and normal distribution, gamma distribution, Beta distribution for continuous variables.  The distribution within this family has a general expression of densities, which share some family-specific properties. • General form of densities in an exponential family

{ pθ(x)}, which consists of densities indexed by θ (one or more parameters), is said to form an s-parameter exponential family if

( s ) X pθ(x) = exp ηk(θ)Tk(x) − B(θ) h(x), k=1

where h(x) ≥ 0 and

s Z X exp{B(θ)} = exp{ ηk(θ)Tk(x)}h(x)dµ(x) < ∞. k=1

0  When (η1(θ), ..., ηs(θ)) is exactly the same as θ, the above form is called the canonical form of the exponential family. • Examples 2  X1, ..., Xn ∼ N(µ, σ ):

( n n ) µ X 1 X 2 n 2 1 exp xi − x − µ √ . σ2 2σ2 i 2σ2 ( πσ)n i=1 i=1 2

2 2 Its canonical parameterization is θ1 = µ/σ and θ2 = 1/σ .  X ∼ Binomial(n, p):

p n exp{x log + n log(1 − p)} . 1 − p x

Its canonical parameterization is θ = log p/(1 − p).  X ∼ Poisson(λ):

P(X = x) = exp{x log λ − λ}/x!

Its canonical parameterization is θ = log λ. • Moment generating function in the exponential family Theorem 1.6 Suppose the densities of an exponential family can be written as the canonical form

s X exp{ ηkTk(x) − A(η)}h(x), k=1

0 where η = (η1, ..., ηs) . Then the moment generating function (MGF) for (T1, ..., Ts), defined as

MT(t1, ..., ts) = E [exp{t1T1 + ... + tsTs}]

0 for t = (t1, ..., ts) , is given by

MT(t) = exp{A(η + t) − A(η)}. Proof of Theorem 1.6 Since s Z X exp{A(η)} = exp{ ηiTi(x)}h(x)dµ(x), k=1 we have MT(t) = E [exp{t1T1 + ... + tsTs}]

s Z X = exp{ (ηi + ti)Ti(x) − A(η)}h(x)dµ(x) k=1

= exp{A(η + t) − A(η)}. • MGF and cumulant generating function (CGF)

 MGF can be useful to calculate the moments of (T1, ..., Ts).  Define the cumulant generating function (CGF)

KT(t1, ..., ts) = log MT(t1, ..., ts) = A(η + t) − A(η).

The coefficients in its Taylor expansion are called the cumulants for (T1, ..., Ts).  The first two cumulants are the mean and variance. • Examples of using MGF  Normal distribution: η = µ/σ2 and

1 A(η) = µ2 = η2σ2/2. 2σ2 Thus,

σ2 M (t) = exp{ ((η + t)2 − η2)} = exp{µt + t2σ2/2}. T 2  For X ∼ N(0, σ2), since T = X, from the Taylor expansion, we have

E[X2r+1] = 0, E[X2r] = 1 · 2 ··· (2r − 1)σ2r, r = 1, 2, ...  Gamma distribution has a canonical form

exp{−x/b + (a − 1) log x − log(Γ(a)ba)}I(x > 0).

 Treat b as the only parameter (a is a constant) then the canonical parameterization is η = −1/b with T = X.  Note A(η) = log(Γ(a)ba) = a log(−1/η) + log Γ(a). we obtain η M (t) = exp{a log } = (1 − bt)−a. X η + t

This gives

E[X] = ab, E[X2] = ab2 + (ab)2, ... READING MATERIALS: Lehmann and Casella, Sections 1.4 and 1.5