Chapter 1: Distribution Theory

CHAPTER 1: DISTRIBUTION THEORY Basic Concepts • Random Variable (R.V.) discrete random variable: probability mass function continuous random variable: probability density function

• Random Vector multivariate version of random variables joint probability function, joint density function • Formal deﬁnitions based on measure theory R.V. is a measurable function on a probability measure space, which consists of sets in a σ-ﬁeld equipped with a probability measure; probability mass: the Radon-Nykodym derivative of a random variable-induced measure w.r.t. to a counting measure; density function: the Radon-Nykodym derivative of a random variable-induced measure w.r.t. the Lebesgue measure. • Descriptive quantities of a univariate distribution cumulative distribution function

FX(x) ≡ P(X ≤ x)

moments (mean): E[Xk], k = 1, 2,... (mean if k = 1) ; centralized moments: E[(X − µ)k] (variance if k = 2) quantile function

−1 F (p) ≡ inf {FX(x) > p} X x

the skewness: E[(X − µ)3]/Var(X)3/2 the kurtosis: E[(X − µ)4]/Var(X)2 • Characteristic function (c.f.) deﬁnition φX(t) = E[exp{itX}] c.f. uniquely determines the distribution

1 Z T e−ita − e−itb FX(b) − FX(a) = lim φX(t)dt T→∞ 2π −T it Z ∞ 1 −itx fX(x) = e φX(t)dt for continuous r.v. 2π −∞ moment generating function (if it exists)

mX(t) = E[exp{tX}] • Descriptive quantities of two R.V.s (X, Y) covariance and correlation: Cov(X, Y), corr(X, Y) conditional density or probability (f ’s become probability mass functions if R.V.s are discrete):

fX|Y(x|y) = fX,Y(x, y)/fY(y) Z E[X|Y] = xfX|Y(x|y)dx

independence

P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y)

fX,Y(x, y) = fX(x)fY(y) • Double expectation formulae

E[X] = E[E[X|Y]] Var(X) = E[Var(X|Y)] + Var(E[X|Y])

• This is useful for computing the moments of X when X given Y has a simple distribution! 2 In-class exercise Compute E[X] and Var(X) if Y ∼ Bernoulli(p) and X given Y = y follows N(µy, σy ). Discrete Random Variables • Binomial distribution Binomial r.v. is the total number of successes in n independent Bernoulli trials. Binomial(n, p):

n P(S = k) = pk(1 − p)n−k, k = 0, ..., n n k

E[Sn] = np, Var(Sn) = np(1 − p) it n φX(t) = (1 − p + pe ) Summation of two independent Binomial r.v.’s

Binomial(n1, p) + Binomial(n2, p) ∼ Binomial(n1 + n2, p)

In-class exercise Prove the last statement. • Negative Binomial distribution It describes the distribution of the number of independent Bernoulli trials until m successes. NB(m, p):

k − 1 P(W = k) = pm(1 − p)k−m, k = m, m + 1, ... m m − 1

2 E[Wm] = m/p, Var(Wm) = m/p − m/p Summation of two independent r.v.’s:

NB(m1, p) + NB(m2, p) ∼ NB(m1 + m2, p)

In-class exercise Prove the last statement. • Hypergeometric distribution It describes the distribution of the number of black balls if we randomly draw n balls from an urn with M blacks and (N − M) whites (urn model). HG(N,M,n):

MN−M ( = ) = k n−k , = , , .., P Sn k N k 0 1 n n nMN(M + N − n) E[S ] = Mn/(M + N), Var(S ) = n n (M + N)2(M + N − 1) • Poisson distribution It is used to describe the distribution of the number of incidences in an inﬁnite population. Poisson(λ):

P(X = k) = λke−λ/k!, k = 0, 1, 2, ...

it E[X] = Var(X) = λ, φX(t) = exp{−λ(1 − e )} Summation of two independent r.v.’s:

Poisson(λ1) + Poisson(λ2) ∼ Poisson(λ1 + λ2) • Poisson distribution related to Binomial distribution Assume

X1 ∼ Poisson(λ1), X2 ∼ Poisson(λ2), X1 ⊥⊥ X2.

Then λ1 X1|X1 + X2 = n ∼ Binomial(n, ). λ1 + λ2

If Xn1, ..., Xnn are i.i.d Bernoulli(pn) and npn → λ, then

n! λk P(S = k) = pk (1 − p )n−k → e−λ. n k!(n − k)! n n k!

In-class exercise Prove these statements. • Multinomial Distribution

Assume Y1, ..., Yn are independent k-category random Pk varaibles such that P(Yi = category l) = pl, l=1 pl = 1. Nl, 1 ≤ l ≤ k counts the number of times that {Y1, ..., Yn} taking category value l. Then P(N = n , ..., N = n ) = n pn1 ··· pnk 1 1 k k n1,...,nk 1 k

n1 + ... + nk = n

the covariance matrix for (N1, ..., Nk)   p1(1 − p1) ... −p1pk  . .. .  n  . . .  −p1pk ... pk(1 − pk)     p1   .   = n diag(p1, ..., pk) −  .  p1 ... pk  pk Continuous Random Variables • Uniform distribution Uniform(a, b): fX(x) = I[a,b](x)/(b − a),

where IA(x) is an indicator function taking value 1 if x ∈ A and 0, otherwise. E[X] = (a + b)/2, Var(X) = (b − a)2/12 The uniform distribution in [0, 1] is the most fundamental distribution each pseudo random number generator tries to generate numbers from. • Normal distribution It describes the distribution behavior of averaged statistics in practice. N(µ, σ2) 1 (x − µ)2 fX(x) = √ exp{− } 2πσ2 2σ2 E[X] = µ, Var(X) = σ2 2 2 φX(t) = exp{itµ − σ t /2} • Gamma distribution It is useful to describe the distribution of a positive random variable with long right tail. Gamma(θ, β)

1 x f (x) = xθ−1 exp{− }, x > 0 X βθΓ(θ) β

E[X] = θβ, Var(X) = θβ2 θ is called the shape parameter and β is the scale parameter. θ = 1 reduces to the so-called exponential distribution, Exp(β), and θ = n/2, β = 2 is equivalent to a chi-squared distribution with n degrees of freedom.

Caution In some references, notation Gamma(θ, β) is used to refer to the distribution Gamma(θ, 1/β) as deﬁned here! • Cauchy distribution Cauchy(a, b)

1 f (x) = X bπ {1 + (x − a)2/b2}

E[|X|] = ∞, φX(t) = exp{iat − |bt|} This distribution is often used as a counter example whose mean does not exist. Algebra of Random Variables • Assumption We assume that X and Y are independent with distribution functions FX and FY, respectively. We are interested in the distribution or density of

X + Y, XY, X/Y.

For XY and X/Y, we assume Y to be positive for convenience, although this is not necessary. • Summation of X and Y Derivation:

FX+Y(z) = E[I(X + Y ≤ z)] = EY[EX[I(X ≤ z − Y)|Y]] Z = EY[FX(z − Y)] = FX(z − y)dFY(y)

Convolution formula (density refers to the situation when X and Y are continuous): Z FX+Y(z) = FY(z − x)dFX(x) ≡ FX ∗ FY(z)

Z Z fX ∗ fY(z) ≡ fX(z − y)fY(y)dy = fY(z − x)fX(x)dx • Product and quotient of X and Y (Y > 0) R FXY(z) = E[E[I(XY ≤ z)|Y]] = FX(z/y)dFY(y) Z fXY(z) = fX(z/y)/yfY(y)dy

R FX/Y(z) = E[E[I(X/Y ≤ z)|Y]] = FX(yz)dFY(y) Z fX/Y(z) = fX(yz)yfY(y)dy

In-class exercise How to obtain the distributions if Y may not be positive? • Application of algebra formulae 2 2 2 2 N(µ1, σ1) + N(µ2, σ2) ∼ N(µ1 + µ2, σ1 + σ2) Gamma(r1, θ) + Gamma(r2, θ) ∼ Gamma(r1 + r2, θ)

Poisson(λ1) + Poisson(λ2) ∼ Poisson(λ1 + λ2)

Negative Binomial(m1, p) + Negative Binomial(m2, p)

∼ Negative Binomial(m1 + m2, p)

In-class exercise Verify the summation for Poisson r.v.’s. • Deriving the summation of R.V.s using characteristic functions Key property If X and Y are independent, then

φX+Y(t) = φX(t)φY(t)

For example, X and Y are normal with 2 2 2 2 φX(t) = exp{iµ1t − σ1t /2}, φY(t) = exp{iµ2t − σ2t /2}

2 2 2 ⇒φX+Y(t) = exp{i(µ1 + µ2)t − (σ1 + σ2)t /2} • More distributions generated from simple random variables 2 2 Assume X ∼ N(0, 1), Y ∼ χm and Z ∼ χn are independent. Then

X p ∼ Student’s t(m), Y/m Y/m ∼ Snedecor’s F , , Z/n m n Y ∼ Beta(m/2, n/2). Y + Z • Densities of t−, F− and Beta− distributions Γ((m + 1)/2) 1 f ( )(x) = √ I(−∞,∞)(x) t m πmΓ(m/2) (1 + x2/m)(m+1)/2

Γ(m + n)/2 (m/n)m/2xm/2−1 fF (x) = I( ,∞)(x) m,n Γ(m/2)Γ(n/2) (1 + mx/n)(m+n)/2 0 Γ(a + b) f (x) = xa−1(1 − x)b−1I(x ∈ (0, 1)) Beta(a,b) Γ(a)Γ(b) It is always recommended that an indicator function is included in the density expression to indicate the support of r.v. • Exponential vs Beta- distributions

Assume Y1, ..., Yn+1 are i.i.d Exp(θ). Then Z = Y1+...+Yi ∼ Beta(i, n − i + 1). i Y1+...+Yn+1 In fact, (Z1,..., Zn) has the same joint distribution as that of the order statistics (ξn:1, ..., ξn:n) of n independent Uniform(0,1) random variables. Transformation of Random Vector Theorem 1.3 Suppose that X is k-dimension random vector with density function fX(x1, ..., xk). Let g be a one-to-one and continuously differentiable map from Rk to Rk. Then Y = g(X) is a random vector with density function

−1 fX(g (y1, ..., yk))|Jg−1 (y1, ..., yk)|,

−1 −1 where g is the inverse of g and Jg−1 is the Jacobian of g . In-class exercise Prove it by transformation in multivariate integration. • An application let R2 ∼ Exp{2}, R > 0 and Θ ∼ Uniform(0, 2π) be independent; X = R cos Θ and Y = R sin Θ are two independent standard normal random variables; it can be applied to simulate normally distributed data.

In-class exercise Prove this statement.

In-class remark It is fundamental to have a mechanism to generate data from U(0, 1) (pseudo random

−1( ) number generator). To generate data from any distribution FX, generate data from FX U where U ∼ U(0, 1). Alternative algorithms include acceptance-rejection method, MCMC and etc. Multivariate Normal Distribution • Deﬁnition 0 Y = (Y1, ..., Yn) is said to have a multivariate normal 0 distribution with mean vector µ = (µ1, ..., µn) and non-degenerate covariance matrix Σn×n if its joint density is

1 1 0 −1 fY(y1, ..., yn) = exp{− (y − µ) Σ (y − µ)} (2π)n/2|Σ|1/2 2 • Characteristic function

φY(t) 0 = E[eit Y] Z − n − 1 0 1 0 − = (2π) 2 |Σ| 2 exp{it y − (y − µ) Σ 1(y − µ)}dy 2 Z 0 −1 0 −1 √ − n − 1 y Σ y − 0 µ Σ µ = ( 2π) 2 |Σ| 2 exp{− + (it + Σ 1µ) y − }dy 2 2 0 −1 Z exp{−µ Σ µ/2} 1 0 − = √ exp − (y − Σit − µ) Σ 1(y − Σit − µ) ( 2π)n/2|Σ|1/2 2 ) (Σit + µ)0Σ−1(Σit + µ) + dy 2 ( 0 −1 ) µ Σ µ 1 0 − = exp − + (Σit + µ) Σ 1(Σit + µ) 2 2

0 1 0 = exp{it µ − t Σt}. 2 • Some remarks multivariate normal distribution is uniquely determined by µ and Σ. for standard multivariate normal distribution,

0 φX(t) = exp{−t t/2}.

the moment generating function for Y is

exp{t0µ + t0Σt/2}.

In-class exercise Derive the moments of the standard normal distribution. • Linear transformation of normal r.v. Theorem 1.4 If Y = An×kXk×1 where X ∼ N(0, I) (standard multivariate normal distribution), then Y’s characteristic function is given by

0 0 k φY(t) = exp −t Σt/2 , t = (t1, ..., tn) ∈ R , where Σ = AA0 so rank(Σ) = rank(A), i.e., Y ∼ N(0, Σ). 0 Conversely, if φY(t) = exp{−t Σt/2} with Σn×n ≥ 0 of rank k, i.e., Y ∼ N(0, Σ), then there exits an A matrix with rank(A) = k such that Y = An×kXk×1 with X ∼ N(0, Ik×k). Proof of Theorem 1.4 For the ﬁrst part, since Y = AX and X ∼ N(0, Ik×k), we obtain

0 φY(t) = E[exp{it (AX)}] 0 0 = E[exp{i(A t) X}] 0 0 0 = exp{−(A t) (A t)/2} 0 0 = exp{−t AA t/2}.

Thus, Y ∼ N(0, Σ) with Σ = AA0. Clearly, rank(Σ) = rank(AA0) = rank(A). 0 For the second part, suppose φY(t) = exp{−t Σt/2}. There exists an orthogonal matrix, O, such that

0 0 Σ = O DO, D = diag((d1, ..., dk, 0, ..., 0) ), d1, ..., dk > 0.

Deﬁne Z = OY then 0 0 0 φZ(t) = E[exp{it (OY)}] = E[exp{i(O t) Y}]

0 0 0 (O t) Σ(O t) 2 2 = exp{− } = exp{−d1t /2 − ... − dktk /2}. 2 1

Thus, Z1, ..., Zk are independent N(0, d1), ..., N(0, dk) but Zk+1, ..., Zn are zeros. p 0 Let Xi = Zi/ di, i = 1, ..., k and write O = (Bn×k, Cn×(n−k)). Clearly,

0 p q Y = O Z = Bdiag{ d1, ..., dk}X

p p so the result holds by letting A = Bdiag{ d1, ..., dk}.

In-class remark Note that this proof also shows the way to generate data from N(0, Σ). • Conditional normal distributions 0 Theorem 1.5 Suppose that Y = (Y1, ..., Yk, Yk+1, ..., Yn) has a 0 0 multivariate normal distribution with mean µ = (µ(1) , µ(2) )0 and a non-degenerate covariance matrix

Σ Σ Σ = 11 12 . Σ21 Σ22

(1) 0 (2) 0 Let Y = (Y1, ..., Yk) and Y = (Yk+1, ..., Yn) . Then (1) (1) (i) Y ∼ Nk(µ , Σ11). (1) (2) (ii) Y and Y are independent iff Σ12 = Σ21 = 0. (iii) For any matrix Am×n, AY has a multivariate normal distribution with mean Aµ and covariance AΣA0. (iv) The conditional distribution of Y(1) given Y(2) is

(1) (2) (1) −1 (2) (2) −1 Y |Y ∼ Nk(µ + Σ12Σ22 (Y − µ ), Σ11 − Σ12Σ22 Σ21). Proof of Theorem 1.5 0 0 We prove (iii) ﬁrst. Since φY(t) = exp{it µ − t Σt/2},

it0AY φAY(t) = E[e ] ( 0 )0 = E[ei A t Y] 0 0 0 0 0 = exp{i(A t) µ − (A t) Σ(A t)/2} 0 0 0 = exp{it (Aµ) − t (AΣA )t/2}.

Thus, AY ∼ N(Aµ, AΣA0). (i) follows from (iii) by letting A = Ik×k 0k×(n−k) . To prove (ii). Note h (1)0 (1) (2)0 (2) φY(t) = exp it µ + it µ

1 n (1)0 (1) (1)0 (2) (2)0 (2)o − t Σ11t + 2t Σ12t + t Σ22t . 2

Thus, Y(1) and Y(2) are independent iff the characteristics function is a product of separate functions for t(1) and

(2) t iff Σ12 = 0. To prove (iv), we consider (1) = (1) − µ(1) − Σ Σ−1( (2) − µ(2)). Z Y 12 22 Y

(1) From (iii), Z ∼ N(0, ΣZ) with

Σ = ( (1), (1)) − Σ Σ−1 ( (2), (1)) + Σ Σ−1 ( (2), (2))Σ−1Σ Z Cov Y Y 2 12 22 Cov Y Y 12 22 Cov Y Y 22 21

= Σ − Σ Σ−1Σ . 11 12 22 21 On the other hand,

( (1), (2)) = ( (1), (2)) − Σ Σ−1 ( (2), (2)) = Cov Z Y Cov Y Y 12 22 Cov Y Y 0 so Z(1) is independent of Y(2) by (ii). Therefore, the conditional distribution Z(1) given Y(2) is the same as the unconditional distribution of Z(1), which is N(0, ΣZ). I.e., (1)| (2) ∼ (1) ∼ ( , Σ − Σ Σ−1Σ ). Z Y Z N 0 11 12 22 21

The result holds.

In-class remark The constructed Z has a geometric interpretation as (Y(1) − µ(1)) subtracting its project on the − / Σ 1 2( (2) − µ(2)) normalized variable 22 Y . • An example of measurement error model 2 X and U are independent with X ∼ N(0, σx) and 2 U ∼ N(0, σu) Additive measurement error model assumes

Y = X + U.

Note that the covariance of (X, Y) is

2 2 σx σx 2 2 2 σx σx + σu

so from (iv),

σ2 σ4 | ∼ ( x , σ2 − x ) ≡ (λ , σ2( − λ)). X Y N 2 2 Y x 2 2 N Y x 1 σx + σu σx + σu

2 2 Here, λ = σx/σy is called (reliability coefﬁcient). • Quadratic form of normal random variables Summation of independent normal r.v.s:

2 2 2 N(0, 1) + ... + N(0, 1) ∼ χn ≡ Gamma(n/2, 2).

If Y ∼ N(0, Σn×n) with Σ > 0, then

0 −1 2 Y Σ Y ∼ χn.

Proof: Y can be expressed as AX where X ∼ N(0, I) and A = Σ−1/2. Thus,

0 −1 0 0 0 −1 0 2 Y Σ Y = X A (AA ) AX = X X ∼ χn.

In-class exercise What is the distribution of Y0BY for an n × n symmetric matrix B? • Noncentral Chi-square distribution Assume X ∼ N(µ, In×n). Deﬁne 0 2 Y = X X ∼ χn(δ) and δ = µ0µ. Y has a noncentral chi-square distribution and its density function is a Poisson-probability weighted summation of gamma densities:

∞ X exp{−δ/2}(δ/2)k f (y) = Gamma(y;(2k + n)/2, 2). Y k! k=0

In-class remark Equivalently, we generate S from Poisson(δ/2) and given S = k, we generate

Y ∼ Gamma((2k + n)/2, 2). • More about normal distributions p 2 noncentral t-distribution: N(δ, 1)/ χn/n 2 2 noncentral F-distribution: χn(δ)/n/(χm/m) One trick to evaluate E[X0AX] when X ∼ N(µ, Σ):

E[X0AX] = E[tr(X0AX)] = E[tr(AXX0)] = tr(AE[XX0]) = tr(A(µµ0 + Σ)) = µ0Aµ + tr(AΣ) Distribution Family • Location-scale family

aX + b: fX((x − b)/a)/a, a > 0, b ∈ R with mean aE[X] + b and variance a2Var(X) Examples include normal distribution, uniform distribution, gamma distributions etc. • Exponential family Exponential family includes many commonly used distributions such as binomial, poisson distributions for discrete variables and normal distribution, gamma distribution, Beta distribution for continuous variables. The distribution within this family has a general expression of densities, which share some family-speciﬁc properties. • General form of densities in an exponential family

{ pθ(x)}, which consists of densities indexed by θ (one or more parameters), is said to form an s-parameter exponential family if

( s ) X pθ(x) = exp ηk(θ)Tk(x) − B(θ) h(x), k=1

where h(x) ≥ 0 and

s Z X exp{B(θ)} = exp{ ηk(θ)Tk(x)}h(x)dµ(x) < ∞. k=1

0 When (η1(θ), ..., ηs(θ)) is exactly the same as θ, the above form is called the canonical form of the exponential family. • Examples 2 X1, ..., Xn ∼ N(µ, σ ):

( n n ) µ X 1 X 2 n 2 1 exp xi − x − µ √ . σ2 2σ2 i 2σ2 ( πσ)n i=1 i=1 2

2 2 Its canonical parameterization is θ1 = µ/σ and θ2 = 1/σ . X ∼ Binomial(n, p):

p n exp{x log + n log(1 − p)} . 1 − p x

Its canonical parameterization is θ = log p/(1 − p). X ∼ Poisson(λ):

P(X = x) = exp{x log λ − λ}/x!

Its canonical parameterization is θ = log λ. • Moment generating function in the exponential family Theorem 1.6 Suppose the densities of an exponential family can be written as the canonical form

s X exp{ ηkTk(x) − A(η)}h(x), k=1

0 where η = (η1, ..., ηs) . Then the moment generating function (MGF) for (T1, ..., Ts), deﬁned as

MT(t1, ..., ts) = E [exp{t1T1 + ... + tsTs}]

0 for t = (t1, ..., ts) , is given by

MT(t) = exp{A(η + t) − A(η)}. Proof of Theorem 1.6 Since s Z X exp{A(η)} = exp{ ηiTi(x)}h(x)dµ(x), k=1 we have MT(t) = E [exp{t1T1 + ... + tsTs}]

s Z X = exp{ (ηi + ti)Ti(x) − A(η)}h(x)dµ(x) k=1

= exp{A(η + t) − A(η)}. • MGF and cumulant generating function (CGF)

MGF can be useful to calculate the moments of (T1, ..., Ts). Deﬁne the cumulant generating function (CGF)

KT(t1, ..., ts) = log MT(t1, ..., ts) = A(η + t) − A(η).

The coefﬁcients in its Taylor expansion are called the cumulants for (T1, ..., Ts). The ﬁrst two cumulants are the mean and variance. • Examples of using MGF Normal distribution: η = µ/σ2 and

1 A(η) = µ2 = η2σ2/2. 2σ2 Thus,

σ2 M (t) = exp{ ((η + t)2 − η2)} = exp{µt + t2σ2/2}. T 2 For X ∼ N(0, σ2), since T = X, from the Taylor expansion, we have

E[X2r+1] = 0, E[X2r] = 1 · 2 ··· (2r − 1)σ2r, r = 1, 2, ... Gamma distribution has a canonical form

exp{−x/b + (a − 1) log x − log(Γ(a)ba)}I(x > 0).

Treat b as the only parameter (a is a constant) then the canonical parameterization is η = −1/b with T = X. Note A(η) = log(Γ(a)ba) = a log(−1/η) + log Γ(a). we obtain η M (t) = exp{a log } = (1 − bt)−a. X η + t

This gives

E[X] = ab, E[X2] = ab2 + (ab)2, ... READING MATERIALS: Lehmann and Casella, Sections 1.4 and 1.5