<<

Department of Mathematics

Ma 3/103 KC Border Introduction to and Winter 2017

Lecture 9: Joint Distributions

Relevant textbook passages: Pitman [3]: Chapter 5; Section 6.4–6.5 Larsen–Marx [2]: Sections 3.7, 3.8, 3.9, 3.11

9.1 Random vectors and joint distributions

Recall that a X is a real-valued function on the sample (S, E,P ), where P is a probability on S; and that it induces a PX on R, called the distribution of X, given by ( ) PX (B) = P (X ∈ B) = P {s ∈ S : X(s) ∈ B} , for every (Borel) of R. The distribution is enough to calculate the expectation of any (Borel) function of X. Now suppose I have more than one random variable on the same . Then I can consider the random vector (X,Y ), or X = (X1,...,Xn).

n • A random vector X defines a probability PX on R , called the distribution of X via: { ( ) } PX (B) = P (X ∈ B) = P s ∈ S : X1(s),...,Xn(s) ∈ B , for every (Borel) subset of Rn. This distribution is also called the joint distribution of X1,...,Xn.

• We can use this to define a joint cumulative distribution function, denoted FX , by

FX (x1, . . . , xn) = P (Xi ⩽ xi, for all i = 1, . . . , n)

• N.B. While the subscript X,Y or X is often used to identify a joint distribution or cumulative distribution function, it is also frequently omitted. You are supposed to figure out the domain of the function by inspecting its arguments.

9.1.1 Example Let S = {SS,SF,FS,FF } and let P be the probability measure on S defined by 7 3 1 1 P (SS) = ,P (SF ) = ,P (FS) = ,P (FF ) = . 12 12 12 12 Define the random variables X and Y by

X(SS) = 1,X(SF ) = 1,X(FS) = 0,X(FF ) = 0, Y (SS) = 1,Y (SF ) = 0,Y (FS) = 1,Y (FF ) = 0.

That is, X and Y indicate Success or Failure on two different experiments, but the experiments are not necessarily independent.

9–1 Ma 3/103 Winter 2017 KC Border Joint Distributions 9–2

Then 10 2 P (1) = ,P (0) = , X 12 X 12 8 4 P (1) = ,P (0) = , Y 12 Y 12 7 3 1 1 P (1, 1) = ,P (1, 0) = ,P (0, 1) = ,P (0, 0) = . X,Y 12 X,Y 12 X,Y 12 X,Y 12 □

9.2 Joint PMFs

A random vector X on a probability space (S, E,P ) is discrete, if you can enumerate its range. Pitman [3]: Section 3.1; When X1,...,Xn are discrete, the joint probability mass function of the random vector also p. 348 X = (X1,...,Xn) is usually denoted pX , and is given by Larsen– Marx [2]: pX (x1, x2, . . . , xn) = P (X1 = x1 and X2 = x2 and ··· and Xn = xn) . Section 3.7 If X and Y are independent random variables, then pX,Y (x, y) = pX (x)pY (y). For a function g of X and Y we have ∑ ∑ E g(X,Y ) = g(x, y)pX,Y (x, y). x y

9.3 Joint densities Pitman [3]: Chapter 5.1 Let X and Y be random variables on a probability space (S, E,P ). The random vector (X,Y ) 2 Larsen– has a joint density fX,Y (x, y) if for every (Borel) subset B ⊂ R , Marx [2]: ∫ ∫ Section 3.7 P ((X,Y ) ∈ B) = fX,Y (x, y) dx dy. B

If X and Y are independent, then fX,Y (x, y) = fX (x)fY (y). For example, ∫ ∞ ∫ ∞ P (X ⩾ Y ) = fX,Y (x, y) dx dy. −∞ y For a function g of X and Y we have

∫ ∞ E g(X,Y ) = g(x, y)fX,Y (x, y) dx dy. −∞

Again, the subscript X,Y or X on a density is frequently omitted.

9.4 Recovering marginal distributions from joint distributions

So now we have random variables X and Y , and the random vector (X,Y ). They have distri- butions PX , PY , and PX,Y . How are they related? The of X is just the distribution PX of X alone. We can recover its probability mass function from the joint probability mass function pX,Y as follows.

v. 2017.02.02::09.29 KC Border Ma 3/103 Winter 2017 KC Border Joint Distributions 9–3

In the discrete case: ∑ pX (x) = P (X = x) = pX,Y (x, y) y Likewise ∑ pY (y) = P (Y = y) = pX,Y (x, y) y

If X and Y are independent random variables, then pX,Y (x, y) = pX (x)pY (y).

Larsen– Marx [2]: p. 169 For the density case, the marginal density of X, denoted fX is given by ∫ ∞ fX (x) = fX,Y (x, y) dy, −∞

and the marginal density fY of Y is given by ∫ ∞ fY (y) = fX,Y (x, y) dx. −∞

The recovery of a marginal density of X from a joint density of X and Y is sometimes described as “integrating out” out y.

9.5 ⋆ Expectation of a random vector

Since random vectors are just vector-valued functions on a sample space S, we can add them and multiply them just like any other functions. For example, the sum of random vectors X = (X1,...,Xn) and Y = (Y1,...,Yn) is given by ( ) ( ) (X + Y )(s) = X(s) + Y (s) = X1(s),...,Xn(s) + Y1(s),...,Yn(s) .

Thus the set of random vectors is a vector space. In fact, the subset of random vectors whose components have a finite expectation is also a vector subspace of the vector space of allrandom vectors. If X = (X1,...,Xn) is a random vector, and each Xi has expectation E Xi, the expectation of X is defined to be EX = (E X1,..., E Xn).

• Expectation is a linear operator on the space of random vectors. This means that

E(aX + bY ) = a EX + b EY .

• Expectation is a positive operator on the space of random vectors. For vectors x = (x1, . . . , xn), define x ≧ 0 if xi ⩾ 0 for each i = 1, . . . , n. Then

X ≧ 0 =⇒ EX ≧ 0.

9.6 The distribution of a sum

We already know to calculate the expectation of a sum of random variables—since expectation is a linear operator, the expectation of a sum is the sum of the expectations.

KC Border v. 2017.02.02::09.29 Ma 3/103 Winter 2017 KC Border Joint Distributions 9–4

We are now in a position to describe the distribution of the sum of two random variables. Let Z = X + Y . Pitman [3]: Discrete case: p. 147 ∑ ∑ Larsen– P (Z = z) = P (x, y) = pX,Y (x, z − x) Marx [2]: (x,y):x+y=z all x p. 178ff.

9.6.1 Density of a sum

If (X,Y ) has joint density fX,Y (x, y), what is the density of X + Y ? Recall that the density is Pitman [3]: the derivative of the cdf, so pp. 372–373 ∫ ∫ d d fX+Y (t) = P (X + Y ⩽ t) = fX,Y (x, y) dx dy dt dt { ⩽ − } ∫ (x,y∫):x t y ) d ∞ t−y = fX,Y (x, y) dx dy dt −∞ −∞ ∫ (∫ ) ∞ d t−y = fX,Y (x, y) dx dy −∞ dt −∞ ∫ ∞

= fX,Y (t − y, y) dy. −∞

So if X and Y are independent, we get the convolution

∫ ∞

fX+Y (t) = fX (t − y)fY (y) dy. −∞

9.7 Covariance

Pitman [3]: When X and Y are independent, we proved § 6.4, p. 430 Var(X + Y ) = Var X + Var Y.

More generally however, since expectation is a positive linear operator, ( Var(X + Y ) = E (X + Y ) − E(X + Y ))2 ( = E (X − E X) + (Y − E Y ))2 ( = E (X − E X)2 + 2(X − E X)(Y − E Y ) + (Y − E Y )2 = Var(X) + Var(Y ) + 2 E(X − E X)(Y − E Y ).

9.7.1 Definition The covariance of X and Y is defined to be

Cov(X,Y ) = E(X − E X)(Y − E Y ). (1)

In general Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).

v. 2017.02.02::09.29 KC Border Ma 3/103 Winter 2017 KC Border Joint Distributions 9–5

There is another way to write the covariance:

Cov(X,Y ) = E(XY ) − E(X) E(Y ). (2)

Proof : Since expectation is a positive linear operator, ( )( ) Cov(X,Y ) = E (X − E(X) Y − E(Y ) ( ) = E XY − X E(Y ) − Y E(X) + E(X) E(Y ) = E(XY ) − E(X) E(Y ) − E(X) E(Y ) + E(X) E(Y ) = E(XY ) − E(X) E(Y ).

9.7.2 Remark It follows that for any random variable X,

Cov(X,X) = Var X.

If X and Y are independent, then Cov(X,Y ) = 0.

The converse is not true.

9.7.3 Example (Covariance = 0, but variables are not independent) (cf. Feller [1, p. 236]) Let X be a random variable that assumes the values ±1 and ±2, each with probability 1/4.(E X = 0) Define Y = X2, and let Y¯ = E Y (= 2.5). Then ( ) Cov(X,Y ) = E X(Y − Y¯ ) ( ) 1 ( ) 1 ( ) 1 ( ) 1 = 1(1 − Y¯ ) + (−1)(1 − Y¯ ) + 2(4 − Y¯ ) + (−2)(4 − Y¯ ) 4 4 4 4 = 0.

But X and Y are not independent:

P (X = 1 & Y = 1) = P (X = 1) = 1/2, but P (X = 1) = 1/4 and P (Y = 1) = 1/2, so

P (X = 1) · P (Y = 1) = 1/8.

9.7.4 Example (Covariance = 0, but variables are not independent) Let U, V be in- dependent and identically distributed random variables with E U = E V = 0. Define

X = U + V,Y = U − V.

Since E X = E Y = 0, ( ) Cov(X,Y ) = E(XY ) = E (U + V )(U − V ) = E(U 2 − V 2) = E U 2 − E V 2 = 0 since U and V have the same distribution. But are X and Y independent?

KC Border v. 2017.02.02::09.29 Ma 3/103 Winter 2017 KC Border Joint Distributions 9–6

If U and V are integer-valued, then X and Y are also integer-valued, but more importantly they have the same parity. That is, X is odd if and only if Y is odd. (This is a handy fact for KenKen solvers.) So let U and V be independent and assume the values ±1 and ±2, each with probability 1/4.(E U = E V = 0.) Then 1 P (X is odd) = P (X is even) = P (Y is odd) = P (Y is even) = , 2 but 1 P (X is even and Y is odd) = 0 ≠ = P (X is even) P (Y is odd) , 4 so X and Y are not independent. □ Pitman [3]: p. 432 9.7.5 Remark The product (X − E X)(Y − E Y ) is positive at outcomes s where X(s) and Y (s) are either both above or both below their means, and negative when one is above and the other below. So one very loose interpretation of positive covariance is that the random variables are probably both above average or below average rather than not. Of course this is just a tendency.

9.7.6 Example (The effects of covariance) For mean zero random variables that have a positive covariance, the joint density tends to concentrate on the diagonal. Figure 9.1 shows the joint density of two standard normals with various covariances. Figure 9.3 shows random samples from these distributions. □

9.8 A covariance menagerie

Recall that for independent random variables X and Y , Var(X + Y ) = Var X + Var Y , and Cov(XY ) = 0. For any random variable X with finite variance, Var X = E(X2) − (E X)2, so E(X2) = Var X + (E X)2. Also, if E X = 0, then Cov(XY ) = E(XY ) (Why?).

9.8.1 Theorem (A Covariance Menagerie) Let X1,...,Xn be independent and identically distributed random variables with common mean µ and variance σ2. Define

∑n S = Xi, and X¯ = S/n, i=1 and let Di = Xi − X,¯ (i = 1, . . . , n).

be the deviation of Xi from X¯. Then

2 1. E(XiXj) = (E Xi)(E Xj) = µ , for i ≠ j (by independence).

2 2 2 2. E(Xi ) = σ + µ . ∑ n 2 2 2 2 3. E(XiS) = j=1 E(XiXj) + E(Xi ) = E(Xi ) = σ + nµ . j≠ i

2 2 4. E(XiX¯) = (σ /n) + µ . 5. E(S) = nµ. 6. Var(S) = nσ2.

v. 2017.02.02::09.29 KC Border Ma 3/103 Winter 2017 KC Border Joint Distributions 9–7

Figure 9.1. Joint density of standard normals, as covariance changes.

KC Border v. 2017.02.02::09.29 Ma 3/103 Winter 2017 KC Border Joint Distributions 9–8

Figure 9.2. Contours of the joint density of standard normals, as covariance changes.

v. 2017.02.02::09.29 KC Border Ma 3/103 Winter 2017 KC Border Joint Distributions 9–9

KC Border v. 2017.02.02::09.29

Figure 9.3. Random samples from standard normals, as covariance changes. Ma 3/103 Winter 2017 KC Border Joint Distributions 9–10

7. E(S2) = nσ2 + n2µ2. 8. E(X¯) = µ.

9. Var(X¯) = σ2/n. 10. E(X¯ 2) = (σ2/n) + µ.

11. E(Di) = 0, i = 1, . . . , n. 2 − 2 12. Var(Di) = E(Di ) = (n 1)σ /n

2 Var(Di) = E(Xi − X¯) = E(X2) − 2 E(X X¯) + E(X¯ 2) = i i ( ) ( ) ( ) 1 (σ2 + µ2) − 2 (σ2/n) + µ2 + (σ2/n) + µ2 = 1 − σ2. n

2 13. Cov(Di,Dj) = E(DiDj) = −σ /n. ( ) E(D D ) = E (X − X¯)(X − X¯) = E(X X ) − E(X X¯) − E(X X¯) + E(X¯ 2) i j i j [ i j [ i j µ2 − (σ2/n) + µ2] − (σ2/n) + µ2] + [(σ2/n) + µ2] = −σ2/n.

Note that this means that deviations from the mean are negatively correlated. This makes sense, because if one variate is bigger than the mean, another must be smaller to offset the difference.

14. Cov(Di,S) = E(DiS) = 0. (( ) ) 2 E(DiS) = E Xi − (S/n) S = E(XiS) − E(S /n) = (σ2 + nµ2) − (nσ2 + n2µ2)/n = 0.

15. Cov(Di, X¯) = E(DiX¯) = E(DiS)/n = 0. The proof of each is a straightforward plug-and-chug calculation. The only reason for writing this as a theorem is to be able to refer to it easily.

9.9 ⋆ Covariance matrix of a random vector

In general, we define the covariance matrix of a random vector by     . .  .   .      Var X = ··· E(Xi − E Xi)(Xj − E Xj) ··· = ··· Cov(Xi,Xj) ··· . . . .

9.10 ⋆ Variance of a linear combination of random variables

9.10.1 Proposition Let X = (X1,...,Xn) be a random vector with covariance matrix   .  .    Σ = ··· Cov(Xi,Xi) ··· . . v. 2017.02.02::09.29 KC Border Ma 3/103 Winter 2017 KC Border Joint Distributions 9–11

and let a = (a1, . . . , an). The random variable ∑ Z = a · X = aiXi i=1 has variance given by ∑n ∑n ′ Var Z = aΣa = Cov(Xi,Xj)aiaj, i=1 j=1 where a is treated as a row vector, and a′ is its transpose, a column vector.

Proof : This just uses the fact that expectation is a positive linear operator. Since adding constants don’t change variance, we may subtract means and assume that each E Xi = 0. Then Cov(Xi,Xj) = E(XiXj). Then Z has mean 0, so ( )   ∑ ∑ 2   Var Z = E Z = E aiXi ajXj i=1 j=1 ∑n ∑n ∑n ∑n = E XiXjaiaj = E(XiXj)aiaj i=1 j=1 i=1 j=1 ∑n ∑n = Cov(Xi,Xj)aiaj. i=1 j=1

Since Var Z ⩾ 0 and a is arbitrary, we see that Var X is a positive semidefinite matrix.

9.11 ⋆ An inner product for random variables

Since random variables are just functions on the probability space (S, E,P ), the set of random variables is a vector space under the usual operations of addition of functions and multiplication by scalars. The collection of random variables that have finite variance is a linear subspace, often denoted L2(P ), and it has a natural inner product.

9.11.1 Fact (Inner product on the space L2(P ) of random variables) Let L2(P ) de- note the linear space of random variables that have finite variance. Then

(X,Y ) = E XY

is a real inner product on L2(P ). The proof of this straightforward, and is essentially the same as the proof that the Euclidean inner product on Rm is an inner product. 1 The next result is just the Cauchy–Schwartz Inequality for this inner product, but I’ve written out a self-contained proof for you.

9.11.2 Cauchy–Schwartz Inequality

E(XY )2 ⩽ (E X2)(E Y 2), (3)

with equality only if X and Y are linearly dependent, that is, there exist a, b not both zero aX + bY = 0 a.s..  1 As usual, there is a caveat for infinite sample spaces, namely equality is replaced by “equality .”

KC Border v. 2017.02.02::09.29 Ma 3/103 Winter 2017 KC Border Joint Distributions 9–12

Proof : If X or Y is zero (almost surely), then we have equality, so assume X,Y are nonzero. Define the quadratic polynomial Q: R → R by ( ) Q(λ) = E (λX + Y )2 ⩾ 0.

Since expectation is a positive linear operator,

Q(λ) = λ2 E(X2) + 2λ E(XY ) + E(Y 2).

Therefore the discriminant of the quadratic Q is nonpositive, 2 that is, 4 E(XY )2−4 E(X2) E(Y 2) ⩽ 0, or E(XY )2 ⩽ E(X2) E(Y 2). Equality in (3) can occur only if the discriminant( is zero,) in which case Q has a real root. That is, there is some λ for which Q(λ) = E (λX + Y )2 = 0. But this implies that λX + Y = 0 (almost surely).

9.11.3 Corollary |Cov(X,Y )|2 ⩽ Var X Var Y. (4)

Proof : Apply the Cauchy–Schwartz inequality to the random variables, X − E X and Y − E Y , and then take square roots.

9.12 Covariance is bilinear

********* Write this up *********

9.13 Correlation

9.13.1 Definition The correlation between X and Y is defined to be Cov(X,Y ) Corr(X,Y ) = (SD X)(SD Y )

It is also equal to Corr(X,Y ) = Cov(X∗,Y ∗) = E(X∗Y ∗), where X∗ and Y ∗ are the standardization of X and Y .

Let X have mean µX and standard deviation σX , and ditto for Y . Recall that X − µ X∗ = X σX has mean 0 and std. dev. 1. Thus by the alternate formula for covariance

Cov(X∗,Y ∗) = E(X∗Y ∗) − E(X∗) E(Y ∗) . =0 =0 Now ( ) X − µ Y − µ E(X∗Y ∗) = E X Y σX σY E(XY ) − E(X) E(Y ) = σX σY = Corr(X,Y ) v. 2017.02.02::09.29 KC Border Ma 3/103 Winter 2017 KC Border Joint Distributions 9–13

Pitman [3]: p. 433 Corollary 9.11.3 (the Cauchy–Schwartz Inequality) implies:

−1 ⩽ Corr(X,Y ) ⩽ 1

If the correlation between X and Y is zero, then X − E X, Y − E Y are orthogonal in terms of our inner product.

Bibliography

[1] W. Feller. 1968. An introduction to and its applications, 3d. ed., volume 1. New York: Wiley. [2] R. J. Larsen and M. L. Marx. 2012. An introduction to mathematical statistics and its applications, fifth ed. Boston: Prentice Hall. [3] J. Pitman. 1993. Probability. Springer Texts in Statistics. New York, Berlin, and Heidelberg: Springer.

2 In case you have forgotten how you derived the quadratic formula in Algebra I, rewrite the polynomial as ( ) 2 1 β 2 − 2 − f(z) = αz + βz + γ = α αz + 2 (β 4αγ)/4α, and note that the only way to guarantee that f(z) ⩾ 0 for all z is to have α > 0 and β2 − 4αγ ⩽ 0.

KC Border v. 2017.02.02::09.29