Multivariate Probability Distributions and Linear Regression

Multivariate probability distributions and linear regression

UNIVERSITY OF HELSINKI Patrik Hoyer 1 Dept. of Computer Science Contents:

• Random variable, probability distribution • Joint distribution • Marginal distribution • Conditional distribution • Independence, conditional independence • Generating data • Expectation, variance, covariance, correlation • Multivariate Gaussian distribution • Multivariate linear regression • Estimating a distribution from sample data

UNIVERSITY OF HELSINKI Patrik Hoyer 2 Dept. of Computer Science • Random variable - sample space (set of possible elementary outcomes) - probability distribution over sample space • Examples: - The throw of a die

x 1 2 3 4 5 6 P (x) 1/6 1/6 1/6 1/6 1/6 1/6

- The sum of two dice x 2 3 4 5 6 7 8 9 10 11 12 P (x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36 - Two separate dice (red, blue)

x (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,1) (2,2) (2,3) ... (6,6) P (x) 1/36 1/36 1/36 1/36 1/36 1/36 1/36 1/36 1/36 1/36

UNIVERSITY OF HELSINKI Patrik Hoyer 3 Dept. of Computer Science • Discrete variables: - Finite number of states (e.g. dice examples) - Inﬁnite number of states (e.g. how many heads before one tales in a sequence of coin tosses?) • Continuous variables: Each particular state has a probability of zero, so we need the concept of a probability density:

x P (X x) = p(t) dt ≤ !−∞ (e.g. how long time until next bus arrives? what will be the price of oil a year from now?)

UNIVERSITY OF HELSINKI Patrik Hoyer 4 Dept. of Computer Science • A probability distribution satisﬁes... 1. Probabilities are non-negative:

P (X = x) = P (x) = P (x) 0 X ≥ 2. Sum to one:

P (x) = 1 (discrete) x ! p(x) dx = 1 (continuous)

[Note that in the discrete case this means that there exists no value of x such that P ( x ) > 1 . However this does not in general hold for a continuous density p ( x ) !]

UNIVERSITY OF HELSINKI Patrik Hoyer 5 Dept. of Computer Science • The joint distribution of two random variables: - Let X and Y be random variables. Their joint distribution is P (x, y) = P (X = x and Y = y)

- Example: Two coin tosses, X denotes ﬁrst throw, Y denotes second (note: independence!) Y H T

H 0.25 0.25 P (x, y) : X T 0.25 0.25

- Example: X : Rain today? Y : Rain tomorrow? Y Y N

Y 0.5 0.2 P (x, y) : X N 0.1 0.2

UNIVERSITY OF HELSINKI Patrik Hoyer 6 Dept. of Computer Science • Marginal distribution: - ‘Interested in or observing only one of the two variables’ - The distribution is obtained by summing (or integrating) over the other variable: P (x) = P (x, y) p(x) = p(x, y) dy y ! - Example (continued): What is the probability of rain tomorrow? That is, what is P ( y ) ?

Y Y N In the same fashion, we can Y 0.5 0.2 X calculate that the chance of rain N 0.1 0.2 today is 0.7.

P!(y) : 0.6 0.4

UNIVERSITY OF HELSINKI Patrik Hoyer 7 Dept. of Computer Science • Conditional distribution: - ‘If we observe X = x how does that affect our belief about the value of Y ?’ - Obtained by selecting the appropriate row/column of the joint distribution, and renormalizing it to sum to one: P (x, y) p(x, y) P (y X = x) = P (y x) = p(y x) = | | P (x) | p(x) - Example (continued): What is the probability that tomorrow rains, given that today does not rain? i.e. what is P ( y X = ‘ no rain ’) ? | Y Y N P (y X = ‘no rain’) | Y 0.5 0.2 Y N X N 0.1 0.2 0.1 0.2 / (0.1 + 0.2) 0.33 0.67 ⇒ ≈

UNIVERSITY OF HELSINKI Patrik Hoyer 8 Dept. of Computer Science ⇒ Chain rule: P (x, y) = P (x)P (y x) = P (y)P (x y) | | p(x, y) = p(x)p(y x) = p(y)p(x y) | | • So the joint distribution can be speciﬁed directly, or using the marginal and conditional distribution (can even choose, ‘which way’ one speciﬁes it)

UNIVERSITY OF HELSINKI Patrik Hoyer 9 Dept. of Computer Science • Independence: Two random variables are independent, if and only if knowing the value of one does not change our belief about the second: x : P (y x) = P (y) y : P (x y) = P (x) ∀ | ⇔ ∀ | This is equivalent to being able to write the joint distribution as the product of the marginals: P (x, y) = P (x)P (y)

We write this as: X Y ⊥⊥

or, if we want to explicitly specify the distribution: (X Y )P ⊥⊥ • Example: Two coin tosses...

UNIVERSITY OF HELSINKI Patrik Hoyer 10 Dept. of Computer Science • Three or more variables: - joint distribution: P (v, w, x, y, z, . . .) (‘multidimensional array/function’) - marginal distributions: (e.g.) P (x) = P (v, w, x, y, z, . . .) v,w,y,z,... ! P (x, y) = P (v, w, x, y, z, . . .) v,w,z,... ! - conditional distributions: (e.g.) P (x v, w, y, z, . . .) = P (v, w, x, y, z, . . .)/P (v, w, y, z, . . .) | P (x, y v, w, z, . . .) = P (v, w, x, y, z, . . .)/P (v, w, z, . . .) | P (v, w, y, z, . . . x) = P (v, w, x, y, z, . . .)/P (x) | P (x y) = P (v, w, x, z, . . . y) ← marginal and conditional | | v,w,z,... !

UNIVERSITY OF HELSINKI Patrik Hoyer 11 Dept. of Computer Science - Chain rule P (v, w, x, y, z, . . .) = P (v)P (w v)P (x v, w)P (y v, w, x) | | | P (z v, w, x, y)P (. . . v, w, x, y, z) | |

- Complete independence between all variables if and only if: P (v, w, x, y, z, . . .) = P (v)P (w)P (x)P (y)P (z)P (. . .)

- Conditional independence (e.g: if we know the value of z then x does not give any additional information about y ): P (x, y z) = P (x z)P (y z) | | | This is also written: X Y Z ⊥⊥ | or explicitly noting the distribution: (X Y Z)P ⊥⊥ |

UNIVERSITY OF HELSINKI Patrik Hoyer 12 Dept. of Computer Science - In general we can say that marginal distributions are conditional on not knowing the value of other variables: P ( x ) = P ( x ) |∅ and (marginal) independence is independence conditional on not observing other variables: P (x, y ) = P (x )P (y ) |∅ |∅ |∅ - Example of conditional independence: Drownings and ice-cream sales. These are mutually dependent (both happen during warm weather) but are, at least approximately, conditionally independent given the weather

UNIVERSITY OF HELSINKI Patrik Hoyer 13 Dept. of Computer Science Example: conditional dependence: Two coin tosses and a bell that rings whenever they get the same result. The coins are marginally independent but conditionally dependent given the bell!

X : First coin toss Y : Second coin toss Z : Bell Y H T

H 0.25 0.25 P (x, y) = X (independent) T 0.25 0.25

Y H T

H 0.5 0 P (x, y Z = ‘bell rang’) = X (dependent!) | T 0 0.5

UNIVERSITY OF HELSINKI Patrik Hoyer 14 Dept. of Computer Science • Data generation, sampling - Given some P ( x ) , how can we draw samples (generate data) from that distribution?

Answer: Divide the unit interval [0,1] into parts corresponding to the probabilities, draw a uniformly distributed number in the interval, and select the state into which we fell:

0.30245... 0 1 X := x ⇒ 2

P (x2) P (x4) P (x6)

P (x1) P (x3) P (x5)

UNIVERSITY OF HELSINKI Patrik Hoyer 15 Dept. of Computer Science • Given a joint distribution P ( x, y , z ) , how can we draw samples (generate data)? - We could list all joint states, then proceed as above, or... - Draw data sequentially from conditional distributions: 1. First draw x from P (x) 2. Next y from P (y x) | 3. Finally z from P (z x, y) | Note: We can freely choose any ordering of the variables!

UNIVERSITY OF HELSINKI Patrik Hoyer 16 Dept. of Computer Science Example (continued): Two coin tosses and a bell that rings if and only if the two tosses give the same result - can draw all the variables simultaneously by listing all the joint states, calculating their probabilities, placing them on the unit interval, and then draw the joint state - can ﬁrst independently generate the coin tosses, then assign the bell - can ﬁrst draw one coin toss and the bell, and then assign the second coin toss

UNIVERSITY OF HELSINKI Patrik Hoyer 17 Dept. of Computer Science • Numerical random variables - Expectation: E X = xP (x) (discrete) { } x ! E X = x p(x) dx (continuous) { } Variance: Var(X) = σ2 = σ = E (X E X )2 - X XX { − { } } Covariance: Cov(X, Y ) = σ = E (X E X )(Y E Y ) - XY { − { } − { } } σ Correlation coefﬁcient: ρ = XY - XY 2 2 σX σY

UNIVERSITY OF HELSINKI Patrik Hoyer 18 Dept. of Computer Science - Multivariate numerical random variables... (random vectors) Expectation:

E V { 1} E V2 E V =  {. }  { } .    E V   { N }    Covariance matrix (‘variance-covariance matrix’) C = Σ = E (V E V )(V E V )T V V { − { } − { } }

Var(V1) Cov(V1, VN ) σV1V1 σV1VN . . =  ..  =  ..  Cov(V , V ) Var(V ) σ σ  N 1 2   VN V1 VN VN     

UNIVERSITY OF HELSINKI Patrik Hoyer 19 Dept. of Computer Science • Conditional expectation, variance, covariance, correlation - Conditional expectation (note: function of y !) E X Y = y = xP (x y) (discrete) { | } | x E X Y = y = x p(x y) dx (continuous) { | } | - Conditional variance (note: function of y !) 2 2 Var(X Y = y) = σX y = σXX y = E (X E X ) P (X Y =y) | | | { − { } } | - Conditional covariance (note: function of z !)

Cov(X, Y z) = σXY z = E (X E X )(Y E Y ) P (X,Y Z=z) | | { − { } − { } } | - Conditional correlation coefﬁcient (note: function of z !) σXY z ρXY z = | | 2 2 σX zσY z | | UNIVERSITY OF HELSINKI Patrik Hoyer 20 Dept. of Computer Science gaussian identities

sam roweis gaussian identitiesgaussian identities (revised July 1999) sam roweis sam roweis 0.1 multidimensional gaussian (revised July 1999) (revised July 1999) - Multivariatea d-dimensional Gaussianmultidimensional (‘normal’) density:gaussian (normal) density for x is:

d/2 1/2 1 T 1 p(x) = (µ, Σ) = (2π)− Σ − exp (x µ) Σ− (x µ) (1) 0.1N multidimensio|nal0|.1gaussimultan−idi2 m−ensional gaussi− an it hasaend-dimensionaltropy: multidimensionala d-dimensionalgaussianmultidimensional(normal) densitygaussianfor x is:(normal) density for x is: has the following properties:

1 d/2 d 1/2 1 d/2T 11/2 1 T 1 mean vector ( µ , S Σand=) = covariance(2logπ)−(2πΣe) matrix−Σ (µexp, Σ const ) as= (2(thexπbits) −onlyµ) Σ−− (x expµ) (2)(x (1)µ) Σ− (x µ) (1) • N 2 2 | | |N| − −2 − | | − −2 − − parameters whereitΣhasisenatropsymmetricy: postivit hase semi-definiteentropy: covariance matrix and the (unfortunate) constant is the log of the units in which x is measured over 1 1 the “natural units” S = log (2πe)d Σ Sconst= logbits(2πe)d Σ const(2) bits (2) 2 2 | | − 2 2 | | − 0.2 wherelinearΣfunctis xa2 symmetricions of awherepnoostivrmΣealsemi-definiteisvaectsymmetricor covparianceostive semi-definitematrix and thecovariance matrix and the no matter(unfortunate)how x is constandistributed,t is(unfortunate)the log of theconstanunits tiniswhicthehlogx isofmeasuredthe units oinverwhich x is measured over the “natural units” the “natural units” E[Ax + y] = A(E[x]) + y (3a) 0.2 linear functCovar[ioAxns0+.o2fy]a=linonearArm(Coalfunctvar[vxect])iAoonsTr of a normal (3b)vector x no matter how x is distributed,no matter1 how x is distributed, UNIVERSITY OF HELSINKI Patrik Hoyer 21 Dept. of Computer Sciencein particular this means that for normal distributed quantities:

E[Ax + y] = A(E[x]) + yE[AxT+ y] = A(E[x])(3a)+ y (3a) x ∼ (µ, Σ) (Ax + y) ∼ Aµ + y, AΣA (4a) N Co⇒ var[Ax + y] =N A(Covar[Coxvar[])AAxT + y] = A(Covar[(3b)x])AT (3b) 1/2 x ∼ (µ, Σ) Σ− (x µ) ∼ (0, I) (4b) N ⇒ − N in particular this means inthatparticularforT normal1 thisdistributedmeans 2thatquanfor tities:normal distributed quantities: x ∼ (µ, Σ) (x µ) Σ− (x µ) ∼ χ (4c) N ⇒ − − n T T x ∼ (µ, Σ) (Ax + yx) ∼∼ (µA,µΣ+) y,(AAxΣA+ y) ∼ A(4a)µ + y, AΣA (4a) N ⇒ NN ⇒ N 1/2 1/2 x ∼ (µ, Σ) Σ− (xx ∼µ) ∼(µ, Σ(0) , I)Σ− (x µ) ∼ (4b)(0, I) (4b) N ⇒ − N N ⇒ − N T 1 2 T 1 2 x ∼ (µ, Σ) (x µ)xΣ∼− (x(µ, Σµ)) ∼ χ(x µ) Σ− (x (4c)µ) ∼ χ (4c) N ⇒ 1− N − ⇒ n − − n

1 1 • all marginal and conditional distributions are also Gaussian, and the conditional (co)variances do not depend on the values of the conditioning variables:

Let x and y be random vectors whose dimensions are n and m. If they are joined together into one random vector z = (xT , yT ), with dimension n + m, then its mean mz and covariance matrix Cz are

m C C m = x , C = x xy , (1) z m z C C y yx y

where mx and my are the means of x and y, and Cx and Cy are the covariance matrices of x and y respectively, and Cxy contains the cross covariances. If z is multivariate Gaussian then x and y are also Gaussian. Additionally the conditional distributions p(x y) and p(y x) are Gaussian. The latter’s | | mean and covariance matrix are

1 my x = my + CyxCx− (x mx) (2) | − 1 Cy x = Cy CyxCx− Cxy (3) | −

T Let v be a Gaussian random vector over three variables (v1, v2, v3) whose UNIVERSITY OF HELSINKI mean mv = EPatrikv Hoyer= 0, and covariance matrix 22 Dept. of Computer Science { } 6 2 1 T Cv = E vv = 2 1 0 . (4) { }  1 0 1    Calculate the marginal distribution p(v1, v3). Are v1 and v3 independent? What is their correlation coeﬃcient?

Are v2 and v3 independent?

Are v2 and v3 conditionally independent, given v1?

2 • The conditional variance, conditional covariance, and conditional correlation coefﬁcient, for the Gaussian distribution, are known 2 as partial variance σ X Z , partial covariance σ X Y Z , and partial · · correlation coefﬁcient ρ X Y Z (respectively) · • These can of course always be computed directly from the covariance matrix (regardless of whether the distribution actually is Gaussian!)...

...but they can only be safely interpreted as conditional variance, conditional covariance, and conditional correlation coefﬁcient (respectively) for the Gaussian distribution.

UNIVERSITY OF HELSINKI Patrik Hoyer 23 Dept. of Computer Science • for Gaussian: zero (partial) covariance ⇔ zero (conditional) covariance ⇔ (conditional) independence

i.e. (σXY Z = 0) ( z : σXY z = 0) (X Y Z) · ⇔ ∀ | ⇔ ⊥⊥ |

• in general: we only have one-way implication: zero (conditional) covariance ⇐ (conditional) independence

i.e. ( z : σXY z = 0) (X Y Z) ∀ | ⇐ ⊥⊥ | Note, however, that conditional independence does not imply zero partial covariance in the completely general case!

UNIVERSITY OF HELSINKI Patrik Hoyer 24 Dept. of Computer Science Linear regression: yˆ = ryxx + y

• Fit a line through the data, explaining how y varies with x . • Minimize sum of squares error between yˆ and y . σXY ryx = 2 • σX Probabilistic interpretation: yˆ E Y X = x • ≈ { | } (note that this is true only for roughly linear relationships)

UNIVERSITY OF HELSINKI Patrik Hoyer 25 Dept. of Computer Science • Note the symmetry: We could equally well regress x on y !

xˆ = rxyy + x

UNIVERSITY OF HELSINKI Patrik Hoyer 26 Dept. of Computer Science • Multivariate linear regression:

zˆ = ax + by + z

σZX Y · a = rzx y = 2 · σX Y ·

Note that the partial regression coefﬁcient r z x y is NOT the same, · in general, as one gets from regressing z on x , ignoring y : rzx

Note also that r z x y is derived from the partial (co)variances. This · holds regardless of the form of the underlying distribution.

UNIVERSITY OF HELSINKI Patrik Hoyer Dept. of Computer Science 27