Random Variables and Densities

Review: Random variables X represents outcomes or states of world. • Instantiations of variables usually in lower case: x and Statistics We will write p(x) to mean probability(X = x). Sample Space: the space of all possible outcomes/states. • (May be discrete or continuous or mixed.) Probability mass (density) function p(x) 0 • ≥ Sam Roweis Assigns a non-negative number to each point in sample space. Sums (integrates) to unity: x p(x)=1 or x p(x)dx =1. Intuitively: how often does x occur, how much do we believe in x. P R Ensemble: random variable + sample space+ probability function • Machine Learning Summer School, January 2005

Probability Expectations, Moments

We use p(x) to represent our beliefs B(x) about the Expectation of a function a(x) is written E[a] or a • states x of the world. • h i There is a formal calculus for manipulating uncertainties E[a]= a = p(x)a(x) • h i x represented by probabilities. X e.g. mean = xp(x), variance = (x E[x])2p(x) Any consistent set of beliefs obeying the Cox Axioms can be x x − • mapped into probabilities. Moments areP expectations of higher orderP powers. • (Mean is first moment. Autocorrelation is second moment.) 1. Rationally ordered degrees of belief: Centralized moments have lower moments subtracted away if B(x) >B(y) and B(y) >B(z) then B(x) >B(z) • 2. Belief in x and its negation x¯ are related: B(x)= f[B(¯x)] (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of moments 3. Belief in conjunction depends only on conditionals: • B(x and y)= g[B(x),B(y x)] = g[B(y),B(x y)] completely defines the entire distribution. | | Means, Variances and Covariances Marginal Probabilities

Remember the definition of the mean and covariance of a vector We can ”sum out” part of a joint distribution to get the marginal • random variable: • distribution of a subset of variables: E[x]= xp(x)dx = m p(x)= p(x,y) x Z y X Cov[x]= E[(x m)(x m)⊤]= (x m)(x m)⊤p(x)dx = V − − x − − This is like adding slices of the table together. Z • Σ p(x,y) which is the expected value of the outer product of the variable z with itself, after subtracting the mean. z Also, the covariance between two variables: y • Cov[x, y]= E[(x mx)(y my)⊤]= C − − y x = (x mx)(y my)⊤p(x, y)dxdy = C x xy − − Z Another equivalent definition: p(x)= p(x y)p(y). which is the expected value of the outer product of one variable • y | with another, after subtracting their means. P Note: C is not symmetric.

Joint Probability Conditional Probability

Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value depends on • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)= p(x,y)/p(y) • p(x,y) = prob(X = x and Y = y) |

z z

p(x,y|z)

p(x,y,z) y

x y

x Bayes’ Rule Entropy

Manipulating the basic definition of conditional probability gives Measures the amount of ambiguity or uncertainty in a distribution: • • one of the most important formulas in : H(p)= p(x)log p(x) p(y x)p(x) p(y x)p(x) − x p(x y)= | = | X | p(y) p(y x )p(x ) x′ ′ ′ Expected value of log p(x) (a function which depends on p(x)!). | • − This gives us a way of ”reversing”conditionalP probabilities. H(p) > 0 unless only one possible outcomein which case H(p)=0. • • Thus, all joint probabilities can be factored by selecting an ordering Maximal value when p is uniform. • for the random variables and using the ”chain rule”: • Tells you the expected ”cost” if each event costs log p(event) p(x,y,z,...)= p(x)p(y x)p(z x,y)p(. . . x,y,z) • − | | |

Independence & Conditional Independence Cross Entropy (KL Divergence)

Two variables are independent iff their joint factors: An assymetric measure of the distancebetween two distributions: • • p(x,y)= p(x)p(y) KL[p q]= p(x)[log p(x) log q(x)] p(x,y) k x − p(x) X KL > 0 unless p = q then KL =0 x • = Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x).

p(y) Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slice factors: p(x,y z)= p(x z)p(y z) z | | | ∀ Statistics (Conditional) Probability Tables Probability: inferring probabilistic quantities for data given fixed For discrete (categorical) quantities, the most basic parametrization • • th models (e.g. prob. of events, marginals, conditionals, etc). is the probability table which lists p(xi = k value). Statistics: inferring a model given fixed data observations Since PTs must be nonnegative and sum to 1, for k-ary variables • • (e.g. clustering, classification, regression). there are k 1 free parameters. − Many approaches to statistics: If a discrete variable is conditioned on the values of some other • • frequentist, Bayesian, decision theory, ... discrete variables we make one table for each possible setting of the parents: these are called conditional probability tables or CPTs.

z z

p(x,y,z) p(x,y|z)

y y

x x

Some (Conditional) Probability Functions Exponential Family

Probability density functions p(x) (for continuous variables) or For (continuous or discrete) random variable x • • probability mass functions p(x = k) (for discrete variables) tell us p(x η)= h(x)exp η⊤T (x) A(η) how likely it is to get a particular value for a random variable | 1 { − } = h(x)exp η⊤T (x) (possibly conditioned on the values of some other variables.) Z(η) { } We can consider various types of variables: binary/discrete • is an exponential family distribution with (categorical), continuous, interval, and integer counts. natural parameter η. For each type we’ll see some basic probability models which are Function T (x) is a sufficient statistic. • parametrized families of distributions. • Function A(η) = log Z(η) is the log normalizer. • Key idea: all you need to know about the data is captured in the • summarizing function T (x). Bernoulli Distribution Multinomial

For a binary random variable x = 0, 1 with p(x =1)= π: For a categorical (discrete), random variable taking on K possible • { } • x 1 x values, let π be the probability of the kth value. We can use a p(x π)= π (1 π) − k | − x π binary vector = (x1,x2,...,xk,...,xK) in which xk =1 if and = exp log x + log(1 π) only if the variable takes on its kth value. Now we can write, 1 π −   −   Exponential family with: p(x π)= πx1πx2 πxK = exp x log π • π | 1 2 · · · K  i i η = log i 1 π X  − Exactly like a probability table, but written using binary vectors. T (x)= x   A(η)= log(1 π) = log(1+ eη) If we observe this variable several times X = x1, x2,..., xN , the − − • { } h(x)=1 (iid) probability depends on the total observed counts of each value: n n The logistic function links natural parameter and chance of heads p(X π)= p(x π) = exp i n xi log πi = exp i ci log πi • | n | { } 1 Y P P  P π = η = logistic(η) 1+ e−

Poisson Multinomial as Exponential Family

For an integer count variable with rate λ: The multinomial parameters are constrained: i πi =1. • • K 1 λxe λ Define (the last) one in terms of the rest: πK =1 − πi p(x λ)= − P − i=1 | x! K 1 πi 1 p(x π) = exp i=1− log π xi + k log πPK = exp x log λ λ | K x! { − } Exponential family with:nP   o Exponential family with: • • ηi = log πi log πK η = log λ − T (xi)= xi T (x)= x A(η)= k log π = k log eηi η − K i A(η)= λ = e h(x)=1 1 P h(x)= The softmax function relates direct and natural parameters: x! • eηi e.g. number of photons x that arrive at a pixel during a fixed π = • i ηj interval given mean intensity λ j e Other count densities: (neg)binomial, geometric. • P Gaussian (normal) Important Gaussian Facts For a continuous univariate random variable: All marginals of a Gaussian are again Gaussian. • 1 1 • Any conditional of a Gaussian is again Gaussian. p(x µ,σ2)= exp (x µ)2 | √2πσ −2σ2 −   1 µx x2 µ2 = exp 2 2 2 log σ √2πσ (σ − 2σ − 2σ − ) p(y|x=x0) Exponential family with: p(x,y) • x0 η =[µ/σ2 ; 1/2σ2] − T (x)=[x ; x2] A(η) = log σ + µ/2σ2 Σ h(x)=1/√2π Note: a univariate Gaussian is a two-parameter distribution with a p(x) • two-component vector of sufficient statistis.

Multivariate Gaussian Distribution Gaussian Marginals/Conditionals For a continuous vector random variable: • To find these parameters is mostly linear algebra: 1/2 1 1 • p(x µ, Σ) = 2πΣ exp (x µ)⊤Σ (x µ) Let z =[x⊤y⊤]⊤ be normally distributed according to: | | |− −2 − − −   x a A C z = ; y ∼ N b C⊤ B       where C is the (non-symmetric) cross-covariance matrix between x Exponential family with: y x • and which has as many rows as the size of and as many 1 1 η = [Σ− µ ; 1/2Σ− ] columns as the size of y. − The marginal distributions are: T (x)=[x ; xx⊤] 1 x (a; A) A(η) = log Σ /2+ µ⊤Σ− µ/2 ∼ N | |n/2 y (b; B) h(x)=(2π)− ∼ N and the conditional distributions are: Sufficient statistics: mean vector and correlation matrix. 1 1 • x y (a + CB (y b); A CB C⊤) Other densities: Student-t, Laplacian. | ∼ N − − − − • y x b C A 1 x a B C A 1C For non-negative values use exponential, Gamma, log-normal. ( + ⊤ − ( ); ⊤ − ) • | ∼ N − − Parameter Constraints Parameterizing Conditionals

If we want to use general optimizations (e.g. conjugate gradient) to When the variable(s) being conditioned on (parents) are discrete, • learn latent variable models, we often have to make sure parameters • we just have one density for each possible setting of the parents. respect certain constraints. (e.g. k αk =1, Σk pos.definite). e.g. a table of natural parameters in exponential models or a table A good trick is to reparameterize these quantities in terms of of tables for discrete models. • P unconstrained values. For mixing proportions, use the softmax: When the conditioned variable is continuous, its value sets some of • exp(qk) the parameters for the other variables. αk = exp(qj) A very common instance of this for regression is the j • “linear-Gaussian”: p(y x) = gauss(θ⊤x; Σ). For covariance matrices, use theP Cholesky decomposition: | • 1 For discrete children and continuous parents, we often use a Σ− = A⊤A • Bernoulli/multinomial whose paramters are some function f(θ⊤x). Σ 1/2 = A | |− ii Yi where A is upper diagonal with positive diagonal:

Aii = exp(ri) > 0 Aij = aij (j > i) Aij = 0 (j < i)

Moments Generalized Linear Models (GLMs) For continuous variables, moment calculations are important. Generalized Linear Models: p(y x) is exponential family with • • | We can easily compute moments of any exponential family conditional mean µ = f(θ⊤x). • distribution by taking the derivatives of the log normalizer A(η). The function f is called the response function; if we chose it to be • the inverse of the mapping b/w conditional mean and natural The qth derivative gives the qth centred moment. • parameters then it is called the canonical response function. dA(η) η = ψ(µ) = mean 1 dη f( )= ψ− ( ) · · d2A(η) We can be even more general and define distributions by arbitrary = variance • dη2 energy functions proportional to the log probability. p(x) exp H (x) · · · ∝ {− k } When the sufficient statistic is a vector, partial derivatives need to Xk • A common choice is to use pairwise terms in the energy: be considered. • H(x)= aixi + wijxixj Xi pairsXij Matrix Inversion Lemma Jensen’s Inequality (Sherman-Morrison-Woodbury Formulae) For any concave function f() and any distribution on x, • There is a good trick for inverting matrices when they can be E[f(x)] f(E[x]) • decomposed into the sum of an easily inverted matrix (D) and a ≤ low rank outer product. It is called the matrix inversion lemma. f(E[x]) 1 1 1 1 1 1 1 (D AB A⊤) = D + D A(B A⊤D A) A⊤D − − − − − − − − − E[f(x)] The same trick can be used to compute determinants: • 1 1 log D + AB A⊤ = log D log B + log B + A⊤D A | − | | |− | | | − |

e.g. log() and √ are concave • This allows us to bound expressions like log p(x) = log p(x,z) • z P

Matrix Derivatives Logsum

Here are some useful matrix derivatives: Often you can easily compute bk = log p(x z = k, θk), • • 6 | ∂ 1 but it will be very negative, say -10 or smaller. log A = (A− )⊤ ∂A | | Now, to compute ℓ = log p(x θ) you need to compute log ebk. ∂ • | k trace[B⊤A]= B (e.g. for calculating responsibilities at test time or for learning) ∂A P ∂ Careful! Do not compute this by doing log(sum(exp(b))). trace[BA⊤CA]=2CAB • ∂A You will get underflow and an incorrect answer. Instead do this: • – Add a constant exponent B to all the values bk such that the largest value comes close to the maxiumum exponent allowed by machine precision: B = MAXEXPONENT-log(K)-max(b). – Compute log(sum(exp(b+B)))-B. Example: if log p(x z =1)= 120 and log p(x z =2)= 120, • what is log p(x) = log[| p(x z =− 1)+ p(x z = 2)]|? − Answer: log[2e 120]= 120+| log2. | − −