Random Variables and Densities

Review: Random variables X represents outcomes or states of world. • Instantiations of variables usually in lower case: x and Statistics We will write p(x) to mean probability(X = x). Sample Space: the space of all possible outcomes/states. • (May be discrete or continuous or mixed.) Probability mass (density) function p(x) 0 • ≥ Sam Roweis Assigns a non-negative number to each point in sample space. Sums (integrates) to unity: x p(x)=1 or x p(x)dx =1. Intuitively: how often does x occur, how much do we believe in x. P R Ensemble: random variable + sample space+ probability function • October, 2002 But you have to be careful about the context and about defining • random variables and sample spaces carefully. Otherwise you can get in trouble (see, e.g. Simpson’s paradox/Prisoner’s paradox).

Probability Expectations, Moments

We use p(x) to represent our beliefs B(x) about the Expectation of a function a(x) is written E[a] or a • states x of the world. • h i There is a formal calculus for manipulating uncertainties E[a]= a = p(x)a(x) • h i x represented by probabilities. X e.g. mean = xp(x), variance = (x E[x])2p(x) Any consistent set of beliefs obeying the Cox Axioms can be x x − • mapped into probabilities. Moments areP expectations of higher orderP powers. • (Mean is first moment. Autocorrelation is second moment.) 1. Rationally ordered degrees of belief: Centralized moments have lower moments subtracted away if B(x) >B(y) and B(y) >B(z) then B(x) >B(z) • 2. Belief in x and its negation x¯ are related: B(x)= f[B(¯x)] (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of moments 3. Belief in conjunction depends only on conditionals: • B(x and y)= g[B(x),B(y x)] = g[B(y),B(x y)] completely defines the entire distribution. | | Joint Probability Conditional Probability

Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value depends on • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)= p(x,y)/p(y) • p(x,y) = prob(X = x and Y = y) |

z z

p(x,y|z)

p(x,y,z) y

x y

x

Marginal Probabilities Bayes’ Rule

We can ”sum out” part of a joint distribution to get the marginal Manipulating the basic definition of conditional probability gives • distribution of a subset of variables: • one of the most important formulas in : p(x)= p(x,y) p(y x)p(x) p(y x)p(x) p(x y)= | = | y | p(y) p(y x )p(x ) X x′ | ′ ′ This is like adding slices of the table together. This gives us a way of ”reversing”conditional probabilities. • Σ p(x,y) P z • Thus, all joint probabilities can be factored by selecting an ordering z • for the random variables and using the ”chain rule”: y p(x,y,z,...)= p(x)p(y x)p(z x,y)p(. . . x,y,z) | | |

y x

x Another equivalent definition: p(x)= p(x y)p(y). • y | P Independence & Conditional Independence Cross Entropy (KL Divergence)

Two variables are independent iff their joint factors: An assymetric measure of the distancebetween two distributions: • • p(x,y)= p(x)p(y) KL[p q]= p(x)[log p(x) log q(x)] p(x,y) k x − p(x) X KL > 0 unless p = q then KL =0 x • = Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x).

p(y) Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slice factors: p(x,y z)= p(x z)p(y z) z | | | ∀

Entropy Jensen’s Inequality

Measures the amount of ambiguity or uncertainty in a distribution: For any concave function f() and any distribution on x, • • H(p)= p(x)log p(x) E[f(x)] f(E[x]) ≤ − x X f(E[x]) Expected value of log p(x) (a function which depends on p(x)!). • − H(p) > 0 unless only one possible outcomein which case H(p)=0. E[f(x)] • Maximal value when p is uniform. • Tells you the expected ”cost” if each event costs log p(event) • − e.g. log() and √ are concave • This allows us to bound expressions like log p(x) = log p(x,z) • z P Statistics (Conditional) Probability Tables Probability: inferring probabilistic quantities for data given fixed For discrete (categorical) quantities, the most basic parametrization • • th models (e.g. prob. of events, marginals, conditionals, etc). is the probability table which lists p(xi = k value). Statistics: inferring a model given fixed data observations Since PTs must be nonnegative and sum to 1, for k-ary variables • • (e.g. clustering, classification, regression). there are k 1 free parameters. − Many approaches to statistics: If a discrete variable is conditioned on the values of some other • • frequentist, Bayesian, decision theory, ... discrete variables we make one table for each possible setting of the parents: these are called conditional probability tables or CPTs.

z z

p(x,y,z) p(x,y|z)

y y

x x

Some (Conditional) Probability Functions Exponential Family

Probability density functions p(x) (for continuous variables) or For (continuous or discrete) random variable x • • probability mass functions p(x = k) (for discrete variables) tell us p(x η)= h(x)exp η⊤T (x) A(η) how likely it is to get a particular value for a random variable | 1 { − } = h(x)exp η⊤T (x) (possibly conditioned on the values of some other variables.) Z(η) { } We can consider various types of variables: binary/discrete • is an exponential family distribution with (categorical), continuous, interval, and integer counts. natural parameter η. For each type we’ll see some basic probability models which are Function T (x) is a sufficient statistic. • parametrized families of distributions. • Function A(η) = log Z(η) is the log normalizer. • Key idea: all you need to know about the data is captured in the • summarizing function T (x). Bernoulli Multinomial

For a binary random variable with p(heads)=π: For a set of integer counts on k trials • • x 1 x p(x π)= π (1 π) − k! | − x1 x2 xn π p(x π)= π1 π2 πn = h(x)exp xi log πi = exp log x + log(1 π) | x1!x2! xn! · · ·   1 π − · · · Xi    −   But the parameters are constrained: π =1. Exponential family with: i i   • • n 1 π So we define the last one πn =1 i=1− πi. η = log − P 1 π n 1 πi p(x π)= h(x)exp − logP xi + k log πn T (x)= x − | i=1 πn n   o A(η)= log(1 π) = log(1+ eη) Exponential family with: P − − • The softmax function h(x)=1 η = log π log πn relates the basic and i i − The logistic function relates the natural parameter and the chance T (xi)= xi natural parameters: • η η of heads A(η)= k log πn = k log e i e i 1 − i π = i ηj π = η h(x)= k!/x1!x2! xn! j e 1+ e− · · · P P

Poisson Gaussian (normal) For an integer count variable with rate λ: For a continuous univariate random variable: • • x λ 1 1 λ e− p(x µ,σ2)= exp (x µ)2 p(x λ)= | √2πσ −2σ2 − | x!   1 2 2 = exp x log λ λ 1 µx x µ x! { − } = exp 2 2 2 log σ √2πσ (σ − 2σ − 2σ − ) Exponential family with: • η = log λ Exponential family with: • T (x)= x η =[µ/σ2 ; 1/2σ2] A(η)= λ = eη − T (x)=[x ; x2] 1 h(x)= 2 x! A(η) = log σ + µ/2σ e.g. number of photons x that arrive at a pixel during a fixed h(x)=1/√2π • interval given mean intensity λ Note: a univariate Gaussian is a two-parameter distribution with a Other count densities: binomial, exponential. • two-component vector of sufficient statistis. • Multivariate Gaussian Distribution Gaussian Marginals/Conditionals For a continuous vector random variable: • To find these parameters is mostly linear algebra: 1/2 1 1 • p(x µ, Σ) = 2πΣ exp (x µ)⊤Σ (x µ) Let z =[x⊤y⊤]⊤ be normally distributed according to: | | |− −2 − − −   x a A C z = ; y ∼ N b C⊤ B       where C is the (non-symmetric) cross-covariance matrix between x Exponential family with: y x • and which has as many rows as the size of and as many 1 1 η = [Σ− µ ; 1/2Σ− ] columns as the size of y. − The marginal distributions are: T (x)=[x ; xx⊤] 1 x (a; A) A(η) = log Σ /2+ µ⊤Σ− µ/2 ∼ N | |n/2 y (b; B) h(x)=(2π)− ∼ N and the conditional distributions are: Sufficient statistics: mean vector and correlation matrix. 1 1 • x y (a + CB (y b); A CB C⊤) Other densities: Student-t, Laplacian. | ∼ N − − − − • y x b C A 1 x a B C A 1C For non-negative values use exponential, Gamma, log-normal. ( + ⊤ − ( ); ⊤ − ) • | ∼ N − −

Important Gaussian Facts Moments

All marginals of a Gaussian are again Gaussian. For continuous variables, moment calculations are important. • • Any conditional of a Gaussian is again Gaussian. We can easily compute moments of any exponential family • distribution by taking the derivatives of the log normalizer A(η). The qth derivative gives the qth centred moment. • p(y|x=x0) dA(η) = mean p(x,y) dη x0 d2A(η) = variance dη2 Σ · · · When the sufficient statistic is a vector, partial derivatives need to • be considered. p(x) Linear-Gaussian Conditionals

When the variable(s) being conditioned on (parents) are discrete, So far we have focused on the (log) probability function p(x θ) • | • we just have one density for each possible setting of the parents. which assigns a probability (density) to any joint configuration of e.g. a table of natural parameters in exponential models or a table variables x given fixed parameters θ. of tables for discrete models. But in learning we turn this on its head: • When the conditioned variable is continuous, its value sets some of we have some fixed data and we want to find parameters. • the parameters for the other variables. Think of p(x θ) as a function of θ for fixed x: • | A very common instance of this for regression is the • L(θ; x)= p(x θ) “linear-Gaussian”: p(y x) = gauss(θ⊤x; Σ). | | ℓ(θ; x) = log p(x θ) For discrete children and continuous parents, we often use a | • This function is called the (log) “likelihood”. Bernoulli/multinomial whose paramters are some function f(θ⊤x). Chose θ to maximize some cost function c(θ) which includes ℓ(θ): • c(θ)= ℓ(θ; ) maximumlikelihood(ML) D c(θ)= ℓ(θ; )+ r(θ) maximum a posteriori (MAP)/penalizedML D (also cross-validation, Bayesian estimators, BIC, AIC, ...)

Multiple Observations, Complete Data, IID Sampling Maximum Likelihood

A single observation of the data x is rarely useful on its own. For IID data: • • Generally we have data including many observations, which creates p( θ)= p(xm θ) • a set of random variables: = x1, x2,..., xM D| m | D { } Y m Two very common assumptions: ℓ(θ; )= log p(x θ) • D m | 1. Observations are independently and identically distributed (IID) X Idea of maximum likelihod estimation (MLE): pick the setting of according to joint distribution of graphical model: IID samples. • 2. We observe all random variables in the domain on each parameters most likely to have generated the data we saw: observation: complete data. θ = argmax ℓ(θ; ) ML∗ θ D Very commonly used in statistics. • Often leads to “intuitive”, “appealing”, or “natural” estimators. Example: Bernoulli Trials Example: Univariate Normal

We observe M iid coin flips: =H,H,T,H,. . . We observe M iid real samples: =1.18,-.25,.78,. . . • D • D Model: p(H)= θ p(T )=(1 θ) Model: p(x)=(2πσ2) 1/2 exp (x µ)2/2σ2 • − • − {− − } Likelihood: Likelihood (using probability density): • • ℓ(θ; ) = log p( θ) ℓ(θ; ) = log p( θ) D D|xm 1 xm D D| = log θ (1 θ) − M 1 (xm µ)2 − = log(2πσ2) − m 2 2 2 Y − − m σ = log θ xm + log(1 θ) (1 xm) X − − Take derivatives and set to zero: m m • X X ∂ℓ 2 = log θNH + log(1 θ)NT = (1/σ ) (xm µ) − ∂µ m − Take derivatives and set to zero: ∂ℓ = M + 1 (x µ)2 • ∂σ2 2σ2 P2σ4 m m ∂ℓ NH NT − − = µ = (1/M) xm ∂θ θ − 1 θ ⇒ ML m P N − σ2 = (1/M) x2 µ2 θ = H ML Pm m − ML ML∗ N + N ⇒ H T P

Example: Multinomial Example: Univariate Normal

We observe M iid die rolls (K-sided): =3,1,K,2,. . . • D Model: p(k)= θk k θk =1 • m Likelihood (for binaryP indicators [x = k]): • B ℓ(θ; ) = log p( θ) D D| [xm=1] [xm=k] = log θxm = log θ1 ...θk A m m Y mY = log θk [x = k]= Nk log θk m Xk X Xk Take derivatives and set to zero (enforcing θ =1): • k k ∂ℓ N = k M P ∂θk θk − N x θ = k ⇒ k∗ M Example: Linear Regression Sufficient Statistics

In linear regression, some inputs (covariates,parents) and all A statistic is a function of a random variable. • • outputs (responses,children) are continuous valued variables. T (X) is a “sufficient statistic” for X if • For each child and setting of discrete parents we use the model: 1 2 1 2 • T (x )= T (x ) L(θ; x )= L(θ; x ) θ 2 ⇒ ∀ p(y x, θ) = gauss(y θ⊤x,σ ) | | Equivalently (by the Neyman factorization theorem) we can write: The likelihood is the familiar “squared error” cost: • • p(x θ)= h (x,T (x)) g (T (x), θ) 1 | m xm 2 ℓ(θ; )= 2 (y θ⊤ ) Example: exponential family models: D −2σ m − • X p(x θ)= h(x)exp η⊤T (x) A(η) The ML parameters can be solved for using linear least-squares: | { − } • ∂ℓ m m m = (y θ⊤x )x ∂θ − m − X 1 θ = (X⊤X) X⊤Y ⇒ ML∗ −

Example: Linear Regression Sufficient Statistics are Sums

In the examples above, the sufficient statistics were merely sums • (counts) of the data: Bernoulli: # of heads, tails y X Multinomial: # of each type x Gaussian: mean, mean-square x x Regression: correlations x x x x x x As we will see, this is true for all exponential family models: x x x x x x • x x sufficient statistics are average natural parameters. x x x x Only exponential family models have simple sufficient statistics. x • (There are some degenerate exceptions, e.g. the uniform has x sufficient statistics of max/min.) Y x MLE for Exponential Family Models Fundamental Operations with Distributions

Recall the probability function for exponential models: Generate data: draw samples from the distribution. This often • • involves generating a uniformly distributed variable in the range p(x θ)= h(x)exp η⊤T (x) A(η) | { − } [0,1] and transforming it. For more complex distributions it may For iid data, sufficient statistic is T (xm): • m involve an iterative procedure that takes a long time to produce a single sample (e.g. Gibbs sampling, MCMC). P m m ℓ(η; ) = log p( η)= log h(x ) MA(η)+ η⊤ T (x ) D D| − Compute log probabilities. m ! m ! • X X When all variables are either observed or marginalized the result is a Take derivatives and set to zero: • single number which is the log prob of the configuration. ∂ℓ m ∂A(η) ∂η = m T (x ) M ∂η Inference: Compute expectations of some variables given others − • ∂A(η) 1 m which are observed or marginalized. ∂η = PM m T (x ) ⇒ Learning. η = 1 T (xm) ML M Pm • Set the parameters of the density functions given some (partially) recalling that the natural momentsP of an exponential distribution observed data to maximize likelihood or penalized likelihood. are the derivatives of the log normalizer.

Basic Statistical Problems Learning with Known Model Structure

Let’s remind ourselves of the basic problems we discussed on the In AI the bottleneck is often knowledge acquisition. • • first day: density estimation, clustering classification and regression. Human experts are rare, expensive, unreliable, slow. • Density estimation is hardest. If we can do joint density estimation But we have lots of data. • then we can always condition to get what we want: • Want to build systems automatically based on data and a small Regression: p(y x)= p(y, x)/p(x) • amount of prior information (from experts). Classification: p|(c x)= p(c, x)/p(x) Clustering: p(c x)=| p(c, x)/p(x) c unobserved Many systems we build will be essentially probability models. | • Assume the prior information we have specifies type & structure of • the model, as well as the form of the (conditional) distributions or potentials. In this case learning setting parameters. • ≡ Also possible to do “structure learning” to learn model. •