Probability and Statistics Sam Roweis Machine Learning Summer School
Total Page:16
File Type:pdf, Size:1020Kb
Random Variables and Densities Review: Random variables X represents outcomes or states of world. • Instantiations of variables usually in lower case: x Probability and Statistics We will write p(x) to mean probability(X = x). Sample Space: the space of all possible outcomes/states. • (May be discrete or continuous or mixed.) Probability mass (density) function p(x) 0 • ≥ Sam Roweis Assigns a non-negative number to each point in sample space. Sums (integrates) to unity: x p(x)=1 or x p(x)dx =1. Intuitively: how often does x occur, how much do we believe in x. P R Ensemble: random variable + sample space+ probability function • Machine Learning Summer School, January 2005 Probability Expectations, Moments We use probabilities p(x) to represent our beliefs B(x) about the Expectation of a function a(x) is written E[a] or a • states x of the world. • h i There is a formal calculus for manipulating uncertainties E[a]= a = p(x)a(x) • h i x represented by probabilities. X e.g. mean = xp(x), variance = (x E[x])2p(x) Any consistent set of beliefs obeying the Cox Axioms can be x x − • mapped into probabilities. Moments areP expectations of higher orderP powers. • (Mean is first moment. Autocorrelation is second moment.) 1. Rationally ordered degrees of belief: Centralized moments have lower moments subtracted away if B(x) >B(y) and B(y) >B(z) then B(x) >B(z) • 2. Belief in x and its negation x¯ are related: B(x)= f[B(¯x)] (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of moments 3. Belief in conjunction depends only on conditionals: • B(x and y)= g[B(x),B(y x)] = g[B(y),B(x y)] completely defines the entire distribution. | | Means, Variances and Covariances Marginal Probabilities Remember the definition of the mean and covariance of a vector We can ”sum out” part of a joint distribution to get the marginal • random variable: • distribution of a subset of variables: E[x]= xp(x)dx = m p(x)= p(x,y) x Z y X Cov[x]= E[(x m)(x m)⊤]= (x m)(x m)⊤p(x)dx = V − − x − − This is like adding slices of the table together. Z • Σ p(x,y) which is the expected value of the outer product of the variable z with itself, after subtracting the mean. z Also, the covariance between two variables: y • Cov[x, y]= E[(x mx)(y my)⊤]= C − − y x = (x mx)(y my)⊤p(x, y)dxdy = C x xy − − Z Another equivalent definition: p(x)= p(x y)p(y). which is the expected value of the outer product of one variable • y | with another, after subtracting their means. P Note: C is not symmetric. Joint Probability Conditional Probability Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value depends on • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)= p(x,y)/p(y) • p(x,y) = prob(X = x and Y = y) | z z p(x,y|z) p(x,y,z) y x y x Bayes’ Rule Entropy Manipulating the basic definition of conditional probability gives Measures the amount of ambiguity or uncertainty in a distribution: • • one of the most important formulas in probability theory: H(p)= p(x)log p(x) p(y x)p(x) p(y x)p(x) − x p(x y)= | = | X | p(y) p(y x )p(x ) x′ ′ ′ Expected value of log p(x) (a function which depends on p(x)!). | • − This gives us a way of ”reversing”conditionalP probabilities. H(p) > 0 unless only one possible outcomein which case H(p)=0. • • Thus, all joint probabilities can be factored by selecting an ordering Maximal value when p is uniform. • for the random variables and using the ”chain rule”: • Tells you the expected ”cost” if each event costs log p(event) p(x,y,z,...)= p(x)p(y x)p(z x,y)p(. x,y,z) • − | | | Independence & Conditional Independence Cross Entropy (KL Divergence) Two variables are independent iff their joint factors: An assymetric measure of the distancebetween two distributions: • • p(x,y)= p(x)p(y) KL[p q]= p(x)[log p(x) log q(x)] p(x,y) k x − p(x) X KL > 0 unless p = q then KL =0 x • = Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x). p(y) Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slice factors: p(x,y z)= p(x z)p(y z) z | | | ∀ Statistics (Conditional) Probability Tables Probability: inferring probabilistic quantities for data given fixed For discrete (categorical) quantities, the most basic parametrization • • th models (e.g. prob. of events, marginals, conditionals, etc). is the probability table which lists p(xi = k value). Statistics: inferring a model given fixed data observations Since PTs must be nonnegative and sum to 1, for k-ary variables • • (e.g. clustering, classification, regression). there are k 1 free parameters. − Many approaches to statistics: If a discrete variable is conditioned on the values of some other • • frequentist, Bayesian, decision theory, ... discrete variables we make one table for each possible setting of the parents: these are called conditional probability tables or CPTs. z z p(x,y,z) p(x,y|z) y y x x Some (Conditional) Probability Functions Exponential Family Probability density functions p(x) (for continuous variables) or For (continuous or discrete) random variable x • • probability mass functions p(x = k) (for discrete variables) tell us p(x η)= h(x)exp η⊤T (x) A(η) how likely it is to get a particular value for a random variable | 1 { − } = h(x)exp η⊤T (x) (possibly conditioned on the values of some other variables.) Z(η) { } We can consider various types of variables: binary/discrete • is an exponential family distribution with (categorical), continuous, interval, and integer counts. natural parameter η. For each type we’ll see some basic probability models which are Function T (x) is a sufficient statistic. • parametrized families of distributions. • Function A(η) = log Z(η) is the log normalizer. • Key idea: all you need to know about the data is captured in the • summarizing function T (x). Bernoulli Distribution Multinomial For a binary random variable x = 0, 1 with p(x =1)= π: For a categorical (discrete), random variable taking on K possible • { } • x 1 x values, let π be the probability of the kth value. We can use a p(x π)= π (1 π) − k | − x π binary vector = (x1,x2,...,xk,...,xK) in which xk =1 if and = exp log x + log(1 π) only if the variable takes on its kth value. Now we can write, 1 π − − Exponential family with: p(x π)= πx1πx2 πxK = exp x log π • π | 1 2 · · · K i i η = log i 1 π X − Exactly like a probability table, but written using binary vectors. T (x)= x A(η)= log(1 π) = log(1+ eη) If we observe this variable several times X = x1, x2,..., xN , the − − • { } h(x)=1 (iid) probability depends on the total observed counts of each value: n n The logistic function links natural parameter and chance of heads p(X π)= p(x π) = exp i n xi log πi = exp i ci log πi • | n | { } 1 Y P P P π = η = logistic(η) 1+ e− Poisson Multinomial as Exponential Family For an integer count variable with rate λ: The multinomial parameters are constrained: i πi =1. • • K 1 λxe λ Define (the last) one in terms of the rest: πK =1 − πi p(x λ)= − P − i=1 | x! K 1 πi 1 p(x π) = exp i=1− log π xi + k log πPK = exp x log λ λ | K x! { − } Exponential family with:nP o Exponential family with: • • ηi = log πi log πK η = log λ − T (xi)= xi T (x)= x A(η)= k log π = k log eηi η − K i A(η)= λ = e h(x)=1 1 P h(x)= The softmax function relates direct and natural parameters: x! • eηi e.g. number of photons x that arrive at a pixel during a fixed π = • i ηj interval given mean intensity λ j e Other count densities: (neg)binomial, geometric. • P Gaussian (normal) Important Gaussian Facts For a continuous univariate random variable: All marginals of a Gaussian are again Gaussian. • 1 1 • Any conditional of a Gaussian is again Gaussian. p(x µ,σ2)= exp (x µ)2 | √ −2σ2 − 2πσ 1 µx x2 µ2 = exp 2 2 2 log σ √2πσ (σ − 2σ − 2σ − ) p(y|x=x0) Exponential family with: p(x,y) • x0 η =[µ/σ2 ; 1/2σ2] − T (x)=[x ; x2] A(η) = log σ + µ/2σ2 Σ h(x)=1/√2π Note: a univariate Gaussian is a two-parameter distribution with a p(x) • two-component vector of sufficient statistis. Multivariate Gaussian Distribution Gaussian Marginals/Conditionals For a continuous vector random variable: • To find these parameters is mostly linear algebra: 1/2 1 1 • p(x µ, Σ) = 2πΣ exp (x µ)⊤Σ (x µ) Let z =[x⊤y⊤]⊤ be normally distributed according to: | | |− −2 − − − x a A C z = ; y ∼ N b C⊤ B where C is the (non-symmetric) cross-covariance matrix between x Exponential family with: y x • and which has as many rows as the size of and as many 1 1 η = [Σ− µ ; 1/2Σ− ] columns as the size of y.