Review: Probability and Statistics Sam Roweis October 2, 2002
Total Page:16
File Type:pdf, Size:1020Kb
Random Variables and Densities Review: Random variables X represents outcomes or states of world. • Instantiations of variables usually in lower case: x Probability and Statistics We will write p(x) to mean probability(X = x). Sample Space: the space of all possible outcomes/states. • (May be discrete or continuous or mixed.) Probability mass (density) function p(x) 0 • ≥ Sam Roweis Assigns a non-negative number to each point in sample space. Sums (integrates) to unity: x p(x) = 1 or x p(x)dx = 1. Intuitively: how often does x occur, how much do we believe in x. P R Ensemble: random variable + sample space+ probability function • October 2, 2002 Probability Expectations, Moments We use probabilities p(x) to represent our beliefs B(x) about the Expectation of a function a(x) is written E[a] or a • states x of the world. • h i There is a formal calculus for manipulating uncertainties E[a] = a = p(x)a(x) • h i x represented by probabilities. X e.g. mean = xp(x), variance = (x E[x])2p(x) Any consistent set of beliefs obeying the Cox Axioms can be x x − • mapped into probabilities. Moments are Pexpectations of higher oPrder powers. • (Mean is first moment. Autocorrelation is second moment.) 1. Rationally ordered degrees of belief: Centralized moments have lower moments subtracted away if B(x) > B(y) and B(y) > B(z) then B(x) > B(z) • 2. Belief in x and its negation x¯ are related: B(x) = f[B(x¯)] (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of moments 3. Belief in conjunction depends only on conditionals: • B(x and y) = g[B(x); B(y x)] = g[B(y); B(x y)] completely defines the entire distribution. j j Joint Probability Conditional Probability Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value depends on • about the probability of other events. which value(s) the others are taking. This is like taking a "slice" through the joint table. • We call this a joint ensemble and write p(x y) = p(x; y)=p(y) • p(x; y) = prob(X = x and Y = y) j z z p(x,y|z) p(x,y,z) y x y x Marginal Probabilities Bayes' Rule We can "sum out" part of a joint distribution to get the marginal Manipulating the basic definition of conditional probability gives • distribution of a subset of variables: • one of the most important formulas in probability theory: p(x) = p(x; y) p(y x)p(x) p(y x)p(x) p(x y) = j = j y j p(y) p(y x )p(x ) X x0 j 0 0 This is like adding slices of the table together. This gives us a way of "reversing"conditional probabilities. • Σ p(x,y) P z • Thus, all joint probabilities can be factored by selecting an ordering z • for the random variables and using the "chain rule": y p(x; y; z; : : :) = p(x)p(y x)p(z x; y)p(: : : x; y; z) j j j y x x Another equivalent definition: p(x) = p(x y)p(y). • y j P Independence & Conditional Independence Entropy Two variables are independent iff their joint factors: Measures the amount of ambiguity or uncertainty in a distribution: • • p(x; y) = p(x)p(y) H(p) = p(x) log p(x) p(x,y) − x p(x) X Expected value of log p(x) (a function which depends on p(x)!). x • − = H(p) > 0 unless only one possible outcomein which case H(p) = 0. • Maximal value when p is uniform. • Tells you the expected "cost" if each event costs log p(event) p(y) • − Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slice factors: p(x; y z) = p(x z)p(y z) z j j j 8 Be Careful! Cross Entropy (KL Divergence) Watch the context: An assymetric measure of the distancebetween two distributions: • e.g. Simpson's paradox • KL[p q] = p(x)[log p(x) log q(x)] Define random variables and sample spaces carefully: k x − • X e.g. Prisoner's paradox KL > 0 unless p = q then KL = 0 • Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x). Statistics (Conditional) Probability Tables Probability: inferring probabilistic quantities for data given fixed For discrete (categorical) quantities, the most basic parametrization • • th models (e.g. prob. of events, marginals, conditionals, etc). is the probability table which lists p(xi = k value). Statistics: inferring a model given fixed data observations Since PTs must be nonnegative and sum to 1, for k-ary variables • • (e.g. clustering, classification, regression). there are k 1 free parameters. − Many approaches to statistics: If a discrete variable is conditioned on the values of some other • • frequentist, Bayesian, decision theory, ... discrete variables we make one table for each possible setting of the parents: these are called conditional probability tables or CPTs. z z p(x,y,z) p(x,y|z) y y x x Some (Conditional) Probability Functions Exponential Family Probability density functions p(x) (for continuous variables) or For (continuous or discrete) random variable x • • probability mass functions p(x = k) (for discrete variables) tell us p(x η) = h(x) exp η>T (x) A(η) how likely it is to get a particular value for a random variable j 1 f − g = h(x) exp η>T (x) (possibly conditioned on the values of some other variables.) Z(η) f g We can consider various types of variables: binary/discrete • is an exponential family distribution with (categorical), continuous, interval, and integer counts. natural parameter η. For each type we'll see some basic probability models which are Function T (x) is a sufficient statistic. • parametrized families of distributions. • Function A(η) = log Z(η) is the log normalizer. • Key idea: all you need to know about the data is captured in the • summarizing function T (x). Bernoulli Multinomial For a binary random variable with p(heads)=π: For a set of integer counts on k trials • • x 1 x p(x π) = π (1 π) − j − k! x1 x2 xn π p(x π) = π1 π2 πn = h(x) exp xi log πi = exp log x + log(1 π) j x1!x2! xn! · · · 8 9 1 π − · · · <Xi = − Exponential family with: But the parameters are constrained: i πi = 1. : ; • • n 1 π So we define the last one πn = 1 − π . η = log − Pi=1 i 1 π n 1 πi − p(x π) = h(x) exp − logP x + k log πn T (x) = x j i=1 πn i A(η) = log(1 π) = log(1 + eη) Exponential family with: nP o − − • h(x) = 1 η = log π log πn i i − The logistic function relates the natural parameter and the chance T (xi) = xi • η of heads A(η) = k log πn = k log e i 1 − i π = η h(x) = k!=x1!x2! xn! 1 + e− · · · P Poisson The softmax function relates the basic and natural parameters: • eηi For an integer count variable with rate λ: π = • i eηj λxe λ j p(x λ) = − j x! P 1 = exp x log λ λ x! f − g Exponential family with: • η = log λ T (x) = x A(η) = λ = eη 1 h(x) = x! e.g. number of photons x that arrive at a pixel during a fixed • interval given mean intensity λ Other count densities: binomial, exponential. • Gaussian (normal) Important Gaussian Facts For a continuous univariate random variable: All marginals of a Gaussian are again Gaussian. • 1 1 • Any conditional of a Gaussian is again Gaussian. p(x µ, σ2) = exp (x µ)2 j p2πσ −2σ2 − 1 µx x2 µ2 = exp 2 2 2 log σ p2πσ (σ − 2σ − 2σ − ) p(y|x=x0) Exponential family with: p(x,y) • x0 η = [µ/σ2 ; 1=2σ2] − T (x) = [x ; x2] A(η) = log σ + µ/2σ2 Σ h(x) = 1=p2π Note: a univariate Gaussian is a two-parameter distribution with a p(x) • two-component vector of sufficient statistis. Multivariate Gaussian Distribution Gaussian Marginals/Conditionals For a continuous vector random variable: • To find these parameters is mostly linear algebra: 1=2 1 1 • p(x µ, Σ) = 2πΣ exp (x µ)>Σ (x µ) Let z = [x>y>]> be normally distributed according to: j j j− −2 − − − x a A C z = ; y ∼ N b C> B where C is the (non-symmetric) cross-covariance matrix between x Exponential family with: y x • and which has as many rows as the size of and as many 1 1 η = [Σ− µ ; 1=2Σ− ] columns as the size of y. − The marginal distributions are: T (x) = [x ; xx>] 1 x (a; A) A(η) = log Σ =2 + µ>Σ− µ/2 ∼ N j jn=2 y (b; B) h(x) = (2π)− ∼ N and the conditional distributions are: Sufficient statistics: mean vector and correlation matrix. 1 1 • x y (a + CB (y b); A CB C>) Other densities: Student-t, Laplacian. j ∼ N − − − − • y x b C A 1 x a B C A 1C For non-negative values use exponential, Gamma, log-normal. ( + > − ( ); > − ) • j ∼ N − − Moments Generalized Linear Models (GLMs) For continuous variables, moment calculations are important. Generalized Linear Models: p(y x) is exponential family with • • j We can easily compute moments of any exponential family conditional mean µ = f(θ>x).