Prob.Ps (Mpage)
Total Page:16
File Type:pdf, Size:1020Kb
Random Variables and Densities Review: Random variables X represents outcomes or states of world. • Instantiations of variables usually in lower case: x Probability and Statistics We will write p(x) to mean probability(X = x). Sample Space: the space of all possible outcomes/states. • (May be discrete or continuous or mixed.) Probability mass (density) function p(x) 0 • ≥ Sam Roweis Assigns a non-negative number to each point in sample space. Sums (integrates) to unity: x p(x)=1 or x p(x)dx =1. Intuitively: how often does x occur, how much do we believe in x. P R Ensemble: random variable + sample space+ probability function • October, 2002 But you have to be careful about the context and about defining • random variables and sample spaces carefully. Otherwise you can get in trouble (see, e.g. Simpson’s paradox/Prisoner’s paradox). Probability Expectations, Moments We use probabilities p(x) to represent our beliefs B(x) about the Expectation of a function a(x) is written E[a] or a • states x of the world. • h i There is a formal calculus for manipulating uncertainties E[a]= a = p(x)a(x) • h i x represented by probabilities. X e.g. mean = xp(x), variance = (x E[x])2p(x) Any consistent set of beliefs obeying the Cox Axioms can be x x − • mapped into probabilities. Moments areP expectations of higher orderP powers. • (Mean is first moment. Autocorrelation is second moment.) 1. Rationally ordered degrees of belief: Centralized moments have lower moments subtracted away if B(x) >B(y) and B(y) >B(z) then B(x) >B(z) • 2. Belief in x and its negation x¯ are related: B(x)= f[B(¯x)] (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of moments 3. Belief in conjunction depends only on conditionals: • B(x and y)= g[B(x),B(y x)] = g[B(y),B(x y)] completely defines the entire distribution. | | Joint Probability Conditional Probability Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value depends on • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)= p(x,y)/p(y) • p(x,y) = prob(X = x and Y = y) | z z p(x,y|z) p(x,y,z) y x y x Marginal Probabilities Bayes’ Rule We can ”sum out” part of a joint distribution to get the marginal Manipulating the basic definition of conditional probability gives • distribution of a subset of variables: • one of the most important formulas in probability theory: p(x)= p(x,y) p(y x)p(x) p(y x)p(x) p(x y)= | = | y | p(y) p(y x )p(x ) X x′ | ′ ′ This is like adding slices of the table together. This gives us a way of ”reversing”conditional probabilities. • Σ p(x,y) P z • Thus, all joint probabilities can be factored by selecting an ordering z • for the random variables and using the ”chain rule”: y p(x,y,z,...)= p(x)p(y x)p(z x,y)p(. x,y,z) | | | y x x Another equivalent definition: p(x)= p(x y)p(y). • y | P Independence & Conditional Independence Cross Entropy (KL Divergence) Two variables are independent iff their joint factors: An assymetric measure of the distancebetween two distributions: • • p(x,y)= p(x)p(y) KL[p q]= p(x)[log p(x) log q(x)] p(x,y) k x − p(x) X KL > 0 unless p = q then KL =0 x • = Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x). p(y) Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slice factors: p(x,y z)= p(x z)p(y z) z | | | ∀ Entropy Jensen’s Inequality Measures the amount of ambiguity or uncertainty in a distribution: For any concave function f() and any distribution on x, • • H(p)= p(x)log p(x) E[f(x)] f(E[x]) ≤ − x X f(E[x]) Expected value of log p(x) (a function which depends on p(x)!). • − H(p) > 0 unless only one possible outcomein which case H(p)=0. E[f(x)] • Maximal value when p is uniform. • Tells you the expected ”cost” if each event costs log p(event) • − e.g. log() and √ are concave • This allows us to bound expressions like log p(x) = log p(x,z) • z P Statistics (Conditional) Probability Tables Probability: inferring probabilistic quantities for data given fixed For discrete (categorical) quantities, the most basic parametrization • • th models (e.g. prob. of events, marginals, conditionals, etc). is the probability table which lists p(xi = k value). Statistics: inferring a model given fixed data observations Since PTs must be nonnegative and sum to 1, for k-ary variables • • (e.g. clustering, classification, regression). there are k 1 free parameters. − Many approaches to statistics: If a discrete variable is conditioned on the values of some other • • frequentist, Bayesian, decision theory, ... discrete variables we make one table for each possible setting of the parents: these are called conditional probability tables or CPTs. z z p(x,y,z) p(x,y|z) y y x x Some (Conditional) Probability Functions Exponential Family Probability density functions p(x) (for continuous variables) or For (continuous or discrete) random variable x • • probability mass functions p(x = k) (for discrete variables) tell us p(x η)= h(x)exp η⊤T (x) A(η) how likely it is to get a particular value for a random variable | 1 { − } = h(x)exp η⊤T (x) (possibly conditioned on the values of some other variables.) Z(η) { } We can consider various types of variables: binary/discrete • is an exponential family distribution with (categorical), continuous, interval, and integer counts. natural parameter η. For each type we’ll see some basic probability models which are Function T (x) is a sufficient statistic. • parametrized families of distributions. • Function A(η) = log Z(η) is the log normalizer. • Key idea: all you need to know about the data is captured in the • summarizing function T (x). Bernoulli Multinomial For a binary random variable with p(heads)=π: For a set of integer counts on k trials • • x 1 x p(x π)= π (1 π) − k! | − x1 x2 xn π p(x π)= π1 π2 πn = h(x)exp xi log πi = exp log x + log(1 π) | x1!x2! xn! · · · 1 π − · · · Xi − But the parameters are constrained: π =1. Exponential family with: i i • • n 1 π So we define the last one πn =1 i=1− πi. η = log − P 1 π n 1 πi p(x π)= h(x)exp − logP xi + k log πn T (x)= x − | i=1 πn n o A(η)= log(1 π) = log(1+ eη) Exponential family with: P − − • The softmax function h(x)=1 η = log π log πn relates the basic and i i − The logistic function relates the natural parameter and the chance T (xi)= xi natural parameters: • η η of heads A(η)= k log πn = k log e i e i 1 − i π = i ηj π = η h(x)= k!/x1!x2! xn! j e 1+ e− · · · P P Poisson Gaussian (normal) For an integer count variable with rate λ: For a continuous univariate random variable: • • x λ 1 1 λ e− p(x µ,σ2)= exp (x µ)2 p(x λ)= | √2πσ −2σ2 − | x! 1 2 2 = exp x log λ λ 1 µx x µ x! { − } = exp 2 2 2 log σ √2πσ (σ − 2σ − 2σ − ) Exponential family with: • η = log λ Exponential family with: • T (x)= x η =[µ/σ2 ; 1/2σ2] A(η)= λ = eη − T (x)=[x ; x2] 1 h(x)= 2 x! A(η) = log σ + µ/2σ e.g. number of photons x that arrive at a pixel during a fixed h(x)=1/√2π • interval given mean intensity λ Note: a univariate Gaussian is a two-parameter distribution with a Other count densities: binomial, exponential. • two-component vector of sufficient statistis. • Multivariate Gaussian Distribution Gaussian Marginals/Conditionals For a continuous vector random variable: • To find these parameters is mostly linear algebra: 1/2 1 1 • p(x µ, Σ) = 2πΣ exp (x µ)⊤Σ (x µ) Let z =[x⊤y⊤]⊤ be normally distributed according to: | | |− −2 − − − x a A C z = ; y ∼ N b C⊤ B where C is the (non-symmetric) cross-covariance matrix between x Exponential family with: y x • and which has as many rows as the size of and as many 1 1 η = [Σ− µ ; 1/2Σ− ] columns as the size of y. − The marginal distributions are: T (x)=[x ; xx⊤] 1 x (a; A) A(η) = log Σ /2+ µ⊤Σ− µ/2 ∼ N | |n/2 y (b; B) h(x)=(2π)− ∼ N and the conditional distributions are: Sufficient statistics: mean vector and correlation matrix. 1 1 • x y (a + CB (y b); A CB C⊤) Other densities: Student-t, Laplacian. | ∼ N − − − − • y x b C A 1 x a B C A 1C For non-negative values use exponential, Gamma, log-normal. ( + ⊤ − ( ); ⊤ − ) • | ∼ N − − Important Gaussian Facts Moments All marginals of a Gaussian are again Gaussian. For continuous variables, moment calculations are important.