Probability and Statistics Sam Roweis Machine Learning Summer School

Random Variables and Densities Review: Random variables X represents outcomes or states of world. • Instantiations of variables usually in lower case: x Probability and Statistics We will write p(x) to mean probability(X = x). Sample Space: the space of all possible outcomes/states. • (May be discrete or continuous or mixed.) Probability mass (density) function p(x) 0 • ≥ Sam Roweis Assigns a non-negative number to each point in sample space. Sums (integrates) to unity: x p(x)=1 or x p(x)dx =1. Intuitively: how often does x occur, how much do we believe in x. P R Ensemble: random variable + sample space+ probability function • Machine Learning Summer School, January 2005 Probability Expectations, Moments We use probabilities p(x) to represent our beliefs B(x) about the Expectation of a function a(x) is written E[a] or a • states x of the world. • h i There is a formal calculus for manipulating uncertainties E[a]= a = p(x)a(x) • h i x represented by probabilities. X e.g. mean = xp(x), variance = (x E[x])2p(x) Any consistent set of beliefs obeying the Cox Axioms can be x x − • mapped into probabilities. Moments areP expectations of higher orderP powers. • (Mean is first moment. Autocorrelation is second moment.) 1. Rationally ordered degrees of belief: Centralized moments have lower moments subtracted away if B(x) >B(y) and B(y) >B(z) then B(x) >B(z) • 2. Belief in x and its negation x¯ are related: B(x)= f[B(¯x)] (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of moments 3. Belief in conjunction depends only on conditionals: • B(x and y)= g[B(x),B(y x)] = g[B(y),B(x y)] completely defines the entire distribution. | | Means, Variances and Covariances Marginal Probabilities Remember the definition of the mean and covariance of a vector We can ”sum out” part of a joint distribution to get the marginal • random variable: • distribution of a subset of variables: E[x]= xp(x)dx = m p(x)= p(x,y) x Z y X Cov[x]= E[(x m)(x m)⊤]= (x m)(x m)⊤p(x)dx = V − − x − − This is like adding slices of the table together. Z • Σ p(x,y) which is the expected value of the outer product of the variable z with itself, after subtracting the mean. z Also, the covariance between two variables: y • Cov[x, y]= E[(x mx)(y my)⊤]= C − − y x = (x mx)(y my)⊤p(x, y)dxdy = C x xy − − Z Another equivalent definition: p(x)= p(x y)p(y). which is the expected value of the outer product of one variable • y | with another, after subtracting their means. P Note: C is not symmetric. Joint Probability Conditional Probability Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value depends on • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)= p(x,y)/p(y) • p(x,y) = prob(X = x and Y = y) | z z p(x,y|z) p(x,y,z) y x y x Bayes’ Rule Entropy Manipulating the basic definition of conditional probability gives Measures the amount of ambiguity or uncertainty in a distribution: • • one of the most important formulas in probability theory: H(p)= p(x)log p(x) p(y x)p(x) p(y x)p(x) − x p(x y)= | = | X | p(y) p(y x )p(x ) x′ ′ ′ Expected value of log p(x) (a function which depends on p(x)!). | • − This gives us a way of ”reversing”conditionalP probabilities. H(p) > 0 unless only one possible outcomein which case H(p)=0. • • Thus, all joint probabilities can be factored by selecting an ordering Maximal value when p is uniform. • for the random variables and using the ”chain rule”: • Tells you the expected ”cost” if each event costs log p(event) p(x,y,z,...)= p(x)p(y x)p(z x,y)p(. x,y,z) • − | | | Independence & Conditional Independence Cross Entropy (KL Divergence) Two variables are independent iff their joint factors: An assymetric measure of the distancebetween two distributions: • • p(x,y)= p(x)p(y) KL[p q]= p(x)[log p(x) log q(x)] p(x,y) k x − p(x) X KL > 0 unless p = q then KL =0 x • = Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x). p(y) Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slice factors: p(x,y z)= p(x z)p(y z) z | | | ∀ Statistics (Conditional) Probability Tables Probability: inferring probabilistic quantities for data given fixed For discrete (categorical) quantities, the most basic parametrization • • th models (e.g. prob. of events, marginals, conditionals, etc). is the probability table which lists p(xi = k value). Statistics: inferring a model given fixed data observations Since PTs must be nonnegative and sum to 1, for k-ary variables • • (e.g. clustering, classification, regression). there are k 1 free parameters. − Many approaches to statistics: If a discrete variable is conditioned on the values of some other • • frequentist, Bayesian, decision theory, ... discrete variables we make one table for each possible setting of the parents: these are called conditional probability tables or CPTs. z z p(x,y,z) p(x,y|z) y y x x Some (Conditional) Probability Functions Exponential Family Probability density functions p(x) (for continuous variables) or For (continuous or discrete) random variable x • • probability mass functions p(x = k) (for discrete variables) tell us p(x η)= h(x)exp η⊤T (x) A(η) how likely it is to get a particular value for a random variable | 1 { − } = h(x)exp η⊤T (x) (possibly conditioned on the values of some other variables.) Z(η) { } We can consider various types of variables: binary/discrete • is an exponential family distribution with (categorical), continuous, interval, and integer counts. natural parameter η. For each type we’ll see some basic probability models which are Function T (x) is a sufficient statistic. • parametrized families of distributions. • Function A(η) = log Z(η) is the log normalizer. • Key idea: all you need to know about the data is captured in the • summarizing function T (x). Bernoulli Distribution Multinomial For a binary random variable x = 0, 1 with p(x =1)= π: For a categorical (discrete), random variable taking on K possible • { } • x 1 x values, let π be the probability of the kth value. We can use a p(x π)= π (1 π) − k | − x π binary vector = (x1,x2,...,xk,...,xK) in which xk =1 if and = exp log x + log(1 π) only if the variable takes on its kth value. Now we can write, 1 π − − Exponential family with: p(x π)= πx1πx2 πxK = exp x log π • π | 1 2 · · · K i i η = log i 1 π X − Exactly like a probability table, but written using binary vectors. T (x)= x A(η)= log(1 π) = log(1+ eη) If we observe this variable several times X = x1, x2,..., xN , the − − • { } h(x)=1 (iid) probability depends on the total observed counts of each value: n n The logistic function links natural parameter and chance of heads p(X π)= p(x π) = exp i n xi log πi = exp i ci log πi • | n | { } 1 Y P P P π = η = logistic(η) 1+ e− Poisson Multinomial as Exponential Family For an integer count variable with rate λ: The multinomial parameters are constrained: i πi =1. • • K 1 λxe λ Define (the last) one in terms of the rest: πK =1 − πi p(x λ)= − P − i=1 | x! K 1 πi 1 p(x π) = exp i=1− log π xi + k log πPK = exp x log λ λ | K x! { − } Exponential family with:nP o Exponential family with: • • ηi = log πi log πK η = log λ − T (xi)= xi T (x)= x A(η)= k log π = k log eηi η − K i A(η)= λ = e h(x)=1 1 P h(x)= The softmax function relates direct and natural parameters: x! • eηi e.g. number of photons x that arrive at a pixel during a fixed π = • i ηj interval given mean intensity λ j e Other count densities: (neg)binomial, geometric. • P Gaussian (normal) Important Gaussian Facts For a continuous univariate random variable: All marginals of a Gaussian are again Gaussian. • 1 1 • Any conditional of a Gaussian is again Gaussian. p(x µ,σ2)= exp (x µ)2 | √ −2σ2 − 2πσ 1 µx x2 µ2 = exp 2 2 2 log σ √2πσ (σ − 2σ − 2σ − ) p(y|x=x0) Exponential family with: p(x,y) • x0 η =[µ/σ2 ; 1/2σ2] − T (x)=[x ; x2] A(η) = log σ + µ/2σ2 Σ h(x)=1/√2π Note: a univariate Gaussian is a two-parameter distribution with a p(x) • two-component vector of sufficient statistis. Multivariate Gaussian Distribution Gaussian Marginals/Conditionals For a continuous vector random variable: • To find these parameters is mostly linear algebra: 1/2 1 1 • p(x µ, Σ) = 2πΣ exp (x µ)⊤Σ (x µ) Let z =[x⊤y⊤]⊤ be normally distributed according to: | | |− −2 − − − x a A C z = ; y ∼ N b C⊤ B where C is the (non-symmetric) cross-covariance matrix between x Exponential family with: y x • and which has as many rows as the size of and as many 1 1 η = [Σ− µ ; 1/2Σ− ] columns as the size of y.

Probability and Statistics Sam Roweis Machine Learning Summer School

Information Theory, Pattern Recognition and Neural Networks Part III Physics Exams 2006

On Measures of Entropy and Information

Information Theory and Entropy

Lecture Notes (Chapter

Stan Modeling Language

Information Theory: MATH34600

Arxiv:1703.01694V2 [Cs.CL] 31 May 2017 Tinctive Forms to Diﬀerentiate Each Word from Others, Especially If a Speaker’S Intended Word Has Higher-Frequency Competitors

Computing Entropies with Nested Sampling

Lecture 6; Using Entropy for Evaluating and Comparing Probability Distributions Readings: Jurafsky and Martin, Section 6.7 Manning and Schutze, Section 2.2

Similarity-Based Approaches to Natural Language Processing

Probability for Linguists Probabilités Pour Les Linguistes

Probability for Linguists