IFT 6268 Probability Review Probabilities in AI Random Variable

Probabilities in AI I Probabilities allow us to be explicit about uncertainties: IFT 6268 probability review I Instead of representing values, we can “keep all options open” by defining a distribution over alternatives. I Example: Instead of setting ’x = 4’, define all of: Roland Memisevic p(x = 1), p(x = 2), p(x = 3), p(x = 4), p(x = 5) I Benefits: 1. Robustness (let modules tell each other their whole state of January 14, 2015 knowledge) 2. Measure of uncertainty (“errorbars”) 3. Multimodality (keep ambiguities around) I We can still express ’x = 4’ as a special case. Random variable: “not random, not a variable” Some useful distributions (1d) I The only relevant property of a random variable is its Discrete distribution. x 1 x I Bernoulli: p (1 p) − where x is 0 or 1. I p(x) is a distribution if − p(x) 0 and p(x) = 1 ≥ x I Discrete distribution: (also known as X “multinoulli”) I Notational quirks: I The symbol p can be heavily overloaded. The argument I Binomial, Multinomial: Sum over Bernoulli/Discrete. decides. For example, in “p(x, y) = p(x)p(y)” each p (Sometimes “multinomial” is used to refer to a discrete means something different! distribution, too...) I Sometimes we write X for the RV and x for the values it can λk exp( λ) I ( ) = − take on. Poisson: p k k! I Another common notation is p(X = x). I ... x refers to the sum over all values that x can take on. Continuous I For continuousP x, replace by (up to some measure theoretic “glitches”, that we usually ignore in practice) I Uniform: P R 1 1 2 I I Some prefer to use the term “density” or “probability Gaussian (1d): p(x) = 1 exp 2σ2 (x µ) 2 2 − − density function (pdf)” to refer to continuous p( ). (2πσ ) · µ = p(x)x = E x x X I Variance: σ2 = p(x)(x µ)2 = E (x µ)2 σ2 = p(x)(x µ)2 = E (x µ)2 − − − − x x X X I (Standard deviation: σ = √σ2) I (Standard deviation: σ = √σ2) How to represent discrete values Some useful distributions (1d) I A very useful way to represent a variable that takes one out of K values: I As a K -vector with (K 1) 0’s, and one 1 at position k − I Using a one-hot encoding allows us to write the discrete 0 distribution compactly as . . xk p(x) = µk x = 1 k . Y . where µk is the probability for state k. 0 I This can greatly simplify calculations (see below). I This is known as one-of-K encoding, one-hot encoding, or as orthogonal encoding. I Note that we can interpret x itself as a probability distribution. Summarizing properties Summarizing properties I Any relevant properties of RVs are just properties of their I Any relevant properties of RVs are just properties of their distributions. distributions. I Mean: I Mean: µ = p(x)x = E x x X I Variance: Summarizing properties The Gaussian (1d) I Any relevant properties of RVs are just properties of their distributions. I Mean: µ = p(x)x = E x x X I Variance: σ2 = p(x)(x µ)2 = E (x µ)2 − − x X I (Standard deviation: σ = √σ2) 2 1 1 2 p(x) = (x µ, σ ) = 1 exp 2 (x µ) N | (2πσ2) 2 − 2σ − Multiple variables Conditionals, marginals I Everything one may want to know about a random vector I The joint distribution p(x, y) of two variables x and y also can be derived from the joint distribution. satisfies I Marginal distributions: p(x, y) > 0 and p(x, y) = 1, x,y X p(x) = p(x, y) and p(y) = p(x, y) y x I Likewise, we can write X X p(x) > 0 and p(x) = 1, I Imagine collapsing tables. x I : X Conditional distributions for vector x p(x, y) p(x, y) p(y x) = and p(x y) = I For discrete RVs, the joint is a table (or a higher | p(x) | p(y) dimensional array). I Think of conditional as a family of distributions, “indexed” I Everything else stays the same. by the conditioning variable. (We could write p(y x) also as | px (y)). Summarizing properties, correlation Correlation example I Mean: µ = p(x)x = E x x X I Covariance: cov(x , x ) = E (x µ )(x µ ) i j i − i j − j I Covariance matrix: Σ = cov(x , x ) Σ = p(x)(x µ)(x µ)T ij i j − − x X I The correlation coefficient: cov(xi , xj ) uncorrelated correlated corr(xi , xj ) = 2 2 σi σj q I Two variables for which the covariance is zero are called uncorrelated. The multivariate Gaussian A fundamental formula p(x y)p(y) = p(x, y) = p(y x)p(x) | | I This can be generalized to more variables (“chainrule of probability”). 1 1 T 1 I A special case is Bayes’ rule: p(x) = D 1 exp (x µ) Σ− (x µ) (2π) 2 Σ 2 − 2 − − | | p(y x)p(x) p(x y) = | | p(y) Independence and conditional independence Independence is useful I Two RVs are called independent, if p(x, y) = p(x)p(y) I Say, we have some variables, x1, x2,..., xK I Captures our intuition of “dependence”. In particular, note I Even just defining their joint (let alone doing computations that this definition implies with it) is hopeless for large K ! I But what if all the x are independent? p(y x) = p(y) i | I Then we need to specify just K probabilities, because the I Independence implies uncorrelatedness, but not vice joint is the product. versa! I A more sophisticated version of this idea, using conditional I Related: Two RVs are called conditionally independent, independence, is the basis for the area of graphical given a third variable z, if models. p(x, y z) = p(x z)p(y z) | | | I (Note that these concepts are just a property of the joint.) Maximum likelihood Maximum likelihood I This is easy if examples are independent and identically distributed (“iid”): I Another useful property of independence. I Task: Given a set of data points p(x1,..., xN ; w) = p(xi ; w) Yi (x1,..., xN ) I Instead of maximizing probability, we may maximize build a model of the data-generating process. log-probability, because the log function is monotonic. I Approach: Fit a parametric distribution p(x; w) with some I So we may maximize: parameters w to the data. I How? Maximize the probability of “seeing” the data under L(w) := log p(xi ; w) = log p(xi ; w) your model! Yi Xi I Thus each example xi contributes an additive component to the objective. Gaussian example Linear regression I What is the ML-estimate of the mean of a Gaussian? I We need to maximize 1 L(µ) = log p(x ; µ) = (x µ)2 const. i − 2σ2 i − − i i X X I The derivative is: ∂L(µ) 1 1 = x µ = x Nµ ∂µ σ2 i − σ2 i − i i X X I By setting to zero, we get: x t 1 → µ = x N i I Given two real-valued observations x and t, learn to Xi predict t from x. I This is a supervised learning problem. Linear regression Linear regression I We can define linear regression as a probabilistic model, if we make the following assumption: t = y(x, w) + I In words, we assume there is a true, underlying function y(x, w), and the function values we observe are corrupted by additive Gaussian noise. I Thus p(t x; w) = t y(x, w), σ2 | N | Noise vs. dependencies we don’t care about Linear regression I To fit the conditional Gaussian, given training data N = (x , t ) , we make the iid assumption and get: D n n n=1 p( ) = (t y(x , w), σ2) D N n| n n Y I Using monotonicity of the log, we may again maximize the log-probability (or minimize its negation): N T 2 minimize tn w xn + const. I Actually, linear regression can work fine also with highly − Xn=1 non-Gaussian noise. Least squares Least squares I To optimize with respect to w, we differentiate: N I We can write this more compactly with the following ∂E T T definitions: = tn w xn xn ∂w − − t1 Xn=1 . t = . I Setting the derivative to zero: t N N N x1 0 = t xT + wT x xT . − n n n n X = . n=1 n=1 X X xN yields the solution This allows us to write the the solution as N N T 1 T T 1 T w = X X − X t w = xnxn − tnxn n=1 n=1 X X “The normal equations”. I (It can be instructional to write down the case for 1-d inputs, if this confuses you) Linear classification (Multi-class) logistic regression I Logistic regression defines a probabilistic model over classes given inputs as follows: T exp w xn x t p( x) = k → Ck | K T j=1 exp wj xn I A prediciton task, where the outputs, t, are discrete (that is, P where w1,..., wK are parameters. they can take on one of K values, 1,..., K ), is called C C I The exp-function ensures positivity, and the normalization classification. that the outputs sum to one. I Like regression, this is a supervised learning problem.

Load more