IFT 6268 Probability Review Probabilities in AI Random Variable

Probabilities in AI I Probabilities allow us to be explicit about uncertainties: IFT 6268 probability review I Instead of representing values, we can “keep all options open” by defining a distribution over alternatives. I Example: Instead of setting ’x = 4’, define all of: Roland Memisevic p(x = 1), p(x = 2), p(x = 3), p(x = 4), p(x = 5) I Benefits: 1. Robustness (let modules tell each other their whole state of January 14, 2015 knowledge) 2. Measure of uncertainty (“errorbars”) 3. Multimodality (keep ambiguities around) I We can still express ’x = 4’ as a special case. Random variable: “not random, not a variable” Some useful distributions (1d) I The only relevant property of a random variable is its Discrete distribution. x 1 x I Bernoulli: p (1 p) − where x is 0 or 1. I p(x) is a distribution if − p(x) 0 and p(x) = 1 ≥ x I Discrete distribution: (also known as X “multinoulli”) I Notational quirks: I The symbol p can be heavily overloaded. The argument I Binomial, Multinomial: Sum over Bernoulli/Discrete. decides. For example, in “p(x, y) = p(x)p(y)” each p (Sometimes “multinomial” is used to refer to a discrete means something different! distribution, too...) I Sometimes we write X for the RV and x for the values it can λk exp( λ) I ( ) = − take on. Poisson: p k k! I Another common notation is p(X = x). I ... x refers to the sum over all values that x can take on. Continuous I For continuousP x, replace by (up to some measure theoretic “glitches”, that we usually ignore in practice) I Uniform: P R 1 1 2 I I Some prefer to use the term “density” or “probability Gaussian (1d): p(x) = 1 exp 2σ2 (x µ) 2 2 − − density function (pdf)” to refer to continuous p( ). (2πσ ) · µ = p(x)x = E x x X I Variance: σ2 = p(x)(x µ)2 = E (x µ)2 σ2 = p(x)(x µ)2 = E (x µ)2 − − − − x x X X I (Standard deviation: σ = √σ2) I (Standard deviation: σ = √σ2) How to represent discrete values Some useful distributions (1d) I A very useful way to represent a variable that takes one out of K values: I As a K -vector with (K 1) 0’s, and one 1 at position k − I Using a one-hot encoding allows us to write the discrete 0 distribution compactly as . . xk p(x) = µk x = 1 k . Y . where µk is the probability for state k. 0 I This can greatly simplify calculations (see below). I This is known as one-of-K encoding, one-hot encoding, or as orthogonal encoding. I Note that we can interpret x itself as a probability distribution. Summarizing properties Summarizing properties I Any relevant properties of RVs are just properties of their I Any relevant properties of RVs are just properties of their distributions. distributions. I Mean: I Mean: µ = p(x)x = E x x X I Variance: Summarizing properties The Gaussian (1d) I Any relevant properties of RVs are just properties of their distributions. I Mean: µ = p(x)x = E x x X I Variance: σ2 = p(x)(x µ)2 = E (x µ)2 − − x X I (Standard deviation: σ = √σ2) 2 1 1 2 p(x) = (x µ, σ ) = 1 exp 2 (x µ) N | (2πσ2) 2 − 2σ − Multiple variables Conditionals, marginals I Everything one may want to know about a random vector I The joint distribution p(x, y) of two variables x and y also can be derived from the joint distribution. satisfies I Marginal distributions: p(x, y) > 0 and p(x, y) = 1, x,y X p(x) = p(x, y) and p(y) = p(x, y) y x I Likewise, we can write X X p(x) > 0 and p(x) = 1, I Imagine collapsing tables. x I : X Conditional distributions for vector x p(x, y) p(x, y) p(y x) = and p(x y) = I For discrete RVs, the joint is a table (or a higher | p(x) | p(y) dimensional array). I Think of conditional as a family of distributions, “indexed” I Everything else stays the same. by the conditioning variable. (We could write p(y x) also as | px (y)). Summarizing properties, correlation Correlation example I Mean: µ = p(x)x = E x x X I Covariance: cov(x , x ) = E (x µ )(x µ ) i j i − i j − j I Covariance matrix: Σ = cov(x , x ) Σ = p(x)(x µ)(x µ)T ij i j − − x X I The correlation coefficient: cov(xi , xj ) uncorrelated correlated corr(xi , xj ) = 2 2 σi σj q I Two variables for which the covariance is zero are called uncorrelated. The multivariate Gaussian A fundamental formula p(x y)p(y) = p(x, y) = p(y x)p(x) | | I This can be generalized to more variables (“chainrule of probability”). 1 1 T 1 I A special case is Bayes’ rule: p(x) = D 1 exp (x µ) Σ− (x µ) (2π) 2 Σ 2 − 2 − − | | p(y x)p(x) p(x y) = | | p(y) Independence and conditional independence Independence is useful I Two RVs are called independent, if p(x, y) = p(x)p(y) I Say, we have some variables, x1, x2,..., xK I Captures our intuition of “dependence”. In particular, note I Even just defining their joint (let alone doing computations that this definition implies with it) is hopeless for large K ! I But what if all the x are independent? p(y x) = p(y) i | I Then we need to specify just K probabilities, because the I Independence implies uncorrelatedness, but not vice joint is the product. versa! I A more sophisticated version of this idea, using conditional I Related: Two RVs are called conditionally independent, independence, is the basis for the area of graphical given a third variable z, if models. p(x, y z) = p(x z)p(y z) | | | I (Note that these concepts are just a property of the joint.) Maximum likelihood Maximum likelihood I This is easy if examples are independent and identically distributed (“iid”): I Another useful property of independence. I Task: Given a set of data points p(x1,..., xN ; w) = p(xi ; w) Yi (x1,..., xN ) I Instead of maximizing probability, we may maximize build a model of the data-generating process. log-probability, because the log function is monotonic. I Approach: Fit a parametric distribution p(x; w) with some I So we may maximize: parameters w to the data. I How? Maximize the probability of “seeing” the data under L(w) := log p(xi ; w) = log p(xi ; w) your model! Yi Xi I Thus each example xi contributes an additive component to the objective. Gaussian example Linear regression I What is the ML-estimate of the mean of a Gaussian? I We need to maximize 1 L(µ) = log p(x ; µ) = (x µ)2 const. i − 2σ2 i − − i i X X I The derivative is: ∂L(µ) 1 1 = x µ = x Nµ ∂µ σ2 i − σ2 i − i i X X I By setting to zero, we get: x t 1 → µ = x N i I Given two real-valued observations x and t, learn to Xi predict t from x. I This is a supervised learning problem. Linear regression Linear regression I We can define linear regression as a probabilistic model, if we make the following assumption: t = y(x, w) + I In words, we assume there is a true, underlying function y(x, w), and the function values we observe are corrupted by additive Gaussian noise. I Thus p(t x; w) = t y(x, w), σ2 | N | Noise vs. dependencies we don’t care about Linear regression I To fit the conditional Gaussian, given training data N = (x , t ) , we make the iid assumption and get: D n n n=1 p( ) = (t y(x , w), σ2) D N n| n n Y I Using monotonicity of the log, we may again maximize the log-probability (or minimize its negation): N T 2 minimize tn w xn + const. I Actually, linear regression can work fine also with highly − Xn=1 non-Gaussian noise. Least squares Least squares I To optimize with respect to w, we differentiate: N I We can write this more compactly with the following ∂E T T definitions: = tn w xn xn ∂w − − t1 Xn=1 . t = . I Setting the derivative to zero: t N N N x1 0 = t xT + wT x xT . − n n n n X = . n=1 n=1 X X xN yields the solution This allows us to write the the solution as N N T 1 T T 1 T w = X X − X t w = xnxn − tnxn n=1 n=1 X X “The normal equations”. I (It can be instructional to write down the case for 1-d inputs, if this confuses you) Linear classification (Multi-class) logistic regression I Logistic regression defines a probabilistic model over classes given inputs as follows: T exp w xn x t p( x) = k → Ck | K T j=1 exp wj xn I A prediciton task, where the outputs, t, are discrete (that is, P where w1,..., wK are parameters. they can take on one of K values, 1,..., K ), is called C C I The exp-function ensures positivity, and the normalization classification. that the outputs sum to one. I Like regression, this is a supervised learning problem.

IFT 6268 Probability Review Probabilities in AI Random Variable

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support