in AI

I Probabilities allow us to be explicit about uncertainties: IFT 6268 review I Instead of representing values, we can “keep all options open” by defining a distribution over alternatives. I Example: Instead of setting ’x = 4’, define all of: Roland Memisevic p(x = 1), p(x = 2), p(x = 3), p(x = 4), p(x = 5) I Benefits: 1. Robustness (let modules tell each other their whole state of January 14, 2015 knowledge) 2. Measure of uncertainty (“errorbars”) 3. Multimodality (keep ambiguities around) I We can still express ’x = 4’ as a special case.

Random variable: “not random, not a variable” Some useful distributions (1d) I The only relevant property of a random variable is its Discrete distribution. x 1 x I Bernoulli: p (1 p) − where x is 0 or 1. I p(x) is a distribution if −

p(x) 0 and p(x) = 1 ≥ x I Discrete distribution: (also known as X “multinoulli”) I Notational quirks: I The symbol p can be heavily overloaded. The argument I Binomial, Multinomial: Sum over Bernoulli/Discrete. decides. For example, in “p(x, y) = p(x)p(y)” each p (Sometimes “multinomial” is used to refer to a discrete means something different! distribution, too...) I Sometimes we write X for the RV and x for the values it can λk exp( λ) I ( ) = − take on. Poisson: p k k! I Another common notation is p(X = x). I ... x refers to the sum over all values that x can take on. Continuous I For continuousP x, replace by (up to some measure theoretic “glitches”, that we usually ignore in practice) I Uniform: P R 1 1 2 I I Some prefer to use the term “density” or “probability Gaussian (1d): p(x) = 1 exp 2σ2 (x µ) 2 2 − − density function (pdf)” to refer to continuous p( ). (2πσ ) ·   µ = p(x)x = E x x X   I Variance:

σ2 = p(x)(x µ)2 = E (x µ)2 σ2 = p(x)(x µ)2 = E (x µ)2 − − − − x x X   X  

I (Standard deviation: σ = √σ2) I (Standard deviation: σ = √σ2)

How to represent discrete values Some useful distributions (1d)

I A very useful way to represent a variable that takes one out of K values: I As a K -vector with (K 1) 0’s, and one 1 at position k − I Using a one-hot encoding allows us to write the discrete 0 distribution compactly as .  .  xk p(x) = µk x = 1   k  .  Y  .    where µk is the probability for state k. 0     I This can greatly simplify calculations (see below). I This is known as one-of-K encoding, one-hot encoding, or as orthogonal encoding. I Note that we can interpret x itself as a probability distribution.

Summarizing properties Summarizing properties

I Any relevant properties of RVs are just properties of their I Any relevant properties of RVs are just properties of their distributions. distributions. I Mean: I Mean: µ = p(x)x = E x x X   I Variance: Summarizing properties The Gaussian (1d)

I Any relevant properties of RVs are just properties of their distributions. I Mean: µ = p(x)x = E x x X   I Variance:

σ2 = p(x)(x µ)2 = E (x µ)2 − − x X  

I (Standard deviation: σ = √σ2) 2 1 1 2 p(x) = (x µ, σ ) = 1 exp 2 (x µ) N | (2πσ2) 2 − 2σ −  

Multiple variables Conditionals, marginals

I Everything one may want to know about a random vector I The joint distribution p(x, y) of two variables x and y also can be derived from the joint distribution. satisfies I Marginal distributions: p(x, y) > 0 and p(x, y) = 1, x,y X p(x) = p(x, y) and p(y) = p(x, y) y x I Likewise, we can write X X p(x) > 0 and p(x) = 1, I Imagine collapsing tables. x I : X Conditional distributions for vector x p(x, y) p(x, y) p(y x) = and p(x y) = I For discrete RVs, the joint is a table (or a higher | p(x) | p(y) dimensional array). I Think of conditional as a family of distributions, “indexed” I Everything else stays the same. by the conditioning variable. (We could write p(y x) also as | px (y)). Summarizing properties, correlation Correlation example

I Mean: µ = p(x)x = E x x X   I Covariance:

cov(x , x ) = E (x µ )(x µ ) i j i − i j − j I Covariance matrix:  

Σ = cov(x , x ) Σ = p(x)(x µ)(x µ)T ij i j − − x X  I The correlation coefficient:

cov(xi , xj ) uncorrelated correlated corr(xi , xj ) = 2 2 σi σj q I Two variables for which the covariance is zero are called uncorrelated.

The multivariate Gaussian A fundamental formula

p(x y)p(y) = p(x, y) = p(y x)p(x) | |

I This can be generalized to more variables (“chainrule of probability”).

1 1 T 1 I A special case is Bayes’ rule: p(x) = D 1 exp (x µ) Σ− (x µ) (2π) 2 Σ 2 − 2 − − | |   p(y x)p(x) p(x y) = | | p(y) Independence and conditional independence Independence is useful

I Two RVs are called independent, if

p(x, y) = p(x)p(y) I Say, we have some variables, x1, x2,..., xK I Captures our intuition of “dependence”. In particular, note I Even just defining their joint (let alone doing computations that this definition implies with it) is hopeless for large K ! I But what if all the x are independent? p(y x) = p(y) i | I Then we need to specify just K probabilities, because the I Independence implies uncorrelatedness, but not vice joint is the product. versa! I A more sophisticated version of this idea, using conditional I Related: Two RVs are called conditionally independent, independence, is the basis for the area of graphical given a third variable z, if models.

p(x, y z) = p(x z)p(y z) | | | I (Note that these concepts are just a property of the joint.)

Maximum likelihood Maximum likelihood

I This is easy if examples are independent and identically distributed (“iid”): I Another useful property of independence.

I Task: Given a set of data points p(x1,..., xN ; w) = p(xi ; w) Yi (x1,..., xN ) I Instead of maximizing probability, we may maximize build a model of the data-generating process. log-probability, because the log function is monotonic. I Approach: Fit a parametric distribution p(x; w) with some I So we may maximize: parameters w to the data. I How? Maximize the probability of “seeing” the data under L(w) := log p(xi ; w) = log p(xi ; w) your model! Yi Xi

I Thus each example xi contributes an additive component to the objective. Gaussian example Linear regression

I What is the ML-estimate of the mean of a Gaussian? I We need to maximize 1 L(µ) = log p(x ; µ) = (x µ)2 const. i − 2σ2 i − − i i X X  I The derivative is: ∂L(µ) 1 1 = x µ = x Nµ ∂µ σ2 i − σ2 i − i i X  X  I By setting to zero, we get: x t 1 → µ = x N i I Given two real-valued observations x and t, learn to Xi predict t from x. I This is a supervised learning problem.

Linear regression Linear regression

I We can define linear regression as a probabilistic model, if we make the following assumption:

t = y(x, w) + 

I In words, we assume there is a true, underlying function y(x, w), and the function values we observe are corrupted by additive Gaussian noise. I Thus p(t x; w) = t y(x, w), σ2 | N |  Noise vs. dependencies we don’t care about Linear regression

I To fit the conditional Gaussian, given training data N = (x , t ) , we make the iid assumption and get: D n n n=1  p( ) = (t y(x , w), σ2) D N n| n n Y I Using monotonicity of the log, we may again maximize the log-probability (or minimize its negation):

N T 2 minimize tn w xn + const. I Actually, linear regression can work fine also with highly − Xn=1 non-Gaussian noise. 

Least squares Least squares

I To optimize with respect to w, we differentiate:

N I We can write this more compactly with the following ∂E T T definitions: = tn w xn xn ∂w − − t1 Xn=1 .  t =  .  I Setting the derivative to zero: t  N    N N x1 0 = t xT + wT x xT . − n n n n X =  .  n=1 n=1 X X xN      yields the solution This allows us to write the the solution as

N N T 1 T T 1 T w = X X − X t w = xnxn − tnxn n=1 n=1  X  X  “The normal equations”. I (It can be instructional to write down the case for 1-d inputs, if this confuses you) Linear classification (Multi-class) logistic regression

I Logistic regression defines a probabilistic model over classes given inputs as follows:

T exp w xn x t p( x) = k → Ck | K T j=1 exp wj xn

I A prediciton task, where the outputs, t, are discrete (that is, P  where w1,..., wK are parameters. they can take on one of K values, 1,..., K ), is called C C I The exp-function ensures positivity, and the normalization classification. that the outputs sum to one. I Like regression, this is a supervised learning problem. I (In practice, one usually adds constant “bias”-terms inside the exp’s)

Multi-class logistic regression Multi-class logistic regression

I Represent discrete one-hot labels row-wise in a matrix T, like we did before for continuous vectors. I In contrast to linear regression, there is no closed-form I The negative log-likelihood cost, assuming iid training data, solution for W. can then be written I But one can use gradient-based optimization to minimize E(W; ) = log p(t x ) E(W; ) iteratively. D − n| n D n I The gradient with respect to each parameter-vector w is Y k = log p( x )tnk − Ck | n T n k ∂E(W; ) exp wk xn Y Y D = tnk xn T xn = t log p( x ) ∂wk − − j exp wj xn − nk Ck | n n  n X X Xk = p( x ) Pt x  K Ck | n − nk n = t wTx log exp wTx n − nk k n − j n X  n k j=1 X X X  K I It can be shown that E(W; ) is convex, so there are no T T D = w xn log exp w xn local minima. − tn − j n j=1 X X  Learning with stochastic gradient descent Learning with stochastic gradient descent

W(τ+1) = W(τ) η ∂En − W

I Here, τ denotes the iteration number, and En is the cost contributed by the nth training case (one term in the sum W(τ) over n). I Parameters are initialized to some random starting value W(0). W(0) I η is called learning rate, and it is typically set to a small real value (such as, η = 0.001). It may be reduced as I Since the algorithm visits one training-case at a time, it will learning progresses. jitter around an idealized “average path” towards the I It is convenient to think of W as a vector not a matrix when optimum. doing learning. (Think of it in “vectorized” form: vec(W)) I That’s why it’s called “stochastic” gradient descent. I One could use the gradient of the whole sum instead but that is often slower (because of redundancies in the data).

The “logsumexp”-trick Random variables and information

I Expressions like

T exp wk xn K T I ”Probabilities allow us to be explicit about uncertainty”. So j=1 exp wj xn how can we measure uncertainty? P  are very common but highly unstable, because the “exp” in I Idea: Define the the denominator can cause an under- or overflow. 1 I Never compute sums i exp(ai ) naively. log = log(p(x)) p(x) − I Add a constant A to each argument in all exp’s, so that P   even the largest argument is small; then undo the contained in a random event. operation after computing the sum! I The information content is additive for independent events. I Many software packages supply a convenience function I So if we use log and fair coin tosses, then the information “logsumexp” for this purpose: 2 content is measured in bits and it exactly fits our intuition. logsumexp logsumexp(a1,..., a ) = log exp(a + A) A K i i − with A = (max a ) − Pi i  Entropy KL divergence

I To measure uncertainty, we define the entropy I Closely related to entropy is the Kullback-Leibler divergence (KL divergence) (or “relative entropy”): H(X) = p(x) log p(x) − x p(x) X KL(p q) = p(x) log || q(x) which is the average information content. x X   I For continuous RV: I The KL divergence measures the dissimilarity between two distributions. H(X) = p(x) log p(x)dx − Zx I It is not symmetric. I I The more uniform, the more uncertain. The more “peaky”, Maximum likelihood learning amounts to minimizing the KL the more certain. divergence between the model distribution and the empirical distribution over the observed training data I Question: Which probability distribution has maximum (exercise). entropy, given mean and (co)variance(s)?

Mutual Information Frequentist – Bayesian

I tells us how to calculate with probabilities. I The mutual information between two random variables x, y I As scientists, we may ask how to interpret a probabilistic with joint density p(x, y) is defined as expression, such as p(x, y) MI(x, y) = p(x, y) log p(x = 1) = 0.7 p(x)p(y) x,y X   I There are two common interpretations: I It is the KL divergence between p(x, y) and the joint of two perfectly independent random variables (with marginals 1. Frequentist: “The relative frequency of x in a (possibly p(x) and p(y)). infinite) population of trials is 0.7” I Thus, MI measures the dependence between x and y. 2. Bayesian: “I believe that x is 1 with certainty 0.7” I It is nonnegative, and it is zero iff x and y are independent, in other words iff p(x, y) = p(x)p(y). I The Bayesian view used to be contentious because it is less intuitive. But it gives us the freedom to turn model parameters into random variables. And it is now an established view in ML. Reading The “vision equation”

I The purpose of vision: Infer world properties (or hidden “causes”), s, from an image, I. I A good introduction to most of the concepts discussed in this class can be found in: I We can express this with an analysis, or encoder, or Pattern Recognition and Machine Learning. C. Bishop. inference, equation: Springer, 2006. I Most illustrations in this presentation are from that book. s = f (I)

I Learning then amounts to estimating the parameters of f from image data.

Latent variables and generative models Latent variables and generative models

I To incorporate ambiguities and uncertainties, we can

I In practice, it is often much easier to write down how re-phrase this equation as a conditional probability: images get formed, given the causes. I p(I s) I This leads to the synthesis, or decoder, equation: ∼ |

I This allows us to define analysis using Bayes’ rule: I = g(s) p(I s)p(s) which describes how images depend on the state of the p(s I) = | world. | p(I) I s is called “latent variable” or “hidden variable”, because I Thus, analysis requires a prior, p(s), over the latent unlike the image, I, we do not observe it. variables. I And inference amounts to updating our prior believe based on data, I, to arrive at the posterior distribution p(s I). | Latent variables and generative models Natural images are not random

I For maximum likelihood learning, we need to marginalize All images over s: p(I) = p(I s)p(s) | Xs I How are f and g, or the probabilities, defined in practice? All natural images I There is a wide variety of possibilities, and we will cover a sample of these in this course. I All models involve constraining the “capacity” of s, to force the learned representation to be meaningful. I Constraining the capacity of s forces learner to “zoom” into where the data is.