Sampling and Monte Carlo Integration
Total Page:16
File Type:pdf, Size:1020Kb
Sampling and Monte Carlo Integration Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap Learning and inference often involves intractable integrals, e.g. I Marginalisation Z p(x) = p(x, y)dy y I Expectations Z E [g(x) | yo] = g(x)p(x|yo)dx for some function g. I For unobserved variables, likelihood and gradient of the log lik Z L(θ) = p(D; θ) = p(u, D; θdu), u ∇θ`(θ) = Ep(u|D;θ) [∇θ log p(u, D; θ)] Notation: Ep(x) is sometimes used to indicate that the expectation is taken with respect to p(x). Michael Gutmann Sampling and Monte Carlo Integration 2 / 41 Recap Learning and inference often involves intractable integrals, e.g. I For unnormalised models with intractable partition functions p˜(D; θ) L(θ) = R x p˜(x; θ)dx ∇θ`(θ) ∝ m(D; θ) − Ep(x;θ) [m(x; θ)] I Combined case of unnormalised models with intractable partition functions and unobserved variables. I Evaluation of intractable integrals can sometimes be avoided by using other learning criteria (e.g. score matching). I Here: methods to approximate integrals like those above using sampling. Michael Gutmann Sampling and Monte Carlo Integration 3 / 41 Program 1. Monte Carlo integration 2. Sampling Michael Gutmann Sampling and Monte Carlo Integration 4 / 41 Program 1. Monte Carlo integration Approximating expectations by averages Importance sampling 2. Sampling Michael Gutmann Sampling and Monte Carlo Integration 5 / 41 Averages with iid samples I Tutorial 7: For Gaussians, the sample average is an estimate (MLE) of the mean (expectation) E[x] 1 n x¯ = X x ≈ [x] n i E i=1 I Gaussianity not needed: assume xi are iid observations of x ∼ p(x). Z 1 n [x] = xp(x)dx ≈ x¯ x¯ = X x E n n n i i=1 I Subscript n reminds us that we used n samples to compute the average. I Approximating integrals by means of sample averages is called Monte Carlo integration. Michael Gutmann Sampling and Monte Carlo Integration 6 / 41 Averages with iid samples I Sample average is unbiased 1 n n [¯x ] = X [x ] =∗ [x] = [x] E n n E i nE E i=1 (∗: “identically distributed” assumption is used, not independence) I Variability " # 1 n 1 n 1 [¯x ] = X x =∗ X [x ] = [x] V n n2 V i n2 V i nV i=1 i=1 (∗: independence assumption used) I Squared error decreases as 1/n h i 1 [¯x ] = (¯x − [x])2 = [x] V n E n E nV Michael Gutmann Sampling and Monte Carlo Integration 7 / 41 Averages with iid samples I Weak law of large numbers: [x] Pr (|x¯ − [x]| ≥ ) ≤ V n E n2 I As n → ∞, the probability for the sample average to deviate from the expected value goes to zero. I We say that sample average converges in probability to the expected value. I Speed of convergence depends on the variance V[x]. I Different “laws of large numbers” exist that make different assumptions. Michael Gutmann Sampling and Monte Carlo Integration 8 / 41 Chebyshev’s inequality I Weak law of large numbers is a direct consequence of Chebyshev’s inequality I Chebyshev’s inequality: Let s be some random variable with mean E[s] and variance V[s]. [s] Pr (|s − [s]| ≥ ) ≤ V E 2 I This means that for all random variables: I probability to deviate more than three standard deviation from the mean is less than 1/9 ≈ 0.11 p (set = 3 V(s)) I Probability to deviate more than 6 standard deviations: 1/36 ≈ 0.03. These are conservative values; for many distributions, the probabilities will be smaller. Michael Gutmann Sampling and Monte Carlo Integration 9 / 41 Proofs (not examinable) I Chebyshev’s inequality follows from Markov’s inequality. I Markov’s inequality: For a random variable y ≥ 0 [y] Pr(y ≥ t) ≤ E (t > 0) t I Chebyshev’s inequality is obtained by setting y = |s − E[s]| 2 2 Pr (|s − E[s]| ≥ t) = Pr (s − E[s]) ≥ t (s − [s])2 ≤ E E . t2 Chebyshev’s inequality follows with t = , and because 2 E[(s − E[s] ] is the variance V[s] of s. Michael Gutmann Sampling and Monte Carlo Integration 10 / 41 Proofs (not examinable) Proof for Markov’s inequality: Let t be an arbitrary positive number and y a one-dimensional non-negative random variable with pdf p. We can decompose the expectation of y using t as split-point, Z ∞ Z t Z ∞ E[y] = up(u)du = up(u)du + up(u)du. 0 0 t Since u ≥ t in the second term, we obtain the inequality Z t Z ∞ E[y] ≥ up(u)du + tp(u)du. 0 t The second term is t times the probability that y ≥ t, so that Z t E[y] ≥ up(u)du + t Pr(y ≥ t) 0 ≥ t Pr(y ≥ t), where the second line holds because the first term in the first line is non-negative. This gives Markov’s inequality (y) Pr(y ≥ t) ≤ E (t > 0) t Michael Gutmann Sampling and Monte Carlo Integration 11 / 41 Averages with correlated samples I When computing the variance of the sample average [x] [¯x ] = V V n n we assumed the samples are identically and independently distributed. I The variance shrinks with increasing n and the average becomes more and more concentrated around E[x]. I Corresponding results exist for the case of statistically dependent samples xi . Known as “ergodic theorems”. I Important for the theory of Markov chain Monte Carlo methods but requires advanced mathematical theory. Michael Gutmann Sampling and Monte Carlo Integration 12 / 41 More general expectations I So far, we have considered Z 1 n [x] = xp(x)dx ≈ X x E n i i=1 where xi ∼ p(x) I This generalises Z 1 n [g(x)] = g(x)p(x)dx ≈ X g(x ) E n i i=1 where xi ∼ p(x) 1 I Variance of the approximation if the xi are iid is n V[g(x)] Michael Gutmann Sampling and Monte Carlo Integration 13 / 41 Example (Based on a slide from Amos Storkey) Z 1 n [g(x)] = g(x)N (x; 0, 1)dx ≈ X g(x )(x ∼ N (x; 0, 1)) E n i i i=1 for g(x) = x and g(x) = x 2 Left: sample average as a function of n Right: Variability (0.5 quantile: solid, 0.1 and 0.9 quantiles: dashed) 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 Average 0 0 -0.5 -0.5 -1 Distribution of the average -1.5 -1 -2 -1.5 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of samples Number of samples Michael Gutmann Sampling and Monte Carlo Integration 14 / 41 Example (Based on a slide from Amos Storkey) Z 1 n [g(x)] = g(x)N (x; 0, 1)dx ≈ X g(x )(x ∼ N (x; 0, 1)) E n i i i=1 for g(x) = exp(0.6x 2) Left: sample average as a function of n Right: Variability (0.5 quantile: solid, 0.1 and 0.9 quantiles: dashed) 10 15 9 8 7 10 6 5 Average 4 5 3 Distribution of the average 2 1 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of samples Number of samples Michael Gutmann Sampling and Monte Carlo Integration 15 / 41 Example I Indicators that something is wrong: I Strong fluctuations in the sample average as n increases. I Large non-declining variability. I Note: integral is not finite: Z 1 Z exp(0.6x 2)N (x; 0, 1)dx = √ exp(0.6x 2) exp(−0.5x 2)dx 2π 1 Z = √ exp(0.1x 2)dx 2π = ∞ but for any n, the sample average is finite and may be mistaken for a good approximation. I Check variability when approximating the expected value by a sample average! Michael Gutmann Sampling and Monte Carlo Integration 16 / 41 Approximating general integrals I If the integral does not correspond to an expectation, we can smuggle in a pdf q to rewrite it as an expected value with respect to q Z Z q(x) I = g(x)dx = g(x) dx q(x) Z g(x) = q(x)dx q(x) g(x) = Eq(x) q(x) 1 n g(x ) ≈ X i n q(x ) i=1 i with xi ∼ q(x) (iid) I This is the basic idea of importance sampling. I q is called the importance (or proposal) distribution Michael Gutmann Sampling and Monte Carlo Integration 17 / 41 Choice of the importance distribution I Call the approximation bI, n 1 X g(xi ) bI = n q(x ) i=1 i I bI is unbiased by construction g(x) Z g(x) Z [bI] = = q(x)dx = g(x)dx = I E E q(x) q(x) I Variance " # h i 1 g(x) 1 g(x)2 1 g(x)2 bI = = − V nV q(x) nE q(x) n E q(x) | {z } I2 Depends on the second moment. Michael Gutmann Sampling and Monte Carlo Integration 18 / 41 Choice of the importance distribution I The second moment is " # g(x)2 Z g(x)2 Z g(x)2 = q(x)dx = dx E q(x) q(x) q(x) Z |g(x)| = |g(x)| dx q(x) I Bad: q(x) is small when |g(x)| is large.