Sampling and Monte Carlo Integration Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap Learning and inference often involves intractable integrals, e.g. I Marginalisation Z p(x) = p(x, y)dy y I Expectations Z E [g(x) | yo] = g(x)p(x|yo)dx for some function g. I For unobserved variables, likelihood and gradient of the log lik Z L(θ) = p(D; θ) = p(u, D; θdu), u ∇θ`(θ) = Ep(u|D;θ) [∇θ log p(u, D; θ)] Notation: Ep(x) is sometimes used to indicate that the expectation is taken with respect to p(x). Michael Gutmann Sampling and Monte Carlo Integration 2 / 41 Recap Learning and inference often involves intractable integrals, e.g. I For unnormalised models with intractable partition functions p˜(D; θ) L(θ) = R x p˜(x; θ)dx ∇θ`(θ) ∝ m(D; θ) − Ep(x;θ) [m(x; θ)] I Combined case of unnormalised models with intractable partition functions and unobserved variables. I Evaluation of intractable integrals can sometimes be avoided by using other learning criteria (e.g. score matching). I Here: methods to approximate integrals like those above using sampling. Michael Gutmann Sampling and Monte Carlo Integration 3 / 41 Program 1. Monte Carlo integration 2. Sampling Michael Gutmann Sampling and Monte Carlo Integration 4 / 41 Program 1. Monte Carlo integration Approximating expectations by averages Importance sampling 2. Sampling Michael Gutmann Sampling and Monte Carlo Integration 5 / 41 Averages with iid samples I Tutorial 7: For Gaussians, the sample average is an estimate (MLE) of the mean (expectation) E[x] 1 n x¯ = X x ≈ [x] n i E i=1 I Gaussianity not needed: assume xi are iid observations of x ∼ p(x). Z 1 n [x] = xp(x)dx ≈ x¯ x¯ = X x E n n n i i=1 I Subscript n reminds us that we used n samples to compute the average. I Approximating integrals by means of sample averages is called Monte Carlo integration. Michael Gutmann Sampling and Monte Carlo Integration 6 / 41 Averages with iid samples I Sample average is unbiased 1 n n [¯x ] = X [x ] =∗ [x] = [x] E n n E i nE E i=1 (∗: “identically distributed” assumption is used, not independence) I Variability " # 1 n 1 n 1 [¯x ] = X x =∗ X [x ] = [x] V n n2 V i n2 V i nV i=1 i=1 (∗: independence assumption used) I Squared error decreases as 1/n h i 1 [¯x ] = (¯x − [x])2 = [x] V n E n E nV Michael Gutmann Sampling and Monte Carlo Integration 7 / 41 Averages with iid samples I Weak law of large numbers: [x] Pr (|x¯ − [x]| ≥ ) ≤ V n E n2 I As n → ∞, the probability for the sample average to deviate from the expected value goes to zero. I We say that sample average converges in probability to the expected value. I Speed of convergence depends on the variance V[x]. I Different “laws of large numbers” exist that make different assumptions. Michael Gutmann Sampling and Monte Carlo Integration 8 / 41 Chebyshev’s inequality I Weak law of large numbers is a direct consequence of Chebyshev’s inequality I Chebyshev’s inequality: Let s be some random variable with mean E[s] and variance V[s]. [s] Pr (|s − [s]| ≥ ) ≤ V E 2 I This means that for all random variables: I probability to deviate more than three standard deviation from the mean is less than 1/9 ≈ 0.11 p (set = 3 V(s)) I Probability to deviate more than 6 standard deviations: 1/36 ≈ 0.03. These are conservative values; for many distributions, the probabilities will be smaller. Michael Gutmann Sampling and Monte Carlo Integration 9 / 41 Proofs (not examinable) I Chebyshev’s inequality follows from Markov’s inequality. I Markov’s inequality: For a random variable y ≥ 0 [y] Pr(y ≥ t) ≤ E (t > 0) t I Chebyshev’s inequality is obtained by setting y = |s − E[s]| 2 2 Pr (|s − E[s]| ≥ t) = Pr (s − E[s]) ≥ t (s − [s])2 ≤ E E . t2 Chebyshev’s inequality follows with t = , and because 2 E[(s − E[s] ] is the variance V[s] of s. Michael Gutmann Sampling and Monte Carlo Integration 10 / 41 Proofs (not examinable) Proof for Markov’s inequality: Let t be an arbitrary positive number and y a one-dimensional non-negative random variable with pdf p. We can decompose the expectation of y using t as split-point, Z ∞ Z t Z ∞ E[y] = up(u)du = up(u)du + up(u)du. 0 0 t Since u ≥ t in the second term, we obtain the inequality Z t Z ∞ E[y] ≥ up(u)du + tp(u)du. 0 t The second term is t times the probability that y ≥ t, so that Z t E[y] ≥ up(u)du + t Pr(y ≥ t) 0 ≥ t Pr(y ≥ t), where the second line holds because the first term in the first line is non-negative. This gives Markov’s inequality (y) Pr(y ≥ t) ≤ E (t > 0) t Michael Gutmann Sampling and Monte Carlo Integration 11 / 41 Averages with correlated samples I When computing the variance of the sample average [x] [¯x ] = V V n n we assumed the samples are identically and independently distributed. I The variance shrinks with increasing n and the average becomes more and more concentrated around E[x]. I Corresponding results exist for the case of statistically dependent samples xi . Known as “ergodic theorems”. I Important for the theory of Markov chain Monte Carlo methods but requires advanced mathematical theory. Michael Gutmann Sampling and Monte Carlo Integration 12 / 41 More general expectations I So far, we have considered Z 1 n [x] = xp(x)dx ≈ X x E n i i=1 where xi ∼ p(x) I This generalises Z 1 n [g(x)] = g(x)p(x)dx ≈ X g(x ) E n i i=1 where xi ∼ p(x) 1 I Variance of the approximation if the xi are iid is n V[g(x)] Michael Gutmann Sampling and Monte Carlo Integration 13 / 41 Example (Based on a slide from Amos Storkey) Z 1 n [g(x)] = g(x)N (x; 0, 1)dx ≈ X g(x )(x ∼ N (x; 0, 1)) E n i i i=1 for g(x) = x and g(x) = x 2 Left: sample average as a function of n Right: Variability (0.5 quantile: solid, 0.1 and 0.9 quantiles: dashed) 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 Average 0 0 -0.5 -0.5 -1 Distribution of the average -1.5 -1 -2 -1.5 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of samples Number of samples Michael Gutmann Sampling and Monte Carlo Integration 14 / 41 Example (Based on a slide from Amos Storkey) Z 1 n [g(x)] = g(x)N (x; 0, 1)dx ≈ X g(x )(x ∼ N (x; 0, 1)) E n i i i=1 for g(x) = exp(0.6x 2) Left: sample average as a function of n Right: Variability (0.5 quantile: solid, 0.1 and 0.9 quantiles: dashed) 10 15 9 8 7 10 6 5 Average 4 5 3 Distribution of the average 2 1 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of samples Number of samples Michael Gutmann Sampling and Monte Carlo Integration 15 / 41 Example I Indicators that something is wrong: I Strong fluctuations in the sample average as n increases. I Large non-declining variability. I Note: integral is not finite: Z 1 Z exp(0.6x 2)N (x; 0, 1)dx = √ exp(0.6x 2) exp(−0.5x 2)dx 2π 1 Z = √ exp(0.1x 2)dx 2π = ∞ but for any n, the sample average is finite and may be mistaken for a good approximation. I Check variability when approximating the expected value by a sample average! Michael Gutmann Sampling and Monte Carlo Integration 16 / 41 Approximating general integrals I If the integral does not correspond to an expectation, we can smuggle in a pdf q to rewrite it as an expected value with respect to q Z Z q(x) I = g(x)dx = g(x) dx q(x) Z g(x) = q(x)dx q(x) g(x) = Eq(x) q(x) 1 n g(x ) ≈ X i n q(x ) i=1 i with xi ∼ q(x) (iid) I This is the basic idea of importance sampling. I q is called the importance (or proposal) distribution Michael Gutmann Sampling and Monte Carlo Integration 17 / 41 Choice of the importance distribution I Call the approximation bI, n 1 X g(xi ) bI = n q(x ) i=1 i I bI is unbiased by construction g(x) Z g(x) Z [bI] = = q(x)dx = g(x)dx = I E E q(x) q(x) I Variance " # h i 1 g(x) 1 g(x)2 1 g(x)2 bI = = − V nV q(x) nE q(x) n E q(x) | {z } I2 Depends on the second moment. Michael Gutmann Sampling and Monte Carlo Integration 18 / 41 Choice of the importance distribution I The second moment is " # g(x)2 Z g(x)2 Z g(x)2 = q(x)dx = dx E q(x) q(x) q(x) Z |g(x)| = |g(x)| dx q(x) I Bad: q(x) is small when |g(x)| is large.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages41 Page
-
File Size-