Deep Learning Srihari

Importance Sampling

Sargur N. Srihari [email protected]

1 Deep Learning Srihari Topics in Monte Carlo Methods

1. Sampling and Monte Carlo Methods 2. Importance Sampling 3. Markov Chain Monte Carlo Methods 4. Gibbs Sampling 5. Mixing between separated modes

2 Deep Learning Srihari Importance Sampling: Choice of p(x) • Sum of integrand to be computed: Discrete case Continuous case s = p(x)f(x) = E ⎡f(x)⎤ s = p(x)f(x)dx = E ⎡f(x)⎤ ∑ p ⎣ ⎦ ∫ p ⎣ ⎦ x • An important step is deciding which part of the integrand has the role of the probability p(x) – from which we sample x(1),..x(n) • And which part has role of f (x) whose (under the ) is estimated as

1 n sˆ f( (i)) n = ∑ x n i=1 Deep Learning Decomposition of Integrand Srihari • In the equation s = ∑ p(x)f(x) s = ∫ p(x)f(x)dx x • There is no unique decomposition because. p(x) f (x) can be rewritten as p(x)f(x) p(x)f(x) = q(x) q( ) x – where we now sample from q and average

p(x)f(x) q( ) x

s = p(x)f(x)dx – In many cases the problem ∫ is specified naturally as expectation of f (x) given distribution p(x) • But it may not be optimal in no of samples required Deep LearningDifficulty of sampling from p(x) Srihari • Principal reason for sampling p(x) is evaluating expectation of some f (x) E[f ] = f(x)p(x)dx ∫ • Given samples x(i), i=1,..,n, from p(x), the finite sum approximation is 1 n fˆ = f (x (i) ) n∑ i=1 • But drawing samples p(x) may be impractical

5 Deep Learning Srihari Using a proposal distribution uses a proposal distribution

• Determining if sample is in shaded area • Generate two random numbers

– z0 from q(z)

– u0 from uniform distribution [0,kq(z0)] • This pair has uniform distribution under the curve of function kq(z) • If u  p ( z ) the pair is rejected otherwise it is retained 0 > 0 • Remaining pairs have a uniform distribution under the curve of p(z) and hence the corresponding z values are distributed according to p(z) as desired 6 Deep Learning Srihari Importance sampling retains samples

• Importance sampling uses: – A proposal distribution– like rejection sampling • where samples not matching conditioning are rejected • But all samples are retained – Assumes that for any x, p(x) can be evaluated

7 Deep Learning Srihari Determining Importance weights

• Samples {x(i)} from simpler distribution q(x)

E[ f ] f ( )p( )d Proposal = ∫ x x x distribution p(x) = f (x) q(x)dx ∫ q(x) 1 n p(x (i) ) = f (x (i) ) n ∑ q( (i) ) l=1 x • Samples are weighted by ratios (i) (i) ri= p(x ) / q(x ) – Known as importance weights • Which corrects the bias introduced by wrong distribution

8 Deep Learning Srihari Importance Sampling • Better way to compute integrals than rejection • Sample from a distribution q(x) and reweigh the samples in a principled way

E [f( )] = f( )p( ) This can be treated as x~p x ∑ x x x computing expectation of p(x)/q(x) wrt q(x) p(x) =∑ f(x) q(x) q(x) q(x) x p(x) f(x) p(x) f(x) = Ex~q[f(x)w(x)] 1 M ≈ ∑ f(x (m))w(x (m)) M m=1 x x

• where w(x)=p(x)/q(x) and samples x(m) drawn from q • So take samples from q(x) and reweigh them with w(x) – But p(x) is intractable, but there is a solution Deep Learning Srihari Case of Uniform q(x)

w(x)=p(x)/q(x) p(x)~N(0.5,1) q(x)~U(0,1)

where we(z)=p(e,z)/q(z) is the sample weight 1 M w (z(m)) ≈ ∑ e x=(e,z) M m=1 Unlike rejection sampling importance sampling will use all samples 10 Deep Learning Srihari Dealing with intractable p(x)

11 Deep Learning Srihari Unnormalized Importance sampling • Define weight using unnormalized distribution p!(x) w(x) = q(x) q(x) • Then p(x) f(x) p!(x) E ⎡w(x)⎤ = q(x) = p!(x) = α x~q ⎣ ⎦ ∑ ∑ x q(x) x • And x E ⎡f(x)⎤ = p(x)f(x) x~p ⎣ ⎦ ∑ x Therefore, we can determine Ex~p[f(x)] p(x) = ∑q(x)f(x) using samples drawn from q as follows: x q(x) 1 p!(x) = ∑q(x)f(x) ∑ f(x[m])w(x[m]) q( ) m α x x E ⎡f(x)⎤ = x~p ⎣ ⎦ 1 w(x[m]) = E ⎡f(x)w(x)⎤ ∑ α x~q ⎣ ⎦ m E ⎡f(x)w(x)⎤ = x~q ⎣ ⎦ E ⎡w(x)⎤ x~q ⎣ ⎦ 12 DeepNormalized Learning Importance SamplingSrihari • Unnormalized importance sampling unsuitable for conditional probabilities

– With unnormalized we could estimate numerator as

where and we(z)=p(e,z)/q(z) 1 M (z m )w (z(m)) ≈ ∑δ e M m=1 – and denominator from samples as 1 M P(E e) w (z(m)) = ≈ ∑ e Which is shown on Slide 16 M m=1 • Separate Numerator & Denominator compounds errors » If Numerator is an underestimate, Denominator is an overestimate, overall result is a severe underestimate Deep Learning Srihari Using same set of samples • If we use the same set of M samples z1,..zM~q – For both numerator and denominator, we avoid this issue of compounding errors – Final form of normalized importance sampling is

where and we(z)=p(e,z)/q(z) • However, it is biased, for M=1

– Fortunately both numerator and denominator are unbiased so the result is asymptotically unbiased Deep Learning Srihari Derivation of Optimal q*(x)

1 n sˆ f( (i)) • Any Monte Carlo p = ∑ x n (i ) i=1,x ~p – can be transformed into an importance sampling estimator 1 n p ( x (i )) f ( x ( i) ) using p(x)f(x) sˆ p(x)f(x) = q(x) q = ∑ (i) q(x) n (i ) q(x ) i=1,x ~q • It can be readily seen that the expected value of the estimator does not depend on q: E [sˆ ] = E [sˆ ] = s q q q p – The variance of an importance sampling estimator is sensitive to the choice of q. The variance is • The minimum variance occurs when q is ⎡ 1 p(x)f(x)⎤ Var[sˆp ] =Var ⎢ ⎥ n q(x) p(x)| f(x)| ⎣ ⎦ q *(x) = Z • where Z is the normalization constant chosen so that q*(x) sums or integrates to one Deep Learning Srihari Choice of suboptimal q(x)

• Any choice of q is valid and q* is the optimal one (yields minimum variance) • Sampling from q* is usually infeasible • Other choices of q can be feasible, reducing variance somewhat

16 Deep Learning Srihari Biased Importance Sampling (BIS) • Another approach is BIS – Has advantage of not requiring normalized p or q – For discrete variables, BIS estimator is

where p^ and q^ are unnormalized forms of p and q and x(i) are samples from q

17 Deep Learning Srihari Effect of choice of q • Good q à efficient Monte Carlo estimation • Poor choice of q à efficiency much worse ⎡ 1 p(x)f(x)⎤ Var[sˆp ] =Var ⎢ ⎥ n q(x) – Looking at ⎣ ⎦ if there are samples for which p ( x )f ( x ) is large then the variance of the q(x) estimator can get large • Happens when q(x) is tiny while neither p(x) or q(x) is small enough to cancel it – The q distribution is usually chosen to be a very simple distribution so that it is easy to sample from • When x is high dimensional this simplicity causes it to match p of p|f| poorly • Very small or very large ratios are possible when x is high dimensional 18 Deep Learning Srihari Importance sampling in Deep learning • Used in many ML algorithms incl. deep learning • Examples – To accelerate training in neural language models with large vocabulary • Other neural nets with large no of outputs – Estimate partition function (normalize prob. distribution) – Estimate log-likelihood in deep directed models such as variational autoencoder – Estimate gradient in SGD where most of the cost comes from a small no of misclassified samples • Sampling more difficult examples can reduce variance of

gradient 19 Deep Learning Srihari Importance Sampling and VAE To evaluate the goodness of a VAE we decide to use Bayesian Model

Decoder comparison

Thus for each model M we need Encoder

Use importance sampling with q being scaled and shifted Gaussians from our encoder

20 http://bjlkeng.github.io/posts/importance-sampling-and-estimating-marginal-likelihood-in-variational-autoencoders/ Deep Learning Srihari Importance Sampling in PERT

Project Evaluation and Review Technique (PERT)

Tj

Ej=Sj+Tj

If every task takes time in table, then E10=15

If Tj are independent exponentially distributed with means as given, then completion time is a with mean E10=18 and a long tail to the right

Exponential If there is a severe penalty for E10>70, direct simulation Distribution with n=10,000 indicates E10>70 happened only 2 times Mean=1/λ Importance sampling gets a better estimate of this probability with λ= 4θ and n= 10,000 −5 gave ˆμλ= 2.02×10 https://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf