<<

Machine Learning Srihari

Basic Sampling Methods

Sargur Srihari [email protected]

1 Machine Learning Srihari Topics 1. Motivation 2. Sampling from PGMs 3. Transforming a Uniform Distribution 4. Rejection Sampling 5. 6. Sampling-Importance-Resampling

2 Machine Learning Srihari 1. Motivation

• Inference is the task of determining a response to a query from a given model • When exact inference is intractable, we need some form of approximation • Inference methods based on numerical sampling are known as Monte Carlo techniques • Most situations will require evaluating expectations of unobserved variables, e.g., to make predictions

3 Machine Learning Srihari Using samples for inference • Obtain set of samples z(l) where i =1,.., L • Drawn independently from distribution p(z) • Allows expectation E [ f ] = f ( z ) p ( z ) d z to be approximated by ∫ 1 L fˆ = ∑ f(z(l )) Called an estimator L i=1

ˆ – Then E [ f ] = E [ f ], i.e., estimator has the correct mean – And 1 2 which is the variance of the estimator var[ fˆ] = E"( f − E( f )) $ L # % • Accuracy independent of dimensionality of z – High accuracy can be achieved with few (10 or 20 samples) • However samples may not be independent – Effective sample size may be smaller than apparent sample size – In example f(z) is small when p(z) is high and vice versa • Expectation may be dominated by regions of small probability thereby requiring large sample sizes 4 Machine Learning Srihari 2. Sampling from directed PGMs

• If joint distribution is represented by a BN – no observed variables – straightforward method is ancestral sampling M p(z) p(z | pa ) • Distribution is specified by = ∏ i i i=1 – where zi are set of variables associated with node i and

– pai are set of variables associated with node parents of node i • To obtain samples from joint

– we make one pass through set of variables in order z1,..zM sampling from conditional distribution p(z|pai) • After one pass through the graph we obtain one sample • Frequency of different values defines the distribution – E.g., allowing us to determine marginals P(L,S) = ∑ P(D,I,G,L,S)= ∑ P(D)P(I )P(G | D,I )P(L |G)P(S | I ) D,I ,G D,I ,G 5 Machine Learning Srihari Ancestral sampling with some nodes instantiated • Directed graph where some nodes instantiated with observed values

P(L = l0 ,s = s1 )= ∑ P(D)P(I)P(G|D,I)× D,I ,G 0 1 P(L = l |G)P(S = s |I) • Called Logic sampling • Use ancestral sampling, except – When sample is obtained for an observed value: • if they agree then sample value is retained and proceed to next variable • If they don’t agree, whole sample is discarded

6 Machine Learning Srihari Properties of Logic Sampling

P(L = l0, s = s1) = ∑ P(D)P(I)P(G | D, I)× D,I,G P(L = l0 | G)P(S = s1 | I)

• Samples correctly from posterior distribution – Corresponds to sampling from joint distribution of hidden and data variables • But prob. of accepting sample decreases as – no of variables increase and – number of states that variables can take increases • A special case of Importance Sampling – Rarely used in practice 7 Machine Learning Srihari Undirected Graphs

• No one-pass sampling strategy even from prior distribution with no observed variables • Computationally expensive methods such as must be used – Start with a sample – Replace first variable conditioned on rest of values, then next variable, etc

8 Machine Learning Srihari 3. Basic Sampling Algorithms

• Simple strategies for generating random samples from a given distribution • Assume that algorithm is provided with a pseudo- random generator for uniform distribution over (0,1) • For standard distributions we can use transformation method of generating non-uniformly distributed random numbers

9 Machine Learning Srihari Transformation Method

• Goal: generate random numbers from a simple non-uniform distribution • Assume we have a source of uniformly distributed random numbers • z is uniformly distributed over (0,1), i.e., p(z) =1 in the interval

p(z) z 0 1

10 Machine Learning Srihari Inverse Probability Transform

• Let F(x) be a cumulative distribution function of some distribution we want to sample from • Let F-1(u) be its inverse then we have 1 F

u F -1

0 • Theorem: 0 x – If u~U(0,1) is a uniform then F-1(u)~F – Proof: • P(F-1(u)≤ x)=P(u ≤ F(x)) applying F to both sides of lhs =F(x) because P(u≤y)=y and since u is uniform on unit interval

11 Machine Learning Srihari Transformation for Standard Distributions f(z) p(z) p(y) z 0 1 y = f(z)

dz p(y) = p(z) (1) dy

y z = h(y) ≡ ∫ p(yˆ) dyˆ −∞ Machine Learning Srihari Geometry of Transformation • We are interested in generating r.v. s from p(y) – non-uniform random variables

• h(y) is indefinite integral of desired p(y) • z ~ U(0,1) is transformed using y = h-1(z) • Results in y being distributed as p(y) 13 Machine Learning Srihari Transformations for Exponential & Cauchy • We need samples from the p(y) = l exp(-ly) where 0 ≤ y <∞ y –In this case the indefinite integral is z = h(y) = ∫ p(y)dy 0 =1− exp(−λy) –If we transform using y = −λ −1 ln(1− z) –Then y will have an exponential distribution • We need samples from the Cauchy Distribution –Cauchy Distribution 1 1 p(y) = p 1+ y 2 –Inverse of the indefinite integral can be expressed as a tan function 14 Machine Learning Srihari Generalization of Transformation Method

• Single variable transformation dz p(y) = p(z) dy – Where z is uniformly distributed over (0,1) • Multiple variable transformation

d(z1,...,zM ) p(y1,...., yM ) = p(z1,...,zM ) d(y1,..., yM )

15 Machine Learning Srihari Transformation Method for Gaussian

• Box-Muller method for Gaussian – Example of a bivariate Gaussian • First generate uniform distribution in unit circle – Generate pairs of uniformly distributed random numbers z ,z ∈(−1, 1) 1 2 • Can be done from U(0,1) using zà2z-1 2 2 – Discard each pair unless z1 +z2 <1 – Leads to uniform distribution of points inside unit circle with 1 p(z ,z ) = 1 2 π 16 Machine Learning Srihari Generating a Gaussian

• For each pair z1,z2 evaluate the quantities 1/ 2 1/ 2 æ - 2ln z1 ö æ - 2ln z2 ö y1 = z1ç ÷ y2 = z2 ç ÷ è r 2 ø è r 2 ø 2 2 2 where r = z1 + z2

– Then y1 and y2 are independent Gaussians with zero mean and variance • For arbitrary mean and covariance matrix 2 If y  N(0,1) then σ y + µ has N(µ,σ ) • In multivariate case If components are independent and N (0,1) then y = µ + Lz will have N(µ,Σ) where Σ = LLT , is called the Cholesky decomposition 17 Machine Learning Srihari Limitation of Transformation Method

• Need to first calculate and then invert indefinite integral of the desired distribution • Feasible only for a small number of distributions • Need alternate approaches • Rejection sampling and Importance sampling applicable to univariate distributions only – But useful as components in more general strategies 18 Machine Learning Srihari 4. Rejection Sampling • Allows sampling from a relatively complex distribution • Consider univariate, then extend to several variables • Wish to sample from distribution p(z) – Not a simple standard distribution – Sampling from p(z) is difficult • Suppose we are able to easily evaluate p(z) for any given value of z, upto a normalizing constant Zp 1 p(z) = p(z) Unnormalized Z p

– where p(z) can readily be evaluated but Zp is unknown – e.g., p(z) is a mixture of Gaussians – Note that we may know the mixture distribution but we need samples to generate expectations 19 Machine Learning Srihari Rejection sampling: Proposal distribution

• Samples are drawn from simple distribution, called proposal distribution q(z) • Introduce constant k whose value is such that  kq(z) ≥ p(z) for all z – Called comparison function

20 Machine Learning Srihari Rejection Sampling Intuition • Samples are drawn from simple distribution q(z) • Rejected if they fall in grey area – Between un-normalized distribution p~(z) and scaled distribution kq(z) • Resulting samples are distributed according to p(z) which is the normalized version of p~(z)

21 Machine Learning Srihari Determining if sample is shaded area • Generate two random numbers

– z0 from q(z)

– u0 from uniform distribution [0,kq(z0)]

• This pair has uniform distribution under the curve of function kq(z) u p(z ) • If 0 > 0 the pair is rejected otherwise it is retained • Remaining pairs have a uniform distribution under the curve of p(z) and hence the corresponding z values are distributed according to p(z) as desired 22 Machine Learning Srihari Example of Rejection Sampling • Task of sampling from Gamma

a a-1 b z exp(-bz) Scaled Gam(z | a,b) = Cauchy G(a) Gamma

• Since Gamma is roughly bell-shaped, proposal distribution is is Cauchy • Cauchy has to be slightly generalized – To ensure it is nowhere smaller than Gamma 23 Machine Learning Srihari Adaptive Rejection Sampling • When difficult to find suitable analytic distribution p(z) • Straight-forward when p(z) is log concave – When ln p(z) has derivatives that are non-increasing functions of z – Function ln p(z) and its gradient are evaluated at initial set of grid points – Intersections are used to construct envelope • A sequence of linear functions

24 Machine Learning Srihari Dimensionality and Rejection Sampling

• Gaussian example Proposal distribution q(z) is Gaussian • Acceptance rate is Scaled version is kq(z) ratio of volumes under p(z) and kq(z) – diminishes exponentially with dimensionality True distribution p(z)

25 Machine Learning 5. Importance Sampling Srihari

• Principal reason for sampling p(z) is evaluating expectation of some f(z) In Bayesian regression, p(t | x)= ∫ p(t|x,w)p(w)dw E[f ] = ∫ f(z)p(z)dz where p(t |x,w)~N(y(x,w),β-1) • Given samples z(l), l=1,..,L, from p(z), the finite sum approximation is 1 L fˆ = ∑ f(z(l )) L i=1 • But drawing samples p(z) may be impractical • Importance sampling uses: – a proposal distribution– like rejection sampling • But all samples are retained – Assumes that for any z, p(z) can be evaluated 26 Machine Learning Srihari Determining Importance weights

• Sample {z(l)} from a simpler distribution q(z)

Proposal E[f ] = ∫ f(z)p(z)dz distribution p(z) = f(z) q(z)dz ∫ q(z) 1 L p(z(l )) = f(z(l )) L ∑ q(z(l )) Unlike rejection sampling l =1 all of the samples are retained

(l) (l) • Samples are weighted by ratios rl= p(z ) / q(z ) – Known as importance weights • Which corrects the bias introduced by wrong distribution

27 Machine Learning Srihari Likelihood-weighted Sampling

• Importance sampling of graphical model using ancestral sampling • Weight of sample z given evidence variables E

r(z) = ∏ p(zi | pai ) zi ∈E

28 Machine Learning Srihari 6. Sampling-Importance Re-sampling (SIR) • Rejection sampling depends on suitable value of k where kq(z)> Rejection Sampling p(k) Proposal q(z) is Gaussian. – For many pairs of distributions p(z) Scaled version is kq(z) and q(z) it is impractical to determine value of k – If it is sufficiently large to guarantee a bound then impractically small acceptance rates

• Method makes use of sampling True distribution p(z) distribution q(z) but avoids having to determine k • Name of method: Sampling proposal distribution followed by determining importance weights and then resampling 29 Machine Learning Srihari SIR Method • Two stages • Stage 1: L samples z(1),..,z(L) are drawn from q(z)

• Stage 2: Weights w1,..,wL are constructed (l) (l) – As in importance sampling wl= p(z ) / q(z ) • Finally a second set of L samples are drawn from the discrete distribution {z(1),..,z(L) } with probabilities given by {w1,..,wL} • If moments wrt distribution p(z) are needed, use: E[ f ] = ∫ f (z)p(z)dz L (l) =∑wl f (z ) l=1 30