Markov Chain Monte Carlo Methods
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling • Rejection • Importance • Hastings-Metropolis • Gibbs Markov Chains • Properties Particle Filtering • Condensation 2 Monte Carlo Methods 3 Markov Chain Monte Carlo History A recent survey places the Metropolis algorithm among the 10 algorithms that have had the greatest influence on the development and practice of science and engineering in the 20th century (Beichl&Sullivan, 2000). The Metropolis algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC) MCMC plays significant role in statistics, econometrics, physics and computing science. There are several high-dimensional problems, such as computing the volume of a convex body in d dimensions and other high-dim integrals, for which MCMC simulation is the only known general approach for providing a solution within a reasonable time. 4 Markov Chain Monte Carlo History While convalescing from an illness in 1946, Stanislaw Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with 52 cards would come out successfully (Eckhard, 1987). Exhaustive combinatorial calculations are very difficult. New, more practical approach: laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation. 5 FERMIAC FERMIAC: The beginning of Monte Carlo Methods Developed in the Early 1930’s The Monte Carlo trolley, or FERMIAC, was an analog computer invented by physicist Enrico Fermi to aid in his studies of neutron transport. At each stage, pseudo-random numbers are used to make decisions that affect the behavior of each neutron (random choice between fast and slow neutrons) 6 ENIAC ENIAC (pron.: /ˈini.æk/; Electronic Numerical Integrator And Computer) was the first electronic general-purpose computer. February 14, 1946: The completed machine was announced to the public Digital, and capable of being reprogrammed to solve a full range of computing problems Designed by John Mauchly and J. Presper Eckert of the University of Pennsylvania Stanislaw Ulam, John von Neumann, Enrico Fermi runs MCMC on ENIAC 7 MANIAC MANIAC Mathematical Analyzer, Numerical Integrator, and Computer or Mathematical Analyzer, Numerator, Integrator, and Computer 1952: built under the direction of Nicholas Metropolis at the Los Alamos Scientific Laboratory. It was based on the von Neumann architecture of the IAS, developed by John von Neumann Metropolis chose the name MANIAC in the hope of stopping the rash of silly acronyms for machine names Illiac, Johnniac, Silliac…. 8 MCMC Applications Sampling from high-dimensional, complicated distributions Bayesian inference and learning Normalization Marginalization Expectation Global optimization 9 The Monte Carlo principle Our goal is to estimate the following integral: The idea of Monte Carlo simulation is to draw an i.i.d. set of samples {x(i) } from a target density p(x) defined on a high-dim. space X. These samples can be used to approximate the target density with the following empirical point-mass function: 10 The Monte Carlo principle Estimate: Estimator: 11 The Monte Carlo principle Theorems a.s. consistent Unbiased estimation Independent of dimension d! Asymptotically normal 12 The Monte Carlo principle One “tiny” problem… Monte Carlo methods need sample from distribution p(x). When p(x) has standard form, e.g. Uniform or Gaussian, it is straightforward to sample from it using easily available routines. However, when this is not the case, we need to introduce more sophisticated sampling techniques. MCMC sampling ) 13 Sampling 14 Main Goal Sample from distribution p(x) that is only known up to a proportionality constant For example, p(x) 0.3 exp(−0.2x2) +0.7 exp(−0.2(x − 10)2) ∝ 15 Rejection Sampling 16 Rejection Sampling Suppose that p(x) is known up to a proportionality constant (e.g. exp(-x2)). It is easy to sample from q(x) that satisfies p(x) ≤ M q(x), M < ∞ M is known Rejection sampling 17 Rejection Sampling 18 Rejection Sampling Theorem The accepted x(i ) can be shown to be sampled with probability p(x) (Robert & Casella, 1999, p. 49). Severe limitations: It is not always possible to bound p(x)/q(x) with a reasonable constant M over the whole space X. If M is too large, the acceptance probability is too small. In high dimensional spaces it can be exponentially slow to sample points. (The points usually will be rejected) 19 Importance Sampling 20 Importance Sampling Importance sampling is an alternative “classical” solution that goes back to the 1940’s. Let us introduce, again, an arbitrary importance proposal distribution q(x) such that its support includes the support of p(x). Then we can rewrite I(f) as follows: 21 Importance Sampling Consequently, (i ) N if one can simulate N i.i.d. samples {x } i=1 according to q(x) and evaluate w(x(i )), possible Monte Carlo estimate of I (f ) is ) 22 Importance Sampling Theorem This estimator is unbiased Under weak assumptions, the strong law of large numbers applies: Some proposal distributions q(x) will obviously be preferable to others. Find one that minimizes the variance of the estimator: 23 Importance Sampling Theorem The variance is minimal when we adopt the following optimal importance distribution: The optimal proposal is not very useful in the sense that it is not easy to sample from | f (x)|p(x). High sampling efficiency is achieved when we focus on sampling from p(x) in the important regions where | f (x)|p(x) is relatively large; hence the name importance sampling 24 Importance Sampling Importance sampling estimates can be super-efficient: For a given function f (x), it is possible to find a distribution q(x) that yields an estimate with a lower variance than when using q(x)= p(x)! Problems with importance sampling In high dimensions it is not efficient either… 25 Andrey Markov Markov Chains 26 Markov Chains Markov chain: Homogen Markov chain: 27 Markov Chains Assume that the state space is finite: 1-Step state transition matrix: Lemma: The state transition matrix is stochastic: t-Step state transition matrix: Lemma: 28 Markov Chains Example Markov chain with three states (s = 3) Transition matrix Transition graph 29 Markov Chains If the probability vector for the initial state is it follows that and, after several iterations (multiplications by T ) stationary distribution no matter what initial distribution ¹(x1) was. The chain has forgotten its past. 30 Markov Chains Our goal is to find conditions under which the Markov chain forgets its past, and independently from its starting distribution, the state distributions converge to a stationary distribution. Definition: [stationary distribution, invariant distribution] This is a necessary condition for having limit behavior of the Markov chain. 31 Markov Chains Theorem: For any starting point, the chain will convergence to the unique invariant distribution p(x), as long as 1. T is a stochastic transition matrix 2. T is irreducible 3. T is aperiodic 32 Limit Theorem of Markov Chains More formally: If the Markov chain is Irreducible and Aperiodic, then: 33 Markov Chains Definition Irreducibility: For each pairs of states (i,j), there is a positive probability, starting in state i, that the process will ever enter state j. = The matrix T cannot be reduced to separate smaller matrices = Transition graph is connected. It is possible to get to any state from any state. 34 Markov Chains Definition Aperiodicity: The chain cannot get trapped in cycles. Definition A state i has period k if any return to state i, must occur in multiples of k time steps. Formally, the period of a state i is defined as (where "gcd" is the greatest common divisor) For example, suppose it is possible to return to the state in {6,8,10,12,...} time steps. Then k=2 35 Markov Chains Definition If k = 1, then the state is said to be aperiodic: returns to state i can occur at irregular times. In other words, a state i is aperiodic if there exists n such that for all n' ≥ n, Definition A Markov chain is aperiodic if every state is aperiodic. 36 Markov Chains Theorem: An irreducible Markov chain only needs one aperiodic state to imply all states are aperiodic. Corollary: An irreducible chain is said to be aperiodic, if for some n 0 and ¸ some state j Example for periodic Markov chain: In this case Let However, if we start the chain from (1,0), or (0,1), then the chain get traps into a cycle, it doesn’t forget its past. Periodic Markov chains don’t forget their past. 37 Reversible Markov chains Detailed Balance Property Definition: reversibility /detailed balance condition: Theorem: A sufficient, but not necessary, condition to ensure that a particular ¼ is the desired invariant distribution of the Markov chain is the detailed balance condition. 38 How fast can Markov chains forget the past? MCMC samplers are irreducible and aperiodic Markov chains have the target distribution as the invariant distribution. the detailed balance condition is satisfied. It is also important to design samplers that converge quickly. 39 Spectral properties Theorem: If ¼ is the left eigenvector of the matrix T with eigenvalue 1. The Perron-Frobenius theorem from linear algebra tells us that the remaining eigenvalues have absolute value less than 1. The second largest eigenvalue, therefore, determines the rate of convergence of the chain, and should be as small as possible. 40 The Hastings-Metropolis Algorithm 41 The Hastings-Metropolis Algorithm Our goal: Generate samples from the following discrete distribution: We don’t know B ! The main idea is to construct a time-reversible Markov chain with (¼1,…,¼m) limit distributions Later we will discuss what to do when the distribution is continuous 42 The Hastings-Metropolis Algorithm Let {1,2,…,m} be the state space of a Markov chain that we can simulate.