Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review
Total Page:16
File Type:pdf, Size:1020Kb
Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review Benyamin Ghojogh* [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Hadi Nekoei* [email protected] MILA (Montreal Institute for Learning Algorithms) – Quebec AI Institute, Montreal, Quebec, Canada Aydin Ghojogh* [email protected] Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada * The first three authors contributed equally to this work. Abstract pendent, we cover importance sampling and re- This paper is a tutorial and literature review on jection sampling. For MCMC methods, we cover sampling algorithms. We have two main types Metropolis algorithm, Metropolis-Hastings al- of sampling in statistics. The first type is sur- gorithm, Gibbs sampling, and slice sampling. vey sampling which draws samples from a set or Then, we explain the random walk behaviour of population. The second type is sampling from Monte Carlo methods and more efficient Monte probability distribution where we have a proba- Carlo methods, including Hamiltonian (or hy- bility density or mass function. In this paper, we brid) Monte Carlo, Adler’s overrelaxation, and cover both types of sampling. First, we review ordered overrelaxation. Finally, we summarize some required background on mean squared er- the characteristics, pros, and cons of sampling ror, variance, bias, maximum likelihood estima- methods compared to each other. This paper can tion, Bernoulli, Binomial, and Hypergeometric be useful for different fields of statistics, machine distributions, the Horvitz–Thompson estimator, learning, reinforcement learning, and computa- and the Markov property. Then, we explain the tional physics. theory of simple random sampling, bootstrap- arXiv:2011.00901v1 [stat.ME] 2 Nov 2020 ping, stratified sampling, and cluster sampling. 1. Introduction We also briefly introduce multistage sampling, Sampling is a fundamental task in statistics. However, this network sampling, and snowball sampling. Af- terminology is used for two different tasks in statistics. On terwards, we switch to sampling from distribu- one hand, sampling refers to survey sampling which is se- tion. We explain sampling from cumulative dis- lecting instances from a population or set: tribution function, Monte Carlo approximation, simple Monte Carlo methods, and Markov Chain D := fx1; x2; : : : ; xN g; (1) Monte Carlo (MCMC) methods. For simple Monte Carlo methods, whose iterations are inde- where the population size is N := jDj. Note that some of the instances of this population may be repetitive num- bers/vectors. Survey sampling draws n samples from the population D to have a set of samples S where n := jSj. There are several articles and books on survey sam- Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review 2 pling such as (Barnett, 1974; Smith, 1976; Foreman, 1991; et al., 2012; Sutton & Barto, 2018). Schofield, 1996; Nassiuma, 2001; Chaudhuri & Stenger, In this tutorial and literature review paper, we cover both 2005; Tille´, 2006; Mukhopadhyay, 2008; Scheaffer et al., areas of sampling, i.e., survey sampling and sampling from 2011; Fuller, 2011; Tille´ & Matei, 2012; Hibberts et al., distributions using Monte Carlo methods. The remain- 2012; Singh & Mangat, 2013; Kalton, 2020). It is a field der of this paper is organized as follows. Section2 re- of research in statistics, with many possible future devel- views some required background on mean squared error, opments (Brick, 2011), especially in distributed networks variance, bias, estimations using maximum likelihood es- and graphs (Frank, 2011a; Heckathorn & Cameron, 2017). timation, Bernoulli, Binomial, and Hypergeometric distri- Some of the popular methods in survey sampling are Sim- butions, the Horvitz–Thompson estimator, and the Markov ple Random Sampling (SRS) (Barnett, 1974), bootstrap- property. We introduce, in detail, the methods of survey ping (Efron & Tibshirani, 1994), stratified sampling, clus- sampling and Monte Carlo methods in Sections3 and4, ter sampling (Barnett, 1974), multistage sampling (Lance respectively. Finally, we provide a summary of methods, & Hattori, 2016), network sampling (Frank, 2011b), and their pros and cons, and conclusions in Section5. snowball sampling (Goodman, 1961). On the other hand, sampling can refer to drawing sam- 2. Background ples from probability distributions. Usually, in real-world 2.1. Mean Squared Error, Variance, and Bias applications, distributions of data are complicated to sam- The materials of this subsection are taken from our previ- ple from; for example, they can be mixture of several dis- ous tutorial paper (Ghojogh & Crowley, 2019). Assume we tributions (Ghojogh et al., 2019a). One can approximate have variable X and we estimate it. Let the random vari- samples from the complicated distributions by sampling able X denote the estimate of X. Let (·) and (·) denote from some other simple-to-sample distribution. The sam- b E P expectation and probability, respectively. The variance of pling methods which perform this sampling approxima- estimating this random variable is defined as: tion are referred to as the Monte Carlo methods (Mackay, 1998; Bishop, 2006; Kalos & Whitlock, 2009; Hammers- ar(X) := (X − (X))2; (2) ley, 2013; Kroese et al., 2013). Monte Carlo approximation V b E b E b (Kalos & Whitlock, 2009) can be used for estimating the X expectation or probability of a function of data over the data which means average deviation of b from the mean of our (X) distribution. Monte Carlo methods can be divided into two estimate, E b , where the deviation is squared for symme- main categories, i.e., simple methods and Markov Chain try of difference. This variance can be restated as: Monte Carlo (MCMC) (MacKay, 2003). Note that Monte ar(X) = (X2) − ( (X))2: (3) Carlo methods are iterative. In simple Monte Carlo meth- V b E b E b ods, every iteration is independent from previous iterations See AppendixA for proof. and drawing samples is performed blindly. Importance Our estimation can have a bias. The bias of our estimate is sampling (Glynn & Iglehart, 1989) and rejection sampling defined as: (Casella et al., 2004; Bishop, 2006; Robert & Casella, 2013) are examples of simple Monte Carlo methods. In ias(X) := (X) − X; (4) MCMC (Murray, 2007), however, every iteration is depen- B b E b dent on its previous iteration because they have the mem- which means how much the mean of our estimate deviates ory of Markov property (Koller & Friedman, 2009). Some from the original X. examples of MCMC are Metropolis algorithm (Metropo- lis et al., 1953), Metropolis-Hastings algorithm (Hastings, Definition 1 (Unbiased Estimator). If the bias of an esti- 1970), Gibbs sampling (Geman & Geman, 1984), and slice mator is zero, i.e., E(Xb) = X, the estimator is unbiased. sampling (Neal, 2003; Skilling & MacKay, 2003). The The Mean Squared Error (MSE) of our estimate, Xb, is de- Metropolis algorithms are usually slow because of their fined as: random walk behaviour (Spitzer, 2013). Some efficient methods, for faster exploration of range of data by sam- 2 MSE(Xb) := E (Xb − X) ; (5) pling methods, are Hamiltonian (or hybrid) Monte Carlo method (Duane et al., 1987), Adler’s overrelaxation (Adler, which means how much our estimate deviates from the 1981), and ordered overrelaxation (Neal, 1998). Monte original X. Carlo methods have been originally developed in compu- The relation of MSE, variance, and bias is as follows: tational physics (Newman, 2013); hence, they have appli- cation in physics (Binder et al., 2012). They also have ap- 2 MSE(Xb) = Var(Xb) + (Bias(Xb)) : (6) plication in other fields such as finance (Glasserman, 2013) and reinforcement learning (Barto & Duff, 1994; Wang See AppendixA for proof. Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review 3 If we have two random variables Xb and Yb, we can say: Proof. See AppendixA for proof. Var(aXb + bYb) Lemma 2. The variance of the estimate of mean is: 2 2 = a Var(Xb) + b Var(Xb) + 2ab Cov(X;b Yb); (7) 1 where ov(X; Y ) is covariance defined as: ar(µ) = σ2: (16) C b b V N ov(X; Y ) := (XY ) − (Y ) (Y ): (8) C b b E b b E b E b Proof. See AppendixA for proof. See AppendixA for proof. If the two random variables are independent, i.e., X ?? Y , Proposition 1. An unbiased estimator for variance is: we have: N 1 X E(XbYb) = E(Xb) E(Yb) =) Cov(X;b Yb) = 0; (9) σ2 = (x − µ)2: (17) N − 1 j See AppendixA for proof. Note that Eq. (9) is not true for j=1 the reverse implication (we can prove by counterexample). Proof. See AppendixA for proof. We can extend Eqs. (7) and (8) to multiple random vari- ables: k Note that Eq. (14) is a biased estimate of variance because X its expectation is: Var aiXi i=1 N k k k 2 1 X 2 1 2 X 2 X X E(σ ) = E (xj − µ) = (N − 1) σ : = ai Var(Xi) + aiajCov(Xi;Xj); N N j=1 i=1 i=1 j=1;j6=i (10) 2.3. Bernoulli, Binomial, and Hypergeometric k k k k X1 X2 X1 X2 Distributions Cov aiXi; bjYj = ai bj Cov(Xi;Yj); i=1 j=1 i=1 j=1 Bernoulli distribution is a discrete distribution of being one (11) and zero with probabilities p and 1 − p, respectively. Its expected value and variance are: where ai’s and bj’s are not random. According to Eq. (9), if the random variables are indepen- E(X) = p; (18) dent, Eq. (10) is simplified to: Var(X) = p (1 − p); (19) k k X X 2 Var aiXi = ai Var(Xi): (12) respectively.