Simulating Data with SAS ®
Total Page:16
File Type:pdf, Size:1020Kb
Simulating Data with SAS ® Rick Wicklin From Simulating Data with SAS®. Full book available for purchase here. Contents Acknowledgments v I Essentials of Simulating Data 1 Chapter 1. Introduction to Simulation ................................................................................................... 3 Chapter 2. Simulating Data from Common Univariate Distributions ................................................. 11 Chapter 3. Preliminary and Background Information ......................................................................... 29 II Basic Simulation Techniques 49 Chapter 4. Simulating Data to Estimate Sampling Distributions ......................................................... 51 Chapter 5. Using Simulation to Evaluate Statistical Techniques ......................................................... 73 Chapter 6. Strategies for Efficient and Effective Simulation ............................................................... 93 III Advanced Simulation Techniques 107 Chapter 7. Advanced Simulation of Univariate Data ......................................................................... 109 Chapter 8. Simulating Data from Basic Multivariate Distributions ................................................... 129 Chapter 9. Advanced Simulation of Multivariate Data ...................................................................... 153 Chapter 10. Building Correlation and Covariance Matrices ................................................................. 175 IV Applications of Simulation in Statistical Modeling 195 Chapter 11. Simulating Data for Basic Regression Models ................................................................. 197 Chapter 12. Simulating Data for Advanced Regression Models .......................................................... 225 Chapter 13. Simulating Data from Times Series Models ..................................................................... 251 Chapter 14. Simulating Data from Spatial Models ............................................................................... 263 Chapter 15. Resampling and Bootstrap Methods ................................................................................. 281 Chapter 16. Moment Matching and the Moment-Ratio Diagram ......................................................... 297 V Appendix 323 Appendix A. A SAS/IML Primer .......................................................................................................... 325 Index 339 From Simulating Data with SAS® by Rick Wicklin. Copyright © 2013, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. From Simulating Data with SAS®. Full book available for purchase here. Chapter 2 Simulating Data from Common Univariate Distributions Contents 2.1 Introduction to Simulating Univariate Data . 11 2.2 Getting Started: Simulate Data from the Standard Normal Distribution . 12 2.3 Template for Simulating Univariate Data in the DATA Step . 13 2.4 Simulating Data from Discrete Distributions . 14 2.4.1 The Bernoulli Distribution . 14 2.4.2 The Binomial Distribution . 15 2.4.3 The Geometric Distribution . 16 2.4.4 The Discrete Uniform Distribution . 17 2.4.5 Tabulated Distributions . 18 2.4.6 The Poisson Distribution . 19 2.5 Simulating Data from Continuous Distributions . 20 2.5.1 The Normal Distribution . 21 2.5.2 The Uniform Distribution . 22 2.5.3 The Exponential Distribution . 22 2.6 Simulating Univariate Data in SAS/IML Software . 24 2.6.1 Simulating Discrete Data . 24 2.6.2 Sampling from Finite Sets . 25 2.6.3 Simulating Continuous Data . 26 2.7 Univariate Distributions Supported in SAS Software . 27 2.8 References . 28 2.1 Introduction to Simulating Univariate Data There are three primary ways to simulate data in SAS software: • Use the DATA step to simulate data from univariate and uncorrelated multivariate distributions. You can use the RAND function to generate random values from more than 20 standard univariate distributions. You can combine these elementary distributions to build more complicated distributions. 12 Chapter 2: Simulating Data from Common Univariate Distributions • Use the SAS/IML language to simulate data from many distributions, including correlated multivariate distributions. You can use the RANDGEN subroutine to generate random values from standard univariate distributions, or you can use several predefined modules to generate data from multivariate distributions. You can extend the SAS/IML language by defining new functions that sample from distributions that are not built into SAS. • Use specialized procedures in SAS/STAT software and SAS/ETS software to simulate data with special properties. Procedures that generate random samples include the SIMNORMAL, SIM2D, and COPULA procedures. This chapter describes the two most important techniques that are used to simulate data in SAS software: the DATA step and the SAS/IML language. Although the DATA step is a useful tool for simulating univariate data, SAS/IML software is more powerful for simulating multivariate data. To learn how to use the SAS/IML language effectively, see Wicklin (2010). Most of the terminology in this book is standard. However, a term that you might not be familiar with is the term random variate. A random variate is a particular outcome of a random variable (Devroye 1986). For example, let X be a Bernoulli random variable that takes on the value 1 with probability p and the value 0 with probability 1 p. If you draw five observations from the probability distribution, you might obtain the values 0; 1; 1; 0; 1. Those five numbers are random variates. This book also uses the terms “simulated values” and “simulated data.” Some authors refer to simulated data as “fake data.” 2.2 Getting Started: Simulate Data from the Standard Normal Distribution To “simulate data” means to generate a random sample from a distribution with known properties. Because an example is often an effective way to convey main ideas, the following DATA step generates a random sample of 100 observations from the standard normal distribution. Figure 2.1 shows the first five observations. data Normal(keep=x); call streaminit(4321); /* Step 1 */ do i = 1 to 100; /* Step 2 */ x = rand("Normal"); /* Step 3 */ output; end; run; proc print data=Normal(obs=5); run; 2.3: Template for Simulating Univariate Data in the DATA Step 13 Figure 2.1 A Few Observations from a Normal Distribution Obs x 1 1.24067 2 -0.53532 3 -1.01394 4 0.68965 5 -0.32458 The DATA step consists of three steps: 1. Set the seed value with the STREAMINIT function. Seeds for random number generation are discussed further in Section 3.3. 2. Use a DO loop to iterate 100 times. 3. For each iteration, call the RAND function to generate a random value from the standard normal distribution. If you change the seed value, you will get a different random sample. If you change the number 100, you will get a sample with a different number of observations. To get a nonnormal distribution, change the name of the distribution from “Normal” to one of the families listed in Section 2.7. Some distributions, including the normal distribution, include parameters that you can specify after the name. 2.3 Template for Simulating Univariate Data in the DATA Step It is easy to generalize the example in the previous section. The following SAS pseudocode shows a basic template that you can use to generate N observations with a specified distribution: %let N = 100; /* size of sample */ data Sample(keep=x); call streaminit(4321); /* or use a different seed */ do i = 1 to &N; /* &N is the value of the N macro var */ /* specify distribution and parameters */ x = rand("DistribName", param1, param2, ...); output; end; run; The simulated data are written to the Sample data set. The macro variable N is defined in order to emphasize the role of that parameter. The expression &N is replaced by the value of the macro parameter (here, 100) before the DATA step is run. 14 Chapter 2: Simulating Data from Common Univariate Distributions The (pseudo) DATA step demonstrates the following steps for simulating data: 1. A call to the STREAMINIT subroutine, which specifies the seed that initializes the random number stream. When the argument is a positive integer, as in this example, the random sequence is reproducible. If you specify 0 as the argument, the random number sequence is initialized from your computer’s internal system clock. This implies that the random sequence will be different each time that you run the program. Seeds for random number generation are discussed in Section 3.3. 2. A DO loop that iterates N times. 3. A call to the RAND function, which generates one random value each time that the function is called. The first argument is the name of a distribution. The supported distributions are enumerated in Section 2.7. Subsequent arguments are parameter values for the distribution. 2.4 Simulating Data from Discrete Distributions When the set of possible outcomes is finite or countably infinite (like the integers), assigning a probability to each outcome creates a discrete probability distribution. Of course, the sum of the probabilities over all outcomes is unity. The following sections generate a sample of size N 100 from some well-known discrete D distributions. The code is followed by a frequency plot of the sample, which is overlaid with the exact probabilities of obtaining each value. You can use PROC FREQ to compute the empirical distribution of the data; the exact probabilities are obtained from the probability mass function (PMF) of the distribution. Section 3.4.2