Sampling Distribution of a Normal Variable

Total Page:16

File Type:pdf, Size:1020Kb

Sampling Distribution of a Normal Variable Ismor Fischer, 5/29/2012 5.2-1 5.2 Formal Statement and Examples Sampling Distribution of a Normal Variable Given a random variable X. Suppose that the population distribution of X is known to be normal, with mean µ and variance σ 2, that is, X ~ N(µ, σ). Then, for any sample size n, it follows that the sampling distribution of X is normal, σ 2 σ with mean µ and variance , that is, X ~ Nµ, . n n Comments: σ is called the “standard error of the mean,” denoted SEM, or more simply, s.e. n X − µ The corresponding Z-score transformation formula is Z = ~ N(0, 1). σ / n Example: Suppose that the ages X of a certain population are normally distributed, with mean µ = 27.0 years, and standard deviation σ = 12.0 years, i.e., X ~ N(27, 12). The probability that the age of a single randomly selected individual is less than 30 years 30 − 27 is P(X < 30) = PZ < 12 X µ = 27 30 = P(Z < 0.25) = 0.5987. In this population, the Now consider all random samples of size n = 36 taken probability that the average from this population. By the above, their mean ages age of 36 random people is under 30 years old, is much X are also normally distributed, with mean µ = 27 yrs greater than the probability σ 12 yrs that the age of one random as before, but with standard error = = 2 yrs. person is under 30 years old. n 36 Exercise: Compare the two That is, X ~ N(27, 2). probabilities of being under 24 years old. The probability that the mean age of a single sample of Exercise: Compare the two n = 36 randomly selected individuals is less than 30 probabilities of being between 24 and 30 years old. 30 − 27 years is P( X < 30) = PZ < 2 = P(Z < 1.5) = 0.9332. X µ = 27 30 Ismor Fischer, 5/29/2012 5.2-2 σ If X ~ N(µ, σ) approximately, then X ~ Nµ, approximately. (The larger the value n of n, the better the approximation.) In fact, more is true... IMPORTANT GENERALIZATION: The Central Limit Theorem Given any random variable X, discrete or continuous, with finite mean µ and finite variance σ 2. Then, regardless of the shape of the population distribution of X, as the sample size n gets larger, the sampling distribution of X becomes increasingly closer to σ 2 σ normal, with mean µ and variance , that is, X ~ Nµ, , n n approximately. X − µ More formally, Z= → Nn(0,1) as →∞. σ / n Intuitively perhaps, there is less variation between different sample mean values, than there is between different population values. This formal result states that, under very general conditions, the sampling variability is usually much smaller than the population variability, as well as gives the precise form of the “limiting distribution” of the statistic. What if the population standard deviation σ is unknown? Then it can be replaced by the s sample standard deviation s, provided n is large. That is, X ~ Nµ, approximately, n s if n ≥ 30 or so, for “most” distributions (... but see example below). Since the value n is a sample-based estimate of the true standard error s.e., it is commonly denoted s.e. µ µ Because the mean X of the sampling distribution is equal to the mean X of the population distribution – i.e., EX[]= µX – we say that X is an unbiased estimator of µX . In other words, the sample mean is an unbiased estimator of the population mean. A biased sample estimator is a statistic θˆ whose “expected value” either consistently overestimates or underestimates its intended population parameter θ . Many other versions of CLT exist, related to so-called Laws of Large Numbers. Ismor Fischer, 5/29/2012 5.2-3 Example: Consider a(n infinite) population of paper notes, 50% of which are blank, 30% are ten-dollar bills, and the remaining 20% are twenty-dollar bills. Experiment 1: Randomly select a single note from the population. Random variable: X = $ amount obtained x f(x) = P(X = x) 0 .5 10 .3 .5 20 .2 .3 .2 Mean µX = E[X] = (.5)(0) + (.3)(10) + (.2)(20) = $7.00 2 2 2 2 2 Variance σ X = E[ (X – µX ) ] = (.5)(−7) + (.3)(3) + (.2)(13) = 61 Standard deviation σ X = $7.81 Ismor Fischer, 5/29/2012 5.2-4 Experiment 2: Each of n = 2 people randomly selects a note, and split the winnings. Random variable: X = $ sample mean amount obtained per person x 0 5 10 5 10 15 10 15 20 (x1, x2) (0, 0) (0, 10) (0, 20) (10, 0) (10, 10) (10, 20) (20, 0) (20, 10) (20, 20) .5 × .5 .5 × .3 .5 × .2 .3 × .5 .3 × .3 .3 × .2 .2 × .5 .2 × .3 .2 × .2 Probability = 0.25 = 0.15 = 0.10 = 0.15 = 0.09 = 0.06 = 0.10 = 0.06 = 0.04 x f ( x ) = P( X = x ) 0 .25 5 .30 = .15 + .15 .30 .29 10 .29 = .10 + .09 + .10 .25 .12 15 .12 = .06 + .06 .04 20 .04 µ = (.25)(0) + (.30)( 5) + (.29)(10) + (.12)(15) + (.04)(20) = $7.00 = µ !! Mean X X σ 2 (.25)(−7)2 + (.30)(−2)2 + (.29)(3)2 + (.12)(8)2 + (.04)(13)2 Variance X = 2 61 σ X = 30.5 = = !! 2 n σ X Standard deviation σ = $5.52 = !! X n Ismor Fischer, 5/29/2012 5.2-5 Experiment 3: Each of n = 3 people randomly selects a note, and split the winnings. Random variable: X = $ sample mean amount obtained per person x 0 3.33 6.67 3.33 6.67 10 6.67 10 13.33 (x1, x2, x3) (0, 0, 0) (0, 0, 10) (0, 0, 20) (0, 10, 0) (0, 10, 10) (0, 10, 20) (0, 20, 0) (0, 20, 10) (0, 20, 20) .5 × .5 × .5 .5 × .5 × .3 .5 × .5 × .2 .5 × .3 × .5 .5 × .3 × .3 .5 × .3 × .2 .5 × .2 × .5 .5 × .2 × .3 .5 × .2 × .2 Probability = 0.125 = 0.075 = 0.050 = 0.075 = 0.045 = 0.030 = 0.050 = 0.030 = 0.020 3.33 6.67 10 6.67 10 13.33 10 13.33 16.67 (10, 0, 0) (10, 0, 10) (10, 0, 20) (10, 10, 0) (10, 10, 10) (10, 10, 20) (10, 20, 0) (10, 20, 10) (10, 20, 20) .3 × .5 × .5 .3 × .5 × .3 .3 × .5 × .2 .3 × .3 × .5 .3 × .3 × .3 .3 × .3 × .2 .3 × .2 × .5 .3 × .2 × .3 .3 × .2 × .2 = 0.075 = 0.045 = 0.030 = 0.045 = 0.027 = 0.018 = 0.030 = 0.018 = 0.012 6.67 10 13.33 10 13.33 16.67 13.33 16.67 20 (20, 0, 0) (20, 0, 10) (20, 0, 20) (20, 10, 0) (20, 10, 10) (20, 10, 20) (20, 20, 0) (20, 20, 10) (20, 20, 20) .2 × .5 × .5 .2 × .5 × .3 .2 × .5 × .2 .2 × .3 × .5 .2 × .3 × .3 .2 × .3 × .2 .2 × .2 × .5 .2 × .2 × .3 .2 × .2 × .2 = 0.050 = 0.030 = 0.020 = 0.030 = 0.018 = 0.012 = 0.020 = 0.012 = 0.008 x f ( x ) = P( X = x ) 0.00 .125 3.33 .225 = .075 + .075 + .075 .285 = .050 + .045 + .050 + .285 6.67 .045 + .045 + .050 .225 .207 = .030 + .030 + .030 + .027 .207 10.00 + .030 + .030 + .030 .125 .114 .114 = .020 + .018 + .018 + .036 13.33 .020 + .018 + .020 .008 16.67 .036 = .012 + .012 + .012 20.00 .008 µ µ Mean X = Exercise = $7.00 = X !!! 2 61 σ X Variance σ 2 = Exercise = 20.333 = = !!! X 3 n σ X Standard deviation σ = $4.51 = !!! X n Ismor Fischer, 5/29/2012 5.2-6 The tendency toward a normal distribution becomes stronger as the sample size n gets larger, despite the mild skew in the original population values. This is an empirical consequence of the Central Limit Theorem. For most such distributions, n ≥ 30 or so is sufficient for a reasonable normal approximation to the sampling distribution. In fact, if the distribution is symmetric, then convergence to a bell curve can often be seen for much lower n, say only n = 5 or 6. Recall also, from the first result in this section, that if the population is normally distributed (with known σ), then so will be the sampling distribution, for any n. BUT BEWARE.... Ismor Fischer, 5/29/2012 5.2-7 However, if the population distribution of X is highly skewed, then the sampling distribution of X can be highly skewed as well (especially if n is not very large), i.e., relying on CLT can be risky! (Although, sometimes using a transformation, such as ln(X) or X, can restore a bell shape to the values. Later…) Example: The two graphs on the bottom of this page are simulated sampling distributions for the highly skewed population shown below. Both are density histograms based on the means of 1000 random samples; the first corresponds to samples of size n = 30, the second to n = 100. Note that skew is still present! Population Distribution .
Recommended publications
  • Startips …A Resource for Survey Researchers
    Subscribe Archives StarTips …a resource for survey researchers Share this article: How to Interpret Standard Deviation and Standard Error in Survey Research Standard Deviation and Standard Error are perhaps the two least understood statistics commonly shown in data tables. The following article is intended to explain their meaning and provide additional insight on how they are used in data analysis. Both statistics are typically shown with the mean of a variable, and in a sense, they both speak about the mean. They are often referred to as the "standard deviation of the mean" and the "standard error of the mean." However, they are not interchangeable and represent very different concepts. Standard Deviation Standard Deviation (often abbreviated as "Std Dev" or "SD") provides an indication of how far the individual responses to a question vary or "deviate" from the mean. SD tells the researcher how spread out the responses are -- are they concentrated around the mean, or scattered far & wide? Did all of your respondents rate your product in the middle of your scale, or did some love it and some hate it? Let's say you've asked respondents to rate your product on a series of attributes on a 5-point scale. The mean for a group of ten respondents (labeled 'A' through 'J' below) for "good value for the money" was 3.2 with a SD of 0.4 and the mean for "product reliability" was 3.4 with a SD of 2.1. At first glance (looking at the means only) it would seem that reliability was rated higher than value.
    [Show full text]
  • Appendix F.1 SAWG SPC Appendices
    Appendix F.1 - SAWG SPC Appendices 8-8-06 Page 1 of 36 Appendix 1: Control Charts for Variables Data – classical Shewhart control chart: When plate counts provide estimates of large levels of organisms, the estimated levels (cfu/ml or cfu/g) can be considered as variables data and the classical control chart procedures can be used. Here it is assumed that the probability of a non-detect is virtually zero. For these types of microbiological data, a log base 10 transformation is used to remove the correlation between means and variances that have been observed often for these types of data and to make the distribution of the output variable used for tracking the process more symmetric than the measured count data1. There are several control charts that may be used to control variables type data. Some of these charts are: the Xi and MR, (Individual and moving range) X and R, (Average and Range), CUSUM, (Cumulative Sum) and X and s, (Average and Standard Deviation). This example includes the Xi and MR charts. The Xi chart just involves plotting the individual results over time. The MR chart involves a slightly more complicated calculation involving taking the difference between the present sample result, Xi and the previous sample result. Xi-1. Thus, the points that are plotted are: MRi = Xi – Xi-1, for values of i = 2, …, n. These charts were chosen to be shown here because they are easy to construct and are common charts used to monitor processes for which control with respect to levels of microbiological organisms is desired.
    [Show full text]
  • Problems with OLS Autocorrelation
    Problems with OLS Considering : Yi = α + βXi + ui we assume Eui = 0 2 = σ2 = σ2 E ui or var ui Euiuj = 0orcovui,uj = 0 We have seen that we have to make very specific assumptions about ui in order to get OLS estimates with the desirable properties. If these assumptions don’t hold than the OLS estimators are not necessarily BLU. We can respond to such problems by changing specification and/or changing the method of estimation. First we consider the problems that might occur and what they imply. In all of these we are basically looking at the residuals to see if they are random. ● 1. The errors are serially dependent ⇒ autocorrelation/serial correlation. 2. The error variances are not constant ⇒ heteroscedasticity 3. In multivariate analysis two or more of the independent variables are closely correlated ⇒ multicollinearity 4. The function is non-linear 5. There are problems of outliers or extreme values -but what are outliers? 6. There are problems of missing variables ⇒can lead to missing variable bias Of course these problems do not have to come separately, nor are they likely to ● Note that in terms of significance things may look OK and even the R2the regression may not look that bad. ● Really want to be able to identify a misleading regression that you may take seriously when you should not. ● The tests in Microfit cover many of the above concerns, but you should always plot the residuals and look at them. Autocorrelation This implies that taking the time series regression Yt = α + βXt + ut but in this case there is some relation between the error terms across observations.
    [Show full text]
  • Chapter 8 Fundamental Sampling Distributions And
    CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS 8.1 Random Sampling pling procedure, it is desirable to choose a random sample in the sense that the observations are made The basic idea of the statistical inference is that we independently and at random. are allowed to draw inferences or conclusions about a Random Sample population based on the statistics computed from the sample data so that we could infer something about Let X1;X2;:::;Xn be n independent random variables, the parameters and obtain more information about the each having the same probability distribution f (x). population. Thus we must make sure that the samples Define X1;X2;:::;Xn to be a random sample of size must be good representatives of the population and n from the population f (x) and write its joint proba- pay attention on the sampling bias and variability to bility distribution as ensure the validity of statistical inference. f (x1;x2;:::;xn) = f (x1) f (x2) f (xn): ··· 8.2 Some Important Statistics It is important to measure the center and the variabil- ity of the population. For the purpose of the inference, we study the following measures regarding to the cen- ter and the variability. 8.2.1 Location Measures of a Sample The most commonly used statistics for measuring the center of a set of data, arranged in order of mag- nitude, are the sample mean, sample median, and sample mode. Let X1;X2;:::;Xn represent n random variables. Sample Mean To calculate the average, or mean, add all values, then Bias divide by the number of individuals.
    [Show full text]
  • The Bootstrap and Jackknife
    The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research • Interest often focuses upon the estimation of some unknown parameter, θ. The parameter θ can represent for example, mean weight of a certain strain of mice, heritability index, a genetic component of variation, a mutation rate, etc. • Two key questions need to be addressed: 1. How do we estimate θ ? 2. Given an estimator for θ , how do we estimate its precision/accuracy? • We assume Question 1 can be reasonably well specified by the researcher • Question 2, for our purposes, will be addressed via the estimation of the estimator’s standard error Summer 2017 Summer Institutes 250 What is a standard error? Suppose we want to estimate a parameter theta (eg. the mean/median/squared-log-mode) of a distribution • Our sample is random, so… • Any function of our sample is random, so... • Our estimate, theta-hat, is random! So... • If we collected a new sample, we’d get a new estimate. Same for another sample, and another... So • Our estimate has a distribution! It’s called a sampling distribution! The standard deviation of that distribution is the standard error Summer 2017 Summer Institutes 251 Bootstrap Motivation Challenges • Answering Question 2, even for relatively simple estimators (e.g., ratios and other non-linear functions of estimators) can be quite challenging • Solutions to most estimators are mathematically intractable or too complicated to develop (with or without advanced training in statistical inference) • However • Great strides in computing, particularly in the last 25 years, have made computational intensive calculations feasible.
    [Show full text]
  • Permutation Tests
    Permutation tests Ken Rice Thomas Lumley UW Biostatistics Seattle, June 2008 Overview • Permutation tests • A mean • Smallest p-value across multiple models • Cautionary notes Testing In testing a null hypothesis we need a test statistic that will have different values under the null hypothesis and the alternatives we care about (eg a relative risk of diabetes) We then need to compute the sampling distribution of the test statistic when the null hypothesis is true. For some test statistics and some null hypotheses this can be done analytically. The p- value for the is the probability that the test statistic would be at least as extreme as we observed, if the null hypothesis is true. A permutation test gives a simple way to compute the sampling distribution for any test statistic, under the strong null hypothesis that a set of genetic variants has absolutely no effect on the outcome. Permutations To estimate the sampling distribution of the test statistic we need many samples generated under the strong null hypothesis. If the null hypothesis is true, changing the exposure would have no effect on the outcome. By randomly shuffling the exposures we can make up as many data sets as we like. If the null hypothesis is true the shuffled data sets should look like the real data, otherwise they should look different from the real data. The ranking of the real test statistic among the shuffled test statistics gives a p-value Example: null is true Data Shuffling outcomes Shuffling outcomes (ordered) gender outcome gender outcome gender outcome Example: null is false Data Shuffling outcomes Shuffling outcomes (ordered) gender outcome gender outcome gender outcome Means Our first example is a difference in mean outcome in a dominant model for a single SNP ## make up some `true' data carrier<-rep(c(0,1), c(100,200)) null.y<-rnorm(300) alt.y<-rnorm(300, mean=carrier/2) In this case we know from theory the distribution of a difference in means and we could just do a t-test.
    [Show full text]
  • Lecture 5 Significance Tests Criticisms of the NHST Publication Bias Research Planning
    Lecture 5 Significance tests Criticisms of the NHST Publication bias Research planning Theophanis Tsandilas !1 Calculating p The p value is the probability of obtaining a statistic as extreme or more extreme than the one observed if the null hypothesis was true. When data are sampled from a known distribution, an exact p can be calculated. If the distribution is unknown, it may be possible to estimate p. 2 Normal distributions If the sampling distribution of the statistic is normal, we will use the standard normal distribution z to derive the p value 3 Example An experiment studies the IQ scores of people lacking enough sleep. H0: μ = 100 and H1: μ < 100 (one-sided) or H0: μ = 100 and H1: μ = 100 (two-sided) 6 4 Example Results from a sample of 15 participants are as follows: 90, 91, 93, 100, 101, 88, 98, 100, 87, 83, 97, 105, 99, 91, 81 The mean IQ score of the above sample is M = 93.6. Is this value statistically significantly different than 100? 5 Creating the test statistic We assume that the population standard deviation is known and equal to SD = 15. Then, the standard error of the mean is: σ 15 σµˆ = = =3.88 pn p15 6 Creating the test statistic We assume that the population standard deviation is known and equal to SD = 15. Then, the standard error of the mean is: σ 15 σµˆ = = =3.88 pn p15 The test statistic tests the standardized difference between the observed mean µ ˆ = 93 . 6 and µ 0 = 100 µˆ µ 93.6 100 z = − 0 = − = 1.65 σµˆ 3.88 − The p value is the probability of getting a z statistic as or more extreme than this value (given that H0 is true) 7 Calculating the p value µˆ µ 93.6 100 z = − 0 = − = 1.65 σµˆ 3.88 − The p value is the probability of getting a z statistic as or more extreme than this value (given that H0 is true) 8 Calculating the p value To calculate the area in the distribution, we will work with the cumulative density probability function (cdf).
    [Show full text]
  • Sampling Distribution of the Variance
    Proceedings of the 2009 Winter Simulation Conference M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, eds. SAMPLING DISTRIBUTION OF THE VARIANCE Pierre L. Douillet Univ Lille Nord de France, F-59000 Lille, France ENSAIT, GEMTEX, F-59100 Roubaix, France ABSTRACT Without confidence intervals, any simulation is worthless. These intervals are quite ever obtained from the so called "sampling variance". In this paper, some well-known results concerning the sampling distribution of the variance are recalled and completed by simulations and new results. The conclusion is that, except from normally distributed populations, this distribution is more difficult to catch than ordinary stated in application papers. 1 INTRODUCTION Modeling is translating reality into formulas, thereafter acting on the formulas and finally translating the results back to reality. Obviously, the model has to be tractable in order to be useful. But too often, the extra hypotheses that are assumed to ensure tractability are held as rock-solid properties of the real world. It must be recalled that "everyday life" is not only made with "every day events" : rare events are rarely occurring, but they do. For example, modeling a bell shaped histogram of experimental frequencies by a Gaussian pdf (probability density function) or a Fisher’s pdf with four parameters is usual. Thereafter transforming this pdf into a mgf (moment generating function) by mgf (z)=Et (expzt) is a powerful tool to obtain (and prove) the properties of the modeling pdf . But this doesn’t imply that a specific moment (e.g. 4) is effectively an accessible experimental reality.
    [Show full text]
  • Arxiv:1804.01620V1 [Stat.ML]
    ACTIVE COVARIANCE ESTIMATION BY RANDOM SUB-SAMPLING OF VARIABLES Eduardo Pavez and Antonio Ortega University of Southern California ABSTRACT missing data case, as well as for designing active learning algorithms as we will discuss in more detail in Section 5. We study covariance matrix estimation for the case of partially ob- In this work we analyze an unbiased covariance matrix estima- served random vectors, where different samples contain different tor under sub-Gaussian assumptions on x. Our main result is an subsets of vector coordinates. Each observation is the product of error bound on the Frobenius norm that reveals the relation between the variable of interest with a 0 1 Bernoulli random variable. We − number of observations, sub-sampling probabilities and entries of analyze an unbiased covariance estimator under this model, and de- the true covariance matrix. We apply this error bound to the design rive an error bound that reveals relations between the sub-sampling of sub-sampling probabilities in an active covariance estimation sce- probabilities and the entries of the covariance matrix. We apply our nario. An interesting conclusion from this work is that when the analysis in an active learning framework, where the expected number covariance matrix is approximately low rank, an active covariance of observed variables is small compared to the dimension of the vec- estimation approach can perform almost as well as an estimator with tor of interest, and propose a design of optimal sub-sampling proba- complete observations. The paper is organized as follows, in Section bilities and an active covariance matrix estimation algorithm.
    [Show full text]
  • Probability Distributions and Error Bars
    Statistics and Data Analysis in MATLAB Kendrick Kay, [email protected] Lecture 1: Probability distributions and error bars 1. Exploring a simple dataset: one variable, one condition - Let's start with the simplest possible dataset. Suppose we measure a single quantity for a single condition. For example, suppose we measure the heights of male adults. What can we do with the data? - The histogram provides a useful summary of a set of data—it shows the distribution of the data. A histogram is constructed by binning values and counting the number of observations in each bin. - The mean and standard deviation are simple summaries of a set of data. They are parametric statistics, as they make implicit assumptions about the form of the data. The mean is designed to quantify the central tendency of a set of data, while the standard deviation is designed to quantify the spread of a set of data. n ∑ xi mean(x) = x = i=1 n n 2 ∑(xi − x) std(x) = i=1 n − 1 In these equations, xi is the ith data point and n is the total number of data points. - The median and interquartile range (IQR) also summarize data. They are nonparametric statistics, as they make minimal assumptions about the form of the data. The Xth percentile is the value below which X% of the data points lie. The median is the 50th percentile. The IQR is the difference between the 75th and 25th percentiles. - Mean and standard deviation are appropriate when the data are roughly Gaussian. When the data are not Gaussian (e.g.
    [Show full text]
  • Examination of Residuals
    EXAMINATION OF RESIDUALS F. J. ANSCOMBE PRINCETON UNIVERSITY AND THE UNIVERSITY OF CHICAGO 1. Introduction 1.1. Suppose that n given observations, yi, Y2, * , y., are claimed to be inde- pendent determinations, having equal weight, of means pA, A2, * *X, n, such that (1) Ai= E ai,Or, where A = (air) is a matrix of given coefficients and (Or) is a vector of unknown parameters. In this paper the suffix i (and later the suffixes j, k, 1) will always run over the values 1, 2, * , n, and the suffix r will run from 1 up to the num- ber of parameters (t1r). Let (#r) denote estimates of (Or) obtained by the method of least squares, let (Yi) denote the fitted values, (2) Y= Eai, and let (zt) denote the residuals, (3) Zi =Yi - Yi. If A stands for the linear space spanned by (ail), (a,2), *-- , that is, by the columns of A, and if X is the complement of A, consisting of all n-component vectors orthogonal to A, then (Yi) is the projection of (yt) on A and (zi) is the projection of (yi) on Z. Let Q = (qij) be the idempotent positive-semidefinite symmetric matrix taking (y1) into (zi), that is, (4) Zi= qtj,yj. If A has dimension n - v (where v > 0), X is of dimension v and Q has rank v. Given A, we can choose a parameter set (0,), where r = 1, 2, * , n -v, such that the columns of A are linearly independent, and then if V-1 = A'A and if I stands for the n X n identity matrix (6ti), we have (5) Q =I-AVA'.
    [Show full text]
  • Examples of Standard Error Adjustment In
    Statistical Analysis of NCES Datasets Employing a Complex Sample Design > Examples > Slide 11 of 13 Examples of Standard Error Adjustment Obtaining a Statistic Using Both SRS and Complex Survey Methods in SAS This resource document will provide you with an example of the analysis of a variable in a complex sample survey dataset using SAS. A subset of the public-use version of the Early Child Longitudinal Studies ECLS-K rounds one and two data from 1998 accompanies this example, as well as an SAS syntax file. The stratified probability design of the ECLS-K requires that researchers use statistical software programs that can incorporate multiple weights provided with the data in order to obtain accurate descriptive or inferential statistics. Research question This dataset training exercise will answer the research question “Is there a difference in mathematics achievement gain from fall to spring of kindergarten between boys and girls?” Step 1- Get the data ready for use in SAS There are two ways for you to obtain the data for this exercise. You may access a training subset of the ECLS-K Public Use File prepared specifically for this exercise by clicking here, or you may use the ECLS-K Public Use File (PUF) data that is available at http://nces.ed.gov/ecls/dataproducts.asp. If you use the training dataset, all of the variables needed for the analysis presented herein will be included in the file. If you choose to access the PUF, extract the following variables from the online data file (also referred to by NCES as an ECB or “electronic code book”): CHILDID CHILD IDENTIFICATION NUMBER C1R4MSCL C1 RC4 MATH IRT SCALE SCORE (fall) C2R4MSCL C2 RC4 MATH IRT SCALE SCORE (spring) GENDER GENDER BYCW0 BASE YEAR CHILD WEIGHT FULL SAMPLE BYCW1 through C1CW90 BASE YEAR CHILD WEIGHT REPLICATES 1 through 90 BYCWSTR BASE YEAR CHILD STRATA VARIABLE BYCWPSU BASE YEAR CHILD PRIMARY SAMPLING UNIT Export the data from this ECB to SAS format.
    [Show full text]