Basic Statistical Inference for Survey Data

Basic Statistical Inference for Survey Data Professor Ron Fricker Naval Postgraduate School Monterey, California 1 Goals for this Lecture • Review of descriptive statistics • Review of basic statistical inference – Point estimation – Sampling distributions and the standard error – Confidence intervals for the mean – Hypothesis tests for the mean • Compare and contrast classical statistical assumptions to survey data requirements • Discuss how to adapt methods to survey data with basic sample designs 2 Two Roles of Statistics • Descriptive: Describing a sample or population – Numerical: (mean, variance, mode) – Graphical: (histogram, boxplot) • Inferential: Using a sample to infer facts about a population – Estimating (e.g., estimating the average starting salary of those with systems engineering Master’s degrees) – Testing theories (e.g., evaluating whether a Master’s degree increases income) – Building models (e.g., modeling the relationship of how an advanced degree increases income) A Descriptive Statistics Question: What was the average survey response to question 7? An Inferential Question: Given the sample, what can we say about the average response to question 7 for the population? Lots of Descriptive Statistics • Numerical: – Measures of location • Mean, median, trimmed mean, percentiles – Measures of variability • Variance, standard deviation, range, inter-quartile range – Measures for categorical data • Mode, proportions • Graphical – Continuous: Histograms, boxplots, scatterplots – Categorical: Bar charts, pie charts Continuous Data: Sample Mean, Variance, and Standard Deviation • Sample average or sample mean is a measure of location or central tendency: 1 n xx= ∑ i n i=1 • Sample variance is a measure of variability n 2 1 2 s =−∑ ()xxi n − 1 i=1 • Standard deviation is the square root of the variance s = s 2 7 Statistical Inference • Sample mean, x , and sample variance , s2, are statistics calculated from data • Sample statistics used to estimate true value of population called estimators • Point estimation: estimate a population statistic with a sample statistic • Interval estimation: estimate a population statistic with an interval – Incorporates uncertainty in the sample statistic • Hypothesis tests: test theories about the population based on evidence in the sample data 8 Classical Statistical Assumptions vs. Survey Practice / Requirements • Basic statistical methods assume: – Population is of infinite size (or so large as to be essentially infinite) – Sample size is a small fraction of the population – Sample is drawn from the population via SRS • In surveys: – Population always finite (though may be very large) – Sample could be sizeable fraction of the population • “Sizeable” is roughly > 5% – Sampling may be complex 9 Point Estimation (1) • Example: Use sample mean or proportion to estimate population mean or proportion • Using SRS or a self-weighting sampling scheme, usual estimators for the mean calculated in all stat software packages are generally fine – Assuming no other adjustments are necessary • E.g., nonresponse, poststratification, etc • Except under SRS, usual point estimates for standard deviation almost always wrong 10 Point Estimation (2) • Naïve analyses just present sample statistics for the means and/or proportions – Perhaps some intuitive sense that the sample statistics are a measure of the population – But often don’t account for sample design • However, when using point estimates, no information about sample uncertainty provided – If you did another survey, how much might its results differ from the current results? • Also, even for mean, if sample design not self- weighting, need to adjust software estimators11 Sampling Distributions • Abstract from people and surveys to random variables and their distributions 12 Sampling Distributions • Sampling distribution is the probability distribution of a sample statistic 0.9 0.8 Distribution Sampling Individual 0.7 Mean of 5 individual obs distribution of 0.6 means of five 0.5 with standard obs 0.4 deviation σ 0.3 0.2 Standard error: 0.1 0.0 σ =σ 5 812 813 814 815 816 817 818 X Index 13 Demonstrating Randomness http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Simulating Sampling Distributions http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Central Limit Theorem (CLT) for the Sample Mean • Let X1, X2, …, Xn be a random sample from any distribution with mean µ and standard deviation σ • For large sample size n, the distribution of the sample mean has approximately a normal distribution – with mean µ, and σ – standard deviation n • The larger the value of n, the better the approximation 16 Example: Sums of Dice Rolls Roll of a Single Die 120 100 y 80 nc 60 que One roll e r F 40 20 0 Sum of Two Dice 123456700 Outcom e 600 500 y c 400 n e u q e Sum of 2 rolls 300 Fr 200 100 0 Sum of 5 Dice 2345678970 101112 Sum 60 50 y nc 40 e u Sum of 5 rolls q 30 e Fr 20 10 0 Sum of 10 Dice 350 1 3 5 7 9 11 13 15 17 19 21 23 25 Sum 300 250 y Sum of nc 200 ue q e 150 r F 10 rolls 100 50 0 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 Sum 17 Demonstrating Sampling Distributions and the CLT http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Interval Estimation for µ • Best estimate for µ is X • But X will never be exactly µ – Further, there is no way to tell how far off • BUT can estimate µ’s location with an interval and be right some of the time – Narrow intervals: higher chance of being wrong – Wide intervals: less chance of being wrong, but also less useful • AND with confidence intervals (CIs) can define the probability the interval “covers” µ! Confidence Intervals: Main Idea • Based on the normal distribution, we know X is within 2 s.e.s of µ 95% of the time • Alternatively, µ is within 2 s.e.s of X 95% of the time 0.9 (Unobserved) dist. of 0.8 Individual sample mean 0.7 Mean of 5 0.6 Sample 0.5 Unobserved mean 95% confidence 0.4 pop mean interval for pop mean 0.3 0.2 0.1 (Unobserved) dist. 0.0 812 813 814 815 816 817 818 of population 20 Index A Simulation intervals not including population mean: 2 100 90 80 70 60 50 40 30 20 10 1 5 10 15 20 25 30 35 40 45 50 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 5 Another Simulation intervals not including population mean: 10 80 70 60 50 40 30 20 1 10 20 30 40 50 60 70 80 90 100 110 120130 140 150160 170 180190 200 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 95 Confidence Interval for µ (1) • For X from sample of size n from a population with mean µ, X − µ T = sn/ has a t distribution with n-1 “degrees of freedom” – Precisely if population has normal distribution – Approximately for sample mean via CLT • Use the t distribution to build a CI for the mean: Pr ()−tTαα/2,nn−−1 <<t/2, 1 =1−α 23 Review: the t Distribution 0.40 normal T3 0.30 T10 T100 0.20 0.10 0.00 -4 -3 -2 -1 0 1 2 3 4 Z= number of SE’s from the mean Confidence Interval for µ (2) • Flip the probability statement around to get a confidence interval: (1) Pr ()−<tTαα/2,nn−−1 <t/2, 1 =1−α ⎛⎞X − µ (2) Pr ⎜⎟−<ttαα/2,nn−−1 </2, 1 =1−α ⎝⎠sn/ ⎛⎞ss (3) Pr ⎜⎟Xt−<αα/2,nn−−1 µ <X+t/2, 1 =1−α ⎝⎠nn 25 Example: Constructing a 95% Confidence Interval for µ • Choose the confidence level: 1-α • Remember the degrees of freedom (ν) = n -1 •Find tα/2,n−1 – Example: if α = 0.05, df=7 then t0.025,7 = 2.365 • Calculate X and s / n • Then ⎛ s s ⎞ Pr⎜ X − 2.365 < µ < X + 2.365 ⎟ = 0.95 ⎝ n n ⎠ Hypothesis Tests • Basic idea is to test a hypothesis / theory on empirical evidence from a sample – E.g., “The fraction of new students aware of the school discrimination policy is less than 75%.” – Does the data support or refute the assertion? This is the probability. If it’s small, we don’t believe our assumption. p=0.75 pˆ = 0.832 If we assume this is true, how likely are we to see this 27 (or something more extreme)? One-Sample, Two-sided t-Test • Hypothesis: H0: µ = µ0 H : µ ≠ µ a 0 X − µ • Standardized test statistic: t = 0 sn2 / • p-value = Pr(T<-t and T>t) = Pr(|T|>t), where T follows a t distribution with n-1 degrees of freedom • Reject H0 if p < α, where α is the predetermined significance level 28 One-Sample, One-sided t-Tests • Hypotheses: H : µ = µ H : µ = µ 0 0 or 0 0 Ha: µ < µ0 Ha: µ > µ0 X − µ • Standardized test statistic: t = 0 sn2 / • p-value = Pr(T < t) or p-value = Pr(T > t), depending on Ha, where T follows a t distribution with n-1 degrees of freedom • Reject H0 if p < α 29 Applying Continuous Methods to Binary Survey Questions • In surveys, often have binary questions, where desire to infer proportion of population in one category or the other • Code binary question responses as 1/0 variable and for large n appeal to the CLT – Confidence interval for the mean is a CI on the proportion of “1”s – T-test for the mean is a hypothesis test on the proportion of “1”s 30 Applying Continuous Methods to Likert Scale Survey Data • Likert scale data is inherently categorical • If willing to make assumption that “distance” between categories is equal, then can code with integers and appeal to CLT Strongly agree 1 Agree 2 Neutral 3 Disagree 4 Strongly disagree 5 31 Adjusting Standard Errors (for Basic Survey Sample Designs) • Sample > 5% of population: finite population correction – Multiply the standard error by ()Nn− /N –E.g.

Basic Statistical Inference for Survey Data

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support