Basic Statistical Inference for Survey Data

Basic Statistical Inference for Survey Data Professor Ron Fricker Naval Postgraduate School Monterey, California 1 Goals for this Lecture • Review of descriptive statistics • Review of basic statistical inference – Point estimation – Sampling distributions and the standard error – Confidence intervals for the mean – Hypothesis tests for the mean • Compare and contrast classical statistical assumptions to survey data requirements • Discuss how to adapt methods to survey data with basic sample designs 2 Two Roles of Statistics • Descriptive: Describing a sample or population – Numerical: (mean, variance, mode) – Graphical: (histogram, boxplot) • Inferential: Using a sample to infer facts about a population – Estimating (e.g., estimating the average starting salary of those with systems engineering Master’s degrees) – Testing theories (e.g., evaluating whether a Master’s degree increases income) – Building models (e.g., modeling the relationship of how an advanced degree increases income) A Descriptive Statistics Question: What was the average survey response to question 7? An Inferential Question: Given the sample, what can we say about the average response to question 7 for the population? Lots of Descriptive Statistics • Numerical: – Measures of location • Mean, median, trimmed mean, percentiles – Measures of variability • Variance, standard deviation, range, inter-quartile range – Measures for categorical data • Mode, proportions • Graphical – Continuous: Histograms, boxplots, scatterplots – Categorical: Bar charts, pie charts Continuous Data: Sample Mean, Variance, and Standard Deviation • Sample average or sample mean is a measure of location or central tendency: 1 n xx= ∑ i n i=1 • Sample variance is a measure of variability n 2 1 2 s =−∑ ()xxi n − 1 i=1 • Standard deviation is the square root of the variance s = s 2 7 Statistical Inference • Sample mean, x , and sample variance , s2, are statistics calculated from data • Sample statistics used to estimate true value of population called estimators • Point estimation: estimate a population statistic with a sample statistic • Interval estimation: estimate a population statistic with an interval – Incorporates uncertainty in the sample statistic • Hypothesis tests: test theories about the population based on evidence in the sample data 8 Classical Statistical Assumptions vs. Survey Practice / Requirements • Basic statistical methods assume: – Population is of infinite size (or so large as to be essentially infinite) – Sample size is a small fraction of the population – Sample is drawn from the population via SRS • In surveys: – Population always finite (though may be very large) – Sample could be sizeable fraction of the population • “Sizeable” is roughly > 5% – Sampling may be complex 9 Point Estimation (1) • Example: Use sample mean or proportion to estimate population mean or proportion • Using SRS or a self-weighting sampling scheme, usual estimators for the mean calculated in all stat software packages are generally fine – Assuming no other adjustments are necessary • E.g., nonresponse, poststratification, etc • Except under SRS, usual point estimates for standard deviation almost always wrong 10 Point Estimation (2) • Naïve analyses just present sample statistics for the means and/or proportions – Perhaps some intuitive sense that the sample statistics are a measure of the population – But often don’t account for sample design • However, when using point estimates, no information about sample uncertainty provided – If you did another survey, how much might its results differ from the current results? • Also, even for mean, if sample design not self- weighting, need to adjust software estimators11 Sampling Distributions • Abstract from people and surveys to random variables and their distributions 12 Sampling Distributions • Sampling distribution is the probability distribution of a sample statistic 0.9 0.8 Distribution Sampling Individual 0.7 Mean of 5 individual obs distribution of 0.6 means of five 0.5 with standard obs 0.4 deviation σ 0.3 0.2 Standard error: 0.1 0.0 σ =σ 5 812 813 814 815 816 817 818 X Index 13 Demonstrating Randomness http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Simulating Sampling Distributions http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Central Limit Theorem (CLT) for the Sample Mean • Let X1, X2, …, Xn be a random sample from any distribution with mean µ and standard deviation σ • For large sample size n, the distribution of the sample mean has approximately a normal distribution – with mean µ, and σ – standard deviation n • The larger the value of n, the better the approximation 16 Example: Sums of Dice Rolls Roll of a Single Die 120 100 y 80 nc 60 que One roll e r F 40 20 0 Sum of Two Dice 123456700 Outcom e 600 500 y c 400 n e u q e Sum of 2 rolls 300 Fr 200 100 0 Sum of 5 Dice 2345678970 101112 Sum 60 50 y nc 40 e u Sum of 5 rolls q 30 e Fr 20 10 0 Sum of 10 Dice 350 1 3 5 7 9 11 13 15 17 19 21 23 25 Sum 300 250 y Sum of nc 200 ue q e 150 r F 10 rolls 100 50 0 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 Sum 17 Demonstrating Sampling Distributions and the CLT http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Interval Estimation for µ • Best estimate for µ is X • But X will never be exactly µ – Further, there is no way to tell how far off • BUT can estimate µ’s location with an interval and be right some of the time – Narrow intervals: higher chance of being wrong – Wide intervals: less chance of being wrong, but also less useful • AND with confidence intervals (CIs) can define the probability the interval “covers” µ! Confidence Intervals: Main Idea • Based on the normal distribution, we know X is within 2 s.e.s of µ 95% of the time • Alternatively, µ is within 2 s.e.s of X 95% of the time 0.9 (Unobserved) dist. of 0.8 Individual sample mean 0.7 Mean of 5 0.6 Sample 0.5 Unobserved mean 95% confidence 0.4 pop mean interval for pop mean 0.3 0.2 0.1 (Unobserved) dist. 0.0 812 813 814 815 816 817 818 of population 20 Index A Simulation intervals not including population mean: 2 100 90 80 70 60 50 40 30 20 10 1 5 10 15 20 25 30 35 40 45 50 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 5 Another Simulation intervals not including population mean: 10 80 70 60 50 40 30 20 1 10 20 30 40 50 60 70 80 90 100 110 120130 140 150160 170 180190 200 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 95 Confidence Interval for µ (1) • For X from sample of size n from a population with mean µ, X − µ T = sn/ has a t distribution with n-1 “degrees of freedom” – Precisely if population has normal distribution – Approximately for sample mean via CLT • Use the t distribution to build a CI for the mean: Pr ()−tTαα/2,nn−−1 <<t/2, 1 =1−α 23 Review: the t Distribution 0.40 normal T3 0.30 T10 T100 0.20 0.10 0.00 -4 -3 -2 -1 0 1 2 3 4 Z= number of SE’s from the mean Confidence Interval for µ (2) • Flip the probability statement around to get a confidence interval: (1) Pr ()−<tTαα/2,nn−−1 <t/2, 1 =1−α ⎛⎞X − µ (2) Pr ⎜⎟−<ttαα/2,nn−−1 </2, 1 =1−α ⎝⎠sn/ ⎛⎞ss (3) Pr ⎜⎟Xt−<αα/2,nn−−1 µ <X+t/2, 1 =1−α ⎝⎠nn 25 Example: Constructing a 95% Confidence Interval for µ • Choose the confidence level: 1-α • Remember the degrees of freedom (ν) = n -1 •Find tα/2,n−1 – Example: if α = 0.05, df=7 then t0.025,7 = 2.365 • Calculate X and s / n • Then ⎛ s s ⎞ Pr⎜ X − 2.365 < µ < X + 2.365 ⎟ = 0.95 ⎝ n n ⎠ Hypothesis Tests • Basic idea is to test a hypothesis / theory on empirical evidence from a sample – E.g., “The fraction of new students aware of the school discrimination policy is less than 75%.” – Does the data support or refute the assertion? This is the probability. If it’s small, we don’t believe our assumption. p=0.75 pˆ = 0.832 If we assume this is true, how likely are we to see this 27 (or something more extreme)? One-Sample, Two-sided t-Test • Hypothesis: H0: µ = µ0 H : µ ≠ µ a 0 X − µ • Standardized test statistic: t = 0 sn2 / • p-value = Pr(T<-t and T>t) = Pr(|T|>t), where T follows a t distribution with n-1 degrees of freedom • Reject H0 if p < α, where α is the predetermined significance level 28 One-Sample, One-sided t-Tests • Hypotheses: H : µ = µ H : µ = µ 0 0 or 0 0 Ha: µ < µ0 Ha: µ > µ0 X − µ • Standardized test statistic: t = 0 sn2 / • p-value = Pr(T < t) or p-value = Pr(T > t), depending on Ha, where T follows a t distribution with n-1 degrees of freedom • Reject H0 if p < α 29 Applying Continuous Methods to Binary Survey Questions • In surveys, often have binary questions, where desire to infer proportion of population in one category or the other • Code binary question responses as 1/0 variable and for large n appeal to the CLT – Confidence interval for the mean is a CI on the proportion of “1”s – T-test for the mean is a hypothesis test on the proportion of “1”s 30 Applying Continuous Methods to Likert Scale Survey Data • Likert scale data is inherently categorical • If willing to make assumption that “distance” between categories is equal, then can code with integers and appeal to CLT Strongly agree 1 Agree 2 Neutral 3 Disagree 4 Strongly disagree 5 31 Adjusting Standard Errors (for Basic Survey Sample Designs) • Sample > 5% of population: finite population correction – Multiply the standard error by ()Nn− /N –E.g.

Load more