<<

Basic Statistical for

Professor Ron Fricker Naval Postgraduate School Monterey, California

1 Goals for this Lecture

• Review of descriptive • Review of basic distributions and the – Confidence intervals for the – Hypothesis tests for the mean • Compare and contrast classical statistical assumptions to survey data requirements • Discuss how to adapt methods to survey data with basic designs

2 Two Roles of Statistics

• Descriptive: Describing a sample or population – Numerical: (mean, , ) – Graphical: (, boxplot) • Inferential: Using a sample to infer facts about a population – Estimating (e.g., estimating the average starting salary of those with Master’s degrees) – Testing theories (e.g., evaluating whether a Master’s degree increases income) – Building models (e.g., modeling the relationship of how an advanced degree increases income) A Question: What was the average survey response to question 7? An Inferential Question: Given the sample, what can we say about the average response to question 7 for the population? Lots of Descriptive Statistics

• Numerical: – Measures of location • Mean, , trimmed mean, – Measures of variability • Variance, , , inter-quartile range – Measures for categorical data • Mode, proportions • Graphical – Continuous: , boxplots, scatterplots – Categorical: Bar charts, pie charts Continuous Data: Sample Mean, Variance, and Standard Deviation

• Sample average or sample mean is a measure of location or : 1 n xx= ∑ i n i=1 • Sample variance is a measure of variability n 2 1 2 s =−∑ ()xxi n − 1 i=1 • Standard deviation is the square root of the variance s = s 2 7 Statistical Inference

• Sample mean, x , and sample variance , s2, are statistics calculated from data • Sample statistics used to estimate true value of population called • Point estimation: estimate a population with a sample statistic • : estimate a population statistic with an interval – Incorporates uncertainty in the sample statistic • Hypothesis tests: test theories about the population based on evidence in the sample data

8 Classical Statistical Assumptions vs. Survey Practice / Requirements

• Basic statistical methods assume: – Population is of infinite size (or so large as to be essentially infinite) – Sample size is a small fraction of the population – Sample is drawn from the population via SRS • In surveys: – Population always finite (though may be very large) – Sample could be sizeable fraction of the population • “Sizeable” is roughly > 5% – Sampling may be complex

9 Point Estimation (1)

• Example: Use sample mean or proportion to estimate population mean or proportion • Using SRS or a self-weighting sampling scheme, usual estimators for the mean calculated in all stat software packages are generally fine – Assuming no other adjustments are necessary • E.g., nonresponse, poststratification, etc • Except under SRS, usual point estimates for standard deviation almost always wrong 10 Point Estimation (2)

• Naïve analyses just present sample statistics for the and/or proportions – Perhaps some intuitive sense that the sample statistics are a measure of the population – But often don’t account for sample design • However, when using point estimates, no about sample uncertainty provided – If you did another survey, how much might its results differ from the current results? • Also, even for mean, if sample design not self-

weighting, need to adjust software estimators11 Sampling Distributions

• Abstract from people and surveys to random variables and their distributions

12 Sampling Distributions

is the distribution of a sample statistic

0.9 0.8 Distribution Sampling Individual 0.7 Mean of 5 individual obs distribution of 0.6 means of five 0.5 with standard obs 0.4 deviation σ 0.3 0.2 Standard error: 0.1 0.0 σ =σ 5 812 813 814 815 816 817 818 X Index

13 Demonstrating

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Simulating Sampling Distributions

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html (CLT) for the Sample Mean

• Let X1, X2, …, Xn be a random sample from any distribution with mean µ and standard deviation σ • For large sample size n, the distribution of the sample mean has approximately a – with mean µ, and σ – standard deviation n • The larger the value of n, the better the approximation

16 Example: Sums of Dice Rolls

Roll of a Single Die

120

100

y 80 nc 60 que One roll e r

F 40

20

0 Sum of Two Dice 123456700 Outcom e 600

500 y

c 400 n e u q

e Sum of 2 rolls 300 Fr

200

100

0 Sum of 5 Dice 234 5678 970 101112 Sum 60

50 y

nc 40 e

u Sum of 5 rolls

q 30 e Fr 20

10

0 Sum of 10 Dice 350 1 3 5 7 9 11 13 15 17 19 21 23 25 Sum 300

250

y Sum of

nc 200 ue q

e 150 r F 10 rolls 100

50

0 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 Sum 17 Demonstrating Sampling Distributions and the CLT

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html Interval Estimation for µ

• Best estimate for µ is X • But X will never be exactly µ – Further, there is no way to tell how far off • BUT can estimate µ’s location with an interval and be right some of the time – Narrow intervals: higher chance of being wrong – Wide intervals: less chance of being wrong, but also less useful • AND with confidence intervals (CIs) can define the probability the interval “covers” µ! Confidence Intervals: Main Idea

• Based on the normal distribution, we know X is within 2 s.e.s of µ 95% of the time • Alternatively, µ is within 2 s.e.s of X 95% of the time 0.9 (Unobserved) dist. of

0.8 Individual sample mean 0.7 Mean of 5 0.6 Sample 0.5 Unobserved mean 95% confidence 0.4 pop mean interval for pop mean 0.3

0.2 0.1 (Unobserved) dist. 0.0 812 813 814 815 816 817 818 of population 20 Index A

intervals not including population mean: 2

100

90

80

70

60

50

40

30

20

10 1 5 10 15 20 25 30 35 40 45 50 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 5 Another Simulation

intervals not including population mean: 10

80

70

60

50

40

30

20 1 10 20 30 40 50 60 70 80 90 100 110 120130 140 150160 170 180190 200 sample 95% Confidence Intervals for mean = 50, sd = 10, n = 95 for µ (1)

• For X from sample of size n from a population with mean µ, X − µ T = sn/ has a t distribution with n-1 “degrees of freedom” – Precisely if population has normal distribution – Approximately for sample mean via CLT • Use the t distribution to build a CI for the mean:

Pr ()−tTαα/2,nn−−1 <

23 Review: the t Distribution

0.40

normal T3 0.30 T10 T100

0.20

0.10

0.00 -4 -3 -2 -1 0 1 2 3 4 Z= number of SE’s from the mean Confidence Interval for µ (2)

• Flip the probability statement around to get a confidence interval:

(1) Pr ()−

25 Example: Constructing a 95% Confidence Interval for µ

• Choose the confidence level: 1-α • Remember the degrees of freedom (ν) = n -1

•Find tα/2,n−1 – Example: if α = 0.05, df=7 then t0.025,7 = 2.365 • Calculate X and s / n • Then ⎛ s s ⎞ Pr⎜ X − 2.365 < µ < X + 2.365 ⎟ = 0.95 ⎝ n n ⎠ Hypothesis Tests

• Basic idea is to test a hypothesis / theory on empirical evidence from a sample – E.g., “The fraction of new students aware of the school discrimination policy is less than 75%.” – Does the data support or refute the assertion?

This is the probability. If it’s small, we don’t believe our assumption.

p=0.75 pˆ = 0.832 If we assume this is true, how likely are we to see this 27 (or something more extreme)? One-Sample, Two-sided t-Test

• Hypothesis:

H0: µ = µ0 H : µ ≠ µ a 0 X − µ • Standardized : t = 0 sn2 / • p-value = Pr(T<-t and T>t) = Pr(|T|>t), where T follows a t distribution with n-1 degrees of freedom

• Reject H0 if p < α, where α is the predetermined significance level 28 One-Sample, One-sided t-Tests

• Hypotheses: H : µ = µ H : µ = µ 0 0 or 0 0 Ha: µ < µ0 Ha: µ > µ0 X − µ • Standardized test statistic: t = 0 sn2 / • p-value = Pr(T < t) or p-value = Pr(T > t), depending on Ha, where T follows a t distribution with n-1 degrees of freedom

• Reject H0 if p < α

29 Applying Continuous Methods to Binary Survey Questions

• In surveys, often have binary questions, where desire to infer proportion of population in one category or the other • Code binary question responses as 1/0 variable and for large n appeal to the CLT – Confidence interval for the mean is a CI on the proportion of “1”s – T-test for the mean is a hypothesis test on the proportion of “1”s

30 Applying Continuous Methods to Survey Data

• Likert scale data is inherently categorical • If willing to make assumption that “distance” between categories is equal, then can code with integers and appeal to CLT Strongly agree 1 Agree 2 Neutral 3 Disagree 4 Strongly disagree 5

31 Adjusting Standard Errors (for Basic Survey Sample Designs)

• Sample > 5% of population: finite population correction – Multiply the standard error by ()Nn− /N –E.g. s..e(x)=−(Nn)/Ns×/ n

• Stratified sample, weighted sum of the strata : H se..()x = ∑(Nhh/N)Var(x ) h=1

32 What We Have Just Reviewed

• Review of descriptive statistics • Review of basic statistical inference – Point estimation – Sampling distributions and the standard error – Confidence intervals for the mean – Hypothesis tests for the mean • Compare and contrast classical statistical assumptions to survey data requirements • Discuss how to adapt methods to survey data with basic sample designs

33