Lies, damn lies and ... SAS to the rescue!
Peter L. Flom
Peter Flom Consulting
SESUG September, 2015 Broad outline
1 Introduction 2 Descriptive statistics 3 Descriptive graphics 4 Inferential statistics 5 The regression family 6 Multivariate statistics Part I
Introduction Schedule
8:00 Descriptive statistics 8:40 Break 8:50 Descriptive graphics 9:30 Break 9:40 Inferential statistics 10:10 Break 10:20 The regression family 11:00 Break 11:10 Multivariate statistics 12:00 End Introductions of participants and self What I plan to do in this course
Give you a fundamental understanding of some basic statistical methods Give you a very brief survey of a lot of many more advanced methods Help you learn to work with statistics and statisticians Give some SAS code you can give to others Note that the most important stuff is at the beginning, so ask questions! What I don’t plan to do in this course
Teach you to be a statistician Teach you SAS What I want from you
Attention Questions - after all, it’s not even graded Feedback - even anonymous is OK, after the course Part II
Descriptive statistics Introduction
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures Introduction
Descriptive vs. inferential stats
Descriptive statistics describe a variable or a sample. Inferential statistics let you infer from a sample to a population (more later) Descriptive statistics are necessary even when your goal is inference Introduction
Types of descriptive statistics
For continuous variables descriptive statistics include Measures of central tendency Measures of dispersion or spread Measures of skewness Measures of kurtosis Other measures For categorical variables, mostly we are limited to frequencies. Measures of central tendency
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures Measures of central tendency The mean What it is
Definition The mean is the ordinary average. Add up the numbers and divide by the number of numbers.
Or, if you want a formula
n P xi ¯ i=1 x = n where x is the variable and there are n values of the variable. Measures of central tendency The mean What can go wrong
Outliers Skewness The clock problem The rate problem Different scales Measures of central tendency The mean Mean salary
proc means data = sashelp . baseball maxdec = 2; var s a l a r y ; run ;
Analysis Variable : Salary 1987 Salary in $ Thousands N Mean Std Dev Minimum Maximum 263 535.93 451.12 67.50 2460.00 Measures of central tendency The mean Alternatives
1 The median 2 The trimmed mean and Winsorized mean 3 The geometric mean 4 The harmonic mean which are the topics of the next few slides Measures of central tendency The median Median salary
The median is simply the value that divides the distribution in half - half are lower, half are higher. ods select BasicMeasures; proc univariate data = sashelp.baseball; var s a l a r y ; run ;
Basic Statistical Measures Location Variability Mean 535.9259 Std Deviation 451.11868 Median 425.0000 Variance 203508 Mode 750.0000 Range 2393 Interquartile Range 560.00000 Measures of central tendency The median What can go wrong
Sometimes we want the outliers When there are many ties, the median may not be completely determined. Measures of central tendency The trimmed mean and Winsorized mean What it is
A compromise between the mean and the median. To calculate the trimmed mean, you remove a certain percentage of the highest and lowest points and then find the mean of what remains. The Winsorized mean is similar but, rather than deleting the points, you set them equal to the lowest or highest values that are not extreme. Measures of central tendency The trimmed mean and Winsorized mean What can go wrong
If the distribution is skewed, the trimmed mean is not an unbiased estimator for either the mean or median. Measures of central tendency The trimmed mean and Winsorized mean Trimmed and winsorized mean salary
ods select TrimmedMeans WinsorizedMeans; proc univariate data = sashelp.baseball trimmed = .1 winsorized = .1; var s a l a r y ; run ; Trimmed per tail % N SE SE Trimmed Winsorized mean mean 10.27 27 25.08 25.09 463.89 486.04 Measures of central tendency The geometric mean What it is
Definition It’s like the mean, except instead of adding the numbers and then dividing by the count, you multiply the numbers and take the nth root of the product
or, if you want a formula
n 1/N Q xi i=1 Measures of central tendency The geometric mean What can go wrong
Doesn’t work when any value is 0 or negative Measures of central tendency The geometric mean When to use it
Useful for combining measures on different scales. E.g. Candidates for college - combine SAT (0 to 1600) and HS GPA (0 to 4) Proportional growth over a series of times Measures of central tendency The geometric mean Geometric mean of college applicants
data college; i n p u t name $ GPA SAT @@; d a t a l i n e s ; Jill 3.0 1550 Joe 4.0 1500 ; data college; set college; gmean = geomean (GPA, SAT ) ; amean = mean(GPA,SAT ) ; run ; proc print data = college; run;
Obs name GPA SAT gmean amean 1 Jill 3 1550 68.1909 776.5 2 Joe 4 1500 77.4597 752.0 Measures of central tendency The harmonic mean Harmonic mean of round trip travel
Definition It is the reciprocal of the arithmetic mean of the reciprocals of a set of numbers.
H = n 1 + 1 +... 1 x1 x2 xn Measures of central tendency The harmonic mean When to use it
Averaging rates, such as speeds or batting averages Averaging ratios such as price earning ratios Measures of central tendency The harmonic mean What can go wrong
Like the geometric mean, it doesn’t work with negative numbers or 0’s. Measures of central tendency The harmonic mean SAS code
data speed ; input To From @@; d a t a l i n e s ; 50 80 40 70 ; data speed; set speed; hmean = harmean(to , from); amean = mean(to ,from); time = 100 / to + 100 / from; actualspeed= 200 / time ; run ; proc print data = speed; run;
Obs To From hmean amean time actualspeed 1 50 80 61.54 65 3.25 61.54 2 40 70 50.91 55 3.93 50.91 Measures of central tendency Exercises Exercises
1 Name 3 variables for which the mean would not be appropriate 2 For each of those, decide which measure of central tendency would be appropriate and why? Measures of spread
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures Measures of spread Standard deviation What it is
Definition The standard deviation is the square root of the average squared difference between the mean and the individual values.
Or
s n P 2 (xi −x¯) i=1 s = n−1 Measures of spread Standard deviation What can go wrong
If the mean isn’t a good measure of central tendency, the sd isn’t a good measure of spread. Measures of spread Standard deviation SD of salary
proc means data = sashelp . baseball ; var s a l a r y ; run ;
Basic Statistical Measures Location Variability Mean 535.9259 Std Deviation 451.11868 Median 425.0000 Variance 203508 Mode 750.0000 Range 2393 nterquartile Range 560.00000 Measures of spread Standard deviation Alternatives
Median absolute deviation (MAD) Range and interquartile range More quantiles Gini’s mean difference Variations on MAD (also see graphics, later) Measures of spread MAD What it is
Definition The median absolute deviation is what it says: 1 Find the median 2 Find each value’s deviation from the median 3 Take absolute values 4 Find the median of those Measures of spread MAD What can go wrong?
Not very efficient Not appropriate with asymmetric distributions Measures of spread Range and interquartile range What it is
The range is just the smallest to largest value The IQR is the 1st quartile to the 3rd quartile Measures of spread Range and interquartile range What can go wrong
The range is strongly affected by even a single outlier The IQR is not affected at all by outliers Measures of spread Range and interquartile range SAS code
ods select RobustScale; proc univariate data = sashelp.baseball RobustScale; var s a l a r y ; run ;
Robust Measures of Scale Measure Value Estimate of Sigma Interquartile Range 560.0000 415.1285 Gini’s Mean Difference 468.0400 414.7897 MAD 275.0000 407.7150 Sn 381.6320 382.9424 Qn 327.7303 325.9949 Measures of spread Range and interquartile range Exercises
List 3 variables that would not be well analyzed by the SD and suggest alternatives. Other measures
Outline
1 Introduction
2 Measures of central tendency
3 Measures of spread
4 Other measures Other measures Skewness What it is
Definition Skewness is the asymmetry of the distribution.
n 1 P ¯ 3 n (xi −x) i=1 n 1 P ¯ 2 3/2 [ n−1 (xi −x) ] i=1
Skewness can take on any number, negative means left skew, positive means right skew, 0 means symmetrical. Other measures Skewness Alternatives and problems
One good way to look at skewness is with density plots (to be covered later) What can go wrong A single outlier can generate skewness. Again, if the mean is not an appropriate measure of central tendency, this is not an appropriate measure of skew Other measures Skewness Skewness of salary
ods select Moments; proc univariate data = sashelp.baseball; var s a l a r y ; run ;
Moments N 263 Sum Weights 263 Mean 535.93 Sum Observations 140948.507 Std Deviation 451.12 Variance 203508.064 Skewness 1.59 Kurtosis 3.05896473 Coeff Variation 84.18 Std Error Mean 27.82 Other measures Kurtosis What it is
It is a measure of the peakedness of the distribution. However, it is very nonintuitive and hard to interpret. It can be used to indicate a non-normal distribution, but its use beyond that is tricky (and confuses even experienced people). Better to use graphical measures such as density plots (to be covered later) Other measures Exercises and further reading Exercises
List 3 variables that are markedly skewed, either to the right or left Other measures Exercises and further reading Discussion Other measures Exercises and further reading Further reading
www.statisticalanalysisconsulting.com/ how-to-go-wrong-with-the-mean/ www.statisticalanalysisconsulting.com/ measures-of-central-tendency-the-harmonic-mean/ www.statisticalanalysisconsulting.com/ measures-of-central-tendency-the-harmonic-mean/ www.statisticalanalysisconsulting.com/ measures-of-central-tendency-the-trimmed-mean-and-median/ www.statisticalanalysisconsulting.com/ statistical-measures-of-spread/ Exploratory Data Analyis by John Tukey Part III
Descriptive graphics Introduction
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading Introduction
General thoughts on statistical graphics - 1
A good graph will Show the data Induce the viewer to think about the substance of the data Avoid distorting the data Present many numbers in a small space Introduction
General thoughts on statistical graphics - 2
Make large data sets coherent Encourage the eye to look at different parts of the data Reveal several levels of detail Serve a clear purpose Introduction
General thoughts on statistical graphics - 3
But a good graphic will not Be a substitute for a table Be a substitute for a model Introduction
General thoughts on statistical graphics - 4
Use of color, shape and so on Consider the audience Not all chart junk is bad Univariate graphics
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading Univariate graphics Univariate discrete data Introduction
This is usually counts or proportions of something, e.g. number of Democrats, Republicans and others. Here: Pie charts should be avoided Dot charts are often good A table may be even better Log scales are sometimes helpful Univariate graphics Univariate discrete data Pie chart with 51 categories - a mess
Geographical_Area Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Puerto Rico Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Univariate graphics Univariate discrete data Dot chart with 51 categories
Wyoming District of Columbia Vermont North Dakota Alaska South Dakota Delaware Montana Rhode Island Hawaii New Hampshire Maine Idaho Nebraska West Virginia New Mexico Nevada Utah Kansas Arkansas Mississippi Iowa Connecticut Oklahoma Oregon e
t Puerto Rico a t Kentucky S Louisiana South Carolina Alabama Colorado Minnesota Wisconsin Maryland Missouri Tennessee Indiana Massachusetts Arizona Washington Virginia New Jersey North Carolina Georgia Michigan Ohio Pennsylvania Illinois Florida New York Texas California
1 5 10 15 20 25 30 35 Population (millions, log scale)
region West South Northeast Midwest Univariate graphics Univariate discrete data Pie chart with 9 categories - a table or dot chart
Middle Atlantic Division 40621237 East South Central Division 18084651
Mountain Division 21784507
East North Central Division New England Division 46395654 14303542
Pacific Division 49070441 West South Central Division 35235521
West North Central Division 20165794 South Atlantic Division 58398377 Univariate graphics Univariate discrete data Pie chart with 4 categories - a table or text
Geographical_Area Midwest Region Northeast Region South Region West Region Univariate graphics Univariate discrete data SAS code
The SAS code for the pie charts isn’t shown because you shouldn’t use it. That for the dot plot is complex, I can e-mail it to you if you want. Univariate graphics Univariate continuous data Introduction
Histograms can be misleading, at least if they are unadorned Density plots are often better, and several smooths can be used. Box plots provide a useful summary When N is small, strip charts can be useful Univariate graphics Univariate continuous data Density plot - example
13:27 Thursday, August 20, 2015 1
Density plot, salaries
0.0013
0.0010 y t i 0.0008 s n e D
0.0005
0.0003
0.0000 -1000 0 1000 2000 3000 1987 Salary in $ Thousands
Kernel Kernel, c=0.5 Kernel, c=2 Univariate graphics Univariate continuous data Density plot - SAS code
proc sgplot data = sashelp.baseball; density salary / type = kernel; density salary / type = kernel (c = .5) curvelabelattrs = (color = red); density salary / type = kernel (c = 2) curvelabelattrs = (color = green); xaxis min = 0; run ; Univariate graphics Univariate continuous data Box plot - example
21:40 Friday, August 7, 2015 1
Density plot, salaries
2500
2000 proc sgplot data = sashelp.baseball; s d
n 1500 a s u o h
T vbox salary;
$
n i
y r a l
a run ; S
7
8 1000 9 1
500
0 Univariate graphics Univariate continuous data Box plot - example, log scale
Salary by division 2500
s 1750 d
n 1250 a s u
o 750 h T
$ 500
n i
y r a
l 250 proc sgplot data = a S
7 sashelp.baseball; 8 9 1 vbox salary; yaxis type = log logbase = 10 logstyle = linear; run ; Univariate graphics Univariate continuous data Strip plot - example
Salary strip plot 2500
$ 2000
n i s
d y r n 1500 a a l s a u S o 1000 h 7 T 8 9
1 500
0 0.90 0.95 1.00 1.05 1.10 jitter Univariate graphics Univariate continuous data Strip plot - SAS code
data s t r i p ; set sashelp.baseball; j i t t e r = 1∗ (ranuni(1234) / 5) + . 9 ; run ; title "Salary strip plot "; proc sgplot data = strip; scatter x = jitter y = salary; xaxis min = 0 max = 2 display = none; run ; Bivariate graphics
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading Bivariate graphics Both categorical Mosaic plots
A little known and under-used plot is the mosaic plot. It is a way of visualizing a crosstabulation. For example, sex and party ID.
The SAS System 08:20 Monday, July 6, 2015 1
The FREQ Procedure
ods select MosaicPlot; proc freq data = mosaic; t a b l e pa rt y ∗sex / plots = mosaic; weight count; run ; Bivariate graphics One categorical Introduction
When N is relatively small, a strip chart is good - it shows all the data. When N is larger, a parallel boxplot shows a lot of the key information. Bivariate graphics One categorical Parallel boxplot - example
Salary by division 2500
2000 title "Salary by division"; s d n
a proc sgplot data = s u
o 1500 h
T sashelp.baseball;
$
n i
y vbox s a l a r y r a l
a 1000 S
/ category = div; 7 8 9 1 run ;
500
0 AE AW NE NW League and Division Bivariate graphics One categorical Strip chart - example
22:14 Friday, August 7, 2015 1
Salary by division, rookies NW n o i s i
v AW i
D title "Salary by division ,
d n a
e
u rookies " ;
g NE a e L proc sgplot data = AE sashelp.baseball; 80 100 120 140 160 1987 Salary in $ Thousands scatter x = salary y = d i v ; where yrmajor le 1; run ; Bivariate graphics Neither categorical The scatter plot
The most common (and one of the best) basic options here is the scatter plot. But there are variations. Bivariate graphics Neither categorical Scatter plot - basic example
Salary by division 2500 s d n
a 2000 s u o h T
1500 $
n i
y
r 1000 a l a S
7 500 8 proc sgplot data = 9 1 0 sashelp.baseball; 0 1000 2000 3000 4000 Career Hits scatter x = CrHits y = Salary ; run ; Bivariate graphics Neither categorical Scatter plot - log scale
Salary by division 2500
s 1750 d n
a 1250 s u
o 750 h proc sgplot data = T
$ 500
n i
y
r sashelp.baseball; a
l 250 a S
7
8 scatter x = CrHits 9 1 y = Salary ; 100 500 1000 2000 4000 Career Hits xaxis type = log logstyle = linear; yaxis type = log logstyle = linear; run ; Bivariate graphics Neither categorical Scatter plot - log scale plus loess
Salary by division 2500 s
d 1500 n a
s 1000 u o h
T 500
$
n i
y r a l a S
100 7 8 9 1
100 500 1000 2000 4000 Career Hits
1987 Salary in $ Thousands Loess Bivariate graphics Neither categorical Scatter, log scale with loess
proc sgplot data = sashelp.baseball; xaxis label = "Career hits (log scale)" type = log logstyle = linear; yaxis label = "Salary in thousands of $ (log scale)" type = log logstyle = linear; scatter x = CrHits y = salary; loess x = CrHits y = Salary / nomarkers ; ellipse x = CrHits y = Salary; run ; Bivariate graphics Neither categorical Scatter plot - A fancy example
The SAS System
Scatter plot with density plots
12 Prediction ellipse (α=.05) DC
) MS X
X LA
X 10
r
e AL
p TN
( AR DE SC
y GA NC
t OK IN i
l MD 8 PA KY a WV MO MI
t OH r WY ILAK o KS ID FL VA WI AZ NV M SD ME NH RI t CT
n ND NM CO 6 TX NY OR a
f MT NE HI NJ n VT MN CA I UT IA MA WA 4 y t
i 0.2 s n
e 0.1 D 0.0 2 4 6 8 10 0.0 0.1 0.2 Unemployment (%) Density Bivariate graphics Neither categorical Scatter plot - Another fancy example
The SAS System
Box plot w/barchart
220
200 t
h 180 g i e W 160
140
120
200000
N 100000
0 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 ht Trivariate and multivariate graphics
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading Trivariate and multivariate graphics All continuous The scatterplot matrix - example
Salary by division 0 100 200 300 400 500
4000
3000
Career Hits 2000
1000
0
500 400 proc sgscatter data = 300 Career Home Runs 200 100 sashelp.baseball; 0 2500 matrix CrHits CrHome 2000
1987 Salary in $ 1500 Thousands 1000 Salary ; 500 0 run ; 0 1000 2000 3000 4000 0 1000 2000 Trivariate and multivariate graphics All continuous The scatterplot matrix - a more complex example
12:34 Monday, August 17, 2015 1
proc sgscatter data = sashelp.baseball ; matrix CrHits CrHome CrBB CrRbi / markerattrs = (symbol = circlefilled size = 8) diagonal = (kernel) e l l i p s e colorresponse = salary; run ; Trivariate and multivariate graphics All continuous Bubble plot
Bubble plot of rookie salaries 30
25 s n u
R 20 title "Bubble plot e m
o 15 H
r of rookie salaries"; e
e 10 r a C 5 proc sgplot data = 0 sashelp.baseball; 50 75 100 125 150 175 Career Hits bubble x = CrHits y = CrHome size = salary; where yrmajor le 1; run ; Trivariate and multivariate graphics Some continuous Coplot
08:30 Sunday, July 5, 2015 1
League and Division = AE League and Division = AW
500
400
300
200
s 100 n u proc sgpanel data = R
e
m 0 o League and Division = NE League and Division = NW H
r e
e sashelp.baseball; r 500 a C 400 panelby div; 300 scatter x = CrHits 200 100 y = CrHome ; 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 run ; Career Hits Trivariate and multivariate graphics Some continuous Scatter plot matrix with group variable - example
Several statistics by league and division - 5 years or less Career Hits Career Home Runs 1987 Salary in $ Tho... Career Times at Bat s
t proc sgscatter data = i H
r e e r a
C sashelp.baseball; s n u R
e title "Several statistics m o H
r e e r
a by league and division C . . . T
$
n i
y − 5 years or less"; r a l a S
7 8 9
1 matrix CrHits CrHome t a B
t a
s e Salary CrAtBat / group m i T
r e e r a
C = div diagonal = League and Division AE AW NE NW ( kernel ) ; where yrmajor le 5; run ; Trivariate and multivariate graphics Some continuous Scatter plot matrix - another example
12:34 Monday, August 17, 2015 1
proc sgscatter data = sashelp.baseball; plot (salary) ∗ (nHits nHome NBB nAssts) / markerattrs = (symbol = circlefilled size = 8) loess colorresponse = yrmajor colormodel = twocolorramp; run ; Trivariate and multivariate graphics None continuous Introduction
When all variables are categorical, generalizations of the mosaic plot can be used. Time series data
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading Time series data
Electrical workers over time
Timeseries decomposition 14:24 Monday, August 3, 2015 1
The TIMESERIES Procedure
Series Values for ELECTRIC
320 s d n a
s 300 u o h t
, s r e k r
o 280 w
l a c i r t c e l e 260
240
Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul 1977 1978 1979 1980 1981 1982 DATE Time series data
Electrical workers over time
Timeseries decomposition 14:24 Monday, August 3, 2015 1
The TIMESERIES Procedure
Seasonal Decomposition/Adjustment for ELECTRIC 1.050 320 r 1.025 a l u e g l 300 c e r y 1.000 r I C - - l d a n
280 n e
o 0.975 r s T a 260 e S 0.950
240 0.925 Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan 1977 1978 1979 1980 1981 1982 1977 1978 1979 1980 1981 1982
1.03 320
1.02 d e t s u 1.01 j r 300 d a l A
u y g 1.00 l l e a r 280 r n I 0.99 o s a
e 260
0.98 S
0.97 240 Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan 1977 1978 1979 1980 1981 1982 1977 1978 1979 1980 1981 1982 Time series data
Electrical workers over time - SAS code
title "Timeseries decomposition"; proc timeseries data=sashelp.workers out=_ n u l l _ plots=(series decomp); id date interval=month; var electric; run ; Exercises and further reading
Outline
5 Introduction
6 Univariate graphics
7 Bivariate graphics
8 Trivariate and multivariate graphics
9 Time series data
10 Exercises and further reading Exercises and further reading
Describe a set of variables and say what graph you would use for it and why Exercises and further reading
Discussion Exercises and further reading
Further reading - blog links
Parallel box plots http: //www.statisticalanalysisconsulting.com/ graphics-for-bivariate-data-parallel-box-plots/ Pie is delicious but not nutritious http: //www.statisticalanalysisconsulting.com/ graphics-for-univariate-data-pie-is-delicious-but-not-nutritious/ Scatterplots http://www.statisticalanalysisconsulting. com/scatterplots-and-enhancements/ Graphics: The good, the bad and the ugly http: //www.statisticalanalysisconsulting.com/ graphics-the-good-the-bad-and-the-ugly/ Exercises and further reading
Further reading - books
Creating more effective graphs by Naomi Robbins Visualizing data by William S. Cleveland The elements of graphing data by William S. Cleveland A trout in the milk by Howard Wainer Part IV
Inferential statistics From sample to population
A population is the entire set of all the subjects (people or whatever) that you want to study. A sample is a subset of that population. A random sample is a sample where all subjects have a definable chance of being selected Null and alternative hypotheses
The null hypothesis is usually "nothing is going on" The alternative is "something is going on" Trial analogy What is a p value?
Definition If, in the population from which this sample was randomly drawn, the null was strictly true, what is probability of getting a test statistic at least as large as the one we got in a sample the size of the one we have? In other words, if we do 1000 really silly things, what proportion will come out significant? Experiments vs. observational studies
In an experiment subjects are randomly selected and then randomly assigned to a condition In an observational study neither of these are true Some people use quasi-experiment where one of the above is true Problems
Not usually the question we want to ask Strongly affected by sample size The Bayesian approach
Idea Set a prior - often a uniform prior Let data modify it. Advantages More intuitive Lets you have a prior Disadvantages Hard to set a prior Uninformed prior usually gives similar results to frequentist approach Still not the question we are interested in What we want
Effect sizes and measure of their accuracy Risk reward analysis Further reading
The Insignificance of Statistical Significance Testing by Douglas Johnson The Cult of Statistical Significance by Stephen Zilliak and Deirdre McCloskey Part V
The regression family Introduction
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading Introduction
What is regression?
Regression is a term for a variety of models relating dependent variables (usually just one) to one or more independent variables. Introduction
Varieties of regression
The type of regression depends on the nature of the dependent variable and on the nature of the relationships. Continuous - OLS and alternatives (see below) Dichotomous - Logistic Categorical (>2 levels) - Multinomial logistic Ordinal - ordinal logistic Count - Poisson, negative binomial and variations Time to event - survival models The OLS model
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading The OLS model
What it is
Ordinary least squares is the most common regression model and it is what people mean when they say ‘regression‘. The model is Y = b0 + b1x1 + b2x2 + ...bpxp + e where e is error and is normally distributed with 0 mean and constant variance. The OLS model
What can go wrong
Overfitting Nonlinear fits Nonnormal residuals Dependent data Collinearity Other models for continuous DV
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading Other models for continuous DV
Introduction
Multivariate adaptive regression splines (MARS) - PROC ADAPTIVEREG Quantile regression - PROC QUANTREG Tranformations - PROC TRANSREG More information: See my paper at SGF 2015. Other models for continuous DV MARS Introduction
MARS models allow extremely flexible curves (called splines) to be fit to data. MARS models are most useful In high dimensional spaces When there is little substantive reason to assume linearity or a low-level polynomial fit Other models for continuous DV MARS Advnatages and disadvantages of MARS models
Advantages of MARS models: Very flexible fitting of the relationship between independent and dependent variables Model selection methods that can sharply reduce the dimension of the model. SAS implementation of these models extends them to dependent variables in the exponential family. Can be more accurate than GLM, with greater parsimony Disadvantages of MARS models: Hard to interpret Less familiar Other models for continuous DV MARS Example
I modeled baseball salary as a function of various attributes of the players. ADAPTIVEREG got a significantly higher R2 with considerably fewer terms. But the result is very hard to interpret. proc adaptivereg data = sashelp.baseball plots = all details = bases; class team ; model salary = YrMajor nAtBat nHits nHome nOuts; run ; Other models for continuous DV Quantile regression Introduction
There are at least three motivations for quantile regression: DV is bimodal or multimodal Highly skewed DV Substantive interest in the quantiles Advantages include: No assumptions about the distribution of the residuals More flexible hypotheses Diadvanages include: Not as powerful as OLS regression when that is appropriate model Not robust to high leverage points. Other models for continuous DV Quantile regression Example
A quantile regression of baseball salary: proc quantreg data = sashelp.baseball plots = all; model salary = YrMajor nAtBat nHits nHome nOuts / quantile = (0.1, 0.5, 0.9); run ; revealed that the relationship between salary and various player attributes was different at different levels of salary. e.g.: Number of home runs was more important at high levels of salary. but this should be viewed with caution because of high leverage points. Other models for continuous DV TRANSREG Introduction
Sometimes it makes sense to transform one or more variables. Can do in data step but PROC TRANSREG offers many options and allows automation of some tasks Some transformations (e.g. splines) are hard or impossible in data step TRANSREG is very flexible and allows optimal fitting. Other models for continuous DV TRANSREG Example
A spline regression of baseball salary proc transreg data = sashelp.baseball plots = all; model identity(salary) = spline(YrMajor nAtBat nHits nHome nOuts); run ; showed non-monotonic relationships between salary and performance Other models for continuous DV TRANSREG The logistic family
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading The logistic family
Introduction
When the dependent variable is categorical (either dichotomous, nominal or ordinal) OLS regression is not recommended because The assumption of normal residuals is violated The predicted values can be ludicrous The usual method for these cases is logistic regression (either ‘normal‘, multinomial or ordinal). The key output is odds ratio estimates. The logistic family
What are odds ratios?
In OLS regression the dependent variable is continuous. In logistic, it’s not. How do we go from a 0 - 1 response to a continuous one from −∞ to ∞? Find odds of something happening for each level of each IV. e.g. odds of men and women voting for Obama. That goes from 0 to ∞ Take ratio of the odds. That goes from 0 to ∞ as well. Take log of the ratio for modeling. That goes from −∞ to ∞ But the OR is easier to interpret The logistic family
Logistic regression - examples
Predict explain purchase of a product vs. no purchase - dichotomous Predict explain position on a team - multinomial Predict explain likelihood of returning - ordinal The logistic family
What can go wrong
Coding 0 and 1 incorrectly - be careful which response SAS is modelling Effect coding. For categorical IVs, SAS defaults to effect coding, but reference coding is often better Quasi-complete and complete separation - slicing the pie too thin Concordant and discordant in output don’t mean what they seem to Need to use SLICE to get interaction odds ratios The logistic family
Ordinal and multinomial logistic example
When the DV has multiple categories, they can be ordinal or nominal. If ordinal, use PROC LOGISTIC and the LINK = clogit. If nominal, LINK = glogit. Interpretation can be tricky, but is basically a generalization of the dichotomous case. Count models
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading Count models
Introduction
When the DV is a count (a non-negative integer) and especially when the counts aren’t very large, OLS is not recommended. Count models such as Poisson or negative binomial regression should be used. PROC GENMOD is used for these analyses. Count models
Examples
How many cell phones does a person own? How many divorces will a person go through? Count models
What can go wrong?
Overdispersion Failure to fit Abundance of 0’s - use ZIP or ZINB models Multilevel models
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading Multilevel models
Introduction
All the regression models above assume independent errors. When this is violated, things can go very wrong. MLM are one way to deal with this. Multilevel models
Examples
Repeated measurements of the same thing on the same people Measurements on people who are clustered Exercises and further reading
Outline
11 Introduction
12 The OLS model
13 Other models for continuous DV
14 The logistic family
15 Count models
16 Multilevel models
17 Exercises and further reading Exercises and further reading
Exercises
From your experience, list several regression problems and propose a regression method for each Exercises and further reading
Discussion Exercises and further reading
Further reading - blog links
Simple linear regression http://www.statisticalanalysisconsulting. com/what-is-simple-linear-regression/ Multiple linear regression http://www.statisticalanalysisconsulting. com/what-is-multiple-linear-regression/ Survival analysis http://www.statisticalanalysisconsulting. com/what-is-survival-analysis/ Alternative methods of regression when OLS is not right http://support.sas.com/resources/papers/ proceedings15/3412-2015.pdf Exercises and further reading
Further reading - books
Regression Analysis by Example by Samprit Chaterjee and Ali Hadi Regression Models for Categorical and Limited Dependent Variables by J. Scott Long Categorical Data Analysis by Alan Agresti Part VI
Multivariate statistics Introduction
Sometimes there is no dependent variable, but you want to be able to figure out what is going on in a huge mass of data. Exploratory factor analysis Introduction
Factor analysis is a method of finding latent factors in multivariate data. Latent variables are those that can’t be directly measured. Examples: Personality scales IQ Views on complex issues Exploratory factor analysis Steps involved
Extracting factors - several methods Rotation - many methods, in two groups Orthogonal - each factor is uncorrelated with others, easier to interpret but may not be realistic Oblique - factors can be correlated Interpretation - EFA is not determinate, much will depend on interpretation Exploratory factor analysis Example
Factor analysis of current statistics showed 2 factors: proc factor data = sashelp.baseball r = varimax; var nassts nAtBat −−nBB nouts; run;
Rotated Factor Pattern Factor1 Factor2 nAtBat Times at Bat in 1986 0.88078 0.37098 nHits Hits in 1986 0.87357 0.33843 nHome Home Runs in 1986 0.81700 −0.19594 nRuns Runs in 1986 0.91078 0.21618 nRBI RBIs in 1986 0.92417 0.04853 nBB Walks in 1986 0.74709 0.09339 nAssts Assists in 1986 0.03736 0.92947 nOuts Put Outs in 1986 0.45303 −0.03541 nError Errors in 1986 0.10152 0.87866 Exploratory factor analysis What can go wrong
GIGO can appear like GIPO - garbage in, pearls out No simple structure Unclear number of factors Principal component analysis (PCA) Introduction
PCA is a dimension reduction method; use it when you have a large number of variables that you want to reduce with minimal loss of information. Principal component analysis (PCA) What can go wrong
Components may not make sense Components may not be useful for further analysis If doing regression, consider partial least squares. Cluster analysis Introduction
Cluster analysis is a set of methods for finding groups of observations that go together in ways you are not aware of to start. Examples: Do patrons of a store tend to go into groups of people who buy certain items? Do groups of politicians go into groups based on their votes on bills? Cluster analysis Methods
Agglomerative methods - start with items separate and gradually combine them using A measure of distance A measure of linkage K-means methods - assign a number of clusters and distance measure and let algorithm do the work Cluster analysis Example
Cluster analysis of the same variables proc cluster data = sashelp.baseball method = average CCC pseudo print = 10 outtree = bb4clust; var nAtBat −− nBB nassts nouts nerror; run ; Cluster analysis Example - continued
showed evidence of 3 clusters:
The SAS System 13:26 Monday, September 7, 2015 1
The CLUSTER Procedure Average Linkage Cluster Analysis
Criteria for the Number of Clusters
10
5 C C
C 0
-5
300 F
o
d 200 u e s
P 100
0 300 d e r a u
q 200 S - T
o
d 100 u e s P 0 2 4 6 8 10 Number of Clusters Cluster analysis Example - continued
with the following attributes
The SAS System 13:26 Monday, September 7, 2015 1
700 250 40 6 6 8 8 9 600 9 1 1 6
200 30 8 n n i
500 i
9
t 1 s a 150 n n
B 400 20 i u
t s R a t
300 i 100 e s
H 10 e 200 m o m i 50 H T 100 0 125 125 100 6 6 6 100 8 80 8
100 8 9 9 9 1 1
1
75 60 n i n
75 n i
i
s s s k I
l 40 n 50 50 B a u R R
W 20 25 25 0 500 1250 30 6 6 6 8
8 400 8 9
9 1000 9 1 1 1
20 300 n i n i n
750 i
s s t s t u 200 r s
i 500 o O r
s 10
t r s u
100 E
A 250 P 0 0 0 Multidimensional scaling Introduction
MDS is a method for figuring out how people are judging similarity, or what similarity is based on. There are many options and choices and (relatively) little literature. Multidimensional scaling Examples
How do people group politicians? How do customers group brands of items? Multidimensional scaling What can go wrong
Overfitting - use training and test sets Results may not be useful - try different methods Exercises and further reading
Outline
18 Exercises and further reading Exercises and further reading
Exercises
Come up with an example of a multivariate method that would be useful in your research or business Exercises and further reading
Further reading
Using Multivariate Statistics by Barbara Tabachnik and Linda Fidell Part VII
Summary and so on General thoughts
Statistics and data analysis are not tools to be applied in a rote fashion. Data analysis should illuminate a scientific or business phenomenon or attempt to solve a problem. The time to consult with a data analyst is as early as possible and as often as possible Summary
Descriptive statistics are a vital first step in any analysis Graphical methods are also vital Inference allows you to go from a sample to a population, but can have problems Regression relates a DV to one or more IVs Multivariate statistics allow you to summarize large data sets. Contact information
Peter Flom Peter Flom Consulting www.StatisticalAnalysisConsulting.com 917 488 7176 Thank you!