Lies, damn lies and ... SAS to the rescue!

Peter L. Flom

Peter Flom Consulting

SESUG September, 2015 Broad outline

1 Introduction 2 Descriptive 3 Descriptive graphics 4 Inferential statistics 5 The regression family 6 Multivariate statistics Part I

Introduction Schedule

8:00 Descriptive statistics 8:40 Break 8:50 Descriptive graphics 9:30 Break 9:40 Inferential statistics 10:10 Break 10:20 The regression family 11:00 Break 11:10 Multivariate statistics 12:00 End Introductions of participants and self What I plan to do in this course

Give you a fundamental understanding of some basic statistical methods Give you a very brief survey of a lot of many more advanced methods Help you learn to work with statistics and statisticians Give some SAS code you can give to others Note that the most important stuff is at the beginning, so ask questions! What I don’t plan to do in this course

Teach you to be a statistician Teach you SAS What I want from you

Attention Questions - after all, it’s not even graded Feedback - even anonymous is OK, after the course Part II

Descriptive statistics Introduction

Outline

1 Introduction

2 Measures of

3 Measures of spread

4 Other measures Introduction

Descriptive vs. inferential stats

Descriptive statistics describe a variable or a sample. Inferential statistics let you infer from a sample to a population (more later) Descriptive statistics are necessary even when your goal is inference Introduction

Types of descriptive statistics

For continuous variables descriptive statistics include Measures of central tendency Measures of dispersion or spread Measures of skewness Measures of kurtosis Other measures For categorical variables, mostly we are limited to frequencies. Measures of central tendency

Outline

1 Introduction

2 Measures of central tendency

3 Measures of spread

4 Other measures Measures of central tendency The What it is

Definition The mean is the ordinary . Add up the numbers and divide by the number of numbers.

Or, if you want a formula

n P xi ¯ i=1 x = n where x is the variable and there are n values of the variable. Measures of central tendency The mean What can go wrong

Outliers Skewness The clock problem The rate problem Different scales Measures of central tendency The mean Mean salary

proc data = sashelp . baseball maxdec = 2; var s a l a r y ; run ;

Analysis Variable : Salary 1987 Salary in $ Thousands N Mean Std Dev Minimum Maximum 263 535.93 451.12 67.50 2460.00 Measures of central tendency The mean Alternatives

1 The 2 The trimmed mean and Winsorized mean 3 The geometric mean 4 The harmonic mean which are the topics of the next few slides Measures of central tendency The median Median salary

The median is simply the value that divides the distribution in half - half are lower, half are higher. ods select BasicMeasures; proc univariate data = sashelp.baseball; var s a l a r y ; run ;

Basic Statistical Measures Location Variability Mean 535.9259 Std Deviation 451.11868 Median 425.0000 Variance 203508 Mode 750.0000 Range 2393 Interquartile Range 560.00000 Measures of central tendency The median What can go wrong

Sometimes we want the When there are many ties, the median may not be completely determined. Measures of central tendency The trimmed mean and Winsorized mean What it is

A compromise between the mean and the median. To calculate the trimmed mean, you remove a certain percentage of the highest and lowest points and then find the mean of what remains. The Winsorized mean is similar but, rather than deleting the points, you set them equal to the lowest or highest values that are not extreme. Measures of central tendency The trimmed mean and Winsorized mean What can go wrong

If the distribution is skewed, the trimmed mean is not an unbiased estimator for either the mean or median. Measures of central tendency The trimmed mean and Winsorized mean Trimmed and winsorized mean salary

ods select TrimmedMeans WinsorizedMeans; proc univariate data = sashelp.baseball trimmed = .1 winsorized = .1; var s a l a r y ; run ; Trimmed per tail % N SE SE Trimmed Winsorized mean mean 10.27 27 25.08 25.09 463.89 486.04 Measures of central tendency The geometric mean What it is

Definition It’s like the mean, except instead of adding the numbers and then dividing by the count, you multiply the numbers and take the nth root of the product

or, if you want a formula

 n 1/N Q xi i=1 Measures of central tendency The geometric mean What can go wrong

Doesn’t work when any value is 0 or negative Measures of central tendency The geometric mean When to use it

Useful for combining measures on different scales. E.g. Candidates for college - combine SAT (0 to 1600) and HS GPA (0 to 4) Proportional growth over a series of times Measures of central tendency The geometric mean Geometric mean of college applicants

data college; i n p u t name $ GPA SAT @@; d a t a l i n e s ; Jill 3.0 1550 Joe 4.0 1500 ; data college; set college; gmean = geomean (GPA, SAT ) ; amean = mean(GPA,SAT ) ; run ; proc print data = college; run;

Obs name GPA SAT gmean amean 1 Jill 3 1550 68.1909 776.5 2 Joe 4 1500 77.4597 752.0 Measures of central tendency The harmonic mean Harmonic mean of round trip travel

Definition It is the reciprocal of the arithmetic mean of the reciprocals of a set of numbers.

H = n 1 + 1 +... 1 x1 x2 xn Measures of central tendency The harmonic mean When to use it

Averaging rates, such as speeds or batting Averaging ratios such as price earning ratios Measures of central tendency The harmonic mean What can go wrong

Like the geometric mean, it doesn’t work with negative numbers or 0’s. Measures of central tendency The harmonic mean SAS code

data speed ; input To From @@; d a t a l i n e s ; 50 80 40 70 ; data speed; set speed; hmean = harmean(to , from); amean = mean(to ,from); time = 100 / to + 100 / from; actualspeed= 200 / time ; run ; proc print data = speed; run;

Obs To From hmean amean time actualspeed 1 50 80 61.54 65 3.25 61.54 2 40 70 50.91 55 3.93 50.91 Measures of central tendency Exercises Exercises

1 Name 3 variables for which the mean would not be appropriate 2 For each of those, decide which measure of central tendency would be appropriate and why? Measures of spread

Outline

1 Introduction

2 Measures of central tendency

3 Measures of spread

4 Other measures Measures of spread Standard deviation What it is

Definition The standard deviation is the square root of the average squared difference between the mean and the individual values.

Or

s n P 2 (xi −x¯) i=1 s = n−1 Measures of spread Standard deviation What can go wrong

If the mean isn’t a good measure of central tendency, the sd isn’t a good measure of spread. Measures of spread Standard deviation SD of salary

proc means data = sashelp . baseball ; var s a l a r y ; run ;

Basic Statistical Measures Location Variability Mean 535.9259 Std Deviation 451.11868 Median 425.0000 Variance 203508 Mode 750.0000 Range 2393 nterquartile Range 560.00000 Measures of spread Standard deviation Alternatives

Median absolute deviation (MAD) Range and interquartile range More quantiles Gini’s mean difference Variations on MAD (also see graphics, later) Measures of spread MAD What it is

Definition The median absolute deviation is what it says: 1 Find the median 2 Find each value’s deviation from the median 3 Take absolute values 4 Find the median of those Measures of spread MAD What can go wrong?

Not very efficient Not appropriate with asymmetric distributions Measures of spread Range and interquartile range What it is

The range is just the smallest to largest value The IQR is the 1st quartile to the 3rd quartile Measures of spread Range and interquartile range What can go wrong

The range is strongly affected by even a single The IQR is not affected at all by outliers Measures of spread Range and interquartile range SAS code

ods select RobustScale; proc univariate data = sashelp.baseball RobustScale; var s a l a r y ; run ;

Robust Measures of Scale Measure Value Estimate of Sigma Interquartile Range 560.0000 415.1285 Gini’s Mean Difference 468.0400 414.7897 MAD 275.0000 407.7150 Sn 381.6320 382.9424 Qn 327.7303 325.9949 Measures of spread Range and interquartile range Exercises

List 3 variables that would not be well analyzed by the SD and suggest alternatives. Other measures

Outline

1 Introduction

2 Measures of central tendency

3 Measures of spread

4 Other measures Other measures Skewness What it is

Definition Skewness is the asymmetry of the distribution.

n 1 P ¯ 3 n (xi −x) i=1 n 1 P ¯ 2 3/2 [ n−1 (xi −x) ] i=1

Skewness can take on any number, negative means left skew, positive means right skew, 0 means symmetrical. Other measures Skewness Alternatives and problems

One good way to look at skewness is with density plots (to be covered later) What can go wrong A single outlier can generate skewness. Again, if the mean is not an appropriate measure of central tendency, this is not an appropriate measure of skew Other measures Skewness Skewness of salary

ods select Moments; proc univariate data = sashelp.baseball; var s a l a r y ; run ;

Moments N 263 Sum Weights 263 Mean 535.93 Sum Observations 140948.507 Std Deviation 451.12 Variance 203508.064 Skewness 1.59 Kurtosis 3.05896473 Coeff Variation 84.18 Std Error Mean 27.82 Other measures Kurtosis What it is

It is a measure of the peakedness of the distribution. However, it is very nonintuitive and hard to interpret. It can be used to indicate a non-normal distribution, but its use beyond that is tricky (and confuses even experienced people). Better to use graphical measures such as density plots (to be covered later) Other measures Exercises and further reading Exercises

List 3 variables that are markedly skewed, either to the right or left Other measures Exercises and further reading Discussion Other measures Exercises and further reading Further reading

www.statisticalanalysisconsulting.com/ how-to-go-wrong-with-the-mean/ www.statisticalanalysisconsulting.com/ measures-of-central-tendency-the-harmonic-mean/ www.statisticalanalysisconsulting.com/ measures-of-central-tendency-the-harmonic-mean/ www.statisticalanalysisconsulting.com/ measures-of-central-tendency-the-trimmed-mean-and-median/ www.statisticalanalysisconsulting.com/ statistical-measures-of-spread/ Exploratory Data Analyis by John Tukey Part III

Descriptive graphics Introduction

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading Introduction

General thoughts on statistical graphics - 1

A good graph will Show the data Induce the viewer to think about the substance of the data Avoid distorting the data Present many numbers in a small space Introduction

General thoughts on statistical graphics - 2

Make large data sets coherent Encourage the eye to look at different parts of the data Reveal several levels of detail Serve a clear purpose Introduction

General thoughts on statistical graphics - 3

But a good graphic will not Be a substitute for a table Be a substitute for a model Introduction

General thoughts on statistical graphics - 4

Use of color, shape and so on Consider the audience Not all chart junk is bad Univariate graphics

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading Univariate graphics Univariate discrete data Introduction

This is usually counts or proportions of something, e.g. number of Democrats, Republicans and others. Here: Pie charts should be avoided Dot charts are often good A table may be even better Log scales are sometimes helpful Univariate graphics Univariate discrete data Pie chart with 51 categories - a mess

Geographical_Area Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Puerto Rico Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Univariate graphics Univariate discrete data Dot chart with 51 categories

Wyoming District of Columbia Vermont North Dakota Alaska South Dakota Delaware Montana Rhode Island Hawaii New Hampshire Maine Idaho Nebraska West Virginia New Mexico Nevada Utah Kansas Arkansas Mississippi Iowa Connecticut Oklahoma Oregon e

t Puerto Rico a t Kentucky S Louisiana South Carolina Alabama Colorado Minnesota Wisconsin Maryland Missouri Tennessee Indiana Massachusetts Arizona Washington Virginia New Jersey North Carolina Georgia Michigan Ohio Pennsylvania Illinois Florida New York Texas California

1 5 10 15 20 25 30 35 Population (millions, log scale)

region West South Northeast Midwest Univariate graphics Univariate discrete data Pie chart with 9 categories - a table or dot chart

Middle Atlantic Division 40621237 East South Central Division 18084651

Mountain Division 21784507

East North Central Division New England Division 46395654 14303542

Pacific Division 49070441 West South Central Division 35235521

West North Central Division 20165794 South Atlantic Division 58398377 Univariate graphics Univariate discrete data Pie chart with 4 categories - a table or text

Geographical_Area Midwest Region Northeast Region South Region West Region Univariate graphics Univariate discrete data SAS code

The SAS code for the pie charts isn’t shown because you shouldn’t use it. That for the dot plot is complex, I can e-mail it to you if you want. Univariate graphics Univariate continuous data Introduction

Histograms can be misleading, at least if they are unadorned Density plots are often better, and several smooths can be used. Box plots provide a useful summary When N is small, strip charts can be useful Univariate graphics Univariate continuous data Density plot - example

13:27 Thursday, August 20, 2015 1

Density plot, salaries

0.0013

0.0010 y t i 0.0008 s n e D

0.0005

0.0003

0.0000 -1000 0 1000 2000 3000 1987 Salary in $ Thousands

Kernel Kernel, c=0.5 Kernel, c=2 Univariate graphics Univariate continuous data Density plot - SAS code

proc sgplot data = sashelp.baseball; density salary / type = kernel; density salary / type = kernel (c = .5) curvelabelattrs = (color = red); density salary / type = kernel (c = 2) curvelabelattrs = (color = green); xaxis min = 0; run ; Univariate graphics Univariate continuous data Box plot - example

21:40 Friday, August 7, 2015 1

Density plot, salaries

2500

2000 proc sgplot data = sashelp.baseball; s d

n 1500 a s u o h

T vbox salary;

$

n i

y r a l

a run ; S

7

8 1000 9 1

500

0 Univariate graphics Univariate continuous data Box plot - example, log scale

Salary by division 2500

s 1750 d

n 1250 a s u

o 750 h T

$ 500

n i

y r a

l 250 proc sgplot data = a S

7 sashelp.baseball; 8 9 1 vbox salary; yaxis type = log logbase = 10 logstyle = linear; run ; Univariate graphics Univariate continuous data Strip plot - example

Salary strip plot 2500

$ 2000

n i s

d y r n 1500 a a l s a u S o 1000 h 7 T 8 9

1 500

0 0.90 0.95 1.00 1.05 1.10 jitter Univariate graphics Univariate continuous data Strip plot - SAS code

data s t r i p ; set sashelp.baseball; j i t t e r = 1∗ (ranuni(1234) / 5) + . 9 ; run ; title "Salary strip plot "; proc sgplot data = strip; scatter x = jitter y = salary; xaxis min = 0 max = 2 display = none; run ; Bivariate graphics

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading Bivariate graphics Both categorical Mosaic plots

A little known and under-used plot is the mosaic plot. It is a way of visualizing a crosstabulation. For example, sex and party ID.

The SAS System 08:20 Monday, July 6, 2015 1

The FREQ Procedure

ods select MosaicPlot; proc freq data = mosaic; t a b l e pa rt y ∗sex / plots = mosaic; weight count; run ; Bivariate graphics One categorical Introduction

When N is relatively small, a strip chart is good - it shows all the data. When N is larger, a parallel boxplot shows a lot of the key information. Bivariate graphics One categorical Parallel boxplot - example

Salary by division 2500

2000 title "Salary by division"; s d n

a proc sgplot data = s u

o 1500 h

T sashelp.baseball;

$

n i

y vbox s a l a r y r a l

a 1000 S

/ category = div; 7 8 9 1 run ;

500

0 AE AW NE NW League and Division Bivariate graphics One categorical Strip chart - example

22:14 Friday, August 7, 2015 1

Salary by division, rookies NW n o i s i

v AW i

D title "Salary by division ,

d n a

e

u rookies " ;

g NE a e L proc sgplot data = AE sashelp.baseball; 80 100 120 140 160 1987 Salary in $ Thousands scatter x = salary y = d i v ; where yrmajor le 1; run ; Bivariate graphics Neither categorical The scatter plot

The most common (and one of the best) basic options here is the scatter plot. But there are variations. Bivariate graphics Neither categorical Scatter plot - basic example

Salary by division 2500 s d n

a 2000 s u o h T

1500 $

n i

y

r 1000 a l a S

7 500 8 proc sgplot data = 9 1 0 sashelp.baseball; 0 1000 2000 3000 4000 Career Hits scatter x = CrHits y = Salary ; run ; Bivariate graphics Neither categorical Scatter plot - log scale

Salary by division 2500

s 1750 d n

a 1250 s u

o 750 h proc sgplot data = T

$ 500

n i

y

r sashelp.baseball; a

l 250 a S

7

8 scatter x = CrHits 9 1 y = Salary ; 100 500 1000 2000 4000 Career Hits xaxis type = log logstyle = linear; yaxis type = log logstyle = linear; run ; Bivariate graphics Neither categorical Scatter plot - log scale plus loess

Salary by division 2500 s

d 1500 n a

s 1000 u o h

T 500

$

n i

y r a l a S

100 7 8 9 1

100 500 1000 2000 4000 Career Hits

1987 Salary in $ Thousands Loess Bivariate graphics Neither categorical Scatter, log scale with loess

proc sgplot data = sashelp.baseball; xaxis label = "Career hits (log scale)" type = log logstyle = linear; yaxis label = "Salary in thousands of $ (log scale)" type = log logstyle = linear; scatter x = CrHits y = salary; loess x = CrHits y = Salary / nomarkers ; ellipse x = CrHits y = Salary; run ; Bivariate graphics Neither categorical Scatter plot - A fancy example

The SAS System

Scatter plot with density plots

12 Prediction ellipse (α=.05) DC

) MS X

X LA

X 10

r

e AL

p TN

( AR DE SC

y GA NC

t OK IN i

l MD 8 PA KY a WV MO MI

t OH r WY ILAK o KS ID FL VA WI AZ NV M SD ME NH RI t CT

n ND NM CO 6 TX NY OR a

f MT NE HI NJ n VT MN CA I UT IA MA WA 4 y t

i 0.2 s n

e 0.1 D 0.0 2 4 6 8 10 0.0 0.1 0.2 Unemployment (%) Density Bivariate graphics Neither categorical Scatter plot - Another fancy example

The SAS System

Box plot w/barchart

220

200 t

h 180 g i e W 160

140

120

200000

N 100000

0 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 ht Trivariate and multivariate graphics

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading Trivariate and multivariate graphics All continuous The scatterplot matrix - example

Salary by division 0 100 200 300 400 500

4000

3000

Career Hits 2000

1000

0

500 400 proc sgscatter data = 300 Career Home Runs 200 100 sashelp.baseball; 0 2500 matrix CrHits CrHome 2000

1987 Salary in $ 1500 Thousands 1000 Salary ; 500 0 run ; 0 1000 2000 3000 4000 0 1000 2000 Trivariate and multivariate graphics All continuous The scatterplot matrix - a more complex example

12:34 Monday, August 17, 2015 1

proc sgscatter data = sashelp.baseball ; matrix CrHits CrHome CrBB CrRbi / markerattrs = (symbol = circlefilled size = 8) diagonal = (kernel) e l l i p s e colorresponse = salary; run ; Trivariate and multivariate graphics All continuous Bubble plot

Bubble plot of rookie salaries 30

25 s n u

R 20 title "Bubble plot e m

o 15 H

r of rookie salaries"; e

e 10 r a C 5 proc sgplot data = 0 sashelp.baseball; 50 75 100 125 150 175 Career Hits bubble x = CrHits y = CrHome size = salary; where yrmajor le 1; run ; Trivariate and multivariate graphics Some continuous Coplot

08:30 Sunday, July 5, 2015 1

League and Division = AE League and Division = AW

500

400

300

200

s 100 n u proc sgpanel data = R

e

m 0 o League and Division = NE League and Division = NW H

r e

e sashelp.baseball; r 500 a C 400 panelby div; 300 scatter x = CrHits 200 100 y = CrHome ; 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 run ; Career Hits Trivariate and multivariate graphics Some continuous Scatter plot matrix with group variable - example

Several statistics by league and division - 5 years or less Career Hits Career Home Runs 1987 Salary in $ Tho... Career Times at Bat s

t proc sgscatter data = i H

r e e r a

C sashelp.baseball; s n u R

e title "Several statistics m o H

r e e r

a by league and division C . . . T

$

n i

y − 5 years or less"; r a l a S

7 8 9

1 matrix CrHits CrHome t a B

t a

s e Salary CrAtBat / group m i T

r e e r a

C = div diagonal = League and Division AE AW NE NW ( kernel ) ; where yrmajor le 5; run ; Trivariate and multivariate graphics Some continuous Scatter plot matrix - another example

12:34 Monday, August 17, 2015 1

proc sgscatter data = sashelp.baseball; plot (salary) ∗ (nHits nHome NBB nAssts) / markerattrs = (symbol = circlefilled size = 8) loess colorresponse = yrmajor colormodel = twocolorramp; run ; Trivariate and multivariate graphics None continuous Introduction

When all variables are categorical, generalizations of the mosaic plot can be used. Time series data

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading Time series data

Electrical workers over time

Timeseries decomposition 14:24 Monday, August 3, 2015 1

The TIMESERIES Procedure

Series Values for ELECTRIC

320 s d n a

s 300 u o h t

, s r e k r

o 280 w

l a c i r t c e l e 260

240

Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul Jan Jul 1977 1978 1979 1980 1981 1982 DATE Time series data

Electrical workers over time

Timeseries decomposition 14:24 Monday, August 3, 2015 1

The TIMESERIES Procedure

Seasonal Decomposition/Adjustment for ELECTRIC 1.050 320 r 1.025 a l u e g l 300 c e r y 1.000 r I C - - l d a n

280 n e

o 0.975 r s T a 260 e S 0.950

240 0.925 Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan 1977 1978 1979 1980 1981 1982 1977 1978 1979 1980 1981 1982

1.03 320

1.02 d e t s u 1.01 j r 300 d a l A

u y g 1.00 l l e a r 280 r n I 0.99 o s a

e 260

0.98 S

0.97 240 Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan 1977 1978 1979 1980 1981 1982 1977 1978 1979 1980 1981 1982 Time series data

Electrical workers over time - SAS code

title "Timeseries decomposition"; proc timeseries data=sashelp.workers out=_ n u l l _ plots=(series decomp); id date interval=month; var electric; run ; Exercises and further reading

Outline

5 Introduction

6 Univariate graphics

7 Bivariate graphics

8 Trivariate and multivariate graphics

9 Time series data

10 Exercises and further reading Exercises and further reading

Describe a set of variables and say what graph you would use for it and why Exercises and further reading

Discussion Exercises and further reading

Further reading - blog links

Parallel box plots http: //www.statisticalanalysisconsulting.com/ graphics-for-bivariate-data-parallel-box-plots/ Pie is delicious but not nutritious http: //www.statisticalanalysisconsulting.com/ graphics-for-univariate-data-pie-is-delicious-but-not-nutritious/ Scatterplots http://www.statisticalanalysisconsulting. com/scatterplots-and-enhancements/ Graphics: The good, the bad and the ugly http: //www.statisticalanalysisconsulting.com/ graphics-the-good-the-bad-and-the-ugly/ Exercises and further reading

Further reading - books

Creating more effective graphs by Naomi Robbins Visualizing data by William S. Cleveland The elements of graphing data by William S. Cleveland A trout in the milk by Howard Wainer Part IV

Inferential statistics From sample to population

A population is the entire set of all the subjects (people or whatever) that you want to study. A sample is a subset of that population. A random sample is a sample where all subjects have a definable chance of being selected Null and alternative hypotheses

The null hypothesis is usually "nothing is going on" The alternative is "something is going on" Trial analogy What is a p value?

Definition If, in the population from which this sample was randomly drawn, the null was strictly true, what is probability of getting a test statistic at least as large as the one we got in a sample the size of the one we have? In other words, if we do 1000 really silly things, what proportion will come out significant? Experiments vs. observational studies

In an experiment subjects are randomly selected and then randomly assigned to a condition In an observational study neither of these are true Some people use quasi-experiment where one of the above is true Problems

Not usually the question we want to ask Strongly affected by sample size The Bayesian approach

Idea Set a prior - often a uniform prior Let data modify it. Advantages More intuitive Lets you have a prior Disadvantages Hard to set a prior Uninformed prior usually gives similar results to frequentist approach Still not the question we are interested in What we want

Effect sizes and measure of their accuracy Risk reward analysis Further reading

The Insignificance of Statistical Significance Testing by Douglas Johnson The Cult of Statistical Significance by Stephen Zilliak and Deirdre McCloskey Part V

The regression family Introduction

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading Introduction

What is regression?

Regression is a term for a variety of models relating dependent variables (usually just one) to one or more independent variables. Introduction

Varieties of regression

The type of regression depends on the nature of the dependent variable and on the nature of the relationships. Continuous - OLS and alternatives (see below) Dichotomous - Logistic Categorical (>2 levels) - Multinomial logistic Ordinal - ordinal logistic Count - Poisson, negative binomial and variations Time to event - survival models The OLS model

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading The OLS model

What it is

Ordinary least squares is the most common regression model and it is what people mean when they say ‘regression‘. The model is Y = b0 + b1x1 + b2x2 + ...bpxp + e where e is error and is normally distributed with 0 mean and constant variance. The OLS model

What can go wrong

Overfitting Nonlinear fits Nonnormal residuals Dependent data Collinearity Other models for continuous DV

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading Other models for continuous DV

Introduction

Multivariate adaptive regression splines (MARS) - PROC ADAPTIVEREG Quantile regression - PROC QUANTREG Tranformations - PROC TRANSREG More information: See my paper at SGF 2015. Other models for continuous DV MARS Introduction

MARS models allow extremely flexible curves (called splines) to be fit to data. MARS models are most useful In high dimensional spaces When there is little substantive reason to assume linearity or a low-level polynomial fit Other models for continuous DV MARS Advnatages and disadvantages of MARS models

Advantages of MARS models: Very flexible fitting of the relationship between independent and dependent variables Model selection methods that can sharply reduce the dimension of the model. SAS implementation of these models extends them to dependent variables in the exponential family. Can be more accurate than GLM, with greater parsimony Disadvantages of MARS models: Hard to interpret Less familiar Other models for continuous DV MARS Example

I modeled baseball salary as a function of various attributes of the players. ADAPTIVEREG got a significantly higher R2 with considerably fewer terms. But the result is very hard to interpret. proc adaptivereg data = sashelp.baseball plots = all details = bases; class team ; model salary = YrMajor nAtBat nHits nHome nOuts; run ; Other models for continuous DV Quantile regression Introduction

There are at least three motivations for quantile regression: DV is bimodal or multimodal Highly skewed DV Substantive interest in the quantiles Advantages include: No assumptions about the distribution of the residuals More flexible hypotheses Diadvanages include: Not as powerful as OLS regression when that is appropriate model Not robust to high leverage points. Other models for continuous DV Quantile regression Example

A quantile regression of baseball salary: proc quantreg data = sashelp.baseball plots = all; model salary = YrMajor nAtBat nHits nHome nOuts / quantile = (0.1, 0.5, 0.9); run ; revealed that the relationship between salary and various player attributes was different at different levels of salary. e.g.: Number of home runs was more important at high levels of salary. but this should be viewed with caution because of high leverage points. Other models for continuous DV TRANSREG Introduction

Sometimes it makes sense to transform one or more variables. Can do in data step but PROC TRANSREG offers many options and allows automation of some tasks Some transformations (e.g. splines) are hard or impossible in data step TRANSREG is very flexible and allows optimal fitting. Other models for continuous DV TRANSREG Example

A spline regression of baseball salary proc transreg data = sashelp.baseball plots = all; model identity(salary) = spline(YrMajor nAtBat nHits nHome nOuts); run ; showed non-monotonic relationships between salary and performance Other models for continuous DV TRANSREG The logistic family

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading The logistic family

Introduction

When the dependent variable is categorical (either dichotomous, nominal or ordinal) OLS regression is not recommended because The assumption of normal residuals is violated The predicted values can be ludicrous The usual method for these cases is logistic regression (either ‘normal‘, multinomial or ordinal). The key output is odds ratio estimates. The logistic family

What are odds ratios?

In OLS regression the dependent variable is continuous. In logistic, it’s not. How do we go from a 0 - 1 response to a continuous one from −∞ to ∞? Find odds of something happening for each level of each IV. e.g. odds of men and women voting for Obama. That goes from 0 to ∞ Take ratio of the odds. That goes from 0 to ∞ as well. Take log of the ratio for modeling. That goes from −∞ to ∞ But the OR is easier to interpret The logistic family

Logistic regression - examples

Predict explain purchase of a product vs. no purchase - dichotomous Predict explain position on a team - multinomial Predict explain likelihood of returning - ordinal The logistic family

What can go wrong

Coding 0 and 1 incorrectly - be careful which response SAS is modelling Effect coding. For categorical IVs, SAS defaults to effect coding, but reference coding is often better Quasi-complete and complete separation - slicing the pie too thin Concordant and discordant in output don’t mean what they seem to Need to use SLICE to get interaction odds ratios The logistic family

Ordinal and multinomial logistic example

When the DV has multiple categories, they can be ordinal or nominal. If ordinal, use PROC LOGISTIC and the LINK = clogit. If nominal, LINK = glogit. Interpretation can be tricky, but is basically a generalization of the dichotomous case. Count models

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading Count models

Introduction

When the DV is a count (a non-negative integer) and especially when the counts aren’t very large, OLS is not recommended. Count models such as Poisson or negative binomial regression should be used. PROC GENMOD is used for these analyses. Count models

Examples

How many cell phones does a person own? How many divorces will a person go through? Count models

What can go wrong?

Overdispersion Failure to fit Abundance of 0’s - use ZIP or ZINB models Multilevel models

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading Multilevel models

Introduction

All the regression models above assume independent errors. When this is violated, things can go very wrong. MLM are one way to deal with this. Multilevel models

Examples

Repeated measurements of the same thing on the same people Measurements on people who are clustered Exercises and further reading

Outline

11 Introduction

12 The OLS model

13 Other models for continuous DV

14 The logistic family

15 Count models

16 Multilevel models

17 Exercises and further reading Exercises and further reading

Exercises

From your experience, list several regression problems and propose a regression method for each Exercises and further reading

Discussion Exercises and further reading

Further reading - blog links

Simple linear regression http://www.statisticalanalysisconsulting. com/what-is-simple-linear-regression/ Multiple linear regression http://www.statisticalanalysisconsulting. com/what-is-multiple-linear-regression/ Survival analysis http://www.statisticalanalysisconsulting. com/what-is-survival-analysis/ Alternative methods of regression when OLS is not right http://support.sas.com/resources/papers/ proceedings15/3412-2015.pdf Exercises and further reading

Further reading - books

Regression Analysis by Example by Samprit Chaterjee and Ali Hadi Regression Models for Categorical and Limited Dependent Variables by J. Scott Long Categorical Data Analysis by Alan Agresti Part VI

Multivariate statistics Introduction

Sometimes there is no dependent variable, but you want to be able to figure out what is going on in a huge mass of data. Exploratory factor analysis Introduction

Factor analysis is a method of finding latent factors in multivariate data. Latent variables are those that can’t be directly measured. Examples: Personality scales IQ Views on complex issues Exploratory factor analysis Steps involved

Extracting factors - several methods Rotation - many methods, in two groups Orthogonal - each factor is uncorrelated with others, easier to interpret but may not be realistic Oblique - factors can be correlated Interpretation - EFA is not determinate, much will depend on interpretation Exploratory factor analysis Example

Factor analysis of current statistics showed 2 factors: proc factor data = sashelp.baseball r = varimax; var nassts nAtBat −−nBB nouts; run;

Rotated Factor Pattern Factor1 Factor2 nAtBat Times at Bat in 1986 0.88078 0.37098 nHits Hits in 1986 0.87357 0.33843 nHome Home Runs in 1986 0.81700 −0.19594 nRuns Runs in 1986 0.91078 0.21618 nRBI RBIs in 1986 0.92417 0.04853 nBB Walks in 1986 0.74709 0.09339 nAssts Assists in 1986 0.03736 0.92947 nOuts Put Outs in 1986 0.45303 −0.03541 nError Errors in 1986 0.10152 0.87866 Exploratory factor analysis What can go wrong

GIGO can appear like GIPO - garbage in, pearls out No simple structure Unclear number of factors Principal component analysis (PCA) Introduction

PCA is a dimension reduction method; use it when you have a large number of variables that you want to reduce with minimal loss of information. Principal component analysis (PCA) What can go wrong

Components may not make sense Components may not be useful for further analysis If doing regression, consider partial least squares. Cluster analysis Introduction

Cluster analysis is a set of methods for finding groups of observations that go together in ways you are not aware of to start. Examples: Do patrons of a store tend to go into groups of people who buy certain items? Do groups of politicians go into groups based on their votes on bills? Cluster analysis Methods

Agglomerative methods - start with items separate and gradually combine them using A measure of distance A measure of linkage K-means methods - assign a number of clusters and distance measure and let algorithm do the work Cluster analysis Example

Cluster analysis of the same variables proc cluster data = sashelp.baseball method = average CCC pseudo print = 10 outtree = bb4clust; var nAtBat −− nBB nassts nouts nerror; run ; Cluster analysis Example - continued

showed evidence of 3 clusters:

The SAS System 13:26 Monday, September 7, 2015 1

The CLUSTER Procedure Average Linkage Cluster Analysis

Criteria for the Number of Clusters

10

5 C C

C 0

-5

300 F

o

d 200 u e s

P 100

0 300 d e r a u

q 200 S - T

o

d 100 u e s P 0 2 4 6 8 10 Number of Clusters Cluster analysis Example - continued

with the following attributes

The SAS System 13:26 Monday, September 7, 2015 1

700 250 40 6 6 8 8 9 600 9 1 1 6

200 30 8 n n i

500 i

9

t 1 s a 150 n n

B 400 20 i u

t s R a t

300 i 100 e s

H 10 e 200 m o m i 50 H T 100 0 125 125 100 6 6 6 100 8 80 8

100 8 9 9 9 1 1

1

75 60 n i n

75 n i

i

s s s k I

l 40 n 50 50 B a u R R

W 20 25 25 0 500 1250 30 6 6 6 8

8 400 8 9

9 1000 9 1 1 1

20 300 n i n i n

750 i

s s t s t u 200 r s

i 500 o O r

s 10

t r s u

100 E

A 250 P 0 0 0 Multidimensional scaling Introduction

MDS is a method for figuring out how people are judging similarity, or what similarity is based on. There are many options and choices and (relatively) little literature. Multidimensional scaling Examples

How do people group politicians? How do customers group brands of items? Multidimensional scaling What can go wrong

Overfitting - use training and test sets Results may not be useful - try different methods Exercises and further reading

Outline

18 Exercises and further reading Exercises and further reading

Exercises

Come up with an example of a multivariate method that would be useful in your research or business Exercises and further reading

Further reading

Using Multivariate Statistics by Barbara Tabachnik and Linda Fidell Part VII

Summary and so on General thoughts

Statistics and data analysis are not tools to be applied in a rote fashion. Data analysis should illuminate a scientific or business phenomenon or attempt to solve a problem. The time to consult with a data analyst is as early as possible and as often as possible Summary

Descriptive statistics are a vital first step in any analysis Graphical methods are also vital Inference allows you to go from a sample to a population, but can have problems Regression relates a DV to one or more IVs Multivariate statistics allow you to summarize large data sets. Contact information

Peter Flom Peter Flom Consulting www.StatisticalAnalysisConsulting.com 917 488 7176 Thank you!