Lies, Damn Lies and ... SAS to the Rescue!

Lies, damn lies and ... SAS to the rescue! Peter L. Flom Peter Flom Consulting SESUG September, 2015 Broad outline 1 Introduction 2 Descriptive statistics 3 Descriptive graphics 4 Inferential statistics 5 The regression family 6 Multivariate statistics Part I Introduction Schedule 8:00 Descriptive statistics 8:40 Break 8:50 Descriptive graphics 9:30 Break 9:40 Inferential statistics 10:10 Break 10:20 The regression family 11:00 Break 11:10 Multivariate statistics 12:00 End Introductions of participants and self What I plan to do in this course Give you a fundamental understanding of some basic statistical methods Give you a very brief survey of a lot of many more advanced methods Help you learn to work with statistics and statisticians Give some SAS code you can give to others Note that the most important stuff is at the beginning, so ask questions! What I don’t plan to do in this course Teach you to be a statistician Teach you SAS What I want from you Attention Questions - after all, it’s not even graded Feedback - even anonymous is OK, after the course Part II Descriptive statistics Introduction Outline 1 Introduction 2 Measures of central tendency 3 Measures of spread 4 Other measures Introduction Descriptive vs. inferential stats Descriptive statistics describe a variable or a sample. Inferential statistics let you infer from a sample to a population (more later) Descriptive statistics are necessary even when your goal is inference Introduction Types of descriptive statistics For continuous variables descriptive statistics include Measures of central tendency Measures of dispersion or spread Measures of skewness Measures of kurtosis Other measures For categorical variables, mostly we are limited to frequencies. Measures of central tendency Outline 1 Introduction 2 Measures of central tendency 3 Measures of spread 4 Other measures Measures of central tendency The mean What it is Definition The mean is the ordinary average. Add up the numbers and divide by the number of numbers. Or, if you want a formula n P xi ¯ i=1 x = n where x is the variable and there are n values of the variable. Measures of central tendency The mean What can go wrong Outliers Skewness The clock problem The rate problem Different scales Measures of central tendency The mean Mean salary proc means data = sashelp . baseball maxdec = 2; var s a l a r y ; run ; Analysis Variable : Salary 1987 Salary in $ Thousands N Mean Std Dev Minimum Maximum 263 535.93 451.12 67.50 2460.00 Measures of central tendency The mean Alternatives 1 The median 2 The trimmed mean and Winsorized mean 3 The geometric mean 4 The harmonic mean which are the topics of the next few slides Measures of central tendency The median Median salary The median is simply the value that divides the distribution in half - half are lower, half are higher. ods select BasicMeasures; proc univariate data = sashelp.baseball; var s a l a r y ; run ; Basic Statistical Measures Location Variability Mean 535.9259 Std Deviation 451.11868 Median 425.0000 Variance 203508 Mode 750.0000 Range 2393 Interquartile Range 560.00000 Measures of central tendency The median What can go wrong Sometimes we want the outliers When there are many ties, the median may not be completely determined. Measures of central tendency The trimmed mean and Winsorized mean What it is A compromise between the mean and the median. To calculate the trimmed mean, you remove a certain percentage of the highest and lowest points and then find the mean of what remains. The Winsorized mean is similar but, rather than deleting the points, you set them equal to the lowest or highest values that are not extreme. Measures of central tendency The trimmed mean and Winsorized mean What can go wrong If the distribution is skewed, the trimmed mean is not an unbiased estimator for either the mean or median. Measures of central tendency The trimmed mean and Winsorized mean Trimmed and winsorized mean salary ods select TrimmedMeans WinsorizedMeans; proc univariate data = sashelp.baseball trimmed = .1 winsorized = .1; var s a l a r y ; run ; Trimmed per tail % N SE SE Trimmed Winsorized mean mean 10.27 27 25.08 25.09 463.89 486.04 Measures of central tendency The geometric mean What it is Definition It’s like the mean, except instead of adding the numbers and then dividing by the count, you multiply the numbers and take the nth root of the product or, if you want a formula n 1=N Q xi i=1 Measures of central tendency The geometric mean What can go wrong Doesn’t work when any value is 0 or negative Measures of central tendency The geometric mean When to use it Useful for combining measures on different scales. E.g. Candidates for college - combine SAT (0 to 1600) and HS GPA (0 to 4) Proportional growth over a series of times Measures of central tendency The geometric mean Geometric mean of college applicants data college; i n p u t name $ GPA SAT @@; d a t a l i n e s ; Jill 3.0 1550 Joe 4.0 1500 ; data college; set college; gmean = geomean (GPA, SAT ) ; amean = mean(GPA,SAT ) ; run ; proc print data = college; run; Obs name GPA SAT gmean amean 1 Jill 3 1550 68.1909 776.5 2 Joe 4 1500 77.4597 752.0 Measures of central tendency The harmonic mean Harmonic mean of round trip travel Definition It is the reciprocal of the arithmetic mean of the reciprocals of a set of numbers. H = n 1 + 1 +::: 1 x1 x2 xn Measures of central tendency The harmonic mean When to use it Averaging rates, such as speeds or batting averages Averaging ratios such as price earning ratios Measures of central tendency The harmonic mean What can go wrong Like the geometric mean, it doesn’t work with negative numbers or 0’s. Measures of central tendency The harmonic mean SAS code data speed ; input To From @@; d a t a l i n e s ; 50 80 40 70 ; data speed; set speed; hmean = harmean(to , from); amean = mean(to ,from); time = 100 / to + 100 / from; actualspeed= 200 / time ; run ; proc print data = speed; run; Obs To From hmean amean time actualspeed 1 50 80 61.54 65 3.25 61.54 2 40 70 50.91 55 3.93 50.91 Measures of central tendency Exercises Exercises 1 Name 3 variables for which the mean would not be appropriate 2 For each of those, decide which measure of central tendency would be appropriate and why? Measures of spread Outline 1 Introduction 2 Measures of central tendency 3 Measures of spread 4 Other measures Measures of spread Standard deviation What it is Definition The standard deviation is the square root of the average squared difference between the mean and the individual values. Or s n P 2 (xi −x¯) i=1 s = n−1 Measures of spread Standard deviation What can go wrong If the mean isn’t a good measure of central tendency, the sd isn’t a good measure of spread. Measures of spread Standard deviation SD of salary proc means data = sashelp . baseball ; var s a l a r y ; run ; Basic Statistical Measures Location Variability Mean 535.9259 Std Deviation 451.11868 Median 425.0000 Variance 203508 Mode 750.0000 Range 2393 nterquartile Range 560.00000 Measures of spread Standard deviation Alternatives Median absolute deviation (MAD) Range and interquartile range More quantiles Gini’s mean difference Variations on MAD (also see graphics, later) Measures of spread MAD What it is Definition The median absolute deviation is what it says: 1 Find the median 2 Find each value’s deviation from the median 3 Take absolute values 4 Find the median of those Measures of spread MAD What can go wrong? Not very efficient Not appropriate with asymmetric distributions Measures of spread Range and interquartile range What it is The range is just the smallest to largest value The IQR is the 1st quartile to the 3rd quartile Measures of spread Range and interquartile range What can go wrong The range is strongly affected by even a single outlier The IQR is not affected at all by outliers Measures of spread Range and interquartile range SAS code ods select RobustScale; proc univariate data = sashelp.baseball RobustScale; var s a l a r y ; run ; Robust Measures of Scale Measure Value Estimate of Sigma Interquartile Range 560.0000 415.1285 Gini’s Mean Difference 468.0400 414.7897 MAD 275.0000 407.7150 Sn 381.6320 382.9424 Qn 327.7303 325.9949 Measures of spread Range and interquartile range Exercises List 3 variables that would not be well analyzed by the SD and suggest alternatives. Other measures Outline 1 Introduction 2 Measures of central tendency 3 Measures of spread 4 Other measures Other measures Skewness What it is Definition Skewness is the asymmetry of the distribution. n 1 P ¯ 3 n (xi −x) i=1 n 1 P ¯ 2 3=2 [ n−1 (xi −x) ] i=1 Skewness can take on any number, negative means left skew, positive means right skew, 0 means symmetrical. Other measures Skewness Alternatives and problems One good way to look at skewness is with density plots (to be covered later) What can go wrong A single outlier can generate skewness. Again, if the mean is not an appropriate measure of central tendency, this is not an appropriate measure of skew Other measures Skewness Skewness of salary ods select Moments; proc univariate data = sashelp.baseball; var s a l a r y ; run ; Moments N 263 Sum Weights 263 Mean 535.93 Sum Observations 140948.507 Std Deviation 451.12 Variance 203508.064 Skewness 1.59 Kurtosis 3.05896473 Coeff Variation 84.18 Std Error Mean 27.82 Other measures Kurtosis What it is It is a measure of the peakedness of the distribution.

Lies, Damn Lies and ... SAS to the Rescue!

Lecture 12 Robust Estimation

Outlier Identification.Pdf

Winsorized and Smoothed Estimation of the Location Model in Mixed Variables Discrimination

Robust Analysis of the Central Tendency, ISSN Impresa (Printed) 2011-2084 Simple and Multiple Regression and ANOVA: a Step by Step Tutorial

Chapter 38 Distribution Analyses

Fast Computation of Trimmed Means, Journal of Statistical Software, Vol

Econometrics in Practice

Robust Statistics Part 1: Introduction and Univariate Data General References

Robust Statistical Methods for Empirical Software Engineering

II Issues in Estimation

High Breakdown Analogs of the Trimmed Mean David J

Robust Statistics: a Functional Approach