<<

Data Screening SCS Short Course 1 ' $ ' $

Course Outline

• Part 1: Getting started Failures to screen data Entering and checking raw data Data entry Creating a documented database Data Screening Checking data at input Assessing univariate problems USA Draft Lottery Data Boxplots and outliers 400

Self-Reports of Height and Weight Fusion Times for Stereograms Transformations to symmetry 130

120 300 Normal probability plots

110

VV 100 • Part 2: Assessing bivariate problems 200 G 90 r o u 80 p

Draft Priority value Enhanced scatterplots

70 Reported weight in Kg NV 100 Smoothing relations 60

50 Plotting discrete data Sex of subject F M 0 40 20 40 60 80 100 120 140 160 180 0 1020304050 Jan Mar May Jul Sep Nov Measured weight in Kg Fusion Time Month Transformations to linearity Dealing with non-constant variance • Michael Friendly Part 3: Multivariate problems and missing data Assessing multivariate problems Multivariate normality York University Multivariate outliers SCS Short Course Dealing with missing data Estimation with missing data (EM algorithms) October, 2004 Simple Imputation Multiple Imputation • SAS macro programs: http://www.math.yorku.ca/SCS/sssg/ http://www.math.yorku.ca/SCS/sasmac/ Color versions of these slides: http://www.math.yorku.ca/SCS/Courses/screen/

& % & %

Michael Friendly

Data Screening SCS Short Course 2 Data Screening SCS Short Course 3 ' $ ' $

Failures to Screen Data

Fusion times for random dot stereograms

• Does knowledge of the form of an embedded image affected time required for subjects to fuse the images? Failures to Screen Data • Two group design: Group NV (no visual info), Group VV (visual and verbal info). Data on Self-Reports of height and weight among men and women active in • t-test: t(76) = 1.939,p=0.0562, NS! exercise TTEST PROCEDURE • Regression of reported weight on measured weight gave very different Variable: TIME Fusion Time regressions for men and women Variances T DF Prob>|T| ------• Plotting the data suggested an answer Unequal 2.0384 70.0 0.0453 Equal 1.9395 76.0 0.0562 Self-Reports of Height and Weight Self-Reports of Height and Weight 170 130 • Boxplots show: times are positively skewed, differ in variance, and one 160 120 150 large outlier in the NV group.

140 110 Fusion Times for Stereograms Fusion Times for Stereograms

130 100 120

110 90

100

80 90 VV VV

80 70 G G Reported weight in Kg Measured weight in Kg 70 r r o o 60 u u 60 p p

50 50 40 NV NV Sex of subject F M Sex of subject F M 30 40 40 50 60 70 80 90 100 110 120 130 20 40 60 80 100 120 140 160 180 Reported weight in Kg Measured weight in Kg

0 1020304050 -0.5 0.0 0.5 1.0 1.5 2.0 Fusion Time log Fusion Time • Transforming the raw data to log (time) cured these problems, and led to the opposite conclusion!

See lib.stat.cmu.edu/DASL/Stories/FusionTime.html & % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 4 Data Screening SCS Short Course 5 ' $ ' $

Entering Raw Data - Basic tools Entering Raw Data - Statistical packages • Ordinary editor (e.g., Notepad, WinEdt, UltraEdit) Easy for small data sets • SPSS Manual alignment of input fields Startup: Newdata window No protection against input errors (wrong type, out of range, etc.) Define variable - name, type (num/char), variable label, missing values • Spreadsheet (e.g., Excel) Import .por (portable file) to SAS Easy for small to moderate sized data sets Data conversion tools (e.g., dbmscopy → SAS) Automatic alignment of input fields Automatic calculation of derived variables, Keyboard macros for repetitive tasks Programmable macros for checking input Import .xls spreadsheet to SAS (File Ð Import) Data conversion tools (e.g., dbmscopy → SAS, SPSS) • Database packages (Access, dBase, etc.) Easy for small to large sized data sets Define fields (type, length, min, max) Values can be verified as entered Import .dbf database to SAS (File Ð Import) Data conversion tools (e.g., dbmscopy → SAS, SPSS)

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 6 Data Screening SCS Short Course 7 ' $ ' $

• SAS/FSP - Full Screen Product

Design display screen to suit the application Entering Raw Data - Statistical packages Define variable - name, type (num/char), variable label filename psy303 ’~/sasuser/psyc303’; • SAS/Insight proc fsedit data=psy303.class97 screen=psy303.screen97;

Globals Ð Analyze Ð Interactive Data Analysis Ð New Define variables - name, type (num/char), measurement level (interval, nominal), role (group, label, frequency) label

Assign name, type, min, max, required, etc. to each variable Automatic range checking Automatically computed fields asn1 = mean( of A1-A5); asn2 = mean( of A6-A10);

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 8 Data Screening SCS Short Course 9 ' $ ' $

Creating a documented database

• Example: data Andy Allanson ACLEC 293 66 1 30 29 14 1 293 66 1 ... Alan Ashby NHOUC 315 81 7 24 38 39 14 3449 835 69 ... ASEA1B479130 18 66 72 76 3 1624 457 63 ... NMONRF496141 20 65 78 37 11 56281575 225 ... A Galarraga NMON1B321 87 10 39 42 30 2 396 101 12 ... A Griffin AOAKSS594169 4 74 51 35 11 44081133 19 ... Checking data at input

• SAS • Check categorical variables using ’other’ format Assign descriptive labels to variables data baseball(label=’1986 Baseball Hitter Data’); User-defined formats (PROC FORMAT) for variable values input name $1-14 league $15 team $16-18 position ...... /* Formats to specify coding of variables (other=) */ if (put(league, $league.) = ’ ’ then error; proc format; if (put(team, $team.) = ’ ’ then error; value $league if (put(position, $position.) = ’ ’ then error; ’N’ =’National’ ’A’ =’American’ other= ’ ’; ... value $team ’ATL’=’Atlanta ’ ’BAL’=’Baltimore ’ • Check ranges of numeric variables ’BOS’=’Boston ’ ’CAL’=’California ’ ’CHA’=’ A ’ ’CHN’=’Chicago N ’ if !(0 < atbat < 500) then error; ... other = ’ ’; if !(0 < hits < 500) then error; data baseball(label=’1986 Baseball Hitter Data’); if !(0 < years < 50) then error; input name $1-14 league $15 team $16-18 position $19-20 ... atbat 3. hits 3. homer 3. runs 3. rbi 3. walks 3. years 3. atbatc 5. hitsc 4. homerc 4. runsc 4. rbic 4. walksc 4. 4. assists 3. errors 3. salary 4.; label name = "Hitter’s name" atbat = ’Times ’ hits = ’Hits’ homer = ’Home Runs’ runs = ’Runs’ rbi = ’Runs Batted In’ walks= ’Walks’ years = ’Years in Major Leagues’ ... format league $league. team $team.;

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 10 Data Screening SCS Short Course 11 ' $ ' $

Checking numeric variables - the DATACHK macro

• Uses PROC UNIVARIATE to extract descriptive stats, high/low obs. • Formats output to 5 variables/page Checking variables • Boxplot of standardized scores to show distribution shape, outliers • Lists observations with more than nout (default: 3) extreme z scores, • Descriptive statistics checks |z| > zout (default: 2) SPSS - Frequencies • Example: SAS - PROC UNIVARIATE %include data(baseball); %datachk(data=baseball, id=name, Min, Max, # missing var=salary runs hits rbi atbat homer assists putouts); Mean, median, std. dev, skewness, etc. Use plot option for stem-leaf/boxplot and normal probability plot Documentation: Use ID statement to identify highest/lowest obs. http://www.math.yorku.ca/SCS/sasmac/datachk.html proc univariate plot data=baseball; var atbat -- salary ; Hebb lab (SAS): %webhelp(datachk);

id name; Baseball Hitters Data 5

• Consistency checks (e.g., unmarried teen-aged widows?) Eddie M 4 Don Mattingl

Sid Bream Willie Upsha Steve Balbon LeonWally Durham Joyner KentPete HrbekO’Brien K Hernandez SPSS - Crosstabs 3 Bill Buckner

SAS - PROC FREQ 2 proc freq; tables age * marital; 1 Standard score 0 • But: these can generate too much output! -1

-2

-3 assists atbat hits homer putouts rbi runs salary Variable

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 12 Data Screening SCS Short Course 13 ' $ ' $

Variable Stat Value Extremes Id

ATBAT N 322 16 Miss 0 19 Cliff Johnson Times at Bat Mean 380.9286 19 Terry Kennedy Std 153.405 20 The datachk macro Skew -0.07806 663 677 680 Boxplots of standard scores show the ‘shape’ of each variable, with labels for 687 T Fernandez ------’far-out’ observations. HITS N 322 1 Mike Schmidt Baseball Hitters Data Miss 0 2 Tony Armas Hits Mean 101.0248 3 Doug Baker 5 Std 46.45474 4 Terry Kennedy

Skew 0.291154 Eddie M 211 4 213 T Fernandez Bob Horner Don Mattingl 223 Kirby Puckett Willie Upsha Glenn Davis Steve Balbon LeonWally Durham Joyner KentPete HrbekO’Brien K Hernandez 238 Don Mattingly Von Hayes Steve Garvey ------3 Bill Buckner RBI N 322 0 Doug Baker Eddie Murray Miss 0 0 Mike Schmidt Runs Batted In Mean 48.02795 0 Tony Armas 2 Std 26.16689 2 Bob Boone Skew 0.608377 113 Don Mattingly 116 117 1 121 Joe Carter ------Standard score RUNS N 322 0 Mike Schmidt 0 Miss 0 1 Cliff Johnson Runs Mean 50.90994 1 Doug Baker Std 26.0241 1 Tony Armas Skew 0.415779 -1 108 Joe Carter 117 Don Mattingly 119 Kirby Puckett 130 R Henderson -2 ------SALARY N 263 68 B Robidoux Miss 59 68 Mike Kingery -3 Salary (in 1000$) Mean 535.9658 70 Std 451.104 70 assists atbat hits homer putouts rbi runs salary Skew 1.589077 * Variable 1975 Don Mattingly 2127 Mike Schmidt 2413 2460 Eddie Murray ------& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 14 Data Screening SCS Short Course 15 ' $ ' $

Sidebar: Using SAS macros

• SAS macros are high-level, general programs consisting of a series of DATA steps and PROC steps. Sidebar: Using SAS macros • Keyword arguments substitute your data names, variable names, and E.g., the SYMBOX macro is defined with the following arguments: options for the named macro parameters. symbox.sas ··· • Use as: 1 %macro symbox( 2 data=_last_, /* name of input data set */ %macname(data=dataset, var=variables, ...); 3 var=, /* name(s) of the variable(s) to examine*/ 4 id=, /* name of ID variable */ e.g., 5 out=symout, /* name of output data set */ %boxplot(data=nations, var=imr, class=region, id=nation); 6 orient=V, /* orientation of boxplots: H or V */ 7 powers=-1 -0.5 0 .5 1, /* list of powers to consider */ 8 name=symbox /* name for graph in graphics catalog */ • Most arguments have default values (e.g., data= last ) 9 );

• All SSSG and VCD macros have internal and/or online documentation, Typical use: http://www.math/yorku.ca/SCS/sssg/ 1 %symbox(data=baseball, http://www.math/yorku.ca/SCS/sasmac/ 2 var=Salary Runs, /* analysis variables */ http://www.math/yorku.ca/SCS/vcd/ 3 id=name, /* player ID variable */ 4 powers =-1 -.5 0 .5 1 2); • Macros can be installed in directories automatically searched by SAS. Put the following options statement in your AUTOEXEC.SAS file: options sasautos=(’c:\sasuser\macros’ sasautos);

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 16 Data Screening SCS Short Course 17 ' $ ' $

Boxplots

Boxplots provide a schematic graphical summary of important features of a distribution, including: • Assessing univariate problems the center (mean, median) • the spread of the middle of the data (IQR) • the behavior of the tails • Boxplots • outliers (plotted individually)

• Transformations to symmetry outliers?? Upper fence (not drawn) • Outliers

1.5 IQR • Normal probability plots Max. inside observation

Baseball data (runs): Power plot Baseball data: Salary Q 75th percentile 5 8 3

Trim : 5 % Eddie Murray Slope: 0.55 Jim Rice 7 mean 4 Power: 0.50 +

Mike Schmidt IQR median 6 Don Mattingl Eddie Murray Jim Rice 3 Dave Winfiel K Hernandez R Henderson 5 Q1 25th percentile 2 4

1 3 Min. inside observation

0 2 1.5 IQR Salary (in 1000$) (Std.)

1 -1

0 Centered Mid Value of runs

-2 H Reynolds Kal Daniels John Moses -1 Lower fence (not drawn) DarylK StillwellBoston AndresJeffMike Reed Aldrete Thoma ReyDaleAl Quinones Newman Sveum CurtGlenn Ford Braggs -3 B Robidoux Mike Kingery -2 -1/X -1/Sqrt Log Sqrt Raw Power 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Squared Spread • Notched boxplots for multiple groups: “Notches” at

± . IQR√ 95% CI Median 1 58 n show approximate 95% confidence intervals around the medians. Medians differ if the notches do not overlap (McGill et al., 1978).

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 18 Data Screening SCS Short Course 19 ' $ ' $

Boxplots - ANOVA data

Boxplots - Example • Boxplots are particularly useful for comparing groups • ANOVA: Do means differ? 1970 USA Draft Lottery • ANOVA: Assumes equal within-group variance! • Each birth date assigned a “random” priority value for selection to the Example: Survival times of animals (Box and Cox, 1964) military • Animals exposed to one of 3 types of poison • Ordinary scatterplot does not reveal anything unusual • Given one of 4 treatments • Boxplots by month show those born later in the year more likely to be • → 3× 4 design, n =4per group drafted Survival times of animals Poison USA Draft Lottery Data USA Draft Lottery Data 1 2 3 400 400 12.5

300 300 10.0

7.5 200 200

Draft Priority value Draft Priority value 5.0 Survival time (hrs)

100 100

2.5

0 0 0 100 200 300 400 Jan Mar May Jul Sep Nov Brithday (day of year) Month 0 See Friendly (1991), “SAS System for Statistical Graphics” §6.3. 1A 1B 1C 1D 2A 2B 2C 2D 3A 3B 3C 3D Poison-Treatment Group

• Boxplot shows that variance increases with mean (why?)

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 20 Data Screening SCS Short Course 21 ' $ ' $

Boxplots with SAS

• BOXPLOT macro - Graphics plots with many options %boxplot(data=draftusa, class=Month, Boxplots - ANOVA data var=priority, id=day, connect=3, cnotch=red); • PROC BOXPLOT (Version 8) p • Methods we will learn today suggest that power transformations, y → y proc boxplot; plot priority * month; are often useful. • Methods we will learn next week suggest rate = 1 / time • SAS/INSIGHT - Analyze Ð Box Plot, select response as Y, class variable(s) as X. Selecting highlights obs. in all other views. Survival rates of animals Poison 1 2 3

0

-1

-2

-3

-4 Survival rate (-survivors / hr)

-5

-6 1A 1B 1C 1D 2A 2B 2C 2D 3A 3B 3C 3D Poison-Treatment Group

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 22 Data Screening SCS Short Course 23 ' $ ' $

Transformations – Ladder of Powers

Transformations to symmetry p • Power transformations are of the form x → x .

• Transformations have several uses in data analysis, including: • A useful family of transformations is ladder of powers (Tukey, 1977), making a distribution more symmetric. defined as x → tp(x),  equalizing variability (spreads) across groups.  xp−1 p =0 making the relationship between two variables linear. t x p p( )= (1) log10 xp=0 • These goals often coincide: a transformation that achieves one goal will often help for another (but not always). • Key ideas: 0 • Some tools (Friendly, 1991): log(x) plays the role of x in the family. /p → x p< Understanding the ladder of powers. 1 keeps order of the same for 0. SYMBOX macro - boxplots of data transformed to various powers. • For simplicity, usually use only√ simple integer and half-integer powers SYMPLOT macro - various plots designed to assess symmetry. (sometimes, p =1/3 → 3 x); scale the values to keep results simple. p POWER plot: line with slope b ⇒ y → y , where p =1− b Power Transformation Re-expression (rounded to 0.5). 3 p x BOXCOX macro - for regression model, transform y → y to minimize 3 Cube /100 2 MSE (or maximum likelihood); influence plot shows impact of 2 Square x /10 observations on choice of power (Box and Cox, 1964). 1 NONE (Raw) x p √ BOXGLM macro - for GLM (anova/regression), transform y → y to 1/2 Square root x minimize MSE (or max. likelihood) p 0 Log log10 x BOXTID macro - for regression, transform xi → xi (Box and Tidwell, √ − / x 1962). -1/2 Reciprocal root 10 -1 Reciprocal −100/x

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 24 Data Screening SCS Short Course 25 ' $ ' $

Ladder of Powers – Example

Baseball data - runs • Ladder of Powers – Properties SYMBOX macro - transforms a variable to a list of powers, show standardized scores using the BOXPLOT macro %include data(baseball); • Preserve the order of data values. Larger data values on the original title ’Baseball data: Runs’; scale will be larger on the transformed scale. (That’s why negative powers %symbox(data=baseball, var=Runs, powers =-1 -.5 0 .5 1 2); have their sign reversed.) Baseball data: Runs 5 • They change the spacing of the data values. Powers p<1, such as 4 √ 3 x and log x compress values in the upper tail of the distribution relative 2 p> x2 to low values; powers 1, such as , have the opposite effect, 1 expanding the spacing of values in the upper end relative to the lower end. 0 √ -1

• Shape of the distribution changes systematically with p.If x pulls in R -2 u n the upper tail, log x will do so more strongly, and negative powers will be s -3 -4 stronger still. -5 • Requires all x>0. If some values are negative, add a constant first, i.e., -6 -7 x → tp(x + c) -8 • Has an effect only if the range of x values is moderately large. -9 -10 -1/X -1/Sqrt Log Sqrt Raw 1.5 2 Power √ • runs → runs looks best.

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 26 Data Screening SCS Short Course 27 ' $ ' $

Ladder of Powers – Example

Baseball data - salary • SYMBOX macro - transforms a variable to a list of powers, show standardized scores using the BOXPLOT macro Plots for assessing symmetry title ’Baseball data: Salary’; %symbox(data=baseball, var=Salary, powers =-1 -.5 0 .5 1, id=name); Upper vs. lower plots Baseball data: Salary 5 • In a symmetric distribution, the distances of points at the lower end to the

Eddie Murray median should match the distances of corresponding points in the upper Jim Rice 4

Mike Schmidt end to the median.

Don Mattingl Eddie Murray Ozzie Smith Jim Rice Dale Murphy Gary Carter 3 Dave Winfiel K Hernandez R Henderson Lower distance to median = Upper distance to median Wade Boggs 2 Med − x(i) = x(n+1−i) − Med

1

f(X) f(X) 0.2 0.2 0 Symmetric + Skew Med Salary (in 1000$) (Std.) P25 P25 P75 Med P10 -1 0.1 0.1 P75 P10 P90

P90 -2 H Reynolds Kal Daniels John Moses DarylK StillwellBoston AndresJeffMike Reed Aldrete Thoma ReyDaleAl Quinones Newman Sveum CurtGlenn Ford Braggs B Robidoux Mike Kingery 0.0 0.0 -3 0 5 10 15 0 5 10 15 -1/X -1/Sqrt Log Sqrt Raw X X Power

• salary → log(salary) looks best.

See http://www.math.yorku.ca/SCS/sasmac/symbox.html

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 28 Data Screening SCS Short Course 29 ' $ ' $

• SYMPLOT macro - Upper vs. lower plot (plot=UPLO). Points should plot Plots for assessing symmetry as a straight line with slope = 1 in a symmetric distribution. title ’Baseball data (runs): Upper vs. Lower plot’; %symplot(data=baseball, var=runs, plot=uplo); Untilting: Mid vs. spread plots

Baseball data (runs): Upper vs. Lower plot • In the Upper vs. Lower plot we must judge departure from symmetry by

50 divergence from the line y = x.

• Change coordinates, so that the reference line for symmetry becomes 40 o horizontal. Rotate 45 , by plotting:

mid ≡ [x(n+1−i) + x(i)]/2 vs. x(n+1−i) − x(i) ≡ spread 30

• In a symmetric distribution, each mid value should equal the median

20 [x(n+1−i) + x(i)]/2=Median

• i i

Upper distance to median 10 Mid values will increase with in a +-skewed distribution, decrease with in a −-skewed distribution. f(X) f(X) 0.2 Symmetric 0.2 + Skew Med 0 P25 P25 P75 Med P10 0 1020304050

Lower distance to median 0.1 mid25 0.1 mid25 P75 P10 P90 mid10 mid10 P90 • For skewed distributions, the points will tend to rise above the line (positive mid5 mid5

0.0 0.0 skew) or fall below (negative skew). 0 5 10 15 0 5 10 15 X X

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 30 Data Screening SCS Short Course 31 ' $ ' $

• SYMPLOT macro - Mid vs. spread plots (plot=MIDSPRD). Points should plot as a horizontal line with slope = 0 in a symmetric distribution. title ’Baseball data (runs): Mid - Spread plot’; %symplot(data=baseball, var=runs, plot=midsprd); Plots for assessing symmetry Baseball data (runs): Mid - Spread plot

56 2 Power plot: Mid vs. z plots

55 • Emerson and Stoto (1982) suggest a variation of the Mid vs. Spread plot, 54 scaled so that a slope, b indicates the power p =1− b for a 53 transformation to approximate symmetry.

52 • In this display, we plot the centered mid value,

51 x(i) + x(n+1−i) − M 50 2

Mid value of runs 49 against a squared measure of spread,

48 2 2 2 2 [M − x(i)] +[x(n+1−i) − M] z2 ≡ Lower + Upper 47 = 4M 4M

46 • SYMPLOT macro - Power plots (plot=power). Points should plot as a 0 1020304050607080 SPREAD horizontal line with slope = 0 in a symmetric distribution. title ’Baseball data (runs): Power plot’; • Because the plot is untilted (slope = 0) when the distribution is symmetric, %symplot(data=baseball, var=runs, plot=power); expansion of the vertical scale allows us to see systematic departures from flatness far more clearly.

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 32 Data Screening SCS Short Course 33 ' $ ' $

Baseball data (runs): Power plot Normal probability plots 8

Trim : 5 % Slope: 0.55 7 Power: 0.50 • Compare observed distribution some theoretical distribution (e.g., the 6 normal or Gaussian distribution)

5 • Ordinary histograms not particularly useful for this, because

4 they use arbitrary bins (class intervals) 3 they lose resolution in the tails (where differences are likely)

2 the standard for comparison is a curve

1 • Quantile-comparison plots (Q-Q plots) plot the quantiles of the data

0

Centered Mid Value of runs against corresponding quantiles in the theoretical distribution, i.e.,

-1 −1 x(i) vs. zi =Φ (pi)

-2 i−1/2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 where x(i) is the i-th sorted data value, having a proportion, pi = Squared Spread n −1 of the observations below it, and zi =Φ (pi) is the corresponding • Symmetry is indicated by a line with slope=0 and intercept=0. quantile in the normal distribution. • p − b The SYMPLOT macro rounds =1 to the nearest half-integer. • When the data follows the normal distribution, the points in such a plot will

• It is often useful to exclude (trim) the highests/lowest 5Ð10% of follow a straight line with slope = 1.

observations for automatic diagnosis. • Departures from the line shows how the data differ from the assumed

See http://www.math.yorku.ca/SCS/sssg/symplot.html distribution.

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 34 Data Screening SCS Short Course 35 ' $ ' $

Normal probability plots Normal probability plots

Patterns of deviation for Normal Q-Q plots: • De-trended plots show the deviations more clearly • Postive (negative) skewed: Both tails above (below) the comparison line • Plot x(i) − zi vs. zi. • Heavy tailed: Lower tail below, upper tail above the comparison line

1.0 4 0.9 (a) Normal (0,1) data (b) Normal (0,1) with outliers 4 7 (a) Normal (0,1) data (b) Normal (0,1) with outliers 0.8 6 0.7 3 0.6 3 5 0.5 0.4 2 4 0.3 0.2 2 1 3 0.1 0.0 2 -0.1 0 -0.2 1 1 -0.3 Data Value Data Value -1 0 -0.4 Deviation From Normal Deviation From Normal -0.5 -1 -0.6 0 -2 -0.7 -2 -0.8 -3 -0.9 -3 -1.0 -1 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -4 -4 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 Normal Quantile Normal Quantile Normal Quantile Normal Quantile 3 13 (c) Chi Square (2) data (d) Student t (2) data 8 20 12 (c) Chi Square (2) data (d) Student t (2) data 11 7 10 2 6 9 8 5 7 4 10 1 6 5 3 4 2 0 3 1 2 Data Value Data Value 1 0 0 Deviation From Normal Deviation From Normal 0 -1 -1 -1 -2 -2 -3 -3 -2 -4 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -4 -10 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 Normal Quantile Normal Quantile Normal Quantile Normal Quantile

& % & %

Michael Friendly Michael Friendly Data Screening SCS Short Course 36 Data Screening SCS Short Course 37 ' $ ' $

Normal probability plots Normal probability plots: confidence bands

Baseball data - salary • Points in a Q-Q plot are not equally variable—observations in the tails vary • Raw data most for normal. %nqplot(data=baseball, var=salary); • s z z Calculate estimated standard error, ˆ( i), of the ordinate i and plot 3000 1000 z ± s z 900 curves showing the interval i 2ˆ( i) to give approximate 95% 800 2000 700

confidence intervals. (Chambers et al. (1983) provide formulas.) 600

500 1000 σ p − p 400 s z ˆ i (1 i) 300 ˆ( i)= 200 f z n 0 ( i) 100 Salary (in 1000$) 0 Deviation From Normal

-100 • Confidence bands help to judge how well the data follow the assumed -1000 -200 distribution -300 -2000 -400 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 700 400 Normal Quantile Normal Quantile • 600 Tr y log salary — better, but not perfect (who is?)

500 300 data baseball; set baseball; 400 label logsal = ’log 1986 salary’;

300 200 logsal = log(salary); %nqplot(data=baseball, var=logsal); 200 10 2

100 100 9 Infant Mortality Rate Deviation From Normal

0 8 1

-100 0 7

6 0 -200

5

-300 -100 log 1986 salary -3-2-10123-3 -2 -1 0 1 2 3 4 Deviation From Normal -1 Normal Quantile Normal Quantile

3

See http://www.math.yorku.ca/SCS/sssg/nqplot.html 2 -2 -3 -2 -1 0 1 2 3 -3-2-10123 Normal Quantile Normal Quantile

& % & %

Michael Friendly Michael Friendly

Data Screening SCS Short Course 107 ' $

References

Box, G. E. P. and Cox, D. R. An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B, 26:211Ð252, 1964. 19, 22

Box, G. E. P. and Tidwell, P. W. Transformation of the independent variables. Technometrics, 4:531Ð550, 1962. 22

Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. Graphical Methods for Data Analysis. Wadsworth, Belmont, CA, 1983. 36

Emerson, J. D. and Stoto, M. A. Exploratory methods for choosing power transformations. Journal of the American Statistical Association, 77:103Ð108, 1982. 31

Friendly, M. SAS System for Statistical Graphics. SAS Institute, Cary, NC, 1st edition, 1991. 18, 22

Gnanadesikan, R. and Kettenring, J. R. Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28:81Ð124, 1972.

Little, R. J. A. and Rubin, D. B. Statistical Analysis with Missing Data. John Wiley and Sons, New York, 1987.

McGill, R., Tukey, J. W., and Larsen, W. Variations of box plots. The American Statistician, 32:12Ð16, 1978. 17

Rubin, D. B. Multiple Imputation for Nonresponse in Surveys. John Wiley and Sons, New York, 1987.

Schafer, J. L. Analysis of Incomplete Multivariate Data. Chapman & Hall, London, 1997.

Tukey, J. W. Exploratory Data Analysis. Addison Wesley, Reading, MA, 1977. 23

& %

Michael Friendly