<<

116 Beginning Tutorials

Understanding PROC UNIVARIATE Statistics Wendell F. Refior, Paul Revere Insurance Group ·

Introduction Missing Value Section The purpose of this presentation is to explain the This section is displayed only if there is at least one llasic statistics reported in PROC UNIVARIATE. missing value. An intuitive and graphical approach to data analysis is encouraged, but this text will limit the figures to be MISSING VALUE - This shows which charac­ used in the oral presentation to conserve space. The ter represents a missing value. For numeric vari­ author advises that data inspection and cleaning ables, lhe default missing value is of course the should precede statistical analysis and description. decimal point or "dot". Other values may have John W. Tukey'sErploratoryDataAnalysistoolsare been used ranging from .A to :Z. Vt:C'f helpful for preliminary data inspection. H the COUNT- The number shown by COUNT is the values of your data set do not conform to a bell­ number ofoccurrences of missing. Note lhat the shaped curve, as does the , the NMISS oulput variable contains this number, if question of transforming the values prior to analysis requested. is raised. Then key summary statistics are nexL Brief mention will be made of problem of statistical % COUNT/NOBS -The number displayed is the vs. practical significance, and the topics of statistical percentage of all observations for which the estimation and hYPQ!hesis testing. The Schlotzhauer analysis variable has a missing value. It may be and Littell text,SAs® System/or Elementary Analysis calculated as follows: ROUND(100 • ( NMISS is highly recommended and was helpful to the author. I (N + NMISS) ), .01).

Overview Extremes Section It is The PROC UNIVARIATE output EX'IREMES and important to check these values if you have any QUANTIT..ES sections are given fJ.I'St as focus on doubt that they are legitimate values for the analysis variable preliminary data inspection. The BOXPLOT and in your data set Obviously, corrections should JITSTOGRAM (stern and leaf) plots of the data, be made before valuable time is spend inter­ preting advocated by Tukey, are introduced next as useful for the other statistical results. spotting possible data errors or unusual characteris­ LOWEST - The five(5) lowest, non-missing, tics in the distribution of data values. The MO­ data values are listed. MENTS section dealing with common summary sta­ tistics will then be covered. The normal distribution IITGHEST - The five(5) highest data values are and the normal probability plot are then explained. listed. The topic of testing for normality of the distribution is then taken up before a brief mention of statistical Quantlles Section testing and estimation. JOO% MAX - The highest or maximum value is shown here, the 100-lh percentile. Data Inspection vs. Description The careful data analyst will thoroughly inspect and 75% Q3 - The third quartile or 75-th percentile is correct data error whenever both possible and prac­ shown here. tical to do so. For this reason, the Missing Value Section and Extremes Section are presented before 50% MED- The or 50-lh percentile is shown other statistical concepts. Checking and cleaning the here, the second quartile. data could save both programmers and analysts from wasting days or weeks of research effort, if severe 25% QJ - The first quartile or 25-th percentile is data errors might require a complete re-analysis of shown here. the results. Note that PROC FREQ is another good procedure to use for lhis purpose. 0% MIN - The lowest or minimum value is shown here, lhe zero-th percentile.

NESUG '90 Proceedin,~s Beginning Tutorials 117

The median is widely used in place of the average or to indicate the middle value of a distribution. Box and Whiskers Plot If the distribution contains some unusually high val­ The height of the box represents the inter-quartile ues compared to the remainder of the values as would , where the top is Q3 and the bottom is Ql. The be the case for income data, then the median would belt across the box is the median and the mean is be the preferred measure of the center of the distribu­ represented by a plus(+) sign. The line extended tion. In fact, media reports on personal income usu­ from the top and bottom represent points up to a ally quote studies on the median annual personal distance equal to two times (Q3-Ql) from the me­ income, not the mean. Now, notice that the median dian. Zeros(O) reach another l.S times (Q3-Ql), and is completely insensitive to, i.e. unchanged by, the asterisks beyond that. highest and lowest values, provided N >= 3. lfN is an odd number and all the values were sorted in, say, ascending order, then the value that is the middle Stem and Leaf Plot one in that list would be the median. If each place The ffiSTOGRAM used is a stem and leaf plot. The were numbered and that number is called its rank, scale is on the left and frequency counts are on the then the median would be the value with rank equal right for each "stem". The leaves on each stem to (N+ 1)/2, the middle place in that order. For N=3, represent one or more data points. If one is repre­ it would be the second highest(or second lowest) sented, then each leaf may be the units digit of the value. If, in this trivial case, of just three observa­ value, while the tens and higher digits are represented tions, the lowest income was $20,000/year, the mid­ on the scale on the left. It is very instructive to dle value was $40,000/year, and the highest income consider the shape of the distribution of values as was any value greater than or equal to $40,000/year, revealed by this . If is NOT much like a then the median would always be $40,000. How­ bell-shaped curve, then normal theory methods may ever, the mean, described in the Moments Section, not apply to this set of values. would vary according to the highest value. A high value of $40,000, would result in a mean of about Normal Probability Plot S33,000(or exactly $100,000/3), but the mean could First, recall that the normal distribution is a theoreti­ be three times that figure if the high value were cal model, the classic bell-shaped curve used for $240,000 and thus, the mean $100,000. Several grading on the curve. This model may or may not be methods for computing the median when N is an even like the actual shape of your data. Before you apply number are available in PROC UNIVARIATE, but statistical methods that asswne that the values of your will not be discussed here. Remember, the median data come from an underlying normal distribution, (and percentiles) are measured on a scale of rank or you should check the validity of that asswnption. place in sorted values, instead of on the scale of dollars as would be the mean in our income example. D:NORMAL - The D:NORMAL statistic on the output provides a formal test of the hypothesis that A value for RANGE • The range is calculated as the differ­ the data are from a normal distribution. of this ence between the highest and lowest values, i.e., PROBD of .01 would indicate a rejection values and MAX - MIN. It is an indicatOr of the spread of hypothesis. However, ifthe to zero, then nor­ the data, but is less used than other measures of KURTOSIS are suffiCiently close dispersion. · mal theory method may still be used. The rejection of normality may only be a result of the. fact that you Q3-Q 1 -This quantity is the inter-quartile range. have a very large data set and therefore enough It may be used as a measure of dispersion, where "power" to detect even minute deviations from nor­ half this quantity((Q3-Ql)/l) is substituted for mality. the . Like the median, this measure has the desirable trait of being insensi­ The normal probability plot is a cumulative distribu­ tive to the extreme values, in fact to any value tion graph of your data on the scale of z, the standard above Q3 or below Q 1. It is used in the normal statistic. The zero in a standard normal dis­ BOXPLOT. tribution is the mean and median(and the ). A one standard MODE - The mode is the value with the most distance of one in either direction is expected normal dis­ occurrences. deviation from the mean. The tribution points, if visible, are plotted as plus(+) signs in an approximately sttaight line at about a 45 degree angle. The data points will definitely be shown as

. NESUG '90 Proceedings 118 Beginning Tutorials

asterisks(*) and should cover most of the plus signs as the mean or average of those squared differ­ if the data conform to a normal distribution. Gross ences. Or, just think of it as a the square of the deviations from a straight line may suggest that either average distance from the mean. The greater the normal theory methods not be used or that the data average distance, disregarding whether positive should be ftrSt tmnsformed by some mathematical or negative, the greater the squared avemge dis­ function to yield a more bell-shaped distribution, if tance, thus, the greater the . If, on aver­ normal theory methods are to be used. It would be age, the values are "close" to the mean, then the advisable 10 consult with a statistician in the latter variance will be small and the mean can be view case. Tukey gives many suggestions for reshaping as a mtherprecise estimator of the middle of this the data to be more "well-behaved." population of values. However, if the values are "distantly spread" from the mean, on average, Key Statistics (Moments Section) then the variance will be large and the mean will be viewed as a poor or imprecise estimator or N - N is the number of observations that were predictor of values from this population. See included in the analysis. This number excludes also the SID DEV. only the observations for wbich this variable had a missing value. Therefore N + NMISS = S1D DEV-The standard deviation is commonly NOBS, where NOBS is the total number of preferred over the variance as a measure of observations in the data set. Note NOBS may be spread or dispersion of the values, since it is on requested in the output data set using the OUT­ the same scale as the values themselves. The PUT statement. STD DEV is the positive square root of the variance and is measured in the same unit as the SUM WGTS - The sum of the observation values; whereas the variance is measured in weights would in a sample survey context be an tenns of the square of that uniL In the presence estimate of the size of the target population, of a few unusually large values, consider using obviously much greater than the size of the sam­ the inter-quartile range as a measure of disper­ ple itself. This talk will not cover how to do sion. weighted analysis or use of the WEIGHT state­ menL Only note here that a weight of one( I) is SKEWNESS - The skewness statistic is a mea­ assumed for each observation, such that SUM sure of how evenly the values are dispersed on WGTS=N. the high side verses the low side of the mean. A normal distribution, the classic bell-shaped MEAN- The mean, or more commonly, the curve used for "grading on a curve," has the avemge, value of the variable is of course equal value zero(O) for skewness, since it is symmet­ to the sum of the N non-missing values divided rically or evenly spread on the high and low sides by N. The mean is a measure of the center of a of the mean. A positive value measures the distribution of values. Caution is advised in tendency for values to be spread further from the interpreting the mean as the center, since it is mean on the high side verses the low. This pulled to the high side in the presence of even statistic can be used as a red flag or yellow just a few unusually large values, as might be the "caution" light, warning you to consider tmns­ case in a study on personal income. In such forming you data prior to analysis, if you wish cases, it is advisable to use the median as noted to use normal theory statistical tests or estima­ in the Quantile Section. tion procedures. A positive skewness may sug­ VARIANCE - The variance is a measure of the gest that for estimation or prediction purposes, a spread or diversity of the values. It is calculated wider confidence "hair' interval(discussed by ftrSt subtmcting the mean from each value under STD MEAN) be used on the high side than and squaring that difference, then summing the on the low side and visa versa for a negative squares and dividing the whole sum by (N-1); skewness. although some use N here. Thus it is impossible KURTOSIS -The kurtosis statistic is a measure for the result to be negative, since even the of whether the extremes are more distant from square of a negative number is a positive num­ the mean or "tails" on a graph of the data values ber. The variance like the mean is highly in­ are "heavier" than would be expected for a bell flated by even just a few unusually large values. shaped curve, given the location of the middle In such case, consider using the square of the data values. A zero value would be attained for inter-quartile range as a measure of spread or a normal theory distribution of values. A posi­ dispersion. To interpret this statistic, think of it tive value indicates how much more likely val-

NESUG '90 Proceedings Beginning Tutorials 119

ues are 10 fall relatively far from the as If N = 12, so df = 11, then the result would be as compared to a bell shaped curve, where the follows: probability of a value more than 2 standard de­ viations from the mean is about one(l) out of20 95% CJ. of population mean = or five(5) percent A "high" kurtosis is another MEAN+/- t(lable: 11, .975) • STD MEAN= type of warning that normal theory tests and ( (MEAN -(2201)* STDMEAN), estimation maybe invalid or may suggest that a (MEAN+ (2.201) • S1'D MEAN) ) data uansformation be performed before data analysis. See Snedecor and Cochran for tests. Note: The validity of the assumption of a beD­ USS - Uncorrected(for the mean) sum of shaped distribution is the responsibility of the squares(USS) will not be discussed. data analysl(s), or consult with a statistician. CSS - Corrected(for the mean) sum of squares(CSS) will not be discussed here. CV - The coefficient of variation(CV) is calcu­ Practical vs. Statistical Significance 100 times the ratio of the lated by multiplying Remember 10 use common sense and your knowl­ deviation to the mean. Here the factor standard edge about the data when you begin 10 set up statis­ puts this measure on the same scale as of 100 tical tests. A study with an unnecessarily large set of although no percent sign is used. A CV percent, data may detect differences as "statistically signifi­ that one standard deviation is one of 10 means cant" when such small differences may be of no 10 percent of the mean. A CV of 100 tenth or practical significance in the real worlc1 Studies more is exactly equal indicates one standard deviation typically have very limited numbers of subjects or mean. Since this is a "relative mea­ in size 10 observations due to budget and time constraints and a given value of CV may have widely sure," may run the risk of not being able 10 detect with significance depending on the different practical sufficient statistical significance a difference in study and the context. analysis variable variables that would be of important practical signif­ STD MEAN - The standard error of the mean icance, if only it could be demonstrated to be a also abbreviated as S.E.(MEAN) is calculated as statistically valid study. Consult with a survey stat­ the standard deviation divided by square root of istician before yoo begin your study, during that all N (seeN above). In sample surveys, the confi­ important design stage, to avoid requiring too few or dence interval estimate of the population mean too many subjects or observations in your study. is formed by adding (and/or subtracting) a cer­ tain multiple of the standard error to(and/or Test Statistics from) the value of the sample mean. For thirty or more observations, a standard normal statistic T:MEAN=O-Thisis the test t-statistic for testing of about two (actually 1.96) is frequently the hypothesis that the mean of this distnbution used(assuming a bell-shaped distribution) 10 is zero(O). Of course, this hypothesis may be of form a 95 percent(95%) confidence interval for no practical use 10 your study! If it is of use, the the population mean. The resulting confidence PROB > m statistic would indicate a rejection interval(C.I) could be expressed as follows: of this hypothesis at the alpha equal five percent level of significance if it is less than 0.05. 95% CJ. of population mean= SON RANK • This is a signed rank test of the MEAN + /- (1.96) * STD MEAN= same hypothesis that the mean is zero(O). This a non-paramelric test and is valid even if ( (MEAN - (1.96) * STD MEAN), test is the nmnality assumption is invalid. The level ) (MEAN+ (1.96) * STD MEAN) of signiftcance of this test is shown by PROB > lSI. For fewer than forty(40) values(N < 40), use a tabled t-statistic, determined by the probability level and the degrees of freedom(dt), i.e., (N-1). These values may be obtained from a basic statistics textbook and are somewhat higher due 10 the fact that estimates based on "small" samples tend to be more variable.

NESUG '90 ProCeedings 120 Beginning Tutorials

Conclusion References PROC UNNARIATE can be an extremely useful Dixon, W J.andMassey,FJ.,Jr.(l969),/ntroduction 1001 to the Programmer/Analyst as well as theRe­ to Statistical Analysis, 3rdEd., New Yorlt: McGraw­ search Scientist for both data cleaning and descrip­ Hill. tive statistics. Understanding how to use all the statistics and when not to use those that may be Schlothauer, SD. and Uttell, R.C.(1987), SAS Sys­ invalid is the responsibility of the user. Again, when tem for Elementary Statistical Analysis, Cary, N.C.: in doubt, consult your friendly (neighborhood) stat­ SAS Institute Inc. istician. Snedecor, G.W. and Cochran, W.G.,(l967), Statisti­ cal Methods, 6th Ed., Ames, Iowa: Iowa State Uni­ versity Press.

Tukey, 1.W.(l977), Exploratory Data Analysis, SAS is a registered trademark of SAS Institute Inc., Reading, MA: Addison-Wesley. Cary NC, USA.

CI'U USAIIE IY Vlt USERS FOR .MIE ., 1990 12~2S TUESDAY, ._.T 1ft. 199G 1 -~ LOCITCI'U+11o .11 ~ OFFICE USERS • IEIIULAR U$ 1-·I'IIDDUCTlGHl UlllVAIIlAT£

IIIIIIEHTS IIUAHTZLESIDEF"" I EIITltEIIES N 11tl6 -!lilTS 1416 1117. KAl( 5.7 .. 2.e612'o ., ~.1113 LIIIIEST IIIOHEST 21117.3 75>. 115 2.8 951::: 5.7 STDDEV 1.D683 VAJUANCE S.l 1.1~127 &OZ tiED 2.1 s.n • s.s SKEWES$ ...... 11, KUI!TCISIS 257. Ill 1 •• 11% -US$ 1.7 •I s.s 7,25.23 ...... ··~ -css 1614.19 - 17. KIN 5f. cv 51.8532 STD MEAN D.I28311H • 5.• u •0 •0 5.7 T:ltEANCD 72.- PltDB>ITI ....n 5.7 PltDB>ISI 0.1111 -E03-0l l •• ·-931$0, 0.7 DdiDINAL 1.1587118 ,_.D ----· <.11 HZS'RCWNI - I -LOT 5.7&+• ~ PftBAJIILlTY PLOT l 1.75+ • •• 5 • 0 I • ••..... • I ••• 23 •I I ...... I ·------· 1586>8 I I ...... -• ...... 277 I ~•-• ...... ~ ·-----· • ._...... -...... 21.5 I --••• ·--·--·I I I HH* ...... 211 ·-----· I u•••• -1.25+...... ············----· 117119 t.2S...... I -•. •~---·----·----·----·-~--·----·----·----·----·-- IIAY IEI'IESSIT Ill' TO 6 CIIUHTS ·---~·----·----·----·----·----·----·----·----·----·-z -1 a +1 +2

NESUG '90 Proceedings