Understanding PROC UNIVARIATE Statistics Wendell F
Total Page:16
File Type:pdf, Size:1020Kb
116 Beginning Tutorials Understanding PROC UNIVARIATE Statistics Wendell F. Refior, Paul Revere Insurance Group · Introduction Missing Value Section The purpose of this presentation is to explain the This section is displayed only if there is at least one llasic statistics reported in PROC UNIVARIATE. missing value. analysis An intuitive and graphical approach to data MISSING VALUE - This shows which charac is encouraged, but this text will limit the figures to be ter represents a missing value. For numeric vari The used in the oral presentation to conserve space. ables, lhe default missing value is of course the author advises that data inspection and cleaning decimal point or "dot". Other values may have description. should precede statistical analysis and been used ranging from .A to :Z. John W. Tukey'sErploratoryDataAnalysistoolsare Vt:C'f helpful for preliminary data inspection. H the COUNT- The number shown by COUNT is the values of your data set do not conform to a bell number ofoccurrences of missing. Note lhat the shaped curve, as does the normal distribution, the NMISS oulput variable contains this number, if question of transforming the values prior to analysis requested. is raised. Then key summary statistics are nexL % COUNT/NOBS -The number displayed is the statistical Brief mention will be made of problem of percentage of all observations for which the vs. practical significance, and the topics of statistical analysis variable has a missing value. It may be estimation and hYPQ!hesis testing. The Schlotzhauer calculated as follows: ROUND(100 • ( NMISS and Littell text,SAs® System/or Elementary Analysis I (N + NMISS) ), .01). is highly recommended and was helpful to the author. Extremes Section Overview It is important to check these values if you have any The PROC UNIVARIATE output EX'IREMES and doubt that they are legitimate values for the analysis QUANTIT..ES sections are given fJ.I'St as focus on variable in your data set Obviously, corrections preliminary data inspection. The BOXPLOT and should be made before valuable time is spend inter JITSTOGRAM (stern and leaf) plots of the data, preting the other statistical results. advocated by Tukey, are introduced next as useful for spotting possible data errors or unusual characteris LOWEST - The five(5) lowest, non-missing, tics in the distribution of data values. The MO data values are listed. MENTS section dealing with common summary sta IITGHEST - The five(5) highest data values are tistics will then be covered. The normal distribution listed. and the normal probability plot are then explained. The topic of testing for normality of the distribution is then taken up before a brief mention of statistical Quantlles Section testing and estimation. JOO% MAX - The highest or maximum value is shown here, the 100-lh percentile. vs. Description Data Inspection 75% Q3 - The third quartile or 75-th percentile is inspect and The careful data analyst will thoroughly shown here. correct data error whenever both possible and prac Value tical to do so. For this reason, the Missing 50% MED- The median or 50-lh percentile is shown before Section and Extremes Section are presented here, the second quartile. other statistical concepts. Checking and cleaning the from data could save both programmers and analysts 25% QJ - The first quartile or 25-th percentile is effort, if severe wasting days or weeks of research shown here. data errors might require a complete re-analysis of is another good the results. Note that PROC FREQ 0% MIN - The lowest or minimum value is shown procedure to use for lhis purpose. here, lhe zero-th percentile. NESUG '90 Proceedin,~s Beginning Tutorials 117 The median is widely used in place of the average or mean to indicate the middle value of a distribution. Box and Whiskers Plot If the distribution contains some unusually high val The height of the box represents the inter-quartile ues compared to the remainder of the values as would range, where the top is Q3 and the bottom is Ql. The be the case for income data, then the median would belt across the box is the median and the mean is be the preferred measure of the center of the distribu represented by a plus(+) sign. The line extended tion. In fact, media reports on personal income usu from the top and bottom represent points up to a ally quote studies on the median annual personal distance equal to two times (Q3-Ql) from the me income, not the mean. Now, notice that the median dian. Zeros(O) reach another l.S times (Q3-Ql), and is completely insensitive to, i.e. unchanged by, the asterisks beyond that. highest and lowest values, provided N >= 3. lfN is an odd number and all the values were sorted in, say, ascending order, then the value that is the middle Stem and Leaf Plot one in that list would be the median. If each place The ffiSTOGRAM used is a stem and leaf plot. The were numbered and that number is called its rank, scale is on the left and frequency counts are on the then the median would be the value with rank equal right for each "stem". The leaves on each stem to (N+ 1)/2, the middle place in that order. For N=3, represent one or more data points. If one is repre it would be the second highest(or second lowest) sented, then each leaf may be the units digit of the value. If, in this trivial case, of just three observa value, while the tens and higher digits are represented tions, the lowest income was $20,000/year, the mid on the scale on the left. It is very instructive to dle value was $40,000/year, and the highest income consider the shape of the distribution of values as was any value greater than or equal to $40,000/year, revealed by this histogram. If is NOT much like a then the median would always be $40,000. How bell-shaped curve, then normal theory methods may ever, the mean, described in the Moments Section, not apply to this set of values. would vary according to the highest value. A high value of $40,000, would result in a mean of about Normal Probability Plot S33,000(or exactly $100,000/3), but the mean could First, recall that the normal distribution is a theoreti be three times that figure if the high value were cal model, the classic bell-shaped curve used for $240,000 and thus, the mean $100,000. Several grading on the curve. This model may or may not be methods for computing the median when N is an even like the actual shape of your data. Before you apply number are available in PROC UNIVARIATE, but statistical methods that asswne that the values of your will not be discussed here. Remember, the median data come from an underlying normal distribution, (and percentiles) are measured on a scale of rank or you should check the validity of that asswnption. place in sorted values, instead of on the scale of dollars as would be the mean in our income example. D:NORMAL - The D:NORMAL statistic on the output provides a formal test of the hypothesis that A value for RANGE • The range is calculated as the differ the data are from a normal distribution. of this ence between the highest and lowest values, i.e., PROBD of .01 would indicate a rejection values SKEWNESS and MAX - MIN. It is an indicatOr of the spread of hypothesis. However, ifthe to zero, then nor the data, but is less used than other measures of KURTOSIS are suffiCiently close dispersion. · mal theory method may still be used. The rejection of normality may only be a result of the. fact that you Q3-Q 1 -This quantity is the inter-quartile range. have a very large data set and therefore enough It may be used as a measure of dispersion, where "power" to detect even minute deviations from nor half this quantity((Q3-Ql)/l) is substituted for mality. the standard deviation. Like the median, this measure has the desirable trait of being insensi The normal probability plot is a cumulative distribu tive to the extreme values, in fact to any value tion graph of your data on the scale of z, the standard above Q3 or below Q 1. It is used in the normal statistic. The zero in a standard normal dis BOXPLOT. tribution is the mean and median(and the mode). A one standard MODE - The mode is the value with the most distance of one in either direction is expected normal dis occurrences. deviation from the mean. The tribution points, if visible, are plotted as plus(+) signs in an approximately sttaight line at about a 45 degree angle. The data points will definitely be shown as . NESUG '90 Proceedings 118 Beginning Tutorials asterisks(*) and should cover most of the plus signs as the mean or average of those squared differ if the data conform to a normal distribution. Gross ences. Or, just think of it as a the square of the deviations from a straight line may suggest that either average distance from the mean. The greater the normal theory methods not be used or that the data average distance, disregarding whether positive should be ftrSt tmnsformed by some mathematical or negative, the greater the squared avemge dis function to yield a more bell-shaped distribution, if tance, thus, the greater the variance. If, on aver normal theory methods are to be used. It would be age, the values are "close" to the mean, then the advisable 10 consult with a statistician in the latter variance will be small and the mean can be view case.