Bulletin ofthe Psychonomic Society 1988. 26 (5), 433-436 Measuring variation for nominal data

TARALD O. KVALSETH University of Minnesota , Minneapolis, Minnesota

This paper is concerned with the measurement of variation or dispersion for nominal categori­ cal data. A new measure of such variation is proposed. It is defined as the complement of the of the individual category frequencies from the modal frequency. Large-sample distribution methods are developed for constructing confidence intervals and testing hypotheses for the equivalent population measure of variation. Numerical sample data are used to exem­ plify the statistical inference methods.

For measuring the variation or dispersion in a set of measures, the IQV in Equation 2 appears to be the most nominal data, a number ofalternative measures have been widely accepted, especially in the social sciences (e.g., defined in the literature. Such measures are generally Ott, Mendenhall, & Larson, 1978, pp. 110-113). The functions ofthe absolute frequencies or counts nh n2, . .. , measure in Equation 3, which is based on Shannon's n, for a set of I categories that are exhaustive and mutu­ (1948) entropy in information theory, has been used quite ally exclusive. Thus, among the total number ofn = Elnl extensively in the biological sciences (Pielou, 1977). observations (individuals or items), n, observations be­ The measure MOM in Equation 1 suffers from one seri­ long to category i, with each observation belonging to one ous limitation: its value is determined exclusively by the and only one category. Among the best known variation relative modal frequency nm/n and I (the number of measures for the n, (i = 1, 2, . . . , I) are those that may categories). This fact that MOM is unaffected by frequen­ be termed the mean deviation from the mode (MOM), cies other than the modal frequency is readily observed the index of qualitative variation (lQV), and the relative by expressing MOM differently, as done below. The pur­ entropy index (REI). These three measures may be de­ pose ofthe present paper is to propose an alternative mea­ fined as follows (e.g., Reynolds, 1977, p. 30): sure of variation for nominal data that is based on devia­ tions from the mode , but that depends on all the E (n... -nl) frequencies and not only the modal frequency. Further­ ; MOM = 1 - ---'-----:-::----,-,- (1) more, the large-sample distribution properties of the new n(l-I) measure are derived; this permits statistical hypotheses to be tested and confidence intervals to be constructed for IQV = (/~ 1 ) (1- ~ (n;!n)2), (2) this variation measure. Thestatistical inference procedures are exemplified using numerical data. and A NEW MODAL-FREQUENCY MEASURE -E (n;!n) log (n;!n) REI i (3) = log I In terms ofthe population probabilities PI(i = 1, 2, . .. , I), the population analogue of the expression in Equa­ where n.; in Equation 1 is the modal frequency, that is, tion 1 is 1 v. 1 - 1; (Pm-pI) (4a) 1-1 I The summations in Equations 1 through 3 are over all i from i= 1 to i=l. Ipm-l Each of the three measures varies between 0 and 1. = 1 - (4b) They take the value of 0 when all observations fall in a 1-1 single category, and they equal 1 when equally many ob­ where servations fall in all categories, irrespective of the num­ ber ofcategories. Consequently, these measures may be p.; = max{p., P2 , ..., PI} . appropriate when comparing the variation in different data From Equation 4b it is seen that V. and its estimator V. sets having different numbers ofcategories. Ofthe three = MOM, obtained by substituting the observed relative Address correspondence to Tara1d O. Kvalseth, Department of frequencies Pi = nfn for the Pi (i = 1, 2, .. ., I) in Equa­ Mechanical Engineering, University of Minnesota, Minneapolis, Min­ tion 4, are determined exclusively by the modal proba­ nesota 55455 . .bilities pm and pm = n.Jn, respectively.

433 Copyright 1988 Psychonomic Society, Inc . 434 KVALSETH

The measure VI is based on the first-order mean ofthe

VI = 1- {/~I ~(Pm-Piyr · (5) It is also easily determined that

In terms ofthe terminology used when measuring the vari­ VI = E pyt = VI-I. (8) i ation of quantitative variables, VI is simply the comple­ ~ent of ~e standard deviation ofthe Pi from p.: The es­ The 0'1(111) is given by timate (VI) of VI obtained by substituting the Pi = n.ln for all the Pi in Equation 5 is given by the standard devi­ 0'1(111) = ~ (~Pi(V~ i)l- V~ ), (9) ation from the mode (SDM) measure: so that when the expressions in Equations 7 and 8 are sub­ I }Ih SDM = I- -- E(n -n·)l (6) stituted into Equation 9, the following variance formula { 1 i m • • n (1- I ) is obtained:

The ~roperties of VI (and hence of 111 = SDM) may be outlined as follows: (I) VI takes into account all the probabilities Pi (i = 1, 2, . . . , I) and the number of categories I; (2) VI ranges in value between 0 and 1 in­ (10) clusive; (3) VI = 0 if and only if there exists an i such that Pi = I (SDM = 0 only when all observations be­ provided VI is neither 0 nor 1. long to a single category); and (4) VI = 1 if and only if When n isreasonably large, and when the estimated Pi = 111 for all i (SDM = 1 only when the observations ~ariance iP(V ) is computed from Equation 10 by replac­ are distributed uniformly over all the categories). 1 mgp.i by th~ observed proportionsPi = n.ln for all i, ap­ The preceding properties of the proposed measure fol­ proxunate inferences can be made about the population low immediately from Equation 5 (and Equation 6). It ~easure VI basedon the preceding results . For most prac­ ~y also.be po~ted out that, when comparing VI in Equa­ ~ICal pu~ses, an inv~stigator is likely to be particularly non 5 WIth VI m Equation 4, VI ~ VI with equality for interested in constructmg a confidence interval for VI and 1 = 2 (i.e., the nominal variable has two categories) or testing the hypothesis that two or more VI values are equal. for the two extremes ofno or maximum variation defined Thus, an approximate (I-a)IOO% confidence interval for under Properties 3 and 4 above; the inequality follows VI is given by from a well-known theorem for generalized means

(Hardy, Littlewood, & P61ya, 1952, Theorem 16). IIdZall &(111), (II) Properties 1-4 also hold for the measures IQV and REI where Zall is obtained from the standard normal tables in Equations 2 and 3 and for their population analogues. (e.g., Zall = 1.96 for a = .05). The null hypothesis that two VI values are equal, that is, H, : V21 = V22, can be STATISTICAL INFERENCES tested (against the alternative hypothesis Hi : V2l =I; V22) using the test statistic In addition to using SDM as a descriptive statistic for the variation in the observations n, (i = 1, 2, ... , I), it (12) may clearly be of interest to make statistical inferences about the underlying population quantity VI' Thus, an in­ vestigator may want to test certain hypotheses about VI and rejecting H, at the a level of significance if and construct confidence intervals. Such approximate in­ I~I > Zall' Thi~ testing procedure is based on the assump­ ferences can be made by using asymptotic distribution ~Ion that the estunators VII and Vll are independent, that methods, as will be demonstrated in some detail. IS, that they are based on two independent samples. Under the usual multinomial sampling model with the Consider now the more general hypothesis Ho : VII = total number ofobservations n fixed , it follows from the ~ Vll = ... = Vn for k 2. Define the weighted average d:lta method (e.g., Agresti, 1984, pp. 247-250; Bishop, of the V1j(j = 1, 2, . .. , k) as Fienberg, & Holland, 1975, chap. 14) that the estimator VI = SDM of VI is asymptotically normally distributed with mean VI and variance 0'1(111) that may be derived as follows. The partial derivatives of VI in Equation 5 with respect to the Pi (i = 1, 2, . .. , I) are given by Ifthe samples of sizes nj (j = 1, 2, . . . , k) are indepen- MEASURING VARIATION FOR NOMINAL DATA 435 dent and reasonably large, the hypothesis ofequal varia­ dent samples of unmarried males and females produced tions may be tested by referring the value of the statistic m, = 110, 80, 10 and n, = 135, 120,45. For these data sets 3 and 4, it is found that V13 = .6309, &1(V13) = .0016, ~ (V1j-V1)1 V14 = .7849, and &I(V14) = .00 ll. When these results (14) x' = i.J A together with those of the previous two data sets are sub­ j iP(V1j) stituted into Equations 13 and 14, it is found that VI = to tables of the chi-square distribution with k-l degrees .6896 and ~ = 14.10. Since 14.10 exceeds even the tabu­ of freedom (e.g., Agresti, 1984, pp. 190-194; Fleiss, lated chi-square value of 12.84 for 0: = .005 and k-l 1981, section 10.1) . The hypothesis is rejected at the 0: = 3 degrees of freedom, the decision is to reject H, : VII level of significance if the value of Equation 14 exceeds = Vl1 = VI 3 = V14 • Thus, it is concluded that the varia­ the upper 1000:% point ofthe chi-square distribution with tion in party affiliation differs significantly (p < .005) k-l degrees offreedom . Ifthe null hypothesis is rejected, between the four population groups. It has already been pairwise comparisons can be made using Equation 12. If decided that no such significant difference exists between the null hypothesis is not rejected, so that the population married males and females, so that additional pairwise variations V1j can reasonably be assumed to be equal for comparisons using Equation 12 are needed if the specific j = 1, 2, ... , k, then VI in Equation 13 may serve as sources of the variation difference are to be determined. an overall variation estimate. CONCLUDING COMMENTS NUMERICAL EXAMPLES In the preceding numerical examples, the same num­ ber of categories (/=3) were used when comparing the To demonstrate the use of the variation measure SDM population V for differentj. However, the same statistical and the statistical inference methods developed above, I j procedures apply if the number of categories (/}) differs some numerical sample data have been used . Consider for differentj. Similarly, the sample measure or descrip­ first the case when a sample ofsize n = 150 is distributed tive statistic SDM is appropriate for comparing the vari­ across /=3 nominal categories as follows: nl =75, nl=67 ation in data sets with differing number ofcategories since and n3 = 8. These n, may, for example, be individuals clas­ SDM takes into account the number of categories. sified as Democrat, Republican, or Independent . The vari­ The new measure SDM is proposed as being suitable ation in these data is computed from Equation 6 as being and useful for measuring the variation or dispersion in SDM = nominal data, that is, when the underlying variable is measured on a nominal scale . It is not a proper measure, nor are the alternative measures in Equations 1-3 ap­ 1 - (150)1(3-1)I [(75-67)1 + (75-8)1] = .68. { r propriate, when the data or the underlying variable is or­ dinal, since SDM does not use the ordering among the By comparison, MDM = .75 from Equation 1 or MDM categories. In fact, if SDM or the other measures in Equa­ = VI = .75 from Equation 4b with pm = 75/150. From tions 1-3 are used for ordinal categorical data, mislead­ Equations 2 and 3, IQV = .82 and REI = .79. ing results may occur. For measuring the variation in or­ By substitutingPi = nln for Pi and VI = SDM = .6819 dinal data, an index such as that proposed by Leik (1966) for VI in Equation 10, the estimated variance is computed is more appropriate. as iJ1(V1) = .00 16. According to Equation 11, an approx­ In conclusion, the SDM measure and the associated imate 95 % confidence interval for the population mea­ statistical inference methods presented in thispaper offer 1.96( .0016)~, sure VI is given by .6819 ± or (.60, .76). alternative means of analyzing the variation for nominal Consider next that the previous data referred to the party variables. Although SDM is clearly preferable to MDM affiliation ofa sample ofmales, whereas another indepen­ in Equation 1, the desirable properties of SDM are also dent sample offemales resulted in nl = 114, nl = 76, and possessed by IQV and REI in Equations 2 and 3. Thus, n3 = 10. Based on these data and Equations 6 and 10, it the choice of measure to use in any given situation is up is found that SDM = V = .6085 and &l(V ) = .0018 l1 l1 to the individual investigator. The use of more than one as compared with the previous values of = .6819 and VlI variation measure for each nominal data set is perhaps &l(V ) = .0016 for males . When these results are sub­ II the compromise and most prudent solution. stituted into Equation 12, the value ofthe test statistic be­ comes Z = 1.26 so that H, : VlI = VII is not rejected (at the .05 level ofsignificance since 1.26 < Zan = 1.96 for REFERENCES 0: = .05). Consequently, it is concluded that the variation in political party affiliation does not differ significantly be­ AGRESTI, A . (1984). Analysisofordinalcategorical data. NewYork: tween males and females. Wiley. BISHOP, Y. M. M., FII!NBI!RG, S. E., &. HOLLAND, P. W. (1975). Dis­ Finally, consider that the preceding two data sets refer crete multivariate analysis: Theory and practice. Cambridge, MA: to married males and females. Two additional indepen- MIT Press. 436 KVALSETH

FLEISS ,J. L. (1981). Statistical methods for rates andproportions. New PJELOU, E. C. (1977). Mathematical ecology. New York: Wiley. York: Wiley. REYNOLDS, H. T. (1977). The analysis of cross-classifications. New HARDY, G. H., LITTLEWOOD, J. E., & POLYA, G. (1952). Inequalities York: Free Press. (2nd ed.). Cambridge, England: Cambridge University Press. SHANNON, C. E. (1948). A mathematical theory of communication. Bell LEIK, R. L. (1966). A measure of ordinal consensus. Pacific Sociolog­ System Technical Journal , 27, 379-423, 623-656. ical Review, 9, 85-90. Orr, L. , MENDENHALL, W., & LARSON, R. F. (1978). Statistics: A tool for the social sciences. North ScituaJe , MA: Duxbury. (Manuscript received for publication January 30, 1988.)