Measuring Variation for Nominal Data
Total Page:16
File Type:pdf, Size:1020Kb
Bulletin ofthe Psychonomic Society 1988. 26 (5), 433-436 Measuring variation for nominal data TARALD O. KVALSETH University of Minnesota , Minneapolis, Minnesota This paper is concerned with the measurement of variation or dispersion for nominal categori cal data. A new measure of such variation is proposed. It is defined as the complement of the standard deviation of the individual category frequencies from the modal frequency. Large-sample distribution methods are developed for constructing confidence intervals and testing hypotheses for the equivalent population measure of variation. Numerical sample data are used to exem plify the statistical inference methods. For measuring the variation or dispersion in a set of measures, the IQV in Equation 2 appears to be the most nominal data, a number ofalternative measures have been widely accepted, especially in the social sciences (e.g., defined in the literature. Such measures are generally Ott, Mendenhall, & Larson, 1978, pp. 110-113). The functions ofthe absolute frequencies or counts nh n2, . .. , measure in Equation 3, which is based on Shannon's n, for a set of I categories that are exhaustive and mutu (1948) entropy in information theory, has been used quite ally exclusive. Thus, among the total number ofn = Elnl extensively in the biological sciences (Pielou, 1977). observations (individuals or items), n, observations be The measure MOM in Equation 1 suffers from one seri long to category i, with each observation belonging to one ous limitation: its value is determined exclusively by the and only one category. Among the best known variation relative modal frequency nm/n and I (the number of measures for the n, (i = 1, 2, . , I) are those that may categories). This fact that MOM is unaffected by frequen be termed the mean deviation from the mode (MOM), cies other than the modal frequency is readily observed the index of qualitative variation (lQV), and the relative by expressing MOM differently, as done below. The pur entropy index (REI). These three measures may be de pose ofthe present paper is to propose an alternative mea fined as follows (e.g., Reynolds, 1977, p. 30): sure of variation for nominal data that is based on devia tions from the mode , but that depends on all the E (n... -nl) frequencies and not only the modal frequency. Further ; MOM = 1 - ---'-----:-::----,-,- (1) more, the large-sample distribution properties of the new n(l-I) measure are derived; this permits statistical hypotheses to be tested and confidence intervals to be constructed for IQV = (/~ 1 ) (1- ~ (n;!n)2), (2) this variation measure. Thestatistical inference procedures are exemplified using numerical data. and A NEW MODAL-FREQUENCY MEASURE -E (n;!n) log (n;!n) REI i (3) = log I In terms ofthe population probabilities PI(i = 1, 2, . .. , I), the population analogue of the expression in Equa where n.; in Equation 1 is the modal frequency, that is, tion 1 is 1 v. 1 - 1; (Pm-pI) (4a) 1-1 I The summations in Equations 1 through 3 are over all i from i= 1 to i=l. Ipm-l Each of the three measures varies between 0 and 1. = 1 - (4b) They take the value of 0 when all observations fall in a 1-1 single category, and they equal 1 when equally many ob where servations fall in all categories, irrespective of the num ber ofcategories. Consequently, these measures may be p.; = max{p., P2 , ..., PI} . appropriate when comparing the variation in different data From Equation 4b it is seen that V. and its estimator V. sets having different numbers ofcategories. Ofthe three = MOM, obtained by substituting the observed relative Address correspondence to Tara1d O. Kvalseth, Department of frequencies Pi = nfn for the Pi (i = 1, 2, .. ., I) in Equa Mechanical Engineering, University of Minnesota, Minneapolis, Min tion 4, are determined exclusively by the modal proba nesota 55455 . .bilities pm and pm = n.Jn, respectively. 433 Copyright 1988 Psychonomic Society, Inc . 434 KVALSETH The measure VI is based on the first-order mean ofthe <WI pm-pi d~via~ions of the Pi from the modal probability p.; Con V'Z;= iJpi = (/-I)(I-V ) forpi=l;Pm (7a) 1 sider instead the following measure based on the second order mean of the (1-1) deviations (Pi-Pm): I-Ipm for Pi=pm. (7b) VI = 1- {/~I ~(Pm-Piyr · (5) It is also easily determined that In terms ofthe terminology used when measuring the vari VI = E pyt = VI-I. (8) i ation of quantitative variables, VI is simply the comple ~ent of ~e standard deviation ofthe Pi from p.: The es The variance 0'1(111) is given by timate (VI) of VI obtained by substituting the Pi = n.ln for all the Pi in Equation 5 is given by the standard devi 0'1(111) = ~ (~Pi(V~ i)l- V~ ), (9) ation from the mode (SDM) measure: so that when the expressions in Equations 7 and 8 are sub I }Ih SDM = I- -- E(n -n·)l (6) stituted into Equation 9, the following variance formula { 1 i m • • n (1- I ) is obtained: The ~roperties of VI (and hence of 111 = SDM) may be outlined as follows: (I) VI takes into account all the probabilities Pi (i = 1, 2, . , I) and the number of categories I; (2) VI ranges in value between 0 and 1 in (10) clusive; (3) VI = 0 if and only if there exists an i such that Pi = I (SDM = 0 only when all observations be provided VI is neither 0 nor 1. long to a single category); and (4) VI = 1 if and only if When n isreasonably large, and when the estimated Pi = 111 for all i (SDM = 1 only when the observations ~ariance iP(V ) is computed from Equation 10 by replac are distributed uniformly over all the categories). 1 mgp.i by th~ observed proportionsPi = n.ln for all i, ap The preceding properties of the proposed measure fol proxunate inferences can be made about the population low immediately from Equation 5 (and Equation 6). It ~easure VI basedon the preceding results . For most prac ~y also.be po~ted out that, when comparing VI in Equa ~ICal pu~ses, an inv~stigator is likely to be particularly non 5 WIth VI m Equation 4, VI ~ VI with equality for interested in constructmg a confidence interval for VI and 1 = 2 (i.e., the nominal variable has two categories) or testing the hypothesis that two or more VI values are equal. for the two extremes ofno or maximum variation defined Thus, an approximate (I-a)IOO% confidence interval for under Properties 3 and 4 above; the inequality follows VI is given by from a well-known theorem for generalized means (Hardy, Littlewood, & P61ya, 1952, Theorem 16). IIdZall &(111), (II) Properties 1-4 also hold for the measures IQV and REI where Zall is obtained from the standard normal tables in Equations 2 and 3 and for their population analogues. (e.g., Zall = 1.96 for a = .05). The null hypothesis that two VI values are equal, that is, H, : V21 = V22, can be STATISTICAL INFERENCES tested (against the alternative hypothesis Hi : V2l =I; V22) using the test statistic In addition to using SDM as a descriptive statistic for the variation in the observations n, (i = 1, 2, ... , I), it (12) may clearly be of interest to make statistical inferences about the underlying population quantity VI' Thus, an in vestigator may want to test certain hypotheses about VI and rejecting H, at the a level of significance if and construct confidence intervals. Such approximate in I~I > Zall' Thi~ testing procedure is based on the assump ferences can be made by using asymptotic distribution ~Ion that the estunators VII and Vll are independent, that methods, as will be demonstrated in some detail. IS, that they are based on two independent samples. Under the usual multinomial sampling model with the Consider now the more general hypothesis Ho : VII = total number ofobservations n fixed , it follows from the ~ Vll = ... = Vn for k 2. Define the weighted average d:lta method (e.g., Agresti, 1984, pp. 247-250; Bishop, of the V1j(j = 1, 2, . .. , k) as Fienberg, & Holland, 1975, chap. 14) that the estimator VI = SDM of VI is asymptotically normally distributed with mean VI and variance 0'1(111) that may be derived as follows. The partial derivatives of VI in Equation 5 with respect to the Pi (i = 1, 2, . .. , I) are given by Ifthe samples of sizes nj (j = 1, 2, . , k) are indepen- MEASURING VARIATION FOR NOMINAL DATA 435 dent and reasonably large, the hypothesis ofequal varia dent samples of unmarried males and females produced tions may be tested by referring the value of the statistic m, = 110, 80, 10 and n, = 135, 120,45. For these data sets 3 and 4, it is found that V13 = .6309, &1(V13) = .0016, ~ (V1j-V1)1 V14 = .7849, and &I(V14) = .00 ll. When these results (14) x' = i.J A together with those of the previous two data sets are sub j iP(V1j) stituted into Equations 13 and 14, it is found that VI = to tables of the chi-square distribution with k-l degrees .6896 and ~ = 14.10. Since 14.10 exceeds even the tabu of freedom (e.g., Agresti, 1984, pp. 190-194; Fleiss, lated chi-square value of 12.84 for 0: = .005 and k-l 1981, section 10.1) . The hypothesis is rejected at the 0: = 3 degrees of freedom, the decision is to reject H, : VII level of significance if the value of Equation 14 exceeds = Vl1 = VI 3 = V14 • Thus, it is concluded that the varia the upper 1000:% point ofthe chi-square distribution with tion in party affiliation differs significantly (p < .005) k-l degrees offreedom .