Interval Estimation for a Difference Between Intraclass Kappa Statistics

BIOMETRICS58, 209-215 March 2002 Interval Estimation for a Difference Between Intraclass Kappa Statistics Allan Donner* and Guangyong Zou Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario N6A 5C1, Canada * e-mail: [email protected] SUMMARY.Model-based inference procedures for the kappa statistic have developed rapidly over the last decade. However, no method has yet been developed for constructing a confidence interval about a difference between independent kappa statistics that is valid in samples of small to moderate size. In this article, we propose and evaluate two such methods based on an idea proposed by Newcombe (1998, Statistics in Medicine, 17, 873-890) for constructing a confidence interval for a difference between independent propor- tions. The methods are shown to provide very satisfactory results in sample sizes as small as 25 subjects per group. Sample size requirements that achieve a prespecified expected width for a confidence interval about a difference of kappa statistic are also presented. KEYWORDS: Confidence interval; Interobserver agreement; Intraclass correlation; Reliability studies; Sam- ple size. 1. Introduction of many measurement studies in educational and psychologi- There has recently been increased interest in developing mod- cal research (e.g., Alsawalmeh and Feldt, 1992). el-based inferences for the kappa statistic, particularly as ap- Appropriate testing procedures for this problem have been plied to the assessment of interobserver agreement. Most of described by Donner, Eliasziw, and Klar (1996) for the case this research has focused on problems that arise in a single of independent samples of subjects and Donner et al. (2000) sample of subjects, each assessed on the presence or absence for the case of dependent samples involving the same sub- of a binary trait by two observers (e.g., Donner and Eliasziw, jects. Reed (2000) has also addressed this problem, provid- 1992; Hale and Fleiss, 1993; Basu, Banerjee, and Sen, 2000; ing an executable Fortran code for testing the homogeneity of kappa statistics in the former case. However, to the best Klar, Lipsitz, and Ibrahim, 2000; Nam, 2000). Extension of of our knowledge, procedures have not yet been developed this research to single-sample inference problems involving for constructing an interval estimate about the difference be- more than two raters and/or multinomial outcomes has also tween two kappa statistics. With the increased emphasis on recently appeared (e.g., Altaye, Donner, and Klar, 2001; Bart- confidence-interval construction as an alternative to signifi- fay and Donner, 2001). cance testing, this would seem to be an important gap in the Although single-sample problems involving the kappa sta- literature. tistic perhaps predominate, the need to compare two or more As discussed by Nam (2000), there are two models that kappa statistics also frequently arises in practice. For exam- have tended to be adopted for constructing inferences for the ple, it may be of interest to compare measures of interobserver kappa statistic in the caSe of two raters and a binary outcome. agreement across different study populations or different pa- The first model allows the marginal probabilities of a success tient subgroups in a single study. The latter was an objective to differ between the two raters and leads naturally to Cohen’s of a study reported by Ishmail et al. (1987) that focused on (1960) kappa. The second model assumes the same marginal interobserver agreement with respect to the presence of the probability of success for each rater and leads naturally to the third heart sound (S3), recognized as an important sign in intraclass kappa statistic, identically equal to Scott’s (1955) T. the evaluation of patients with congestive heart failure. One Landis and Koch (1977) and Bloch and Kraemer (1989) have question of interest in this study was whether the degree of presented arguments supporting the use of this model when interobserver agreement varied by the gender of the patient, the main emphasis is directed at the reliability of the mea- an issue that arose because it may be more difficult to hear S3 surement process rather than in potential differences among in women compared with men. Other examples of such inves- raters. Zwick (1988) also discusses this issue, pointing out tigations are given by Barlow, Lai, and Azen (1991), McLellan that, if marginal differences are small, the value of Cohen’s et al. (1985), and McLaughlin et al. (1987). The comparison of kappa will be close to that of Scott’s T and the choice between coefficients of interobserver agreement is also the major focus them will not be important. Furthermore, as shown by Black- 209 210 Biometries, March 2002 man and Koval (2000), the two estimators are asymptotically Let (Ih,uh) denote lower and upper lOO(1 - a)% limits equivalent. for nh. Then the hybrid confidence interval for A is given by In this article, we develop and evaluate a method for con- (L,U),where structing a confidence interval for a difference between two in- L = (RI - R2) - ~,~2{var(ki)l~~=~~+ ~ar(k2)1~~=~~ }1/2 traclass kappa statistics computed from independent samples of subjects. The procedures compared are developed in Sec- and tion 2, followed in Section 3 by a simulation study evaluating = - k2) } 1/2 ' their properties. Section 4 provides guidelines for sample size u (a1 + ~,/2{var(~1)lK,"u,+",,(R2)("2=12 estimation, while two examples illustrating the procedures are Further details concerning the theoretical basis for this pro- presented in Section 5. The article concludes with overall rec- cedure are given in the Appendix. ommendations for practice and with advice concerning avail- One approach to obtaining the limits (Ih, uh), h = 1,2, able software. was proposed by Donner and Eliasziw (1992), who used a 2. Development of Procedures goodness-of-fit approach to construct a confidence interval for Following Donner et al. (1996), we let Xijh denote the rating nh. This approach, algebraically equivalent to an approach for the ith subject by the jth rater in sample h (h = 1,2). independently proposed by Hale and Fleiss (1993), assumes Letting nh = Pr(Xijh = 1) be the probability of a success, only that the observed frequencies Wkh (k = 1,2,3) follow a the probabilities of joint responses within sample h are given multinomial distribution with corresponding probability Pkh by Plh(nh) = Pr(Xi1h = 1,X,2h = 1) = 7ri + nh(1- nh)nh, conditional on Nh. If estimated probabilities &(Kh) are ob- P2h(nh) = Pr(Xilh = 0,XiZh = 1 or Xilh = 1, Xi2h = 0) = tained by replacing ~h with ?h, it then follows that x; = 2nh(l - sjLd(l- nh), and P3h(nh) = Pr(Xi1h = O,Xi2h = 0) ~;=,{nkh - NhPkh (nh)12 / {NhPkh (nh)} has a limiting chi- = (1 - nh) +nh(l - nh)nh. This model, previously discussed square distribution with 1 d.f. One can then obtain the con- by several authors, including Mak (1988) and Bloch and Krae- fidence limits (Zh,uh) for Kh by finding the two admissible mer (1989), has been referred to as the common correlation roots to the cubic equation xG2 = x:,~-,. We refer to this model because it assumes that the correlation between any method as the hybrid goodness-of-fit (HGOF) approach. pair (Xilh,Xi~h)has the same value nh. An alternative method of obtaining (Ih,uh) is to invert a We now assume that the observed frequencies nlh,n2h, n3h modified Wald test (Rao and Mukerjee, 1997), which may corresponding to Plh(nh), P2h(nh), Psh(nh) follow a multi- be constructed by replacing var(kh) with vTrrp(Rh),obtained nomial distribution conditional on Nh = $=, nkh, the to- by substituting ejrh for nh. The confidence limits for nh are tal number of subjects in sample h. Letting +h = (2nlh + then given by the two admissible roots of the cubic equation 2 2 nzh)/(2Nh), Bloch and Kraemer (1989) show that, for a sin- given by (Rh - nh) /varp(nh) = xl,l-cu. An approach similar gle sample of subjects, the maximum likelihood estimator of to this was applied by Lee and Tu (1994) to the problem of nh under the common correlation model is given by k.h = constructing confidence limits about Cohen's kappa, who refer 1-7~h/{2Nh?h (1-?h)}, with large-sample variance obtained to it as the profile variance approach. We refer to it here as as var(Rh) = (1 - nh)[( 1- nh)( 1- 2nh) +nh(2 - nh)/{2nh( 1- the hybrid profile variance (HPR) approach. Note that the nh)}]/Nh.It is easily verified that kh is simply the standard HGOF and HPR methods differ only in how the confidence intraclass correlation coefficient (e.g., Snedecor and Cochran, limits (Zh, uh), h = 1,2, are computed, 1989) as applied to dichotomous outcome data. We can estimate var(kh) by substituting +h and k.h for rh 3. Evaluation and nh, respectively. A large-sample lOO(1 - a)%confidence The finite sample properties of the three methods presented interval for n1-n~ is then given by (R1 - R2)?~z,/~{v2r(k.1)+ in Section 2 are, for the most part, intractable. We therefore &ir(R2)}1/2, where i&ir(kh) is the sample estimate of var(kh) compared the performance of the SA, HGOF, and HPR meth- and 2,p denotes the lOO(1 - a/2)% percentile point of the ods using Monte Carlo simulation. Simulation runs having standard normal distribution. We refer to this method as the +h(l-+h) = 0 (for which kh is undefined) were discarded until simple asymptotic (SA) method. 10,000 runs were obtained. The expected proportion of such A potential difficulty with the SA method is that the sam- runs is given by {n2+ 7r( 1 - n)~}~$- { (1- T)~ +n( 1- n)~}~, pling distribution of kh may be far from normal when nh is well under 5% for all parameter combinations considered in close to one and the nuisance parameter nh is close to zero or the simulation.

Load more