Rater Agreement Solutions For

SPSS Algorithms for Bootstrapping and Jackknifing

Generalized Measures of Agreement

Brian G. Dates

Southwest Counseling and Development Services

Jason E. King

Baylor College of Medicine

Paper presented at the annual meeting of the Southwest Educational Research Association, New Orleans, LA, February 6-9, 2008. SPSS Algorithms

2 SPSS Algorithms

Abstract

The authors recently developed SPSS syntax for obtaining multi-rater and multi- category measures of inter-rater reliability. The present work extends the syntax to include resampling algorithms (i.e., jackknife, percentile bootstrap, bootstrap-t, and bias corrected and adjusted bootstrap).

3 SPSS Algorithms SPSS Algorithms for Bootstrapping and Jackknifing Generalized Measures of Agreement

Measures of inter-rater agreement, or reliability, applicable to nominally scaled data have been available for some time (Scott, 1955; Cohen, 1960; Holly & Guilford, 1964; Fleiss, 1971; Light & Margolin, 1971; Conger, 1980; Perreault & Leigh, 1989; Gwet, 2001). However, researchers often remain unaware of these tools and their importance when conducting studies involving multiple raters. Reasons for this include the fact that the literature on inter-rater agreement is scattered and scarce. Further, published articles on the topic are generally limited in scope and length, often leaving the reader with more questions than answers. In addition, commercial software programs such as SAS and SPSS have not developed “point and click” solutions that have the capacity to analyze multiple raters and/or multiple categories. The programs that are available usually provide analysis for only the two rater-two category form of the statistic. In light of this need, the authors recently (Dates & King, 2007) developed SPSS syntax for obtaining multi-rater and multi-category measures of inter-rater reliability. The present work extends the syntax to include bootstrap and jackknife resampling algorithms. The SPSS syntax files presented in this paper permit calculation of jackknife and BCa bootstrap estimates for several generalized measures of agreement: Bennett’s S, Fleiss’ generalized kappa, Cohen’s kappa, Conger’s kappa, and Gwet’s AC1. Although we do not include syntax for Light’s kappa and Krippendorff’s alpha, syntax for the latter has been made available by Andrew Hayes at the Ohio State University: http://www.comm.ohio-state.edu/ahayes/ .

Generalized Measures of Agreement

We first briefly describe the inter-rater reliability indices of interest in this paper, followed by a discussion of resampling techniques. Finally, we present the SPSS syntax with instructions for use.

Bennet’s S

In 1954, Bennett and colleagues (Bennett, Alpert, & Goldstein) suggested adjusting inter-rater reliability to accommodate the percentage of rater agreement that might be expected by chance. To this end, they proposed an index which adjusted the proportion of rater agreement based on the number of categories employed. The formula is:

S = (Q Pa – 1)/(Q –1)

where: Q = number of categories

Bennett’s S, the name generally applied to this statistic, is also known as Guilford’s G (Holly & Guilford, 1964) because Guilford was the first person to use

4 SPSS Algorithms the approach extensively in the determination of inter-rater reliability. In either case, Bennet provided a formula for the variance of the statistic as well:

2 2 s s = (Q/(Q-1)) Pa(1 – Pa) (n-1)

Fleiss’ Generalized Kappa

In 1955, Scott developed a measure of inter-rater agreement, which he termed pi. While Scott believed that a term representing the chance agreement among raters was dependent upon the use of categories, his criticism of Bennett’s S was based on his belief that chance agreement should take into account the marginal probabilities of agreement. His formula for two raters and two categories was:

π = (pa - pe.π)/(1 - pe.π)

where: π = the chance corrected agreement coefficient, pa = the calculated proportion of rater agreement presented above, and pe = the proportion of rater agreement expected by chance, which for two raters and two categories was expressed as:

2 2 pe.π = (n+1 + n1+) + (n+2 + n2+) 4n2

Fleiss (1971) proposed a generalized kappa to encompass multiple raters and multiple categories. His formula for the expanded chance corrected term, pe, was:

q __ 2 pe.k = Σ pq q=1

where: q = category, r = rater, n = subject, and n pq = Σ riq/r . n=1 or q n 2 pe.k = Σ ( Σ riq/r) q=1 n=1

Therefore, pe.k was set equal to the sum across all categories of the square of the proportion of rater use per category. Fleiss’ expansion (actually an expansion of

5 SPSS Algorithms Scott’s pi, not Cohen’s kappa) has become the principal index of inter-rater agreement over the past 30 years.

Cohen’s Kappa

Cohen (1960) disagreed with Scott regarding the nature of the estimator of chance agreement, pe. He believed that it should be based on rater-category specific use rather than category use alone. His formula for chance agreement took the form:

pe.k = n+1n1+ + n+2n2+ n2

Cohen based his chance agreement correction on the sum across categories of the product of rater marginals within each category. Thus, the appropriate expression for the expanded Cohen estimator is:

q pe.k = Σ piq q=1

r where piq = Π (riq/n), and r=1

Π = the product of terms following.

As one equation, these are:

q r pe.k = Σ Π (riq/n) q=1 r=1

There are references to the expansion of Cohen’s kappa; however, the one paper available presented an expansion for interval data. Ours is the only known syntax that provides a solution for a generalized Cohen’s kappa.

Conger’s Kappa

Conger (1980) conceived of pe.π in a manner similar to Fleiss, with the difference that the sum of the variance of rater marginals across all categories was subtracted from Fleiss’ pe.π. Conger created an expression of proportion of agreement due to chance which eliminated the rater effect, which he believed was embedded in Fleiss’ expression. The formula for Conger’s chance agreement statistic took the form:

q 2 pe.C = pe.π - Σ s rq/(r-1) q=1 where pe.C = Conger’s expression of chance agreement,

6 SPSS Algorithms

pe.π = Fleiss’ expression for chance agreement,

r r 2 2 2 2 2 s rq = [r Σ (nrq) – (Σ nrq) ]/n r r=1 r=1

nrq = the number of subjects placed in category q by rater r.

r = the number of raters .

Gwet’s AC1

Gwet (2001) has proposed a third “kappa type” statistic (AC1), which considers only category usage as does Scott’s pi, but also takes into consideration both the number of categories and the probability of category non- use. His formula for the proportion of agreement expected by chance is for two categories and two raters:

pe.AC1 = 2 π1(1- π1)

where: π1 = p+1+p1+ 2 p+1 = n+1 n

p1+ = n1+ n Gwet’s chance correction factor is calculated based on the product of the proportion of category use and the proportion of non-use of that category. So for example, in the case of two categories the equation above is capable of expressing pe solely in terms of p1 because (1 - p1) is equal to p2 . Because Gwet’s formula for pe is based on category use, its relationship to Scott’s pi can be expressed as:

pe.AC1 = (1 - pe.π)/(q-1)

The equation for the AC1 estimator may be expanded to accommodate multiple raters and multiple categories:

q pe.AC1 = 1 Σπq(1- πq) (Q-1) q=1

Jackknife and Bootstrap Approaches

Jackknife

7 SPSS Algorithms

The Tukey-Quenouille jackknife (Quenouille, 1956; Tukey, 1958) is a resampling technique used to estimate bias and standard errors using the data at hand (i.e., the sample drawn), without relying on parametric assumptions. The approach entails drawing n samples of size (n-1); that is, for every jackknife sample, one subject is removed in turn from the dataset and the statistic calculated. Sample statistics are then calculated for each resample and aggregated to form an empirically derived sampling distribution from which bias or variability estimates can be derived. Any jackknife statistic is considered more precise than a least squares solution. The jackknife bias is the difference between the inter-rater statistic value estimated via the least squares methodology and value estimated through the use of the jackknife resampling approach. Because the resampling outcome is considered closer to the true population value than is the least squares outcome, the jackknife bias is really a measure of bias in the least squares solution.

Bootstrap

Efron’s (1979, 1982) bootstrap has proven to be effective in a variety of contexts, even more so than the jackknife. Once again, the conjecture is that the sampling distribution of a statistic can be approximated by the distribution of a large number of resampled estimates of the statistic obtained from a single sample of observations. Yet instead of removing one subject at a time and calculating the statistic based on n computations of (n-1) subjects, bootstrapping draws a resample with replacement of size n, repeats the process a user- determined number of times, and then proceeds with calculating the statistics of interest. Drawing 1000 resamples is generally accepted as the minimum allowable, with more resamples needed to determine confidence intervals as opposed to standard errors. Bootstrap bias has the same interpretation as jackknife bias. Due to its efficacy, bootstrapping has become fashionable in many fields. In fact, “an almost bewildering array” of variants of bootstrap confidence intervals have been advanced (Hall, 1988, p. 927). These vary in the accuracy with which the bootstrap-generated interval spans the true interval. Accuracy is also contingent on the statistic under examination: No single bootstrapping routine is optimal across all statistical techniques. The ordinary percentile method (described above) is perhaps the most commonly employed approach. For our application, the agreement statistics are determined by finding the mean of all the samples as in the least squares approach; however, the standard deviation(s) are determined by ordering the bootstrap values of the statistic in a vector and finding the values that coincide to the 0.025 and 0.975 percentile levels respectively. Those are used as the limits of the 95% confidence interval. A second approach is the bootstrap t method, which allows us to additionally estimate a sampling distribution of test statistics using the bootstrap resamples. However, this approach requires a parametric estimate of the standard error of interest, else a “nested” bootstrap is needed (i.e., bootstrapping within every bootstrap sample).

8 SPSS Algorithms

Efron (1987) offered the bias-corrected and accelerated bootstrap (BCa) method to correct for problems with the percentile bootstrap. The BCa method allows for the possibility that the difference distribution associated with the bootstrap estimates is not centered on zero but is distributed around an unknown constant (i.e., “biased”) and with non-constant variance (i.e., the variance tends to “accelerate” across values in the sampling distribution. This method adjusts the endpoints of the bootstrap distribution for bias and non-constant variance. Computing the accelerated bootstrap confidence interval requires estimating a bias coefficient (z0) and an acceleration coefficient a. Both can be estimated nonparametrically from the data or theoretically calculated for a specific distribution. The syntax at hand used constants based on the normal distribution.

SPSS Syntax

Instructions for Use

Each syntax file employs a macro and is configured to perform most of the calculations using matrix operations. The syntax accommodates any number of raters, subjects, and categories. The SPSS data file should be prepared with raters as variables and subjects as rows. The column variables should use variable names starting with “rater” and attaching a number, e.g., rater1, rater2, rater3, etc. Each cell represents the category number assigned by a single rater to a single subject. Categories are numbered for the purpose of identification and do not represent any ordered relationship or relative preference. The lowest number used for a rating category must be “1”. Missing data cause a subject to be eliminated from the analysis since all inter-rater reliability statistics have been constructed for complete datasets. When filled in, the data set should look like the example below for three raters and six subjects.

rater1 rater2 rater3 1 1 1 1 2 2 1 2 3 2 2 2 4 2 1 2 5 2 1 2 6 2 1 2

The first step in running any of the syntaxes is to tell SPSS where to get the data file containing the dataset. In the first line of any of the syntaxes, after the “Get file” command, fill in the path to the data file, as in:

GET FILE='c:/program files/spss/yourfile.sav'.

This tells SPSS to go to the c:/ drive, look in the file “program files”, then in the file, “spss”, and extract the SPSS data file, “yourfile.sav” (yourfile.sav is a sample name.)

9 SPSS Algorithms The second step prior to running the syntax correctly is to make sure the macro call line is filled in appropriately, which is preceded by the comment line, “Macro Call Line” near the end of the syntax file just before Section 9. This macro call line contains information related to the raters and the number of bootstrap resamplings desired for the analysis. On the macro call line enter the range of raters, e.g., rater1 to rater3, for three raters, or rater1 to rater6 for six raters. Do this after the prompt "raters=" and prior to the first "/". After the slash, enter the number of bootstrap resamplings desired after the prompt "reps='. If the choice was five raters and 5000 bootstrap samples, the entire macro call line would read,

irr VARS=rater1 to rater5/reps=5000 .

Once the macro call line has been completed, the entire syntax file can be selected and run.

Output

The first section of the output in the viewer will contain two tables. The first table provides the number of times each rater employed each category, with the table arranged with raters as rows and categories as columns. The second table presents the number of times each rater agreed with all the other raters in each category. Once again, raters are represented by rows and categories by columns. Also presented are basic summaries of number of items (subjects) rated, number of rating categories, number of raters, and the overall proportion of rater agreement. The next section contains the inter-rater reliability statistic based on the standard least squares approach, in order from left to right, 1) the inter-rater agreement statistic, 2) the asymptotic standard error of the statistic, 3) the standardized z-score indicating the number of standard deviations of the agreement statistic from the mean, 4) the probability (significance) level associated with the z-statistic, the lower bound of the 95% confidence interval around the statistic, and the upper bound of the 95% confidence interval around the statistic. These are presented first for the overall rater agreement statistic and then for the individual categories. In the case that a category is not used by any of the raters, the output will show -99999.00 to indicate missing data. The exception is Bennett’s S, which has no category based statistics. Next follows the jackknife and bootstrap estimates of the sample statistic, bias, and the number of resamples computed, followed by overall and category specific information as in the least squares solution. The output contains estimates using the conditional variance, that is, the variance based on subjects alone. Results can be generalized to the universe of subjects but not to the universe of raters. These results are generalizable only to the raters used to generate the dataset. It could be said, therefore, that the same group of raters would rate any set of subjects on the same factors with the same level of agreement as in the present study. The unconditional variance solution is determined by adding the variance among raters to the conditional variance. This variance estimate would be useful

10 SPSS Algorithms if a goal was to develop a rating system which could be used by any rater, not simply limited to the study at hand. Typically, this will not often be the case as a very large number of both subjects and raters is necessary in order to justify using the unconditional variance solution. In the case where the unconditional variance does apply, it can be said that any group of raters rating any group of subjects on the same factors as the study at hand, would have the same level of agreement as in the present study. If the user is interested in a solution based on unconditional variance, an ordinary least squares solution is available in the syntax available on Jason King’s website: http://www.ccitonline.org/jking/homepage/ . The next output is a graph of the bootstrap distribution with a normal curve imposed, which produces a visual representation for the user. Finally, Section 9 of the syntax contains references to SPSS scripts that clean the viewer by eliminating irrelevant output. These may be optionally developed and applied by the user. The script that deletes the Case Processing Summary is created from the script in the SPSS Script folder with filename, “Clean Viewer” in the Sub Main section of the script by typing the lines,

intType = SPSSPivot strLabel = "Case Processing Summary"

The remaining scripts are created by altering a line in the SPSS Starter Script, “Delete Viewer Items” by removing the apostrophe before the type of item for which deletion is desired. If the user decides not to use these scripts, simply delete them from the syntax. If some or all of the deletions are desired, the line alterations for each script together with the filename to save are:

Deletion Purpose Line Change Save In Scripts Folder As Case Processing intType = SPSSPivot Delete Case Processing Summary 'Type Summary.sbs strLabel = "Case Processing Summary" Output Title ‘intTypeToDelete = Delete Output Title.sbs SPSSTitle to intTypeToDelete = SPSSTitle Output Warnings 'intTypeToDelete = Delete Output SPSSWarning to Warnings.sbs intTypeToDelete = SPSSWarning Output Notes 'intTypeToDelete = Delete Output Notes.sbs SPSSNote to intTypeToDelete = SPSSNote Output Log 'intTypeToDelete = Delete Output Log.sbs SPSSLog to intTypeToDelete = SPSSLog

11 SPSS Algorithms

References

Bennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303-308. Dates, B. G., & King. J. E. (2007, January). SPSS solutions for inter-rater agreement using nominal data. Paper presented at the annual meeting of the Southwest Educational Research Association, San Antonio, Texas, Feb. 7-9, 2007. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322-328. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1-26. Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics. Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82, 171-185. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382. Gwet, K. (2001). Handbook of inter-rater reliability. STATAXIS Publishing Company. Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals. Annals of Statistics, 16, 927-953. Holley, J. W., & Guilford, J. P. (1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749-753. Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data. Journal of the American Statistical Association, 66, 534-544. Perreault, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135-148. Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43, 353-360. Scott, W. A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, XIX, 321-325. Tukey, J. W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics, 29 (June), 614.

12