Guidelines for Reporting Reliability and Agreement Studies (GRRAS) Were Proposed Jan Kottnera,*, Laurent AudigEb, Stig Brorsonc, Allan Donnerd, Byron J

Journal of Clinical Epidemiology 64 (2011) 96e106 Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed Jan Kottnera,*, Laurent Audigeb, Stig Brorsonc, Allan Donnerd, Byron J. Gajewskie, Asbjørn Hro´bjartssonf, Chris Robertsg, Mohamed Shoukrih, David L. Streineri aDepartment of Nursing Science, Centre for Humanities and Health Sciences, Charite-Universita ¨tsmedizin Berlin, Berlin, Germany bAO Clinical Investigation and Documentation, Du¨bendorf, Switzerland cDepartment of Orthopaedic Surgery, Herlev University Hospital, Herlev, Denmark dDepartment of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, The University of Western Ontario, London, Ontario, Canada eDepartment of Biostatistics, University of Kansas Schools of Medicine & Nursing, Kansas City, KS, USA fThe Nordic Cochrane Centre, Rigshospitalet, Copenhagen, Denmark gSchool of Community Based Medicine, The University of Manchester, Manchester, UK hDepartment of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Center, The University of Western Ontario, London, Ontario, Canada iDepartment of Psychiatry, University of Toronto, Toronto, Ontario, Canada Accepted 2 March 2010 Abstract Objective: Results of reliability and agreement studies are intended to provide information about the amount of error inherent in any diagnosis, score, or measurement. The level of reliability and agreement among users of scales, instruments, or classifications is widely unknown. Therefore, there is a need for rigorously conducted interrater and intrarater reliability and agreement studies. Information about sample selection, study design, and statistical analysis is often incomplete. Because of inadequate reporting, interpretation and synthesis of study results are often difficult. Widely accepted criteria, standards, or guidelines for reporting reliability and agreement in the health care and medical field are lacking. The objective was to develop guidelines for reporting reliability and agreement studies. Study Design and Setting: Eight experts in reliability and agreement investigation developed guidelines for reporting. Results: Fifteen issues that should be addressed when reliability and agreement are reported are proposed. The issues correspond to the headings usually used in publications. Conclusion: The proposed guidelines intend to improve the quality of reporting. Ó 2011 Elsevier Inc. All rights reserved. Keywords: Agreement; Guideline; Interrater; Intrarater; Reliability; Reproducibility 1. Background conceptually distinct [7e12]. Reliability may be defined as the ratio of variability between subjects (e.g., patients) or ob- Reliability and agreement are important issues in classifi- jects (e.g., computed tomography scans) to the total variabil- cation, scale and instrument development, quality assurance, e ity of all measurements in the sample [1,6]. Therefore, and in the conduct of clinical studies [1 3]. Results of reliability can be defined as the ability of a measurement to reliability and agreement studies provide information about differentiate between subjects or objects. On the other hand, the amount of error inherent in any diagnosis, score, or agreement is the degree to which scores or ratings are iden- measurement, where the amount of measurement error deter- tical. Both concepts are important, because they provide mines the validity of the study results or scores [1,3e6]. information about the quality of measurements. Further- The terms ‘‘reliability’’ and ‘‘agreement’’ are often more, the study designs for examining the two concepts are used interchangeably. However, the two concepts are similar. We focus on two aspects of these concepts: e Interrater agreement/reliability (different raters, using * Corresponding author. Department of Nursing Science, Centre for the same scale, classification, instrument, or proce- Humanities and Health Sciences, Charite-Universitatsmedizin Berlin, ¨ dure, assess the same subjects or objects) Augustenburger Platz 1, 13353 Berlin, Germany. Tel.: þ49-30-450-529- e 054; fax: þ49-30-450-529-900. Intrarater agreement/reliability (also referred to as E-mail address: [email protected] (J. Kottner). testeretest) (the same rater, using the same scale, 0895-4356/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2010.03.002 J. Kottner et al. / Journal of Clinical Epidemiology 64 (2011) 96e106 97 widely accepted formal criteria, standards, or guidelines What is new? for reliability and agreement reporting in the health care and medical fields are lacking [2,22,23]. Authors of reviews Key finding have often established their own criteria for data extraction e Reporting of interater/intrarater reliability and and quality assessment [12,15,17,19,21e25]. The Ameri- agreement is often incomplete and inadequate. can Educational Research Association, American Psycho- logical Association (APA), and the National Council on e Widely accepted criteria, standards, or guide- Measurement in Education have attempted to improve the lines for reliability and agreement reporting in reporting of reliability and agreement, but they focus on the health care and medical fields are lacking. psychological testing [26]. e Fifteen issues that should be addressed when reliability and agreement are reported are proposed. 2. Project What this adds to what is known e There is a need for rigorous interrater and intra- In the absence of standards for reporting reliability and rater reliability and agreement studies to be agreement studies in the medical field, we evolved the idea performed in the future. that formal guidelines might be useful for researchers, authors, reviewers, and journal editors. The lead author e Systematic reviews and meta-analysis of reli- initially contacted 13 experts in reliability and agreement ability and agreement studies will become investigation and asked whether they saw a need for such more frequent. guidelines and whether they wished to take part in this project. The experts were informally identified based on What is the implication and what should change substantial contributions and publications in the field of now? agreement and reliability. Each expert was also asked e Reporting of reliability and agreement studies whether he knew an additional expert who would be inter- should be improved. ested in participating. Finally, from all these individuals, e The proposed guidelines help to improve an international group of eight researchers, who were reporting. experienced in instrument development and evaluation (L.A., D.L.S., S.B., and A.H.), reliability and agreement estimation (A.D., B.J.G., C.R., and M.S.), or in systematic reviews (L.A., S.B., and A.H.) of reliability studies, was classification, instrument, or procedure, assesses the formed. same subjects or objects at different times). The objective was to develop guidelines for reporting e Issues regarding internal consistency are not ad- reliability and agreement studies. The specific aim was to dressed here. establish items that should be addressed when reliability When reporting the results of reliability and agreement and agreement studies are reported. studies, it is necessary to provide information sufficient to The development process contained elements of Glaser’s understand how the study was designed and conducted and state-of-the-art method [27] and of the nominal group how the results were obtained. Reliability and agreement technique [28]. Based on an extensive literature review, are not fixed properties of measurement tools but, rather, the project coordinator (J.K.) produced a first draft of are the product of interactions between the tools, the sub- guidelines, including an initial list of possible items. It jects/objects, and the context of the assessment. was sent to all team members, and they were asked to Reliability and agreement estimates are affected by vari- review, comment, change the document and wording, and ous sources of variability in the measurement setting to indicate whether they agree or disagree. Feedback indi- (e.g., rater and sample characteristics, type of instrument, cated that there was a need for clarification of key concepts administration process) and the statistical approach (e.g., that are related to reliability and agreement studies. There- assumptions concerning measurement level, statistical fore, definitions of key concepts were discussed among model). Therefore, study results are only interpretable the team members, and definitions were established when the measurement setting is sufficiently described (Appendix). Based on the comments from the first review, and the method of calculation or graphical presentation a second draft was created and again sent to team members, is fully explained. accompanied by a summary report of all criticisms, discus- After reviewing many reliability and agreement studies, sions, and suggestions that had been made. Based on the it becomes apparent that important information about the critiques of the second draft, a third draft was created, study design and statistical analysis is often incomplete reviewed, and discussed again. After the review of the [2,12e21]. Because of inadequate reporting, interpretation fourth draft, consensus was achieved, and team members and synthesis of the results are often difficult. Moreover, approved these guidelines in their current form. 98 J. Kottner et al. / Journal of Clinical Epidemiology 64 (2011) 96e106 Table 1 Guidelines for Reporting Reliability

Guidelines for Reporting Reliability and Agreement Studies (GRRAS) Were Proposed Jan Kottnera,*, Laurent AudigEb, Stig Brorsonc, Allan Donnerd, Byron J

Volume Quantification by Contrast-Enhanced Ultrasound

On the Sampling Variance of Intraclass Correlations and Genetic Correlations

HERITABILITY ESTIMATES from TWIN STUDIES I. Formulae of Heritability Estimates

Title: Assessing Test-Retest Reliability of Psychological Measures: Persistent

Inference from Complex Samples

We Need to Talk About Reliability: Making Better Use of Test-Retest Studies for Study Design and Interpretation

The Intraclass Covariance Matrix

Psychometrics Using Stata

Confidence Interval Estimation for Intraclass Correlation Coefficient

Statistical Methods for Research Workers BIOLOGICAL MONOGRAPHS and MANUALS

Estimation and Inference of the Three-Level Intraclass Correlation Coefficient

Robust Methods for Estimating the Intraclass Correlation Coefficient and for Analyzing Recurrent Event Data

Guidelines for Reporting Reliability and Agreement Studies (GRRAS) Were Proposed Jan Kottnera,*, Laurent AudigEb, Stig Brorsonc, Allan Donnerd, Byron J

Guidelines for Reporting Reliability and Agreement Studies (GRRAS) Were Proposed Jan Kottnera,*, Laurent AudigEb, Stig Brorsonc, Allan Donnerd, Byron J