Proceed Boldly Yet Cautiously — in the Patient-reported Outcomes (PRO) World

Dorota Staniewska, PhD, Psychometric Statistician, Outcomes Research, Evidera

INTRODUCTION Because IRT has been used extensively 5. Detection of the presence of Advanced psychometric techniques in educational testing over the last potentially biased items; and have been gaining ground in recent 40 years, robust analytic techniques 6. Detection of changes in latent trait years in evaluation of patient-reported have been developed for most of across different evaluation times outcome (PRO) instruments.1,2 the estimation problems. Unlike the for subpopulations of interest. Properly applied, psychometric Classical Test Theory techniques that modeling (whether from the IRT describe patient performance in terms IRT methods allow for collecting or Rasch families) can provide of domain or total score, considering items measuring the same latent unparalleled power in detecting all items to be equal, the IRT approach trait for building robust and non-functioning items, help define examines each item’s contribution disease-specific outcomes and to the construct measured by the whole instrument. With IRT, given specify responder behavior. Misused, PROPERLY APPLIED, these methods can lead to wrong acceptable item fit, more information PSYCHOMETRIC MODELING inferences about the population and can be gleaned about the quality of the selection of inappropriate items measurement and, because person (WHETHER FROM THE IRT latent traits and item difficulties are on for analysis. OR RASCH FAMILIES) CAN the same scale, an immediate check The advantages parametric modeling of whether these two are compatible PROVIDE UNPARALLELED provides to instrument development is possible. In particular, the following POWER IN DETECTING and population behavior are reviewed issues have strong theoretical NON-FUNCTIONING ITEMS, here, together with words of caution underpinnings: HELP DEFINE DISEASE- regarding indiscriminate application of 1. Construction of new instruments the measurement theory models. with strong measurement SPECIFIC OUTCOMES properties; AND SPECIFY RESPONDER PROCEED BOLDLY! 2. Evaluation of the fit of each BEHAVIOR. Item response theory (IRT) defines individual item to the measurement patient responses to each individual model chosen; item as a function of the patient’s 3. Evaluation of the statistical characteristic (latent trait) and the statistically valid item banks. In consequences of choosing characteristics of the item (generally addition, they naturally provide a some items over others for the called discrimination and difficulty measurable degree of precision instrument; following educational measurement at every latent trait and, through conventions). IRT is a powerful 4. Evaluation of the relative merits of item and test information, describe technique allowing for more in-depth different instruments measuring the degree of precision of both understanding of the underlying the same trait; the individual item and the whole population and item characteristics. instrument at each level of latent

EVIDERA.COM THE EVIDENCE FORUM October 2014 trait. This is paramount in developing the researcher to problems with PROCEED CAUTIOUSLY! parallel forms of instruments, what certain items, but more importantly, While software for analysis of is especially salient in Computer having the potential to further instrument responses has been Adaptive Testing where in-depth inform the instrument development developed (e.g., RUMM, IRTPRO, information about each item is process. Thus, DIF is quite useful in Multilog, and even an experimental necessary in order to pick the PRO development to examine for SAS procedure), both the setting up one most appropriate to the subgroup differences in responses for of the models and the interpretation current estimate of the latent particularly heterogeneous patient of the output are not always as trait of the patient. Applying IRT populations, but also to provide straightforward as they might seem techniques can also be useful quantitative measure of variabilities and should be approached with at the development stage of the discovered during the qualitative care. In particular, standard normal instrument when psychometric item phase of development (provided distributions for the latent trait and fit and distractor performance can adequate sample is available). the difficulty parameter are generally be examined in order to select items assumed and will be generally Equating allows for patient latent that best fit the population. estimated; however, if this is not the traits (i.e., scores) obtained across case with the PRO (if, for example, Additional techniques readily different administrations of the the behavior is unipolar, like alcohol available when using psychometrics instrument to be put on the same abuse disorder, or bimodal, like are Differential Item Functioning scale. In particular, while the follow-up spinal muscular atrophy) care should (DIF) and equating. Differential Item version of an instrument might differ be taken to set reasonable initial Functioning was developed as from the baseline version (through, for estimates of population . part of Classical Test Theory and instance, the addition of new items), then expanded by application of as long as the number of overlapping One should never forget that item IRT methods. DIF helps to identify items is sufficient (30 to 70 percent, response theory models come with 3 potentially biased items, i.e., depending on the construct ), the strong parametric assumptions; items for which one subgroup (for IRT-based scale score from the two all models have the assumption of example, males when DIF due to instruments can be directly compared unidimensionality (only one trait gender is being examined) scores with equating. This in turn allows for is measured by a collection of differently (lower or higher) on the valid interpretations of any observed items), monotonicity (probability item than the other subgroup when improvements in score. Another of a higher response increases controlled for the latent trait. As the application of equating scale scores with increased latent trait) and population can be partitioned in many would be equating two different local independence (only the latent ways (for example, by gender, race, populations (e.g., the pediatric and trait explains the performance on education, disease group division), this adult cancer patients) so that they the item conditioned on it; the is a very powerful technique alerting can also be directly compared.

EVIDERA.COM THE EVIDENCE FORUM October 2014 responses are independent). While small deviations from the three assumptions are permissible4,5, conspicuous violations of any of these assumptions result in faulty inferences about model fit and the violation of construct validity (i.e., what is measured by the instrument is no longer what was intended, and may, in fact, be impossible to ascertain). In the context of PRO instruments, this directly translates into the impossibility of interpretation of the significance of improvements in the score. If violations are suspected, either IRT is not appropriate for the scale or more advanced IRT approaches need to be employed (such as ones developed by Mark Reckase6 or Howard Wainer7).

Furthermore, a much larger sample inferences about both the items Model fit is found to be comparable size than for nonparametric analysis and the population of interest, to any other psychometric model, it is needed in order to provide reliable especially if the local independence, should be favored over other models estimates of thresholds. While the unidimensionality and monotonicity because of its simplicity and relative recommendation of the sample assumptions are violated. ease of interpretation of output. sizes varies8,9 and has not been systematically studied in the high- The Rasch-only approach to instrument CONCLUSIONS reliability PRO realm, generally, analysis does have an immediately We are by no claiming that at least 300 patients per item is observable disadvantage. Because this is a complete list of advantages of recommended.10 However, some the same discrimination parameter IRT and warnings about misapplying authors11 indicate that sample sizes is assumed and estimated for all the models. We are hoping, however, exceeding 100 are sufficient for items, item dependencies might not that this article will give the reader Rasch modeling of PRO data, while be immediately apparent, especially both insight and pause about this others12 point to the number of items if residuals and residual correlations exciting direction that PRO research and of scores as being are not examined carefully. In more has been taking over the last 10 years. more consequential for estimation. parameter-heavy models, item dependencies can be immediately It is true that careful application of The Food and Drug Administration assessed by unusual behavior of each psychometric techniques will greatly (FDA) encourages, but does not item’s discrimination parameter. In fact, inform the instrument development require, the use of IRT or Rasch the validity assumption of the same process and provide incredible analysis as part of the PRO instrument value of the discrimination parameter insight into patient responses development and evaluation process for every item should be carefully as a function of their disease for those PRO endpoints that will considered. As it is presumably severity. The blind application of be used for product labeling.13 somewhat impossible to ascertain these techniques, however, could To the best of our knowledge, the validity of this assumption from the result in faulty inferences and thus these psychometric analyses content perspective, it is probably safer substantive misjudgments in the have generally resulted in more to check if discriminations are similar validity of the resulting instrument, and focused conversations and in the in the Generalized Partial Credit Model therefore potentially fatal conclusions development of instruments more and the Graded Response Model. regarding the trait measured and grounded in measurement theory. improvements in score. However, misusing the IRT or Rasch Despite all the above caveats analysis can lead to inappropriate regarding the Rasch model, it needs We cannot stress enough that the to be stated that if the Partial Credit presence of reliable estimates

EVIDERA.COM THE EVIDENCE FORUM October 2014 for psychometric methods is not The validity of instrument development A careful examination of the data and a guarantee of either content or still needs to hold, and methods its assumptions will ensure success construct validity of the instrument to ensure this validity have been with applying any model and lead and will not compensate for failures discussed in this forum before.14,15 to reliable and valid conclusions, in , item phrasing, resulting in a more powerful population misspecifications, etc. instrument being developed.

For more information, please contact [email protected].

References 1 Reeve BB, Hays RD, Bjorner JB, et al. Psychometric Evaluation and Calibration of Health-related Quality of Life Item Banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007 May; 45(5 Suppl 1):S22-S31. 2 Cook KF, Keefe F, Jensen MP, et al. Development and Validation of a New Self-report Measure of Pain Behaviors. Pain. 2013 Dec; 154(12):2867-2876. 3 Kolen MJ, Brennan RL. Test Equating, Scaling, and Linking: Methods and Practices (Statistics for Social and Public Policy). Springer; 2004. 4 Harrison DA. Robustness of IRT Parameter Estimation to Violations of the Unidimensionality Assumption. J Educ Behav Stat. 1986; 11(2):91-115. 5 Ackerman TA. The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence. Paper presented at the annual meeting of the American Educational Research Association; April 1987; Washington, DC. 6 Reckase MD. The Difficulty of Test Items that Measure More than One Ability. Appl Psychol Meas. 1985 Dec; 9(4):401-412. 7 Wainer H, Lewis C. Toward a Psychometrics for Testlets. J Educ Meas. 1990 Mar; 27(1):1-14. 8 Reise SP, Yu J. Parameter Recovery in the Graded Response Model Using MULTILOG. J Educ Meas. 1990 Jun; 27(2):133-144. 9 Forero CG, Maydeu-Olivares A. Estimation of IRT Graded Response Models: Limited Versus Full Information Methods. Psychol Methods. 2009 Sep; 14(3):275-299. 10 Kim S-H, Cohen AS. A Comparison of Linking and Concurrent Calibration Under the Graded Response Model. Appl Psychol Meas. 2002 Mar; 26(1):25-41. 11 Chen WH, Lenderking W, Jin Y, Wyrwich KW, Gelhorn H, Revicki DA. Is Rasch Model Analysis Applicable in Small Sample Size Pilot Studies for Assessing Item Characteristics? An Example Using PROMIS Pain Behavior Item Bank Data. Qual Life Res. 2014 Mar; 23(2):485-493. 12 Sébille V, Blanchin M, Guillemin F, et al. A Simple Ratio-based Approach for Power and Sample Size Determination for 2-group Comparison Using Rasch Models. BMC Med Res Methodol. 2014 Jul 5; 14(1):87. 13 U.S. Department of Health and Human Services. Food and Drug Administration; Guidance for Industry on Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. Federal Register; 2009 Dec; 74(235):65132-65133. 14 Vernon M, Wyrwich K. Methods for Selecting and Measuring Endpoints that are Meaningful to Patients in Rare Disease Clinical Development Programs. The Evidence Forum. Bethesda, MD, USA: Evidera; 2014 Mar:10-12. 15 Lenderking W, Revicki D. Clinician-reported Outcomes (ClinROs), Concepts and Development.The Evidence Forum. Bethesda, MD, USA: Evidera; 2013 Oct:36-40.

EVIDERA.COM THE EVIDENCE FORUM October 2014