<<

Item Response Theory (IRT): Enhancing Health Outcomes Measurement

Bryce B. Reeve, Ph.D. e-mail: [email protected] Presentation Overview • IRT Models – Theory of IRT (Reeve) – IRT item, , and person properties (Reeve) – Comparison with Classical Theory (Reeve) – IRT Assumptions and Model Fit (Orlando Edelen) – IRT Scoring (Orlando Edelen) • Applying IRT to enhancing health outcomes measurement – Designing and evaluating scales (Siemons; Krishnan) – Assessing Differential Item Functioning (DIF) (Orlando Edelen) – Linking scales (Glas; Oude Voshaar) – Item Banking and Computerized Adaptive Testing (Bjorner; Nikolaus) Please Note:

• The quality of a health outcomes measure is related to the attention the developer(s) took to use qualitative and quantitative methods integrating multiple perspectives throughout the process.

• IRT Methods do not replace the classical/traditional test theory methods for item/scale analysis.

• IRT analysis is not a magic wand! – It cannot fix bad or poorly defined constructs – By itself, it does not address all forms of and other attributes that evaluate the quality of a . The Need for Better Outcome Measures

Needs Challenges Develop measures that are Have a minimum set of valid, reliable, and sensitive questions to reduce to detect clinically respondent burden. meaningful change Different forms of an Different forms to be instrument to measure linked on the same metric different health levels. for group comparisons Non-biased measurement Detect differences in across groups group perceptions What is (IRT)? 1. Theory for Scale Construction 2. Methodology for: – Evaluating the properties of items within a scale and the overall scale – Refining a scale – Scoring an individual – Linking multiple scales on to a common metric – Tailoring PRO measures to an individual or group.

• IRT is designed for: – Modeling latent “unobservable” variables (traits, domains, q) – Multi-item Scales/ IRT Model: Item Characteristic Curves I am unhappy some of the time?

1.00

False True 0.75

0.50

0.25 ProbabilityResponse of

0.00

-3 -2 -1 0 1 2 3 None Depression Severe q IRT Model I am unhappy some of the time?

1.00

True 0.75 b (item) parameter:

0.50 - Threshold - Location - Difficulty 0.25 - Severity ProbabilityResponse of b = .25 0.00

-3 -2 -1 0 1 2 3 None Depression Severe q IRT Models I am unhappy some of the time. I don’t care what happens to me.

1.00 b parameter: - Threshold True 0.75 - Location - Difficulty - Severity 0.50

0.25 b = .25 ProbabilityResponse of b = 1.33 0.00

-3 -2 -1 0 1 2 3 None Depression Severe q IRT Model I am unhappy some of the time?

1.00 True

0.75 a = 2.83 a (item) parameter: - slope 0.50 - discrimination - relationship 0.25 with trait

ProbabilityResponse of b = .25 0.00

-3 -2 -1 0 1 2 3 None Depression Severe q IRT Models I am unhappy some of the time. I don’t care what happens to me.

1.00 I cry easily. a = 2.20 True 0.75 a = 2.83 a = 1.11

0.50

b = -0.23 0.25 b = 0.25 ProbabilityResponse of b = 1.33 0.00

-3 -2 -1 0 1 2 3 None Depression Severe q IRT Model: Item Characteristic Curves I am unhappy some of the time?

1.00

False True 0.75

1 P X  1q   i 1.7 a q b  0.50 1  e i i ai(q – bi)

0.25 ProbabilityResponse of

0.00

-3 -2 -1 0 1 2 3 None Depression Severe q IRT: Item Information Curves (The range of the latent construct over which an item is most useful for distinguishing among respondents) 2.0 I am unhappy some of the time a = 2.83 I don’t care what happens to me I cry easily 1.5 a = 2.20

1.0 Information 0.5 a = 1.11

0.0 -3 -2 -1 0 1 2 3 None Depression Severe b = -0.23 b = 0.25 b = 1.33 Building reliable and efficient measures…

2.0 I am unhappy some of the time.

I don’t seem to care what happens to me.

1.0 Information

I cry easily.

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 Mild Severe Depression 10 Items from the MMPI-2 Depression Scale Scale (Test) Information Curve (The range of the latent construct over which a scale is most useful for distinguishing among respondents)

Reliabilit y  1 - Error Variance 25 m r = .96 TI (q )   I (q )  1 - () 2 i 1 20 1 r = .95  1  Informatio n 15 r = .93

10

Information r = .90

5 r = .80

0

-3 -2 -1 0 1 2 3 None Depression Severe Questions on the MMPI-2 depression scales were chosen because they maximally discriminate a clinically depressed group from a non-clinical group

25 r = .96

20 r = .95

15 r = .93

10

Information r = .90

5 r = .80

0

-3 -2 -1 0 1 2 3 None Depression Severe Standard Error of Measurement Curve (The range of the latent construct over which a scale is most useful for measuring respondent trait levels)

2.5 1 SEM  I 2.0

1.5

1.0

0.5 Standard of MeasurementError

0.0

-3 -2 -1 0 1 2 3 None Depression Severe What is the reduction in information going from a 22 to 12 item scale?

25

20 r = .95

15 r = .93

10

r = .90 Information

5 r = .80

0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00

Low Fear of Disease Recurrence High q *r = approximate What about IRT models for questions with more than two response categories?

Data from responses to the PROMIS Depression Item Bank. Item Response Theory (IRT): Category Response Curves

In the past 7 days, 1.0 Never Some Always I felt unhappy. 0.8 times Often Rarely

0.6

P P

0.4

0.2

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe In the past 7 days, 1.0 Never I felt I had no 0.8 Some reason for living. times Always 0.6 Rarely Often P P

0.4

0.2

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe Item Response Theory (IRT) In the past 7 days, I felt unhappy.

1.0 Never Some Always 0.8 Rarely times Often Category 0.6

Response P Curves 0.4

0.2

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe

4.0

3.0 Information 2.0

Function Information

1.0

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe Item Response Theory (IRT) In the past 7 days, I felt I had no reason for living. 1.0 Never 0.8 Some Category times Always 0.6 Rarely Often Response P Curves 0.4

0.2

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe

4.0

3.0 Information 2.0

Function Information

1.0

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe Item Response Theory (IRT): Item Information Functions

I felt unhappy. 1. Never I felt depressed. 2. Rarely I withdrew from other people. 3. Sometimes I felt worthless. 4. Often 6.0 I felt I had no reason for living. 5. Always

5.0

4.0

3.0

2.0 Information

1.0

0.0 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 very mild Depressive Symptoms severe IRT Family of Models

IRT models come in many varieties (over a 100) to handle: • Unidimensional and multidimensional data • Binary, polytomous, and continuous response data • Ordered as well as unordered response data

IRT Models You May See in Outcomes Research Item Response Model Format Model Characteristics Rasch / 1- Dichotomous Discrimination power equal across all Parameter Logistic items. Threshold varies across items. 2-Parameter Dichotomous Discrimination and threshold Logistic parameters vary across items. Graded Response Polytomous Ordered responses. Discrimination varies across items. Nominal Polytomous No pre-specified item order. Discrimination varies across items. Partial Credit Polytomous Discrimination power constrained to () be equal across items. Polytomous Discrimination equal across items. (Rasch Model) Item threshold steps equal across items. Generalized Partial Polytomous Variation of Partial Credit Model with Credit discrimination varying among items. Applications of IRT models for Health Outcomes Measurement 1. Design and Evaluation 1. Design and Evaluation 1. Design and Evaluation 2. Testing for Differential Item Functioning (DIF)

In the past 7 days, I cried In the past 7 days, I felt blue None of A little of Some of Most of All of the None of A little of Some of Most of All of the the time the time the time the time time the time the time the time the time time

Depression 3. Linking Health Outcome Measures

PROMIS CES Depression Depression Measure Scale 3. Linking Health Outcome Measures

PROMIS Becks CES Depression Depression Depression Measure Inventory Scale

20 30 40 50 60 70 80 Depression 4. Item Banking and Computerized Adaptive Testing (CAT)

In the past 7 days, I felt unhappy:

Some Never Rarely Often Always times

no mild moderate severe extreme depression depression depression depression depression

      

Depression Item Bank

Item Item Item Item Item Item Item Item Item Item 1 2 3 4 5 6 7 8 9 n Traditional Measurement Theory (Classical Test Theory, CTT) versus Modern Measurement Theory Classical Test Theory Item Response Theory Measures of precision fixed for Precision measures vary across all scores scores Longer scales increase Shorter, targeted scales can be reliability equally reliable Scale properties are sample Item & scale properties are dependent invariant within a linear transformation Comparing person scores Person scores comparable across dependent on item set different item sets Comparing respondents Different scales can be placed on requires parallel scales a common metric Mixed item formats leads to Easily handles mixed item formats unbalanced impact on total scale scores Summed scores are on an Scores on interval scale ordinal scale Graphical tools for item and scale analysis Questions on the MMPI-2 depression scales were chosen because they maximally discriminate a clinically depressed group from a non-clinical group

25 r = .96

20 r = .95

15 r = .93

10

Information r = .90

5 r = .80

0

-3 -2 -1 0 1 2 3 None to mild Depression Severe Conclusions

• IRT serves as a powerful analytic tool to help design health outcomes measures.

• Limitations – Lack of user-friendliness of software – Required knowledge of measurement theory. – Needs large sample sizes Important Deadlines

April 12: Oral and Poster Presentation Abstract Submissions Due Energizing the Science of May 31: Quality of Life Research: Scholarship Applications and Award Nominations Due Where have we been and where can we go? July 1: Presenters Confirm Participation (Oral and Poster Presentations) August 12: Early Registration Deadline September 16: Advanced Registration Deadline September 16: ISOQOL Hotel Room Block Closes

isoqol.org/2013conference Sample Size Issues Sample Size Issues • The IRT model to be estimated – Parameters , Sample Size  - Rasch models need less data. • The number of items or questions. – Number of items , Sample Size  • The number of response options. – Number of response categories , Sample Size  • Unidimensionality of construct – Better the data meet assumption of unidimensionality, sample size  • The item properties – Items at the extremes need more data • Population distribution – Distributed across theta continuum, Sample Size  • Purpose of Study – Evaluation of an instrument, smaller sample sizes needed – Estimate accurate respondent scores, larger sample sizes needed. – Calibrating items for an item bank, larger sample sizes Rasch / 1-Parameter Logistic IRT Model

1 1 1 P X  1q   P X  1q   P X  1q   i  q b  i  a q b  i 1.7 a q b  1  e i 1  e i 1  e i

1.00 q - bi

0.75 I am unhappy some of the time I don’t care what happens to me I cry easily

0.50

b = -0.23 0.25 b = 0.25 ProbabilityResponse of b = 1.33 0.00

-3 -2 -1 0 1 2 3 None Depression Severe q 2-Parameter Logistic IRT Model

1 P X  1q   i 1.7 a q b  1  e i i ai(q – bi)

1.00 a = 2.20

0.75 a = 2.83 a = 1.11

0.50

b = -0.23 0.25 b = 0.25 ProbabilityResponse of b = 1.33 0.00

-3 -2 -1 0 1 2 3 None Depression Severe q