<<

Assessment and Evaluation of a Crosswalk between the Functional Independence Measure

and the Continuity Assessment Record and Evaluation

by

DAVID MELLICK

BS, University of Northern Colorado

MA, University of Northern Arizona

A dissertation submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Clinical Sciences Program

2020

This dissertation for the Doctor of Philosophy degree by

David Mellick

has been approved for the

Clinical Sciences Program

by

Heather Haugen, PhD Chair

Cynthia Harrison-Felix, PhD Research Mentor

Gale Whiteneck, PhD

Jessica M. Ketchum, PhD

David Weitzenkamp, PhD

Date: Tuesday, August 4, 2020

ii

Assessment and Evaluation of a Crosswalk between the Functional Independence Measure and

the Continuity Assessment Record and Evaluation

Dissertation directed by Associate Professor Heather Haugen.

Abstract

Background: Assessment of functional outcome is a major initiative for facilities

engaging in post-acute care. Unfortunately across time, the various types of post-acute care

facilities had developed measures to assess function independently of each other, making for

comparisons of patients across facilities difficult. To rectify this difference, in 2005 the Centers

for Medicare and Medicaid Services developed a comprehensive measure that all facilities were to use to assess function called the Continuity Assessment Record and Evaluation (CARE).

However because of the change in instruments, comparisons between the previous measures

and the CARE necessitate a need for a crosswalk so that scores can be compared. This study

focused on the creation and evaluation of a crosswalk between the motor subscale of the

Functional Independence Measure (FIM) which had been used to assess function in inpatient

rehabilitation facilities and the newer CARE tool.

Methods: An existing dataset of 982 persons who had sustained a moderate to severe

traumatic brain injury and were assessed using both FIM and CARE at inpatient rehabilitation

admission and discharge was utilized to create three crosswalks using various methodology

(expert opinion, equipercentile and Rasch). The dataset was split into a training and validation

dataset. Each crosswalk was evaluated using several criteria including, reduction in uncertainty,

iii

percent of each crosswalked score falling within a ½ standard deviation (SD) of the reference measure, population invariance, comparison of statistical moments and effect size.

Results: Using the training dataset the expert opinion crosswalk met all of the criteria except for the direction of population invariance within the race category. The equipercentile methodology satisfied all of the evaluation criteria, and the Rasch model met all of the criteria except for a difference in directionality in the skewness of the distributions as well as not meeting the targeted 80% of scores falling within ½ SD of the reference assessment. These results differed from the validation sample for the population invariance criteria, in which the age categories were in opposite directions and had observed differences between standardized mean difference for age that exceeded the threshold of 0.08 for both the equipercentile and

Rasch crosswalks.

Conclusions and Significance: All three crosswalk methods produced acceptable criteria for use. Therefore motor/physical functional outcome can be compared between cohorts having been assessed using different measures. The results indicate that for researchers wanting to compare cohorts that have been assessed using different instruments, any of the crosswalks could be utilized.

iv

Table of Contents

CHAPTER I. INTRODUCTION ...... 1

Overview...... 1

Pathways of Care ...... 1

How is Function Measured in Post-Acute Care Settings? ...... 3

Continuity Assessment Report and Evaluation ...... 10

Linking Scores ...... 11

Specific Aims ...... 13

Significance ...... 14

Summary ...... 15

CHAPTER II. Review of the Literature ...... 17

Introduction ...... 17

Literature Search Methods ...... 17

Review of the Literature ...... 18

FIM ...... 18

CARE ...... 20

Measurement Linking ...... 26

Criteria for Linking ...... 32

Linking Functional Outcomes ...... 35

Summary of Gaps in the Literature ...... 40

v

CHAPTER III. METHODS ...... 42

Research Design and Data Collection ...... 42

The Traumatic Brain Injury Model Systems National Database ...... 42

Study Participants ...... 44

Variables Collected ...... 45

Analysis Plan ...... 48

Missing Data ...... 49

Analysis Plan for Research Aim 1 ...... 49

Analysis Plan for Research Aim 2 ...... 52

Validation Sample ...... 56

CHAPTER IV. RESULTS OF ANALYSIS ...... 57

Sample Demographics ...... 57

Sample Functional Outcome Measure Scores ...... 59

Missing Data ...... 61

Expert Opinion ...... 62

Equipercentile ...... 68

Rasch ...... 82

CHAPTER V. CONCLUSION ...... 103

Study Limitations ...... 108

Summary ...... 110

CHAPTER VI. BIBLIOGRAPHY ...... 111

vi

List of Tables

Table I-1: Differences in Post-Acute Care Assessments ...... 6 Table I-2: Differences in Functional Measure Items ...... 7 Table II-1: Justification for CARE tool core items ...... 23 Table III-1: Inclusion criteria for TBIMS NDB ...... 44 Table III-2: Study Dataset Variables ...... 45 Table III-3: Scoring Codes for the FIM-M and CARE ...... 48 Table IV-1: Categorization of Severity of TBI ...... 58 Table IV-2: Demographics of the samples ...... 59 Table IV-3: Functional Outcome Measure Scores ...... 60 Table IV-4: Expert Opinion FIM-M and CARE Item Conversion Scores ...... 62 Table IV-5: Correlations and RiU between the CAREeo and FIMeo ...... 64 Table IV-6: Percent of the scores within 1/2 SD of other assessment using expert opinion method ...... 65 Table IV-7: Demographic Population Invariance for FIMeo and CAREeo ...... 65 Table IV-8: Difference in SMD for Demographic variables for FIMeo and CAREeo ...... 67 Table IV-9: Statistical moments 1 and 2 for CAREeo and FIMeo ...... 67 Table IV-10: Statistical moments 3 and 4 for CAREeo and FIMeo ...... 68 Table IV-11: Effect sizes of CAREeo and FIMeo crosswalks by expert opinion method ...... 68 Table IV-12: Concordance table CARE to FIM-M and FIM-M to CARE ...... 69 Table IV-13: Correlations and RiU between the assessments using equipercentile method ...... 75 Table IV-14: Percent of the scores within 1/2 SD of original score using equipercentile method ...... 76 Table IV-15: Demographic population invariance for CARE and CAREfromFIM using equipercentile method ...... 77 Table IV-16: Difference in SMD for demographic variables for CARE using equipercentile method ...... 78 Table IV-17: Demographic population invariance for FIM-M and FIMfromCARE using equipercentile method ...... 79 Table IV-18: Difference in SMD for demographic variables for FIM-M using equipercentile method ...... 79 Table IV-19: Statistical moments 1 and 2 for CARE and FIM-M using the equipercentile method ...... 80 Table IV-20: Statistical moments 3 and 4 for CARE and FIM-M using the equipercentile method ...... 81 Table IV-21: Effect sizes of CARE and FIM-M crosswalks by equipercentile method ...... 81 Table IV-22: FIM concordance table using Rasch ...... 88 Table IV-23: CARE concordance table using Rasch ...... 91 Table IV-24: Correlations and RiU between the assessments using Rasch ...... 96 Table IV-25: Percent of the scores within ½ SD of original score using Rasch method ...... 97 Table IV-26: Demographic population invariance for CARE and CAREfromFIM using Rasch method ...... 98

vii

Table IV-27: Difference in SMD for demographic variables for CARE using Rasch method ...... 99 Table IV-28: Demographic population invariance for FIM and FIMfromCARE using Rasch method ...... 99 Table IV-29: Difference in SMD for demographic variables for FIM-M using Rasch method ...... 100 Table IV-30: Statistical moments 1 and 2 for CARE and FIM-M using the Rasch method ... 101 Table IV-31: Statistical moments 3 and 4 for CARE and FIM using the Rasch method ...... 101 Table IV-32: Effect sizes of CARE and FIM-M crosswalks by Rasch method ...... 102 Table V-1: Results of the evaluation criteria for the training dataset ...... 104 Table V-2: Results of the evaluation criteria for the validation dataset ...... 105

viii

List of Figures

Figure IV-1: Scatterplot of FIM-M and CARE total scores by time of administration ...... 61 Figure IV-2 : Scatterplot of CAREeo and FIMeo ...... 63 Figure IV-3: Graphical representation of CARE to FIM-M concordance table ...... 72 Figure IV-4: Graphical representation of FIM-M to CARE concordance table ...... 73 Figure IV-5: Scatterplot of CARE using the equipercentile method ...... 74 Figure IV-6: Scatterplot of FIM-M using the equipercentile method ...... 75 Figure IV-7: FIM-M person/item map ...... 83 Figure IV-8: FIM-M item map ...... 84 Figure IV-9: CARE person/item map ...... 85 Figure IV-10: CARE item map ...... 86 Figure IV-11: Scatterplot of CARE using Rasch ...... 95 Figure IV-12: Scatterplot of FIM-M using Rasch ...... 96

ix

List of Abbreviations

ACT American college test ADL Activity of daily living AM-PAC Activity measure for post-acute care CI Confidence interval CoA Coefficient of alienation CMS Centers for Medicare and Medicaid services CRAN Comprehensive R archive network CTT Classical test theory DRA Deficit reduction act FIM-M FIM motor subscale GCS Glasgow coma scale HAQ-DI Health assessment questionnaire disability index IADL Instrumental activities of daily living ICC Intraclass correlation coefficient IRF Inpatient rehabilitation facility IRF-PAI Inpatient rehabilitation facility – Patient assessment instrument IRT Item response theory K-MBI Korean modified Barthel index LEAS Lower extremity activity scale LSU HIS Louisiana state university health status instruments LTACH Long term acute care hospital MDS Minimum data set NDB National database Neuro-QOL Quality of life outcomes in neurological disorders NIDILRR National institute on disability, independent living, and rehabilitation research OASIS Outcome and assessment information set PAC Post acute care PAC-PRD Post-acute care payment reform demonstration PROMIS Patient reported outcomes measurement information systems PPS Prospective payment system PTA Post traumatic amnesia RiU Reduction in uncertainty SAT Scholastic aptitude test SCIM Spinal cord independence measure SD Standard deviation SE Standard error SEED Standard error for equating differences SF-36 Medical outcomes study short form 36 item version SMD Standardized mean difference SNF Skilled nursing facility TBI Traumatic brain injury TBIMS Traumatic brain injury model systems TFC Time to follow commands UDSMR Uniform Data System for Medical Rehabilitation UCLA University of California Los Angeles

x

US United States VA Department of Veterans Affairs

xi

CHAPTER I.

INTRODUCTION

Overview

Pathways of Care

Receiving heath care in the United States (US) after a catastrophic injury is anything but easy to comprehend. Which providers perform what care and to whom is a constant dance that represents decisions based on the type and severity of injury, the type of insurance an individual has, and the availability of treatment. A newly injured person has various care pathways they may take, or more likely have decided for them, based on financial and or insurance constraints. Take, for example, a fictional person who was involved in a car crash and sustained a traumatic brain injury (TBI). Immediately after the crash she was transported to a local trauma center’s emergency department and was quickly assessed and admitted to the hospital. During the acute care stay she was stabilized and many of the critical issues revolving around her medical condition were minimized.

Upon discharge from the acute care facility, there are several possible destinations that she could be discharged to receive post-acute care (PAC) in the continuum of care and recovery

after catastrophic injury. As you will come to understand, many of these PAC facilities offer various levels of support. One problem with admission to a PAC facility is that there are no

guidelines to determine which category of PAC a person may thrive, rather it is often based

on care need, ability to pay, and the availability of family/home support.

1

Each PAC facility type has developed distinctive outcome measures that assess a patient’s performance in conducting basic activities such as eating, bathing, grooming, walking, etc. to help determine the best post-acute placement (e.g., return to home or some alternative, if necessary), treatment plan and ultimately government insurance reimbursement rates. Today there are several levels of PAC facilities;

• Inpatient Rehabilitation Facilities (IRFs) are licensed facilities that are required to

provide to each patient a minimum of three hours of combined physical,

occupational, or speech therapy per day. IRFs are also characterized by an increased

physician presence and care is often provided by registered nurses.

• Long Term Acute Care Hospitals (LTACHs) are facilities that provide treatment for

patients with more serious medical conditions that require medical care on an

ongoing basis but no longer require intensive acute medical care or extensive

diagnostic procedures.

• Skilled Nursing Facilities (SNFs) are nursing homes or hospital-based care units that

obtain Centers for Medicare & Medicaid Services (CMS) certification to provide

skilled nursing care and rehabilitation services on an inpatient basis.

• Home Health Agencies (HHAs) are organizations which provide care to persons who

are typically unable to leave their residence without considerable effort and require

at least some part-time skilled nursing and/or therapy services.

Broadly, all of these types of PAC facilities are designed to minimize the disabling impact of injuries and health conditions, and help individuals maximize physical and psychological recovery while improving quality of life and rebuilding connectedness to their

2

community. In 2013, nearly 22 percent, or nearly 8 million hospital discharges from US hospitals used some type of post-acute services.1 Half of these discharges were to a HHA,

41% were to a SNF, 7% were to an IRF, and 2% were to a LTACH. Most importantly, each

type of PAC has a unique methods of measuring a person’s function upon admission and discharge to verify change (i.e., improvement/gain) in function and to justify reimbursement that is specific to the type of PAC facility. Given the importance of the idea of function, we first must know how function can be measured.

How is Function Measured in Post-Acute Care Settings?

The term ‘Functional Assessment’ was first coined by Lawton in 1971 who defined it as any “systematic attempt to measure objectively the level at which a person is functioning in a variety of areas”.2 In practice, “functional assessment” aims to measure the performance of a person on activities of daily living (ADL). Dittmar and Gresham have defined five applications for functional assessment: 1) evaluating individual outcomes; 2) planning for treatment interventions; 3) determining effectiveness of treatment; 4) maintaining continuity of care; 5) improving resources or staffing needs.3 Further uses have been identified such as the basis for payment systems and classification into disability related groups (DRGs) or payment/reimbursement categorizations.4 In 2002, the World

Health Organization further clarified the term “functional status” as an umbrella term that includes all body structures, activities, and participation in daily life.5 However early

measures such as the Katz and Barthel indexes6 used in rehabilitation focused specifically on

“activities”, such as activities of daily living and included eating, bathing, grooming, dressing, as well as mobility items such as transfer ability, walking, climbing stairs, and use

3

of a wheelchair. Furthermore, unlike other measures such as height and weight, functional

activity does not have a standard of measurement or scale; instead it can only be measured using a rating or response scale which is based on a person’s observed performance for a given activity.

Measurement of a patient’s outcomes in post-acute rehabilitation remains an important tool for disability management. Standard measurement procedures and instruments are often used to inform stakeholders of and outcomes achieved by patients. Despite having over three decades of PAC measurement historically, there has been little agreement on the timing of assessments, nor has there been consistent approaches and assessments to compare a patient’s performance throughout PACs. Over

time, with the objective to improve the measurement process, separate instruments have

been developed by researchers to evaluate the functional status of persons in a PAC setting.

There are three PAC outcome assessments that have solidified measurement over

the past 30 years: The Inpatient Rehabilitation Facility – Patient Assessment Instrument

(IRF-PAI) for IRFs,7 the Minimum Data Set (MDS) for SNFs,8 and the Outcome and

Assessment Information Set (OASIS) for HHAs9. LTACHs did not create, nor were mandated

to use any specific outcome assessments.

To assess functional status within a SNF, the MDS was developed under a

congressional mandate calling for a uniform assessment for residents of nursing homes to

develop care plans.10 However, in time it became an instrument to measure quality of care

and ultimately became a source for determining reimbursement for healthcare services.11,12

4

HHAs assess individuals using the OASIS. The OASIS was developed using funds from the Robert Wood Johnson Foundation as well as the Health Care Financing Administration with the goal to improve home health services care.11,13 In 2000, OASIS data provided the basis for the establishment of Medicare reimbursement for home health.

To measure functional status for patients in IRFs, the FIM™, formerly the Functional

Independence Measure, was developed in 1987 by a task force of the American Congress of

Rehabilitation Medicine and the American Academy of Physical Medicine and Rehabilitation as a measure of burden of care.14 Starting in 1990, the FIM became the basis for the CMS

mandated tool IRF-PAI. Licensed IRFs were required to measure functional status using the

FIM on all Medicare and Medicaid patients and submit these data to CMS. In 2002, data

reported using the IRF-PAI, including facility, diagnostic, and demographic data along with

the FIM were used to determine the Prospective Payment System (PPS) Part A reimbursement for Medicare fee-for-service patients.15

Each of these assessments, FIM, OASIS, MDS are site-specific within the PAC system and lack common terminology, definitions, and domains, and each utilized different methodology for development.16 Furthermore, each provider utilized a different PPS for

Medicare reimbursement. Complicating the matter is that many of the PAC facilities could

treat patients with the same condition, yet measure different functional statuses due to the

differences in assessments, leading to biases toward which person got treated at which PAC setting, depending on the PAC options in the region and other factors not measurable using the Medicare claims data.17 Further, each assessment collects data on a different schedule

(Table I-1). Even when the tool collects data referred to as “at admission”, only the OASIS

5

actually observes the calendar day of admission as the time frame, whereas IRF-PAI has a three-day window from admission and MDS varies across eight days. Furthermore, the MDS has interim collection at day 14 as well as every 30 days after admission until day 100, and

OASIS requires assessments every 60 days. Both IRF-PAI and OASIS have a discharge

assessment but the MDS does not require one.

Table I-1: Differences in Post-Acute Care Assessments

Inpatient Rehabilitation Dimension Facilities Skilled Nursing Facilities Home Health Agencies Tool IRF-PAI MDS OASIS

Frequency of Admission and Discharge Initial (day 1-8); day 14; day 30; Calendar day of admission Measurement every 30 days until day 100 and every 60 days thereafter; and at discharge Time Period Lowest level within 7 day look-back Status on day of of first/last 3 days assessment Measurement

Method of Direct observation or Information gathered from Direct observation Measurement with reported multiple caregivers’ preferred, but also often performance descriptions and used interviews with documentation. Direct patient in-home caregiver observation not required.

While all three instruments attempt to classify functional status, each does so with

different items (Table I-2). Some major differences for the MDS items include the ability to transfer in both the bathing and items, the MDS does not differentiate upper body

dressing from lower body dressing. The IRF-PAI accounts for not only the level of assistance

needed for bowel and bladder control, but also the use of equipment and frequency of

accidents, whereas the MDS and OASIS focus primarily on incontinence. The IRF-PAI is the

only instrument that specifically captures walking up stairs as an indicator for locomotion.

Finally the OASIS combines all transfer abilities regardless of activity

6

(tub/shower//chair) whereas the IRF-PAI separates out those activities, and as mentioned before the MDS includes toileting and bathing functions into their transfer items.

Table I-2: Differences in Functional Measure Items

IRF-PAI MDS OASIS Self-Care Eating includes the ability Eating – how resident eats and Feeding or eating to use suitable utensils to drinks. Includes intake of ability to feed self- bring food to the mouth, nourishment by other means. meals and snacks. as well as the ability to chew and swallow the food once the meal is presented in the customary manner on a table or tray. Grooming includes oral Personal hygiene – how Grooming – ability to care, hair grooming resident maintains personal tend to personal (combing or brushing hygiene, including combing hygiene needs (i.e. hair), washing the hands, hair, brushing teeth, shaving, washing face and washing the face, and applying makeup, hands, hair care, either shaving the face or washing/drying face, hands shaving or make up, applying make-up. If the and perineum. teeth or denture care, subject neither shaves fingernail care. nor applies make-up, grooming includes only the first four tasks.

Bathing includes washing, Bathing – how resident takes Bathing – ability to rinsing, and drying the full-body bath/shower, sponge wash entire body body from the neck down bath, and transfers in/out of excludes grooming (excluding the back) in tub/shower. (washing face and either a tub, shower, or hands only). sponge/bed bath. Dressing – Upper Body Dressing – how resident puts Ability to dress upper includes dressing and on, fastens, and takes off all body undressing above the items of clothing, including waist, as well as applying donning/removing prosthesis. and removing a prosthesis or orthosis when applicable. The patient performs this activity safely. Dressing – Lower Body Ability to dress lower includes dressing and body – including undressing from the waist undergarments, down, as well as applying slacks, socks or nylons, and removing a shoes. 7

Table I-3 cont prosthesis or orthosis when applicable. The patient performs this activity safely. Toileting includes Toilet use – how resident uses Toileting – ability to maintaining perineal the toilet room; transfer on/off get to and from the hygiene and adjusting toilet, cleanses, changes pad, toilet or bedside clothing before and after manages ostomy or catheter, . using a toilet, commode, adjusts clothes. , or . The patient performs this activity safely. Sphincter Bladder level of Bladder continence – control of Urinary incontinence control assistance, include use of urinary bladder function, with or urinary catheter equipment and frequency appliances, or continence presence. of accidents programs, if employed. Bowel level of assistance, Bowel continence – control of Bowel incontinence include use of equipment bowel movement, with frequency. and frequency of appliance or bowel continence accidents. programs, if employed.

Transfers Transfers: Bed, Chair, Modes of transfer – bedfast all Transferring: ability to Wheelchair includes all or most of time, or bed rails move from bed to aspects of transferring used for bed mobility or chair, on and off toilet from a bed to a chair and transfer. or commode, into and back, or from a bed to a out of tub or shower, wheelchair and back, or and ability to turn and coming to a standing position self in bed if position if walking is the patient is bedfast. typical mode of locomotion.

Transfers: Toilet includes Toilet use – how resident uses safely getting on and off a the toilet room; transfer on/off standard toilet. toilet, cleanses, changes pad, manages ostomy or catheter, adjusts clothes. Transfers: Tub/Shower Bathing – how resident takes includes getting into and full-body bath/shower, sponge out of a tub/shower. The bath, and transfers in/out of patient performs the tub/shower. activity safely. Locomotion Walking includes walking Locomotion on unit – how Ambulation/ on a level surface once in resident moves between Locomotion – ability a standing position. locations in his/her room and to safely walk, once in Wheelchair includes using adjacent corridor on same a standing position, or a wheelchair on a level floor. If in wheelchair, self- use a wheelchair, once surface once in a seated sufficiency once in chair. in a seated position, position.

8

Table I-4 cont Stairs includes going up Locomotion off unit – how on a variety of and down 12 to 14 stairs resident moves to and returns surfaces. (one flight) indoors in a from off unit locations. If safe manner. facility has only one floor, how resident move to and from distant areas on the floor. If in wheelchair, self-sufficiency once in chair.

Because of the measurement content and timing of measurement differences, several studies have tried to equate functional status across these three PAC settings. In

1997, Williams et al. convened an expert panel to see if a score on the MDS could be transformed to match a score on the FIM, the functional measurement tool of the IRF-PAI.18

The results were mixed, while some items showed a high level of agreement between specific items, there were some noted floor and ceiling effects in which most patients scored too high or too low on a specific item not providing enough variance to actually translate them. Later, in 2000, CMS funded research to develop a new multipurpose functional assessment instrument, the Minimal Data Set for Post-Acute Care (MDS-PAC),19 which aimed to respond to the Balanced Budget Act of 1997 by providing a uniform assessment tool for SNFs. It was hoped that the resulting instrument would have been utilized to assist creating a scoring link between the FIM and MDS-PAC as well as evaluate whether or not the estimated FIM score would match payment equity with the new measure. Unfortunately, the overall result was that the adding of a fourth measurement of functional status to try and equate the FIM and the MDS was not successful.

Nevertheless, both of these research projects further solidified the need and desire for a true standardized instrument that could be used in all PAC settings.

9

Continuity Assessment Report and Evaluation

In light of the disappointing results of linking the FIM to the MDS, in 2005, and as a result of the Deficit Reduction Act (DRA), CMS developed a Medicare Payment Reform

Demonstration that assessed the consistency of payment incentives for Medicare populations treated at various PAC settings, including LTACHs, IRFs, SNFs, and HHAs. This demonstration project allowed CMS to understand differences in patient treatment, outcomes, and cost in the various settings.20 As a result of this project, researchers created a standardized patient assessment tool to be used in all PAC settings at admission and discharge. The tool was named the Continuity Assessment Report and Evaluation (CARE)

Item Set. Like the other outcome assessments used in post-acute facilities and as mandated

by Congress, the CARE measures the medical, cognitive, and functional status of Medicare

beneficiaries as well as change in functional status and other outcomes such as presence of

urinary tract infections, clostridium difficile infection, and pressure ulcers. In fact, the CARE

was designed to replace the existing federal assessment tools for each of the PAC settings

including the MDS used in SNFs, the OASIS in HHAs, and the IRF-PAI which includes the FIM

for IRFs,20,21 and be used in LTACHs which previously did not have a standard of outcome

measurement.

While the overall goal of the new CARE consolidates functional measurement

assessment at all PAC settings, the transition introduced a number of challenges at the

practice level. First, given the relatively new adoption of the CARE, the tool has not been

fully validated independently of CMSs research creating the tool. Second, the CARE may

showcase more theoretical value than practical effectiveness. Third, the CARE may be more

10

complicated, inconsistent, and burdensome to administer. Fourth, switching to the CARE implies that some people will be assessed with the FIM, MDS, OASIS on admission and then the CARE at discharge. So, for these cases, how does a clinician assess change or success?

This last issue is important not only for clinicians, but also for researchers who have

relied heavily on outcome assessments such as the FIM, MDS and OASIS. Moreover, studies

that are longitudinal in nature need to address the inconsistency of functional outcome

measurement across time. How does a researcher compare outcomes, either within a single

patient, or cohorts of patients, when the measurement tool changes? One way to address

this “seam” in the data would be to provide a statistical link between the scores of one

measure to the scores on another measure.

Linking Scores

At its root, test linking is a process by which there is a transformation of a score from

one test to a score on another test. Holland and Dorans22 provided an historical framework

that classified score linking into three categories: predicting, scale aligning, and equating.

Each type of linking method determines how the linked scores are used and interpreted,22-25

and not all provide methodology that results in valid equating.

Predicting

For prediction linking, the goal is to predict scores for one test using scores from another test. The model may be multivariable in nature utilizing information from other instruments or subject-specific characteristics (e.g., age, sex, injury severity) to improve

prediction. By the very nature of modeling, this method will not act as a valid equating

11

methodology as it violates assumptions of equating outlined below,23 specifically regression

modeling will not produce scores that can be used bi-directionally.

Scale alignment

The goal of scale alignment is to create a common scale that each measure would transform onto.24 Scale alignment had its root in educational testing development (e.g. SAT and ACT). Scale aligning has many subcategories, including activities such as battery scaling,25 anchor scaling,24 vertical scaling,25,26,27,28 calibration,24 and concordance29. Each of these subcategories are defined by whether or not the linking function is applied to instruments with similar constructs, similar difficulty, similar reliability, and used for the same population of examinees. Battery scaling is used if the linking occurs for different constructs but a similar population. For example, this occurs when researchers are linking verbal and math sections of an educational test. If researchers are linking the same construct such as “verbal” with test that differ in difficulty it is considered vertical scaling.

Calibration scaling refers to when linking the tests that have different reliability. This occurs when researchers are creating a “short form” of a test that uses a portion of variables used in the complete test. Finally, concordance refers to when the linking represents the same construct, difficulty, reliability and is assessed on the same population. The vast majority of linkages, fall under this concordance category as true equating, described below is rarely accomplished.23

12

Equating

Equating is the most rigorous form of linking in which a score on one assessment

directly maps to a score on the other.24,25 True scale equating is difficult to obtain, as

Holland30 noted several fundamental assumptions for a linkage to be considered equating:

1) Equal construct – each assessment is measuring the same construct;

2) Equal reliability – the assessments should have the same level of reliability;

3) Symmetry – the transformation should be bi-directional;

4) Equity – the examiner/assessor should be indifferent to which assessment is

administered; and

5) Population invariance – the linking function that is used to produce the equating

should be the same regardless of subpopulation from which it is derived.

Dorans23 also points out that many examples of linking scores are more likely to be concordances. While the goal of both concordances and equating is to establish a relationship between scores of two assessments, the terminology used to describe them has meaning and will infer different interpretations.

Specific Aims

Because of the vast amount of data already collected using different outcome assessments in PAC settings and the move to consolidate these differences in a unified measure, various research projects will need to be initiated to fully understand how to overcome the disparity in data sources. Linking the scores is one way of addressing the seam as it remains important for researchers and clinicians alike to be able to equate

13

function regardless of the measure used. Furthermore, this “linking” will ultimately need to be constructed for various populations, as the function may not be the same depending on the population studied. For instance, the linking function could be significantly different for those with a significant motor impairment like spinal cord injury versus people having sustained a TBI. This study will attempt to create and evaluate crosswalks, or score linkages, between the FIM and CARE, utilizing a longitudinal database of people with moderate to severe TBI to determine if reducing the influence of the discrepancy in the data can be effective.

Research Aim 1: Create bi-directional crosswalks between the FIM and the CARE that maps the score from one scale to that of the other.

Research Aim 2: Evaluate each crosswalk using methods to assess equating methodology.

Significance

The ability to crosswalk different assessments is an important step in establishing the validity, efficacy, and effectiveness of instruments. This dissertation will develop and assess a variable crosswalk between the FIM and the CARE. While the FIM consists of 18 items that range in domain from self-care, mobility, bowel/bladder function and cognitive, the crosswalk proposed will exclude the cognitive section as the CARE tool does not capture cognitive items for both admission and discharge like the mobility and self-care component.

Specifically this dissertation will focus on linking the CARE and FIM Motor (FIM-M) subscale.

While the FIM is largely regarded as the standard of functional independence measurement31,32 its use has primarily been relegated to IRFs. Other PAC settings utilized

14

different outcome measurements and comparison between them has been difficult.33,34

Because of this disconnect, in 2005 CMS developed and introduced the CARE Item Set, with the objective to unify the existing federal assessment tools for each of the PAC settings.21

Amongst other assessments, the CARE includes a set of items to measure motor function.

Given the movement toward a comprehensive post-acute measurement system, it is necessary to assess a linkage between the currently used FIM and the newly adopted CARE.

Developing and testing crosswalks between the FIM-M and CARE will provide evidence of the ability to compare patient’s functional status over time and between cohorts that have been administered either outcome measure. While the crosswalk developed for this dissertation will only be validated for people with moderate to severe TBI, it will provide a contribution to the field to detail the methods which others may follow to design crosswalks for other populations.

Summary

Functional outcome measurement is an important tool in describing a patient’s

recovery following treatment in PAC settings. Furthermore, it also plays an integral role in

determining financial reimbursements from CMS to those PAC facilities. Recently CMS has

adopted a single measure for the assessment of function in PAC settings called the CARE

Item set. For PAC settings like IRFs, this tool now replaces a longstanding tool, the FIM,

which has been in existence for functional measurement since 1987. Given the widely adopted use of the FIM, many existing longitudinal research studies that have used FIM as a measure of outcome will likely need to switch to the CARE because of the mandate of CMS

15

to collect this measure instead of FIM. To date, no studies outside the internal studies which

CMS conducted during the creation of the CARE have attempted to link FIM and CARE item set scores. This study presents the development and evaluation of three crosswalks between the FIM-M and CARE item set in a large sample of patients who sustained a TBI and who completed both instruments as part of an existing research protocol. First, an expert review panel will evaluate the content of both items and scales and create a harmonized scoring algorithm for which each test will be scored. Second, using classical test theory, an equipercentile method will be employed which will evaluate and link the score distributions of each test. Finally, an Item Response Theory (IRT) approach will be utilized for converting the test scores to a single logit scale in which concordance scores can be calculated. The accuracy of each crosswalk will be evaluated by measuring the reduction in

uncertainty (RiU), population invariance, statistical moment comparisons, and effect sizes.

Each crosswalk will be cross-validated in a proportion of the study sample not used for

crosswalk creation.

16

CHAPTER II. D

REVIEW OF THE LITERATURE

Introduction

The focus of this literature review aims to provide a comprehensive overview of not only the FIM and CARE and their prior use and applicability to measurement in a brain injury population, but also to examine the history of various linking frameworks that have been utilized in a variety of medical conditions. Therefore, a literature review was completed to further understand:

• The history and utility of the FIM

• The history and utility of the CARE

• Common equating frameworks and the methods to evaluate them

Literature Search Methods

The literature review was completed with the purpose of identifying relevant peer- reviewed publications, published white papers, and government publications. Searches were primarily through PubMed and Google Scholar using the following terms:

• ‘post-acute rehabilitation’ and ‘outcome’

• ‘rehabilitation’ and ‘traumatic brain injury’

• ‘functional independence’ and ‘traumatic brain injury’

• ‘functional independence measure’

• ‘continuity assessment record and evaluation’

17

• ‘test equating’ and ‘equating’ or ‘linking’ or ‘concordance’

• ‘equipercentile’

• ‘crosswalk’ and ‘test linking’

• ‘test equating’ and ‘rasch’ or ‘equating’

• ‘evaluation’ and ‘test equating’

• ‘crosswalk’ and ‘function’

Article titles and abstracts were reviewed to select pertinent studies or papers to contribute to the understanding of the project. The references from those selected papers were also reviewed to complete an exhaustive and comprehensive review.

Review of the Literature

FIM

The FIM is used to objectively measure functional independence and burden of care through an individual’s level of motor and cognitive ability and assesses the extent of assistance required to complete activities of daily living (ADLs). The scale accounts for a patient’s level of independence, amount of assistance needed, use of adaptive or assistive devices, and the percentage of a given task completed successfully. This instrument is comprised of 18 items with a seven-level response scale of independent performance in self-care, sphincter control, mobility, locomotion, communication and social cognition.35

Thus, a total FIM score ranges from 18 to 126. Hamilton36 noted that, “Because each item is scaled on the basis of functional independence, it is expected that the total score (with each

18

item appropriately weighted) will correlate with the burden of care for the disabled person”

(p. 862). Over time, the FIM has been associated with having different subscales with

Granger identifying three constructs: ADL, mobility, and continence.35 Stineman et al.37

showed in a study of 93,829 rehabilitation inpatients that a factor analysis of the FIM

instrument supported the identification of ADL/motor and cognitive/communication

dimensions across 20 impairment categories. Today, the multidimensional structure of the

FIM by means of a Rasch analysis followed by factor analysis of standardized residuals, demonstrated the divergence of the five cognitively-oriented items from the 13 motor-

oriented items.32,38

In US inpatient rehabilitation settings, the Uniform Data System for Medical

Rehabilitation (UDSMR), is the most widely used clinical database for assessing

rehabilitation outcomes.39,40 Administered in most inpatient rehabilitation facilities within

three days of admission and discharge,41 the FIM is the core functional status measure of

the UDSMR and was developed to establish a uniform standard for the assessment of

functional status during medical rehabilitation.40,42 The FIM incorporates concepts and

items from previous functional assessment instruments, such as the Katz Index of ADL, the

PULSES profile, the Kenny Self-Care Evaluation, and the Barthel Index.43 The FIM system was

developed by a national task force co-sponsored by the American Congress of Rehabilitation

Medicine and the American Academy of Physical Medicine and Rehabilitation to develop a national UDSMR to rate patient functional independence and the outcomes of medical

rehabilitation.36 The original work of this task force was expanded by the Department of

Rehabilitation Medicine at the State University of New York at Buffalo. Since 1987, it has

19

been the mission of the UDSMR to measure medical rehabilitation outcomes across the continuum of care in PAC settings.44 The UDSMR maintains a national data repository for reporting purposes of three million case records from 1,400 facilities around the world.44,45

Furthermore, eRehabData offered to providers from the American Medical Rehabilitation

Providers Association, is another large inpatient rehabilitation outcome system that offers

services to IRFs to maintain compliance with CMS by processing and submitting to CMS, IRF-

PAI data including the FIM.

Psychometric studies of the FIM instrument support its use for research purposes.

One of the strengths of the FIM is that it has undergone several methodological evaluations,

in which it has demonstrated good psychometrics.46,47 Dodds et al.47 noted high internal

consistency (Cronbach’s coefficient of .93 at admission and .95 at discharge), demonstrating the FIM to be a reliable instrument. Extensive investigations of the FIM’s reliability and validity have provided evidence of its interrater and test-retest reliability,48 internal consistency,37,49 concurrent validity,50,51 and predictive validity51,52. Ottenbacher et al.48 performed a meta-analysis of 11 studies that estimated the median interrater reliability for the total FIM to be .95 and the test-retest and equivalence reliability to be .95 and .92, respectively. Additionally, the stability of the FIM motor score, or knowing that the assessment was measuring the same construct across time, was demonstrated in several studies.32,38

CARE

As part of the national Post-Acute Care Payment Reform Demonstration (PAC-PRD) and enabled by Congress under the DRA of 2005, the CARE was designed to standardize

20

assessments of patients’ medical, functional, cognitive, and social support status across all

PAC settings. Largely this effort stemmed from modernizing CMS’s existing assessments

including the IRF-PAI, the MDS, and the OASIS.

Kramer53 first detailed the types of approaches to assess PAC and to make

recommendations to CMS on ways to create a uniform assessment. In his report, Kramer

recommended 31 domains for which information about discharge placement, care

transitions, and outcome monitoring should to be measured. Only three domains listed

satisfied all three purposes. They included Physical functioning/mobility, ADL/self-care, and

Instrumental ADL (IADL)/advanced cognition.53 While Kramer noted the importance of

these three domains, he cautioned that the scale to measure these domains needs to be

sensitive to small differences for those who are dependent because such difference could

have a strong impact on where these beneficiaries can reside.53 Kramer also outlined a long

term vision for achieving a uniform patient assessment that included the following steps:53

1. Agree on a core dataset at every hospital discharge

2. Develop algorithms that would recommend patient placement

3. Assure that uniform assessment could be delivered electronically

4. Create a health and outcome monitoring system

5. Payment would be based on metrics regardless of discharge setting

With these goals in mind Kramer set the stage for the creation and adoption of what would

ultimately be known as the CARE.

Johnston54 also added to the choir of voices in support for a uniform post-acute

assessment and described a framework for which that could be created. He also noted that

21

any uniform assessment would need to include a measurement of the extent of functional independence, as well as the ability to participate in self-care.

The background of the development of the CARE is detailed in Gage55 three-part

report prepared by RTI International and delivered to the CMS Office of Clinical Standards and Quality. In the first report, Gage utilized expert panels and reported on methods for inclusion of items into the CARE. Supported by previous research,53 the work identified four

domains; medical severity, functional impairment, cognitive impairment, and social support.

Given the breadth of information of the CARE tool in its entirety, only the functional domain

will be described in more detail here.

Initially, the CARE consisted of core and supplemental items. The core items

included six self-care items as well as five functional mobility items, and were used to

evaluate all patients regardless of functional level.55 They represented items covering a

range of difficulty, were easily scored, and played a critical role in discharge planning. The

core self-care items included basic self-care such as eating, tube feeding, oral hygiene, toilet

hygiene, and dressing upper and lower body. The core functional mobility items consisted

of lying to sitting on side of bed, sit to stand, transfers, walking, and wheelchair distance.

Each core item from either self-care or functional mobility is scored on a six-level rating

scale measuring the need for assistance: dependent, substantial assistance, partial

assistance, supervision or touching assistance, set-up or cleanup assistance, or

independent. The purpose of each item is to quantify the amount of assistance and

therefore inform resource utilization needs as well as discharge placement. The

justifications for inclusion of each core item are listed in Table II-1.

22

Table II-1: Justification for CARE tool core items

Self-Care / Reasons for inclusion into the CARE Mobility Items

Eating Eating measures the ability to use suitable utensils to bring food to the mouth and swallow food once the meal is presented on a table or tray and also includes modified food consistency. Patients requiring higher levels of assistance may have higher resource utilization and this may also affect PAC discharge placement.

Tube Feeding Tube feeding includes the ability to manage all equipment and supplies for tube feeding. Patients requiring higher levels of assistance may have higher resource utilization and this may also affect PAC discharge placement. The supervision required for patients with substantial assistance may not be available in all settings. The tube feeding item is distinct from both the swallowing item and the eating item because patients who are able to manage the feeding tube on their own will be rated as independent and may require additional resources.

Oral Hygiene The oral hygiene item is included because it is an activity that all patients need to perform. Patients requiring higher levels of assistance may have higher resource utilization.

Toilet Hygiene Toilet hygiene includes the ability to maintain perineal hygiene, adjust clothes before and after using the toilet, commode, bedpan, or urinal. Patients requiring higher levels of assistance may have higher resource utilization.

Upper body Upper body dressing includes the ability to put on and remove shirt or pajama top, dressing including buttoning three buttons. This item measures upper body mobility and fine motor skills. Patients requiring higher levels of assistance may have higher resource utilization.

Lower body Lower body dressing includes the ability to dress and undress below the waist, dressing including fasteners. This item measures lower body mobility, balance, and dexterity. Similar to the upper body dressing item, patients requiring higher levels of assistance may have higher resource utilization.

Lying to sitting on This is a lower level function item. Need for assistance with this item is indicative side of bed of resource utilization and may also affect PAC discharge placement.

Sit to stand This item measures balance and transition and is a more difficult functional item that may be used to assess fall risk. Need for assistance with this item is indicative of resource utilization.

Toilet transfer/ Both toilet transfer and chair-to-chair transfer are included in the CARE tool. Chair/Bed to chair Chair-to-chair, or bed-to-chair, transfer is a more basic surface-to-surface transfer, transfer but toilet transfer is more difficult because it occurs in a constrained space. Toilet transfer is predictive of a patient’s ability to return home. For both items, patients requiring higher levels of assistance may have higher resource utilization and this may also affect PAC discharge placement.

23

Table II-2 cont Longest distance The walking items codes the longest distance the patient can walk. This is a the patient can performance based item and the response categories include Walk 150 ft, Walk walk 100 ft, Walk 50 ft, or Walk in room once standing. This locomotion item is predictive of post-acute discharge placement and resource utilization. Patients with limited mobility requiring higher levels of assistance may have higher resource utilization.

Longest distance For patients whose primary mode of mobility is wheelchair, there is a locomotion the patient can item that corresponds to the walking item. The wheelchair items codes the wheel longest distance the patient can wheel. This is a performance based item and the response categories include Wheel 150 ft, Wheel 100 ft, Wheel 50 ft, or Wheel in room once sitting. This locomotion item is predictive of post-acute discharge placement and resource utilization. Patients with limited mobility requiring higher levels of assistance may have higher resource utilization.

While the core set of items was being discussed, the expert panels also considered supplemental items that would help clarify a patient’s functional status. The supplemental items address a range of activities that differed in difficulty and are listed below:

• Wash upper body • Shower/bathe self • Roll left or right • Sit to lying • Picking up object • Putting on/taking off footwear • Wheelchair use for mobility • 1 step (curb) • Walk 50 feet with two turns • 12 steps interior • 4 steps exterior • Walk 10 feet on uneven surface • Car transfer • Wheelchair users only—short ramp • Wheelchair users only—long ramp • Telephone—answering • Telephone—placing call

Gage summarizes the report by acknowledging that this first version of the CARE, while solid, still has further development to achieve. Factors specific to less common

24

diseases like stroke and spinal cord injury are yet to be accounted for.55 From the

demonstration project, the CARE was effective in uniformly assessing patients’ and using

the CARE to assist in creation of the PPS. All PAC settings (IRFs, HHAs, SNFs and LTACHs)

were able to utilize the CARE to collect consistent and reliable data, and transmit these data

to CMS.17

Later that year RTI pursued testing the CARE through conducting further reliability

studies as well as inter-rater reliability testing. They found that the CARE items were reliable

when used across settings and by different disciplines.56 The levels of agreement varied but

most showed correlation coefficients above 0.70; a few appeared weaker across the board

such as certain aspects of swallowing measurement, walking 150 feet, light shopping, and

laundry. The researchers also performed factor analysis and Rasch to assess the

dimensionality and item-fit of the CARE measures, which found reasonable

unidemensionality and good item fit for the core variables in the CARE.

To underscore this initiative for a uniform measure to assess function, Chang and

colleagues57 developed a Chinese version of the CARE (CARE-C) to better appreciate post-

acute measurement in Taiwan. While their study focused on stroke only, they concluded

that the development of CARE-C was useful in evaluating the functional quality metrics and

facilitates the assessment in PAC settings in Taiwan.

As evidenced by the historic utility of the FIM, but its use in only one post-acute

setting, and the new drive for CMS to implement the CARE tool in all PAC settings, there

remains a lack of research that addresses how these two assessments could be linked such

that the “seam” generated by switching assessments can be minimized. The primary

25

methodology for doing this was developed by research done primarily in educational testing

and is defined as measurement equating.

Measurement Linking

Linking is a statistical process that is used to adjust scores on instruments so that

scores on each of the instruments can be used interchangeably. Interchangeably means that

regardless of the instruments or version of the instrument taken, the same scores represent

the same level of achievement and differences in scores are not due to difficulty differences

between alternate versions of the instrument. In this context “difficult” is defined as the

estimate of skill level needed to an item or instrument. While this concept is rather

easy to follow when discussing an aptitude test in which there exists right and wrong

answers, it is a bit more abstract to think about instruments that measure functional ability

that score difficulty levels which are not based on “passing an item”, yet the concepts and

terminology remain the same. Depending on how well a linking function works (described

further in detail below) the end result is considered to be a “crosswalk” between the two

instruments. Beginning in 1951, Flanagan58 differentiated the various types of frameworks

necessary for linking and the construction of score scales. He used the term comparability

when describing scores on instruments that were scaled such that the distributions were

similar for a certain population. For instance, an instrument that measures reading

comprehension could be comparable to an instrument that measures mathematical

aptitude for children in a certain age group. He argued that any linking had to be population

specific, meaning, in our example, that the linking transformation may not be appropriate

for college aged students. In that regard he stated “Comparability which would hold for all

26

types of groups—that is, general comparability between different tests, or even between various forms of a particular test—is strictly and logically impossible” (p. 748). Flanagan described three major topics of linking. First, he differentiated linking scores that measure the same construct from linking scores with different constructs. For example, linking two instruments that measure reading ability is different than linking an instrument for which one measures reading comprehension to one that measures mathematical skill. Second, he recognized linking scores on test that measure the same construct but differ in difficulty.

Third, he noted that some linking transformations could be symmetric, meaning that the transformation of scores are bi-directional.58 Many of the concepts still are considered and

used today, however the nomenclature has evolved from simply using the term

comparability.

Following Flanagan’s 1951 effort, Angoff59 proceeded to use the term equating to

represent linking scores on multiple forms of a test that were created to be similar in

nature. Borrowing from the educational sphere, there exists many versions of the Scholastic

Aptitude Test (SAT) all of which are linked such that each version is equated with the other.

Differing from Flanagan, Angoff postulated that equating relationships should be population independent. Angoff also changed the way linking procedures were described. First, he used the term calibration to refer to linking scores on instruments that measure the same construct but differ in reliability and/or difficulty. In this case, imagine an instrument that measures reading comprehension that has 100 items and you wanted to link it to an instrument that contained only 10 out of the 100 items essentially creating a “short form” version of the instrument. Angoff would term this type of linking calibration because the

27

reliability of the two version of the instrument would be different. Second, he limited the

use of the term comparability to only linking scores on tests in which they were measuring different constructs.

Mislevy59 and Lim60 separately developed and further expanded to the frameworks

for linking that added statistical moderation and projection. Statistical moderation is a process of linking scores through a third (moderator) variable. Projection is a method of non-symmetrical linking using regression. Neither Mislevy nor Lim required as a part of their

framework to distinguish test linking measures with similar or same constructs and those

with different constructs.

Finally, Dorans22,60 made a distinction between the linking of scores from

instruments that measure the same construct versus linking of scores that measure

different constructs. He used the term concordance to represent the linking of scores with similar constructs and that the use of statistical methods would lead to similar distributions for the measure.

In general, linking will be called something different depending on the attributes of instruments that are being linked. While equating may be the goal of linking, few transformations will make this rigorous definition abiding by the five principles (equal construct, equal reliability, symmetry, equity, and population invariance). Most transformation will fall into the definition of scale alignment which would consist of a

transformation being; Battery scaled, in which the instruments measure different constructs

(reading and math) across a common population (5th graders); Vertical scaled, in which the

constructs and reliability are similar (math) but will differ in difficulty (addition and

28

subtraction vs fractions and percentiles) and populations of examinees (2nd graders vs 5th graders); Calibration, in which the instruments measure the same construct (math) in the same population (5th graders) but differ in instrument reliability (short form); and finally

Concordance, in which the instruments measure the same constructs, have similar reliability and difficulty and are measured on the same population.

There exists a variety of techniques used to equate tests,59,61, which can reduce to two different processes of equating: classical test theory (CTT) and IRT.

Classical Test Theory

CTT is regarded as the traditional approach to measurement. It is based on the

assumption that a test-taker has an observed score and a true score; where the observed

score is estimated as the true score plus/minus some unobservable measurement error.62

CTT is used in item analysis, a process in which an individual’s response to an item on an

instrument is analyzed to assess item difficulty, or to identify an individual’s response to a

particular item as it relates to the overall score to reflect a point-biserial correlation. It is

also used to develop instruments, and to evaluate their quality, reliability, and validity. One

process that uses CTT relies on having score distributions of the two tests coincide in a well-

defined population, which is called equipercentile equating. Conversely, a broader approach

using regression procedures, in which one score is regressed on another that can then be

used to calculate a prediction of the other score.61 Pommerich et al.29 demonstrated

through linking of American College Test (ACT) and SAT that the equipercentile method

(concordance) and regression (prediction) serve different purposes, and the result should

be used accordingly. While the former is more appropriate when determining comparable

29

scores at which the same percentages of examinees score above and below relevant score

points, the latter is more appropriate to predict an individual’s score.

One benefit of the equipercentile methodology is that equated scores will always be

within the range of possible scores under the traditional conceptualization of percentiles

and percentile ranks. However, equating relationships cannot be determined outside the

highest observed scores. For example, if the distributions of the instruments to be linked do

not represent all possible scores then the linking transformation will not be able to create a

link to those scores. Let’s say there are two tests both that can have scores ranging from 0

to 100 that need to be linked, yet in our population one test has a ceiling effect such that all scores are above 50, then the equipercentile linking transformation will never be able to link to a score lower than 50. Furthermore, the score distribution can be irregular due to random error in estimating the equivalents, therefore smoothing methods can be used to mediate the effects of random error and provide more regular shapes. However, smoothing can introduce systematic error since the raw data will be changed to better fit a normal distribution. The intent in using a smoothing method is for the increase in systematic error to be more than offset by the decrease in random error.

IRT (Rasch)

In 1953 Rasch63 created a system in which, while trying to equate two assessments, he independently estimated scale item difficulty free from the effect of the ability of the person responding to the items, as well as estimated person measures devoid of the effect of the difficulty of the item. While he understood that he could not determine how a person would respond to an item, he could however, estimate the probability of success on that

30

item. He postulated that the probability is only responsive to a person’s ability and the difficulty of the item as expressed in this formula:

Probability of Success log = ff Probability of Failure � � 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 − 𝐷𝐷𝐴𝐴 𝐴𝐴𝑖𝑖𝑖𝑖𝐴𝐴𝐴𝐴𝐴𝐴

IRT is a body of related psychometric theory that provides an integrated

psychometric framework for developing and scoring tests. The main feature of IRT models is that they relate item responses to characteristics of individual persons and test items. The models are functions relating items and person parameters to the probability of a discrete outcome, such as a correct response to an item. IRT models have been developed for tests for which items are scored dichotomously (0/1) as well as for tests for which items are scored polytomously. IRT provides a basis for estimating parameters, ascertaining how well data fit a model, and investigating the psychometric properties of tests. Applications of IRT include test development, item banking, differential item functioning, adaptive testing, test equating, and test scaling. IRT (latent trait) methods have recently been advocated as a potential improvement over traditional CTT methods.64-66 Lord64 argued from theoretical considerations that traditional equating methods are not appropriate for equating tests of differing difficulty, whereas IRT methods have the capacity to provide an appropriate equating in this case. It is possible that a large calibrated item pool could be used to

construct forms that share the same content and statistical specifications to such a degree

that scores can be truly equated; often, however, the relationship between scores on such

forms is better described as calibrated.66

31

Criteria for Linking

Three questions arise when contemplating the development of a crosswalk. ‘Should linking be done?’ and ‘How should linking be accomplished?’ have been described previously in the literature review, however the last question, ‘has the equating procedure been done well enough?’, is the focus of this section. Harris and Crouse67 have stated that there is a problem of evaluating equating because there exists a diversity of methods. Harris and Crouse67 thoroughly reviewed and discussed available criteria. Among them are: equating in circle paradigm, generated or simulated data, large sample criteria, standard error (SE), indices, consistency, and many more. Kolen and Brennan25 also have identified that the properties of equating such as equity, symmetry, and population invariance can also be used to develop evaluative criteria.

Several studies have compared linking methods derived from linear, equipercentile and IRT methodology.64,68,69 Results from these studies were equivocal in their findings.

Lord64 posited that there was not enough experimental results to generalize which methods under specific research designs are best suited to equate different tests. Kolen69 performed a cross validation study which compared two traditional equating schemes with seven IRT models. He found that the equipercentile method produced more stable results compared to an IRT method. In a paper that sought to analyze the National Reference Scale, a dataset that equated reading in 14 separate reading tests for students in the fourth, fifth, and sixth grades, Rentz and Bashaw70 concluded that the Rasch and equipercentile equating results were reasonably similar. Slinde71 noted, however, that the equipercentile method was an inadequate method for tests of differing difficulties and hypothesized that methods utilizing

32

item characteristic curve theory are likely more appropriate for vertical equating. Kolen69 also showed that overall the equipercentile and IRT methods were similar, although he demonstrated that the equipercentile procedure produced more stable results when the instruments being linked were easier.

Marco, Petersen, and Stewart72 compared a variety of equipercentile, linear, and IRT equating methods for equating the verbal portion of the SAT. When a test was equated to an anchor test of similar difficulty all but one of the methods appeared to be satisfactory.

When tests of different difficulty were equated, the IRT methods were superior and the linear methods clearly inferior. Empirically, Jaeger73 identified five indices: 1. Similarity of

Cumulative Score Distributions, 2. Shape of the Rea-Score to Scaled Score Transformations,

3. Consistency of Linear and Equipercentile Equating Results, 4. Similarity of Item Difficulty

Distributions, and 5. Similarity of Item Discrimination Distributions, that should logically discriminate between situations in which either the linear equating method adequately adjusts for differences between score distributions, or if a more complex solution was needed. He found that linear methodology discriminated in four of the five indices and acknowledged that item difficulty plays a role in reducing the adequacy of the linear model.

Another study Skaggs and Lissitz74 explored the differences between linear, equipercentile and IRT test equating methods. Using a Monte-Carlo approach, their results revealed a lack of robustness for the Rasch model by violating the equal discrimination assumption, and that the recommended procedure for tests similar to those in the study was the equipercentile method.

33

Kolen and Brennan25 postulated that estimating random error using the standard error for equating differences (SEED) can be used to develop equating criteria. Harris and

Crouse67 called these evaluative criteria SE criteria and defined it to be an analytical method to estimate the amount of equating error from sampling. While using SEED was mostly reserved for kernel equating methodology, Moses and Zhuang75 extended the use of SEED to inform more traditional equating functions like equipercentile. The advantages of using

SEED are that it is easy to apply and to interpret, yet its main disadvantage is that it overlooks systematic errors. In general, SEs of equating are used: (a) as means of expressing equating error when scores are reported, (b) as a basis for comparing equating methods and, (c) in the selection of sample sizes for desired equating precision.25

The equating studies that do concentrate on the SEED tend to be concerned with the derivation of SEs for various equating methodologies and data collection designs.67 So far, the SEED has been derived for most of the equating procedures. The delta method has been used to derive most of the SE functions for the equating designs and methods including IRT equating.76 Liou and Cheng77 and Wang78 provided SEs for equipercentile equating; and

Holland et al.79 provided SEs for the kernel equating method. Wang78 extended the derivation of Holland79 to provide the SEED for the log-linear method.

When investigating the literature, one feature that has a major effect on equating is that of sample size. Aşiret studied the different equating methodologies when using data with small sample size (n < 200). He postulated that using sample sizes over 200, the type of procedure used became less important80.

34

The equating literature has mixed views about utilizing statistical significance test when selecting or evaluating equating functions.25,81,82 Often, selection of an equating function relies more on theory and heuristics.83 Kolen and Brennan25 also have identified that the properties of equating such as equity, symmetry and population invariance can also be used to develop evaluative criteria. Ultimately there exist many different approaches to evaluate the effectiveness of an equating process, but there is no definitive criterion. In fact, many of the researchers who have provided altering evaluations, conclude different findings depending on the criterion of choice. Livingston et al.82 stated that:

”These two studies do not contradict each other; they simply used

different methods of evaluating the results. That the selection of a

criterion can have such an impact on interpreting equating results has

devastating implications for the vast majority of equating studies for

which such a large target population is unavailable” (p. 108).

Linking Functional Outcomes

The previous section on linking and evaluation of linking has revolved around

literature in educational testing. Transfer of these procedures to functional assessments have been somewhat limited and did not take hold until the mid-1990s. In order to identify gaps in the literature as it relates to linking functional outcome measures in the PAC

settings, this section will document recent studies in which linking has been performed on

functional outcomes specifically.

In 1997 two studies were published that created crosswalks linking functional assessments. While these measures are more a health status measure rather than a

35

functional outcome measure, it highlights the importance for linking across different assessment types. Fisher et al.84 linked the physical functioning subscale of the Medical

Outcomes Study Short Form 36 item version (SF-36) and the Louisiana State University

Health Status Instruments (LSU HIS) in a convenience sample of 285 people waiting in a

general medical clinic. They used an IRT based approach and found that the difficulty of

items were similar and the two equated instruments were moderately correlated (.80). In

the second study, Williams18 developed and tested a crosswalk between the FIM as

administered in an IRF to the MDS which was administered in nursing homes across several

rehabilitation diagnoses (stroke, hip fracture, cardiac disease, amputation, and

gastrointestinal disease). Unlike any traditional CTT or IRT methodology they proposed

creating links of the instruments by utilizing expert opinion. Experts in the field of

rehabilitation would assess the items of each test as well as the scoring categories to create

a “pseudo–FIM” score from the MDS score, which would be compared to the observed

score of the FIM. They concluded that for most comparisons, the degree of agreement fell

within commonly accepted standards of accuracy, yet reported statistically significant

differences between the means when assessing the crosswalk.

Another attempt to crosswalk the FIM was performed by Buchanan.85 In a desire to

replace the FIM with a newer measure based on the nursing homes outcome assessment

tool MDS, the MDS-PAC was linked to the FIM in order to utilize the prospective payment

system (PPS) for inpatient rehabilitation. Again not utilizing CTT or IRT methodology, a re-

calibration or re-alignment of items and scores was performed by clinical experts.

Ultimately they concluded that the linking effort failed to achieve sufficient accuracy for use

36

with a payment system, they did imply that perhaps the translation would have been acceptable for outcome measurement or quality monitoring.

In 2007, Velozo86 utilized Rasch methodology to develop a crosswalk between the

FIM and MDS based on a sample of Department of Veterans Affairs (VA) patients across a

variety of diagnostic groups. The crosswalk was developed more specifically using common person methodology, in which the same people are administered the different assessments.

He determined the FIM and MDS were indeed measuring the same construct, the internal consistency as measured by person separation was high (Cronbach alpha = 0.94), and there was good point-measure correlations (0.54-0.84). Velozo stopped short of identifying the crosswalk between the MDS and FIM as “successful” and pointed out that future validity testing will be needed to assess accuracy and applicability of the crosswalk. Nevertheless,

Velozo concluded that while successful crosswalks can be created, the viability of the methodology is still in its infancy as it relates to healthcare.

Perhaps proving Velozo’s caution, Wang87 performed a validation analysis of

Velozo’s crosswalk between the FIM and MDS. Their findings were mixed, as they found small mean differences in the FIM observed versus MDS-derived FIM Motor scores, but the difference was highly population invariant. Furthermore, while the group distributions of the scores were similar, the individual scores were disappointing as only 34% of the derived scores were within 5 points of the observed FIM Motor scores that had a range of 13 to 91.

Haley88 used a nonequivalent group design IRT approach to develop a crosswalk between physical functioning items in the Activity Measure for Post Acute Care (AM-PAC) and the Quality of Life Outcomes in Neurological Disorders (Neuro-QOL). The AM-PAC items

37

were administered to rehabilitation inpatients (n=1,041), whereas the Neuro-QOL consisted of a community sample (n=549) without neurological conditions surveyed through the

Internet. This approach linked scores by anchoring each test on common items shared by

the two assessments. He concluded that this type of linking allowed for accurate estimation

of AM-PAC mobility and ADL subscale scores based on Neuro-QOL mobility and ADL

subscale scores and vice versa, yet cautioned that the results need to be validated for when the same persons take the same tests (person anchoring).

Similarly, ten Klooster89 utilized Rasch methods as well as two-parameter and multidimensional IRT models to establish crosswalks between the SF-36 physical functioning scale and the Health Assessment Questionnaire Disability index (HAQ-DI). He co-calibrated both scales using data from 1,791 patients with rheumatoid arthritis. He found that the

Rasch-based crosswalk performed similarly to the crosswalks using the two-parameter and multidimensional IRT models. Oude Voshaar90 used the conversion scores from ten Klooster to link the SF-36 physical functioning scale and the HAQ-DI in a sample of patients who participated in a longitudinal study from the Netherlands with a variety of rheumatic diseases (rheumatoid arthritis, fibromyalgia, systemic lupus erythematosus). They assessed the reliability of the crosswalk using intraclass correlation coefficients (ICC) and construction of Bland-Altman plots of the difference against the mean of predicted and observed scores.

They concluded that the crosswalk performed well across various subspecialties of rheumatoid diseases and in a cross-cultural setting. However, they did stipulate that the crosswalk was successful for only group comparisons.

38

Noonan91 sought to find a crosswalk between the Modified Fatigue Impact Scale and the PROMIS Fatigue Short Form. Using an equipercentile linking function in a sample of 444 persons with multiple sclerosis, he found that there was a significant impact of sample size, but that the crosswalk created was appropriate for group-level analysis with sample sizes over 150.

Another crosswalk study using IRT linking functional scores was undertaken by

Schalet et al.92. They proceeded to link scores from the Veterans RAND 12-Item Health

Survey with scores from the Patient Reported Outcomes Measurement Information

Systems (PROMIS®) Global Health Score. Using a common person methodology, a common metric was created using IRT from which crosswalk tables for both the mental and physical measures were produced. The linking was evaluated by calculating the standard deviations

(SD) of differences and estimation of confidence intervals (CI). They showed that linking physical health performed better than mental health and the correlation coefficients for both were lower (.63 to .80) than what would be needed for a robust crosswalk.

More recently, Ghomrawi93 utilized an equipercentile methodology to develop a crosswalk between the University of California, Los Angeles (UCLA) Activity Scale and the

Lower Extremity Activity Scale (LEAS). Each of these assessments is used to assess activity levels of persons having undergone joint replacement. The study focused on patients having had primary total knee arthroplasty or total hip arthroplasty at the Hospital for Special

Surgery between 2007 and 2011. They showed that their crosswalk derived scores had a similar responsiveness to change as the original scores and had similar discriminant properties.

39

Finally, Hong94 used IRT (Rasch) modeling to develop a crosswalk between the FIM

and the Korean version of the Modified Barthel Index (K-MBI) based on a sample of Korean community dwelling individuals (n=276). Each patient was measured using both instruments so that a common person equating could be produced. Hong concluded that their

crosswalks between the FIM and K-MBI demonstrated good internal consistency on total

scores and for subscales (Cronbach alpha = 0.93-0.97) and there was good reliability

between the raw FIM scores and the converted K-MBI scores (r=0.91-0.93).

Given the lack of standardization of either creating or evaluating crosswalks, this

study attempts to utilize several methodologies and evaluate each by utilizing several of the

most used methodologies that researchers have used in the past.

Summary of Gaps in the Literature

As seen in this literature review, much of the linking research has been developed

and applied in education testing. The desire to equate instruments stems largely from the

need to have multiple versions of forms as well as to equate person’s ability across different

assessments. Only recently has this methodology been applied to measurement of

functional status for persons with physical/cognitive/mental limitations. Because of the

relatively newly adopted CARE measure as the de facto CMS measurement of function in

PAC settings, no studies have yet to publish crosswalks between the various PAC

measurements and CARE. Several studies have attempted to develop crosswalks between

the FIM and other functional status measurements like the MDS,18,86,87,94 while others have

tried to link outcome assessments such as the modified fatigue impact scale,91 and the

40

mental summary measure from the Veterans RAND 12-Item Health Survey to the respective

PROMIS corollary of fatigue and global health scale91,92. Only one utilized CTT,93 two used

clinical expertise,18,85 whereas the majority took advantage of IRT modeling using Rasch.

There have been no studies that have attempted to develop a crosswalk between

FIM and CARE. This dissertation is the first to apply three equating methodologies to

crosswalk the FIM and the CARE and applying several criteria to evaluate the effectiveness

of those methodologies.

41

CHAPTER III.

METHODS

Research Design and Data Collection

The Traumatic Brain Injury Model Systems National Database

This project utilized an existing dataset, the TBI Model Systems (TBIMS) National

Database (NDB). Established in 1987, the TBIMS, a program funded by the National Institute

on Disability, Independent Living, and Rehabilitation Research (NIDILRR), created a

longitudinal database for which data from individuals having sustained a TBI were collected.

Domains of data collected included, but were not limited to; demographic characteristics,

injury characteristics, mortality, injury severity, functional status, community participation,

and various psycho-social outcomes. Data consists of information gathered from the

individual regarding their pre-injury status, abstracted from the medical record regarding

their hospitalization course, and follow-up interviews with patients at 1, 2, 5, and every 5

years thereafter on the anniversary of their injury.

The TBIMS NDB contains data on individuals sustaining a moderate to severe

traumatic brain injury having received inpatient rehabilitation at one of the following

NIDILRR-funded TBIMS Centers:

1. The Virginia Commonwealth TBI Model System*

2. The Institute for Rehabilitation and Research – Memorial Hermann

3. Southeastern Michigan TBI System*

4. Northern California TBI Model System*

42

5. The Ohio Regional TBI Model System*

6. Moss TBI Model System*

7. University of Alabama at Birmingham TBI Care System*

8. Rocky Mountain Regional Brain Injury System*

9. Spaulding-Harvard TBI Model System

10. Mayo Clinic TBI Model System*

11. Northern New Jersey TBI System

12. Carolinas TBI Rehabilitation and Research System

13. University of Washington TBI Model System*

14. JFK Johnson Rehabilitation Institute TBI Model System

15. University of Pittsburgh Medical Center TBI Model System

16. North Texas TBI Model System*

17. New York TBI Model System*

18. Midwest Regional TBI Model System

19. Rusk Rehabilitation TBIMS at New York University

20. Indiana University / Rehabilitation Hospital of Indiana*

21. South Florida TBI Model System

The asterisked Centers provided data for the current study. Each subject met the criteria for inclusion into the NDB below:

43

Table III-1: Inclusion criteria for TBIMS NDB

Dimension Definition

TBI Damage to the brain tissue cause by an external mechanical force as evidenced by medically documented loss of consciousness or post traumatic amnesia (PTA) or by abnormal neurological findings that can be attributed to TBI on physical examination or mental status examination.

Severity Having either PTA longer than 24 hours, loss of consciousness exceeding 30 minutes, Glasgow Coma Scale (GCS) Score of less than 13, or neuroimaging abnormalities defining severity as moderate to severe. Age Sixteen years or older at the time of injury.

Pathway Admitted to a TBIMS’s acute care hospital within 72 hours of injury.

Comprehensive Received acute and comprehensive rehabilitation within each Center’s “system of Rehabilitation care”. Comprehensive rehabilitation must occur in an IRF, SNF or LTACH and meet the following criteria: • Medical and rehabilitation care is supervised on a regular basis by a physician; • 24 hour nursing care; • Integrated team approach including Physical Therapy, Occupational Therapy, Speech, Psychology/Neuropsychology. Consent All participants are consented or if unable family or legal guardian provide consent.

Study Participants

All data for this project was collected as part of a multi-center TBIMS collaborative module study and all participants consented to the additional data collection. Eleven

Centers participated in the study, which only included persons discharged from acute rehabilitation between 10/1/2016 and 12/24/2018. Participants were evaluated on functional status using both the FIM and CARE by trained and certified clinical staff at each facility. Collection of both the FIM and CARE was mandated for IRFs by CMS to receive reimbursement for patients who were receiving CMS benefits. Largely the evaluations were conducted by several rehabilitation clinical staff members each focusing on the items that best matched their specialty. Each patient was evaluated at least twice during his or her

44

stay, once within three days of admission and again within three days prior to discharge from the inpatient rehabilitation facility.

Variables Collected

Data extracted from the TBIMS NDB consisted of demographic and injury characteristics needed to characterize the sample. While the CARE does contain elements to address cognition, the administration and construct of those variables do not relate to function and therefore do not map well to the cognitive FIM items. Thus, only functional mobility and self-care items from both the FIM and CARE were included in the study dataset. The items in the final dataset are listed in Table III-2.

Table III-2: Study Dataset Variables

Item Definition DEMOGRAPHICS Sex Male vs. Female.

Severity of Injury An index of severity was created using the GCS, time to follow commands (TFC), or length of PTA. If any one of these measures indicated a severe TBI, then the TBI was considered Severe, otherwise was classified as Moderate.

Minority Status Any non-White race was considered Minority vs. Non-Minority.

Cause of Injury Etiology of TBI consisted of major groupings: Vehicular, Falls, Violence, Sports, Other.

Age Age at time of injury

CARE Eating The ability to use suitable utensils to bring food to the mouth and swallow food once the meal is presented on a table/tray. Includes modified food consistency.

Oral hygiene The ability to use suitable items to clean teeth. Dentures (if applicable): The ability to remove and replace dentures from and to the mouth, and manage equipment for soaking and rinsing them.

Toileting hygiene The ability to maintain perineal hygiene, adjust clothes before and after using the toilet, commode, bedpan or urinal. If managing an ostomy, include wiping the opening but not managing equipment.

45

Table III-3 cont Shower/bathe self The ability to bathe self in shower or tub, including washing, rinsing, and drying self. Does not include transferring in/out of tub/shower.

Upper body dressing The ability to put on and remove shirt or pajama top; includes buttoning, if applicable.

Lower body dressing The ability to dress and undress below the waist, including fasteners; does not include footwear.

Putting on/taking off The ability to put on and take off socks and shoes or other footwear that footwear is appropriate for safe mobility.

Roll left and right The ability to roll from lying on back to left and right side, and return to lying on back.

Sit to lying The ability to move from sitting on side of bed to lying flat on the bed.

Lying to sitting on side of The ability to safely move from lying on back to sitting on the side of the bed bed with feet flat on the floor, and with no back support.

Sit to stand The ability to safely come to a standing position from sitting in a chair or on the side of the bed.

Chair/bed-to-chair The ability to safely transfer to and from a bed to a chair (or wheelchair). transfer Toilet transfer The ability to safely get on and off a toilet or commode.

Walk 10 feet Once standing, the ability to walk at least 10 feet in a room, corridor or similar space.

Walk 50 feet with two Once standing, the ability to walk at least 50 feet and make two turns. turns Walk 150 feet Once standing, the ability to walk at least 150 feet in a corridor or similar space.

Walking 10 feet on The ability to walk 10 feet on uneven or sloping surfaces, such as grass or uneven surfaces gravel.

1 step (curb) The ability to step over a curb or up and down one step.

4 steps The ability to go up and down four steps with or without a rail.

12 steps The ability to go up and down 12 steps with or without a rail.

Picking up object The ability to bend/stoop from a standing position to pick up a small object, such as a spoon, from the floor.

Wheel 50 feet with two Once seated in wheelchair/scooter, the ability to wheel at least 50 feet turns and make two turns.

Wheel 150 feet Once seated in wheelchair/scooter, the ability to wheel at least 150 feet in a corridor or similar space.

46

Table III-4 cont CARE Total Score The total CARE Score was calculated using the methodology outlined in the National Quality Forum,95 which sums the following self-care items (eating, oral care, toileting, showering, dressing upper body, dressing lower body and footwear) and 5 mobility items (roll left and right, lying to sitting, sit to stand, chair to bed transfer, toilet transfer), as well as 4 items of either walking (10ft, 50ft, 150ft, 10ft uneven surface) or wheelchair (50ft, 150ft, each wheelchair item is counted twice in scoring) depending on if a person was walking at time of assessment, 1 stair, 4 stair, 12 stairs, picking up an object.

FIM-M Eating Eating includes the ability to use suitable utensils to bring food to the mouth, as well as the ability to chew and swallow the food once the meal is presented in the customary manner on a table or tray.

Grooming Grooming includes oral care, hair grooming (combing or brushing hair), washing the hands, washing the face, and either shaving the face or applying make-up. If the subject neither shaves nor applies make-up, Grooming includes only the first four tasks.

Bathing Bathing includes washing, rinsing, and drying the body from the neck down (excluding the back) in either a tub, shower, or sponge/bed bath.

Dressing Upper Body Dressing – Upper Body includes dressing and undressing above the waist, as well as applying and removing a prosthesis or orthosis when applicable. The patient performs this activity safely.

Dressing Lower Body Dressing – Lower Body includes dressing and undressing from the waist down, as well as applying and removing a prosthesis or orthosis when applicable. The patient performs this activity safely.

Toileting Toileting includes maintaining perineal hygiene and adjusting clothing before and after using a toilet, commode, bedpan, or urinal. The patient performs this activity safely.

Transfers: Bed, Chair, Transfers: Bed, Chair, Wheelchair includes all aspects of transferring from Wheelchair a bed to a chair and back, or from a bed to a wheelchair and back, or coming to a standing position if walking is the typical mode of locomotion.

Transfers: toilet Transfers: Toilet includes safely getting on and off a standard toilet.

Transfers: tub Tub includes getting into and out of a tub. The patient performs the activity safely.

Transfers: shower Shower includes getting into and out of a shower. The patient performs the activity safely.

Locomotion: Walk Locomotion: Walk includes walking on a level surface once in a standing position. The patient performs the activity safely. This is the first of two locomotion function modifiers.

47

Table III-5 cont Locomotion: Wheelchair Locomotion: Wheelchair includes using a wheelchair on a level surface once in a seated position. The patient performs the activity safely. This is the second locomotion function modifier.

Locomotion: Stairs Locomotion: Stairs includes going up and down 12 to 14 stairs (one flight) indoors in a safe manner.

FIM-M Total The total FIM-M subscale consists of the sum of the 6 items of self-care (eating, grooming, toileting, bathing, dressing upper body, dressing lower body) and seven items of mobility (bed/chair/wheelchair transfer, toilet transfer, tub/shower transfer, bowel, bladder, locomotion (walk or wheelchair), and stairs).

The codes for scoring both the FIM-M items and the CARE items can be seen in Table

III-3: Scoring Codes for the FIM-M and CARE.

Table III-6: Scoring Codes for the FIM-M and CARE

CARE Level FIM-M Level 6 – Independence 7 – Complete independence (timely, safely) 5 –Set-up or clean-up assistance 6 – Modified independence (device) 4 – Supervision or touching assistance 5 – Supervision 3 – Partial/moderate assistance 4 – Minimal assistance (subject 75%+) 2 – Substantial/maximal assistance 3 – Moderate assistance (subject 50%+) 1 – Dependent 2 – Maximal assistance (subject 25%+) 1 – Total assistance (subject 0%+)

Analysis Plan

Using the data available, each individual was randomly assigned to either

“admission” or “discharge” in which either the admission assessments or discharge assessments were utilized for analysis. This maintains the assumption of independence, meaning that the observations between groups are independent. The analyzable dataset will contain a mixture of admission records and discharge records, but no single person will have more than one entry in the dataset. The sample was randomized into a training cohort and a validation cohort at a 70/30 percent split. The training dataset was used to

48

create the crosswalks whereas having a separate validation dataset that was independent of the training set provided an unbiased evaluation of the crosswalk.

Missing Data

Using the training cohort, a missing data analysis was conducted. Identification of missingness was done by creating a frequency distribution for all demographic variables as well as the Total CARE and FIM-M scores. For the FIM-M subscale, all 13 items were summed and any one item missing results in a missing total subscale score. Similarly, the score of the CARE was considered missing if any one of the items was missing; however, there were scoring rules in place such that some missingness was recoded to the lowest

score. For instance, if a person was not walking, all of the walking and stairs items were

entered as “not applicable”. In this circumstance, to obtain a score the walking items were

replaced by the wheelchair items and the stairs scores were recoded to 1, as per

instructions. In the same way, any item coded with the “item not attempted due to severity

or safety” were scored as 1. Any item that was not skipped and/or classified as refused

resulted in a missing calculated total score. For each measure, the percentage of

missingness was shown with a frequency distribution. If there is more than 10% missing for

the total score, then an analysis comparing those people missing data vs. those with

complete data will be conducted to identify potential bias in the results.

Analysis Plan for Research Aim 1

Aim 1 - Create bi-directional crosswalks between the FIM-M scale and the CARE that maps

the score from one scale to that of the other.

49

a. Using expert opinion, create a new scoring algorithm that will utilize the items

and scoring that match according to the definitions of each scale.

b. Using equipercentile matching to determine the relationship where a total

score matches an equivalent percentile on the other instrument.

c. Using Rasch to perform a common person equating methodology that converts

each scales scores to logits from which the other score can be equated.

Using identified best practices in test equating, three methods of creating crosswalks

were performed. The first method relied solely on content expert opinion, the second

method utilized CTT to equate the assessments using an equipercentile framework, and the

third method was conducted using more modern IRT methods, specifically a common-

person equating using Rasch. For the expert opinion and Rasch a traditional crosswalk was

not produced. The methodology for expert opinion necessitated that new scores were

created so that there was not a rescoring back to the original measure. Similarly for the

Rasch methodology, to utilize the benefits of converting the measurement to logits, the

logits were not calibrated back to the original measurement. For all three methodologies

two “crosswalks” were created. One in which the original FIM-M scores mapped to a new

CARE standard, and the second where the original CARE scores mapped to a new FIM-M standard. Only for the equipercentile method was there a true back scoring to the reference measure.

Expert Opinion

Using similar methodology from previous studies,18 an expert panel was assembled

who had knowledge about both the FIM and CARE tool as well as measurement theory. The

50

expert panel consisted of Gale Whiteneck, PhD who has extensive knowledge in psychometrics and is intimately familiar with the dataset being used. The other panelist

besides the author of this dissertation was Linda Jones, PhD, who had just completed her

own dissertation on the creation of a crosswalk between the FIM and the Spinal Cord

Independence Measure. Each panelist was provided a list of items from both the FIM-M and

CARE tool and were instructed to map similar items based on content of the questions

and/or instructions from the assessment developers.7,95 Next, the expert panel evaluated

the commonality of the coding metrics and identified where appropriate grouping of code

levels occurred. Given that the item scale range for the FIM is 1 to 7 and for CARE is 1 to 6,

viewing the definitions for each item necessitated the creation of a common score.

Once the paired items and codes were agreed upon by the expert panel, the

resulting decisions were vetted by clinical rehabilitation staff at the Rocky Mountain

Regional Brain Injury System to see if the decisions made were representative of clinical

knowledge. A physical therapist and an occupational therapist reviewed the item and score

mapping from the expert panel. They corroborated the expert panel’s decision, so based on

that opinion, new total scores were generated for both the CARE and FIM-M.

Equipercentile

Unlike linear linking in which the mean and SDs of the score distributions are

matched between assessments, the equipercentile methodology matches the entire

distribution. To conduct the analysis, a raw score frequency distribution and associated

percentile ranks was obtained for both the FIM-M and CARE assessments. The cumulative

frequency, was plotted and since the line did not reveal large steps between scores,

51

additional smoothing methods were not performed. Smoothing methods would typically be

used to reduce irregularities due to sampling error in either the score distributions or the

equipercentile equating function itself.96 Using the equate package (The equate package,

available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-

project.org/package=equate) in the statistical package R,97 scores having identical

percentile ranks in the cumulative distributions of both assessments were considered

equivalent from which an equivalency table was produced.

Rasch

For linking using Rasch, there only needs to be some commonality between either

the items and/or persons. Given that all participants were administered both the FIM and

CARE a common-person equating method using Rasch was conducted.98

Winsteps v4.4.0 was used to conduct Rasch analyses. There were two steps that were completed to create both a FIM-M to CARE and CARE to FIM-M recalibration. First, an item-person map for each FIM-M and CARE measure was created in which item difficulty

and person ability were expressed onto a common scale, for which logits were constructed.

Second, each measure was calibrated to the reference measure. Importantly, the statistical

power of using logits was retained, and thus the assessment of the crosswalk without

converting the logits back to the original scoring was conducted.

Analysis Plan for Research Aim 2

Research Aim 2: Each crosswalk will be evaluated using methods to assess equating

methodology, resulting in recommendations regarding the use of crosswalks in TBI

research.

52

a. A RiU statistic was calculated and compared to a value considered sufficiently

large (50% RiU).99

b. The percent of subjects with linked scores within ½ SD to the actual score was

calculated. A threshold of 80% stood as a marker for valid equating.81

c. Population invariance was measured by comparing the actual total scores with

the converted scores based on important sub-populations: sex, minority status,

severity of TBI, age, and cause of injury. A successful equating methodology

would observe the same directionality and the difference in Standardized

Mean Differences (SMD) is <0.08.99,100

d. The four moments of the distributions, including the means, SDs, skewness,

and kurtosis were calculated and compared.

e. Effect sizes were calculated to determine the percent overlap that exists

between the converted and actual scores.101

Given the lack of agreed upon methods for evaluating crosswalks for the purposes of

this dissertation, both statistical significance tests (measure of uncertainty statistic,

population invariance, and effect size) were used, as well as descriptive plots of statistical

moments, to evaluate the effectiveness of equating the FIM-M with the CARE.

Measure of Uncertainty Reduction

In 1969, McNemar102 described a statistical index, the coefficient of alienation (CoA) as:

= 1 , 2 𝐶𝐶𝐶𝐶𝐴𝐴 � − 𝑟𝑟

53

which measures the uncertainty about a dependent variable that is left after accounting for information from the predictor variable. Dorans99 defined the RiU as a percentage:

1 = 2 𝜎𝜎𝑋𝑋𝑋𝑋 − 𝜎𝜎𝑋𝑋𝑋𝑋√ − 𝑟𝑟 𝑅𝑅𝐴𝐴𝑅𝑅 A 50% reduction therefore equates to exactly halfway𝜎𝜎𝑋𝑋𝑋𝑋 between 100% reduction (r = 1) and

0% (r = 0), and the needed correlation coefficient to achieve the 50% reduction is 0.866. For

each crosswalk the CoA and the RiU was calculated for comparison.

Population Invariance

A second important quality to assess is population invariance. Lord and Wingersky103 indicated that, under certain theoretical conditions, true score equating methods are group invariant. A successful equating relationship will be the same regardless of the group of subjects used to conduct the equating. Dorans and Holland99 developed procedures and statistical methodology for investigating group invariance. However, while multiple models for each population and comparing them to the aggregated model is theoretically worthwhile, having just one model is more practical. Therefore a standardized mean difference (SMD) metric was calculated.91 The groups that were assessed for invariance

were by sex, minority status, age, severity of TBI, and cause of injury. A lower SMD indicates

less difference between the selected groups, and a difference of greater than 0.08 cutoff

was chosen based on Doran’s work in 2000 to indicate a meaningful difference in the

distribution between each crosswalk.

Comparison of Statistical Moments

Statistical moments are used to quantify characteristics about the shape of the

distribution. The mean and SD provide information on the location and variability (spread 54

and dispersion) of a set of numbers. These are referred to as the first two statistical

moments of the distribution. The third and fourth moments are skewness and kurtosis.

Skewness is a measure of the symmetry of the shape of the distribution. If the shape is perfectly symmetrical the skewness will be zero. However, if there is “a tail” the skewness

will be positive or negative depending on the direction of “the tail”. Kurtosis, on the other

hand, will measure the flatness or peakedness of a distribution. High peaks are labeled as

leptokurtic and flat shaped peaks are called platykurtic.

For all scores on the original outcome and equated scores, the distributions were

plotted and compared. In addition, directionality and magnitude of the moments were

compared. Cooper100 and Byers99 provided guidance on evaluating the moments and to

assess similarity of the distributions. Using their definition, a successful crosswalk between

the FIM-M and the CARE should show 1) the means between the converted and actual

outcome measures should be within one SD, 2) the SDs between the two measures should

be within one unit, and 3) The 95% CI for both kurtosis and skewness should overlap.

Comparison of Effect Sizes

Other than visual observation comparing the overlap between two distributions,

calculating the percent overlap can be difficult. Using an effect size, while usually a tool for

measuring the strength of two variables, can be used to ascertain the similarity of the

distributions. Cohen’s D is a common effect size statistic that is calculated using the mean

difference between two measures, divided by the pooled SD. Smaller effect sizes indicate a

higher percentage of overlap between the two measures.101 The Cohen’s D effect size was

55

computed between the converted and actual outcome measures. A score of less than 0.2

was considered overlap necessary for successful linking.

Validation Sample

The crosswalks that were created in Aim 1 using the 70% training sample were then

applied to the 30% validation cohort. These crosswalks were then assessed using the same methodology as in Aim 2. The result, when compared to the evaluation of the training dataset determines if the crosswalks maintain similar characteristics in a sample not used to create the crosswalks.

56

CHAPTER IV.

RESULTS OF ANALYSIS

Sample Demographics

Data from 982 subjects were available for these analyses. Data were randomly split utilizing a 70/30 ratio allocating a training sample consisting of 684 individuals and a validation sample (N=298). Subjects were then randomized as to whether the rehabilitation admission or the discharge test scores would be used. For the training dataset 339 (49.7%) of the test administrations were from the discharges scores and a similar proportion (n =

148, 49.6%) was found in the validation dataset.

For testing invariance using SMD requires that the demographics be dichotomized.

For all the crosswalks each demographic variable was dichotomized as follows:

race/ethnicity categories that were non-White or Hispanic were considered minorities

whereas all people identified as White were considered non-minority; cause of injury was

dichotomized as vehicular vs. non-vehicular etiologies; age was broken down as less than 26

years of age vs. people 26 and older at the time of their injury; using several severity

measures (TFC, PTA and GCS), an index severity measure created mild-moderate vs. severe

categories. To calculate the severity index, each severity variable was categorized as mild,

moderate or severe.104 Table IV-1 shows the categorization scheme used. For the

dichotomous variable, any severity variable that classified a person as severe then that

person was determined to be severe otherwise the person was classified into the

mild/moderate category.

57

Table IV-1: Categorization of Severity of TBI

Variable Severe Moderate Mild Time to Follow Commands >= 1 day < 1 day Days of Post Traumatic >14 days 1 to 14 days 0 days Amnesia Glasgow Coma Scale Score 3 -8 9-12 13-15 (upon admission to the Emergency department)

Finally the sex variable was already in a dichotomous form (male vs. female). All

dichotomous variables are designated in bold in Table IV-2.

Basic demographic characteristics for the training sample and the validation sample

are shown in Table IV-2. In general the sample was primarily male, white, young, and had

sustained a severe injury and was injured by a vehicular crash. A chi-square test of

significance for categorical variables and Kruskal-Wallis Rank Sum Test for continuous

variables was produced to identify potential differences between the samples on these

demographic characteristics. All continuous variables (Age, GCS, TFC Days, PTA Days) were

determined to not be normally distributed each having p<.05 using the Shapiro-Wilk’s

normality test. PTA was calculated as the days between injury and date of emergence. TFC

was calculated as the days between injury and ability to follow simple motor commands. If

the person was still in PTA at the time of discharge, the date of discharge was used as the

endpoint for the calculation. The length of PTA was significantly longer for the validation

sample than the training sample (p = 0.02). There were no other significant differences

between the groups regarding demographic or injury severity characteristics (all p’s > 0.05)

between the samples.

58

Table IV-2: Demographics of the samples

Validation Training sample sample Variable (N = 684) (N = 298) Significance n 684 298 Male Sex, count (%) 514 (75.1) 227 (76.4) 0.727 Race/ethnicity, count (%) Asian/Pacific Islander 17 (2.5) 8 (2.7) Black 131 (19.2) 55 (18.5) Hispanic Origin 70 (10.2) 32 (10.7) Native American 5 (0.7) 4 (1.3) Other 16 (2.3) 2 (0.7) Unknown 1 (0.1) 0 (0.0) White 444 (64.9) 197 (66.1) Non-Minority, count (%) 444 (64.9) 197 (66.1) 0.773 Etiology, count (%) Falls 222 (32.5) 101 (33.9) Other 82 (12.0) 29 (9.7) Sports 16 (2.3) 12 (4.0) Vehicular 293 (42.9) 122 (40.9) Violence 70 (10.2) 34 (11.4) Vehicular Etiology, count (%) 293 (42.8) 122 (40.9) 0.629 Age, count (%) 16-25 74 (24.8) 163 (23.9) 26-35 53 (17.8) 131 (19.2) 36-45 30 (10.1) 77 (11.3) 46-55 39 (13.1) 95 (13.9) 56-65 51 (17.1) 99 (14.5) 66-75 29 (9.7) 60 (8.8) 75-85 17 (5.7) 49 (7.2) 86+ 5 (1.7) 9 (1.3) Age, mean (SD) 44.60 (20.0) 44.16 (19.9) 0.748 Age Groups 26-86+, count (%) 520 (76.0) 224 (75.2) 0.836 GCS, median[IRQ] 13.0 [7.0, 15.0] 14.0 [8.0, 15.0] 0.092 PTA, median[IRQ] 16.0 [4.0, 31.0] 0.002* TFC, median[IRQ] 1.0 [0.5, 6.0] 1.0 [0.5, 4.0] 0.059 Severity, count (%) 0.722 Mild 84 (12.4) 37 (12.4) Moderate 85 (12.6) 43 (14.4) Severe 508 (75.0) 218 (73.2) Severe TBI, count (%) 508 (75.0) 218 (73.2) 0.588

Sample Functional Outcome Measure Scores

Total scores for both FIM-M and CARE were slightly lower in the validation sample than the training sample; however, neither were significantly different from each other (see

Table IV-3). The mean score for the CARE on the training sample ( = 81.64, SD=31.07)

59 𝑋𝑋�

compared to the validation sample ( =78.11, SD=33.10) showed a mean difference of 3.53.

Similarly, the mean difference between𝑋𝑋� the training sample and the validation sample for

the FIM-M was 1.60 (training: =52.72, SD=21.68; validation: =51.12, SD=22.83). Figure

IV-1 shows that there was a spread𝑋𝑋� distribution of scores for both𝑋𝑋� FIM-M and CARE across

the range of each instrument in the training sample. Not surprisingly lower scores are seen

with the admission administration. The lines represents a linear regression line by

administration type.

Table IV-3: Functional Outcome Measure Scores

Validation Outcome Training sample sample Significance n 684 298 CARE Total, mean (SD) 78.1 (33.1) 81.64 (31.1) 0.13 FIM-M Total, mean (SD) 51.1 (22.8) 52.72 (21.7) 0.30

60

Figure IV-1: Scatterplot of FIM-M and CARE total scores by time of administration

A = admission; D = discharge

The correlation between the FIM-M Total and CARE Total was 0.952, achieving a RiU of 69.5%, well within the established standard of 50%.

Missing Data

None of the variables used in the analysis showed more than 10% missing. The CARE

total score was missing for 6.7% (n=46) of cases from the training dataset and 7.7% (n=23)

sample cases from the validation sample. The FIM-M was missing from three people (0.4%)

and only 1 person (0.3%) for validation sample. The only missing data for the demographic

61

variables was 1 person was missing sex from the validation sample. Any case that was

missing data was removed from further analysis.

Expert Opinion

Rescoring of the FIM-M and CARE

Utilizing the expert opinion process, nine variables from each FIM-M and CARE were

utilized in creating new scores. Domains covered were Eating, Toileting, Bathing, Dressing

upper body and lower body, Chair and Toilet transfer, Locomotion, and Stairs. Each domain

was re-scored between 1 and 4 which gave the converted total scores a range between 9

and 36. The following assessments will compare the CARE newly computed score (CAREeo)

and the FIM-M newly computed score (FIMeo). Table IV-4 shows the specifics of the coding.

In general, all FIM-M items utilized the same conversion system. All but one CARE item

(Dressing lower body) used the same scheme. Dressing lower body included the item that

addressed putting on footwear. The “lowest” or most severe score between either dressing

or footwear took precedent in the final conversion.

Table IV-4: Expert Opinion FIM-M and CARE Item Conversion Scores

Assessment Items Original Score Converted score FIM-M Feeding 1 1 Bathing 2 1 Toileting 3 2 Dressing upper body 4 3 Dressing lower body Bed/Chair transfer 5 3 Toilet transfer 6 4 Locomotion 7 4 Stairs

CARE Eating 1 1 Showering 2 1 Toileting Dressing upper body 3 2 4 3

62

Table IV-5 cont Dressing lower body 5 3 Chair transfer 6 4 Toilet transfer Walking/Wheel 150ft 12 Steps

The scatterplot and distributional shapes after rescoring both the FIM-M and the

CARE using expert opinion are shown in Figure IV-2. The blue line represents a regression line and the marginal graphs shows a graphical representation of the distribution of each variable. The graph reveals a strong positive linear relationship between the scores (Slope =

0.94, p <.001) and shows similar scatter (homoscedasticity) across values of FIMeo and

CAREeo.

Figure IV-2 : Scatterplot of CAREeo and FIMeo

63

Assessment of crosswalk

Reduction in Uncertainty

In the training sample, the original CARE and original FIM-M had an overall correlation of 0.95 with a RiU of 69.3%. The correlation between the newly calculated CARE

(CAREeo) and FIM-M (FIMeo) showed a slightly higher correlation of 0.97 with a RiU of

75.0%. Table IV-5 displays the correlation coefficient, r2, the CoA and the RiU for both the original scores and the scores converted from expert opinion. The validation sample shows similar results, however the correlation for both the actual scores and the calculated scores were slightly smaller (r=0.94 and r=0.96, respectively).

Table IV-6: Correlations and RiU between the CAREeo and FIMeo

RiU Sample Crosswalk Correlation Correlation2 CoA RiU Percent Training Actual FIM-M and CARE 0.95 0.91 0.31 0.69 69.3 Converted FIM-M and CARE 0.97 0.94 0.25 0.75 75.0 Validation Actual FIM-M and CARE 0.94 0.88 0.34 0.66 66.1 Converted FIM-M and CARE 0.97 0.94 0.25 0.75 74.7

Percent scores within 0.5 SD

Over 93% of the scores from either CAREeo to FIM-M or from FIMeo to CARE fell

within 1 SD of the score of the target measure (see Table IV-6). The table shows the sample

size (N), the ½ SD of the target assessment, and the percent indicates the proportion of

scores that fell within ½ SD of the target assessment.

64

Table IV-7: Percent of the scores within 1/2 SD of other assessment using expert opinion method

Sample Crosswalk N Target Percent Training CAREeo to FIMeo 613 4.3 94.6 FIMeo to CAREeo 613 4.3 94.6 Validation CAREeo to FIMeo 272 4.0 93.8 FIMeo to CAREeo 272 4.1 93.8

Population Invariance

To assess population invariance, mean differences in each outcome (FIMeo and

CAREeo) were assessed by the dichotomous demographic variables; Sex, Severity, Race, Age and Cause (see Table IV-7). An invariance statistic (SMD) was produced for each assessment

and the difference in subpopulation invariance was calculated. Differences in invariance

statistics greater than 0.08 in absolute value indicate that there is differences in how the

crosswalk scores behave in relation to the demographic variable.

Table IV-8: Demographic Population Invariance for FIMeo and CAREeo

Mean Differe Pooled Invaria CI CI Sample Demographic Outcome nce SD nce Low High Training Female-Male CAREeo -0.11 8.64 -0.01 -0.20 0.17 FIMeo -0.08 8.63 -0.01 -0.19 0.17 Mild/moderate - Severe CAREeo 1.44 8.62 0.17 -0.02 0.35 FIMeo 1.54 8.61 0.18 0.00 0.36 Non-minority - Minority CAREeo 0.04 8.64 0.00 -0.16 0.17 FIMeo -0.08 8.63 -0.01 -0.18 0.16 26+ - <26 CAREeo -0.66 8.64 -0.08 -0.26 0.11 FIMeo -0.45 8.63 -0.05 -0.24 0.13 Non-vehicular - Vehicular CAREeo -0.30 8.64 -0.04 -0.20 0.13 FIMeo -0.23 8.63 -0.03 -0.19 0.13 Validation Female-Male CAREeo 0.70 8.09 0.09 -0.20 0.37

65

Table IV-9 cont FIMeo 0.60 8.29 0.07 -0.21 0.36 Mild/moderate - Severe CAREeo 1.02 8.08 0.13 -0.14 0.39 FIMeo 1.21 8.28 0.15 -0.12 0.41 Non-minority - Minority CAREeo -1.07 8.08 -0.13 -0.38 0.12 FIMeo -1.09 8.28 -0.13 -0.38 0.12 26+ - <26 CAREeo 0.35 8.10 0.04 -0.23 0.32 FIMeo 0.57 8.29 0.07 -0.21 0.34 Non-vehicular - Vehicular CAREeo 0.85 8.09 0.11 -0.14 0.35 FIMeo 0.76 8.29 0.09 -0.15 0.34

For each demographic variable except Race/ethnicity, the direction (+/-) of the mean difference was similar between the CAREeo and FIMeo scoring in the training sample.

Males, persons over 26 years of age at injury, and vehicular-related injuries all scored higher than the comparison group for both measures (CAREeo and FIMeo). Those with more

severe injuries scored on average lower than persons with moderate or mild injuries for both assessments. Race/ethnicity was the only variable that was mixed in direction of

effect, evidenced by minorities scoring higher than non-minorities on the CAREeo yet

opposite on the FIMeo. For the validation sample the direction of the mean difference of each demographic variable was consistent, however sex, age and cause of injury were different in direction from the training sample. Table IV-8 shows that the difference in SMD was not larger than 0.08 for any demographic variable indicating that the differences found

in demographics were similar in both assessments for both the training and validation samples.

66

Table IV-10: Difference in SMD for Demographic variables for FIMeo and CAREeo

Invariance Sample Demographic Difference Training Sex -0.003 Severity -0.012 Race 0.017 Age -0.026 Cause -0.009 Validation Sex 0.015 Severity -0.019 Race -0.003 Age -0.027 Cause 0.015

Statistical Moments

In the training sample, the mean difference between the CAREeo and FIMeo was -

0.02, well within the range of one SD (see Table IV-9). Furthermore, the difference in SDs

(0.01) was less than one unit. Similar statistics were noted in the validation sample with a

small mean difference (0.16) and SD difference (-0.20).

Table IV-11: Statistical moments 1 and 2 for CAREeo and FIMeo

CAREeo FIMeo Sample Mean SD Mean SD Difference SD Difference Training 22.86 8.63 22.88 8.62 -0.02 0.01 Validation 23.43 8.08 23.26 8.28 0.16 -0.20

Both CAREeo and FIMeo revealed a slightly negative skew (x = -0.22, 95% CI = [-.41,-

0.32] and x = -0.21, 95% CI = [-0.40,- 0.02]) with the Pearson’s kurtosis coefficients (x = -

1.18, 95% CI = [-1.14,-0.99] and x = -1.18, 95% CI = [-1.37, -1.00]) revealing a platykurtic in

the training sample. All coefficients for both skew and kurtosis fell within each other’s 95%

CI (see Table IV-10). The shape of the distribution in the validation sample also revealed a

67

slightly negative skew as well as a negative kurtosis coefficient. Neither skew nor kurtosis were outside the 95% CI of the other assessment.

Table IV-12: Statistical moments 3 and 4 for CAREeo and FIMeo

Sample Crosswalk Moment Coefficient SE Low High Training CAREeo Skewness -0.22 0.1 -0.42 -0.03 FIMeo Skewness -0.27 0.1 -0.47 -0.07 CAREeo Kurtosis -1.17 0.1 -1.36 -0.97 FIMeo Kurtosis -1.13 0.1 -1.33 -0.93 Validation CAREeo Skewness -0.19 0.15 -0.49 0.11 FIMeo Skewness -0.24 0.15 -0.54 0.05 CAREeo Kurtosis -0.96 0.15 -1.26 -0.67 FIMeo Kurtosis -0.99 0.15 -1.28 -0.69

Effect Size

Table IV-11 shows the result of a Cohen’s D effect size that was calculated using the difference of the means between the CAREeo and FIMeo for both the training sample (-

.003) and the validation sample (0.019). Both indicative of negligible differences in the crosswalked distributions.

Table IV-13: Effect sizes of CAREeo and FIMeo crosswalks by expert opinion method

Sample Outcome Mean Difference Pooled SD Cohen’s D CI Low CI High Training CAREeo - FIMeo -0.02 8.63 <0.00 -0.11 0.11 Validation CAREeo - FIMeo 0.16 8.18 0.02 -0.15 0.19

Equipercentile

One crosswalk was produced and applied to each sample (training/validation) using

the equipercentile method. To accomplish the crosswalk, the R package equate was used to

create concordance tables from the raw frequency distributions of both the CARE and the

FIM-M. In the first crosswalk, the equipercentile equivalent of a FIM-M score on the CARE

68

scale was calculated by finding the percentile rank in FIM-M of a particular score, and then finding the CARE score associated with that CARE percentile rank. The second crosswalk was created the same way except that the equipercentile equivalent of a CARE score on the

FIM-M scale was calculated. The concordance table (Table IV-12) is shown below.

Table IV-14: Concordance table CARE to FIM-M and FIM-M to CARE

Computed FIM-M Computed CARE Score SE FIM-M CARE Score SE 22 12.767 0.085 13 22.437 0.265 23 13.078 0.137 14 27.142 1.016 24 13.178 0.153 15 27.983 2.411 25 13.289 0.169 16 29.237 1.282 26 13.389 0.184 17 30.900 1.373 27 13.772 1.615 18 32.320 2.323 28 15.014 2.059 19 33.207 1.497 29 15.831 0.896 20 33.781 2.444 30 16.410 0.945 21 34.994 2.093 31 17.085 1.177 22 36.987 4.326 32 17.810 1.361 23 38.186 1.661 33 18.646 2.545 24 41.504 1.365 34 20.389 4.380 25 42.067 1.381 35 21.003 1.215 26 42.757 2.819 36 21.440 1.246 27 43.860 3.576 37 22.009 2.768 28 45.322 2.432 38 22.841 1.409 29 46.006 1.638 39 23.375 1.439 30 46.422 1.659 40 23.721 3.624 31 48.778 3.024 41 23.988 3.633 32 49.886 3.822 42 24.911 1.842 33 50.775 5.133 43 26.324 3.779 34 53.003 5.177 44 27.086 2.181 35 54.522 2.618 45 27.771 1.719 36 55.146 2.629 46 28.980 5.265 37 56.041 5.297 47 30.362 3.200 38 57.173 2.289 48 30.681 2.007 39 58.261 3.237 49 31.148 2.020 40 59.840 5.429

69

Table IV-12 cont 50 32.162 5.448 41 60.886 2.337 51 33.180 4.113 42 62.007 1.652 52 33.643 2.753 43 64.178 5.575 53 33.999 2.760 44 65.423 8.395 54 34.621 2.782 45 67.457 5.635 55 35.532 8.428 46 69.530 2.842 56 36.978 2.823 47 71.397 3.435 57 37.815 2.439 48 73.136 1.917 58 38.768 2.873 49 74.156 1.733 59 39.391 2.890 50 75.476 2.184 60 40.171 5.800 51 76.514 2.190 61 41.121 2.498 52 77.796 3.506 62 41.994 1.356 53 79.965 2.190 63 42.446 1.370 54 81.518 3.514 64 42.857 4.459 55 83.057 2.187 65 43.549 8.954 56 84.549 2.919 66 44.600 1.379 57 86.833 3.488 67 44.887 1.380 58 88.504 2.176 68 45.093 1.383 59 89.441 2.167 69 45.505 2.016 60 90.224 1.564 70 46.335 2.024 61 91.715 1.314 71 46.837 1.403 62 92.436 1.308 72 47.248 1.409 63 93.812 1.395 73 47.881 1.672 64 95.301 1.832 74 48.833 1.850 65 96.779 1.807 75 49.710 1.326 66 97.706 1.145 76 50.396 1.332 67 98.531 3.953 77 51.513 1.560 68 101.572 1.709 78 52.091 1.556 69 103.137 2.495 79 52.491 1.563 70 105.408 4.899 80 53.025 1.557 71 108.935 14.254 81 53.654 2.341 72 111.758 1.975 82 54.322 2.340 73 112.957 3.382 83 54.959 1.555 74 113.705 1.331 84 55.609 2.077 75 114.173 1.306 85 56.321 2.073 76 116.290 3.173 86 56.742 1.692 77 119.707 1.532 87 57.081 1.688 78 120.468 1.494 88 57.640 2.064 79 121.905 1.645 89 58.471 2.060 80 123.555 1.848

70

Table IV-12 cont 90 59.761 1.672 81 124.722 2.652 91 60.489 1.669 82 125.658 1.469 92 61.395 1.823 83 126.126 1.428 93 62.537 0.951 84 126.845 1.604 94 63.126 0.936 85 128.119 1.804 95 63.794 1.259 86 129.927 0.523 96 64.365 1.250 87 130.756 2.313 97 65.265 2.166 88 131.641 0.249 98 66.366 1.420 89 131.934 0.186 99 67.167 1.396 90 132.207 0.121 100 67.514 0.879 91 132.402 0.065 101 67.795 0.865 102 68.217 0.852 103 68.874 2.286 104 69.580 1.321 105 69.891 1.304 106 70.114 1.296 107 70.513 1.190 108 70.882 1.168 109 71.005 1.163 110 71.252 1.155 111 71.573 2.132 112 72.259 2.098 113 73.023 1.800 114 74.795 1.562 115 75.506 1.534 116 75.862 1.507 117 76.218 1.489 118 76.455 1.483 119 76.671 1.890 120 77.357 1.852 121 78.312 2.103 122 79.101 1.751 123 79.783 1.323 124 80.317 1.288 125 81.237 2.246 126 82.686 3.603 127 84.142 1.455 128 84.921 1.207 129 85.388 1.171

71

Table IV-12 cont 130 86.084 0.592 131 87.130 1.209 132 89.188 0.545

A graphical representation of the concordance table can be seen in Figure IV-3 and

Figure IV-4 which displays the identity line, which would be the line if the scores were on the same scale and there was no conversion applied and the equipercentile line which demonstrates the applied conversion equation.

Figure IV-3: Graphical representation of CARE to FIM-M concordance table

72

Figure IV-4: Graphical representation of FIM-M to CARE concordance table

Once the concordance tables were applied, new scores (CAREfromFIM, and

FIMfromCARE) were added to the dataset. Figure IV-5 and Figure IV-6 display the

scatterplots as well as the score distribution for both of the measures. There exists a strong

linear relationship between the CARE Total and CAREfromFIM (Slope = 0.95, p <.001) as well

as between the FIM Total and FIMfromCARE (Slope = 0.96, p <.001). Each scatterplot below

shows the regression line in blue with each of the scales distribution shown in the margins

of the graph.

73

Figure IV-5: Scatterplot of CARE using the equipercentile method

74

Figure IV-6: Scatterplot of FIM-M using the equipercentile method

Assessment of crosswalk

Reduction in Uncertainty

Both the crosswalk correlations, FIMfromCARE to FIM-M, as well as CAREfromFIM to

CARE showed high correlations respectively (r=.954 and r=.953), resulting in a RiU

percentage of 70.0% for the FIM-M crosswalk and 69.8% for the CARE crosswalk. Similar,

but smaller coefficients were revealed with the validation sample in which all correlations

were 0.94 with a corresponding percent RiU of 65.1%-65.7%.

Table IV-15: Correlations and RiU between the assessments using equipercentile method

Sample Crosswalk Correlation Correlation2 CoA RiU RiUPercent Training FIM-M - CARE 0.95 0.91 0.3 0.7 69.5 FIM-M - 0.95 0.91 0.3 0.7 70.0 FIMfromCARE CARE – CAREfromFIM 0.95 0.91 0.3 0.7 69.8

75

Table IV-13 cont Validation FIM-M - CARE 0.94 0.88 0.35 0.65 65.1 FIM-M - 0.94 0.88 0.34 0.66 65.5 FIMfromCARE CARE – CAREfromFIM 0.94 0.88 0.34 0.66 65.7

Percent scores within ½ SD

Over 90% of the scores fell within a range of ½ SD identified in the table as the

Target, for both the CARE (91.2%) and the FIM-M (91.7%). Whereas the validation sample

showed over 88% of scores falling within ½ SD (see Table IV-14).

Table IV-16: Percent of the scores within 1/2 SD of original score using equipercentile method

Sample Crosswalk N Target Percent Training CARE 635 16.55 0.91 FIM-M 635 11.42 0.92 Validation CARE 274 15.53 0.88 FIM-M 274 10.84 0.89

Population Invariance

When assessing the distribution of scores for CARE by demographics, all demographic variables revealed the same directionality. Persons who were males, mild/moderate severity, non-Minority, over 35, and non-vehicular etiologies scored higher than their dichotomous comparisons (see Table IV-15). These results did not hold true for the validation sample. With the exception of age, the direction of effect was the same, sex and cause of injury had a different direction than was shown in the training sample.

76

Table IV-17: Demographic population invariance for CARE and CAREfromFIM using equipercentile method

Mean Poole Differ d Invarian CI CI Sample Demographic Outcome ence SD ce Low High

Training Female -Male CARETotal -0.69 33.12 -0.02 -0.20 0.16 CAREfromFIM -0.20 33.33 -0.01 -0.19 0.17

Mild/moderate - Severe CARETotal 4.68 33.06 0.14 -0.04 0.32 CAREfromFIM 6.53 33.21 0.20 0.01 0.38

Non -minority - Minority CARETotal -1.67 33.11 -0.05 -0.21 0.11 CAREfromFIM -1.61 33.32 -0.05 -0.21 0.12

26 + - <26 CARETotal -2.04 33.11 -0.06 -0.24 0.12 CAREfromFIM -2.10 33.32 -0.06 -0.25 0.12

Non -vehicular - Vehicular CARETotal -0.73 33.12 -0.02 -0.18 0.14 CAREfromFIM -1.14 33.33 -0.03 -0.19 0.12 Validation Female-Male CARETotal 2.24 31.22 0.07 -0.21 0.36 CAREfromFIM 2.11 31.36 0.07 -0.22 0.35 Mild/moderate - Severe CARETotal 4.55 31.17 0.15 -0.12 0.41 CAREfromFIM 4.52 31.30 0.14 -0.12 0.41 Non-minority - Minority CARETotal -4.44 31.16 -0.14 -0.39 0.11 CAREfromFIM -5.01 31.28 -0.16 -0.41 0.09 26+ - <26 CARETotal -0.43 31.23 -0.01 -0.29 0.26 CAREfromFIM 2.19 31.36 0.07 -0.20 0.34 Non-vehicular - Vehicular CARETotal 3.88 31.18 0.12 -0.12 0.37 CAREfromFIM 2.61 31.34 0.08 -0.16 0.33

Furthermore, Table IV-16 shows all of the differences in SMD were under the threshold of 0.08 except for age in the validation sample, indicating that the crosswalks met the criteria for population invariance at least for the selected demographic variables.

77

Table IV-18: Difference in SMD for demographic variables for CARE using equipercentile method

Invariance Sample Demographic Difference Training Sex -0.015 Severity -0.055 Race -0.002 Age 0.001 Cause 0.012 Validation Sex 0.005 Severity 0.002 Race 0.018 Age -0.084 Cause 0.042

Similarly, Table IV-17 and Table IV-18 show for the FIM-M and FIMfromCARE the same directionality as well as having the demographic SMD differences below the threshold of 0.08 for the training sample, and like the CARE crosswalk, the sex and cause direction of effect in the validation sample were opposite of what was found in the training sample.

Again in the validation sample, the age difference for the FIM-M crosswalk was different in that older people scored higher on the original FIM-M, whereas younger people score

higher on average using the converted FIM-M scores. This difference of effect and the magnitude of the difference resulted in age for the validation sample not meeting the

threshold of 0.08.

78

Table IV-19: Demographic population invariance for FIM-M and FIMfromCARE using equipercentile method

Mean Pooled CI CI Sample Demographic Outcome Difference SD Invariance Low High

Training Female -Male FIM-M -0.33 22.92 -0.01 -0.20 0.17 FIMfromCARE -0.57 22.84 -0.03 -0.20 0.16

Mild/moderate - Severe FIM-M 4.44 22.84 0.20 0.01 0.38 FIMfromCARE 3.22 22.79 0.14 -0.04 0.32

Non -minority - Minority FIM-M -1.26 22.91 -0.06 -0.22 0.11 FIMfromCARE -1.09 22.83 -0.05 -0.21 0.12

26 + - <26 FIM-M -1.43 22.91 -0.06 -0.25 0.12 FIMfromCARE -1.47 22.83 -0.06 -0.25 0.12

Non -vehicular - Vehicular FIM-M -0.88 22.91 -0.04 -0.20 0.12 FIMfromCARE -0.53 22.84 -0.02 -0.18 0.13 Validation Female-Male FIM-M 1.44 21.60 0.07 -0.22 0.35 FIMfromCARE 1.44 21.44 0.07 -0.22 0.35 Mild/moderate - Severe FIM-M 3.33 21.56 0.16 -0.11 0.42 FIMfromCARE 3.22 21.40 0.15 -0.11 0.42 Non-minority - Minority FIM-M -3.48 21.54 -0.16 -0.41 0.09 FIMfromCARE -2.97 21.40 -0.14 -0.39 0.11 26+ - <26 FIM-M 1.39 21.60 0.06 -0.21 0.34 FIMfromCARE -0.34 21.45 -0.02 -0.29 0.26 Non-vehicular - Vehicular FIM-M 1.86 21.59 0.09 -0.16 0.33 FIMfromCARE 2.68 21.41 0.12 -0.12 0.37

Table IV-20: Difference in SMD for demographic variables for FIM-M using equipercentile method

Invariance Sample Demographic Difference Training Sex 0.011 Severity 0.054 Race -0.007 Age 0.002 Cause -0.015 Validation Sex 0.000 Severity 0.004 Race -0.023 79

Table IV-18 cont Age 0.081 Cause -0.039

Statistical Moments

The average score for the CARE was 78.108 (SD = 33.101) whereas the CAREfromFIM was slightly lower at 78.050 (SD=33.194) resulting in a mean difference of 0.058 and a SD

difference of -0.093. Similar equitable results were found with the FIM-M and the

FIMfromCARE crosswalk. Both the mean difference and SD difference were negligible (0.005

and 0.002 respectively). Table IV-19 shows the values for the statistical moments 1 and 2 for

the CARE and FIM-M crosswalks. Overall the means for the validation sample were a bit

higher than the training sample and revealed slightly larger mean and SD differences,

although the mean difference was well within a SD and the SD difference was within one

unit.

Table IV-21: Statistical moments 1 and 2 for CARE and FIM-M using the equipercentile method

Mean SD Mean SD Sample Crosswalk Mean SD Converted Converted Difference Difference Training CARE 78.11 33.10 78.05 33.19 0.06 -0.09 FIM-M 51.12 22.83 51.11 22.83 0.00 0.00 Validation CARE 81.64 31.07 80.36 31.43 1.28 -0.36 FIM-M 52.72 21.68 53.76 21.33 -1.03 0.34

All distributions (training and validation) for the equipercentile crosswalks reveal a

slightly negative skew and kurtosis. With the common SE of 0.094 for the training sample

and 0.142 for the validation sample, all crosswalked coefficients fall within the 95% CI of the original measure (see Table IV-20).

80

Table IV-22: Statistical moments 3 and 4 for CARE and FIM-M using the equipercentile method

Sample Crosswalk Moment Coefficient SE Low High Training CARE Skewness -0.10 0.09 -0.29 0.09 CAREfromFIM Skewness -0.11 0.09 -0.30 0.08 FIM-M Skewness -0.18 0.09 -0.36 0.01 FIMfromCARE Skewness -0.18 0.09 -0.37 0.01 CARE Kurtosis -1.10 0.09 -1.28 -0.91 CAREfromFIM Kurtosis -1.09 0.09 -1.28 -0.90 FIM-M Kurtosis -1.06 0.09 -1.25 -0.88 FIMfromCARE Kurtosis -1.07 0.09 -1.26 -0.88 Validation CARE Skewness -0.09 0.14 -0.38 0.19 CAREfromFIM Skewness -0.10 0.14 -0.39 0.18 FIM-M Skewness -0.18 0.14 -0.46 0.10 FIMfromCARE Skewness -0.18 0.14 -0.46 0.10 CARE Kurtosis -0.90 0.14 -1.18 -0.61 CAREfromFIM Kurtosis -0.97 0.14 -1.25 -0.68 FIM-M Kurtosis -0.91 0.14 -1.19 -0.62 FIMfromCARE Kurtosis -0.83 0.14 -1.11 -0.55

Effect Size

The effect size of the CARE and CAREfromFIM comparison is 0.002 and the Cohen’s

D for FIM-M and FIMfromCARE is <0.001 indicting that the overall distributions are not significantly different for both crosswalks (see Table IV-21) in the training sample. The validation sample shows slightly larger coefficients (0.04 and -0.05) for CARE and FIM-M crosswalks than for the training sample, yet still would be considered low.

Table IV-23: Effect sizes of CARE and FIM-M crosswalks by equipercentile method

Mean Sample Outcome Difference Pooled SD Cohen’s D CI Low CI High Training CARE - CAREfromFIM 0.06 33.15 0 -0.11 0.11 FIM - FIMfromCARE 0.00 22.83 0 -0.11 0.11 Validation CARE - CAREfromFIM 1.28 31.26 0.04 -0.12 0.21 FIM - FIMfromCARE -1.03 21.51 -0.05 -0.21 0.12

81

Rasch

The Rating Scale Rasch model was chosen as the framework for converting each assessment CARE and FIM-M to logits. One of the assumptions in using Rasch to equate scores is the idea of unidimensionality. Using Winsteps, calculations of the standardized residual variance for both CARE and FIM were performed. The results indicate that for both measures there exists some measure of non-unidimensionality. Eigenvalues for the unexplained variance in the first contrast were 3.1 for the CARE and 2.0 for the FIM. A typical cutoff for meeting unidimensionality is a score under two. Additionally, both

measures had an observed percent of the first contrast under a 10% threshold with the

observed percent of 4.2% for the CARE and 4.3% for the FIM. Given the low percentage

explained variance of the residual and the relatively small eigenvalue numbers, the Rasch

assumptions regarding unidimensionality were considered met.

The rating scale model was chosen as the separate analysis of CARE and FIM-M, as

both the CARE and the FIM-M items each share the same rating scale (FIM-M 1-7, and CARE

1-6). From the separate Rasch analysis (FIM-M and CARE) a random equivalence equating

method was applied in which both the mean and SD of the Rasch person-ability measures

were used to re-scale onto the target assessment scale. The procedure for doing this was

well documented on the Rasch software package’s Winsteps website105 which details the

common person-equating steps. To convert each assessment into logits a distribution of

both person and item maps were created for each measure individually. This enabled the

ability to determine the “difficulty” and “ability” of items and persons, which in turn provide

the metrics for recoding scores into logits. Figure IV-7 and Figure IV-8 show the FIM-M

82

Person and Item maps. The Item map reveals the eating and grooming items are “easier” than navigating stairs and transferring to the tub or shower. The Person map shows that there is a normal distribution of logit scores on the FIM-M assessment.

Figure IV-7: FIM-M person/item map

MEASURE PERSON - MAP - ITEM | 5 .# + .# | | 4 + | .## | | .# | | 3 .# T+ .### | ## | .# | . | .## | 2 .## + .### | .# | .#### S| .## | ##### | 1 .#### +T FIMStairs .#### | .###### | FIMTubTrans .########## |S ##### | FIMBath FIMLocomotion .#### | FIMDrsdwn FIMToilet 0 .####### M+M FIMBedTrans FIMToilTrans ######## | FIMBladMgt .######## | FIMDrup .########## |S FIMBwlMgt .#### | FIMGroom .####### | FIMFeed -1 .##### +T .### | #### | ##### S| ### | .### | -2 .### + .### | .## | ### | | .# | -3 T+ | .# | | -4 ############ + | EACH "#" IS 4: EACH "." IS 1 TO 3

83

Figure IV-8: FIM-M item map

MEASURE PERSON - MAP - ITEM - Measures for category scores (maximum probability of observing a category) | 1 2 3 4 5 6 7 5 .# + | | | .# | | 4 + FIMS.7 | FIMT.7 .## | | .# | FIMB.7 FIML.7 FIMT.7 FIMD.7 | FIMT.7 3 .# T+ FIMB.7 FIMB.7 .### | ## | FIMS.6 FIMD.7 .# | FIMT.6 FIMB.7 . | FIMG.7 .## | FIMF.7 2 .## + FIMB.6 FIML.6 FIMT.6 .### | FIMD.6 FIMT.6 .# | FIMB.6 FIMB.6 .#### S| FIMS.5 .## | FIMT.5 FIMD.6 ##### | FIMB.6 1 .#### +T FIMG.6 .#### | FIMB.5 FIMF.6 FIML.5 FIMT.5 .###### | FIMS.4 FIMD.5 FIMT.5 .########## |S FIMB.5 FIMB.5 ##### | FIMT.4 .#### | FIMD.5 0 .####### M+M FIMS.3 FIMB.4 FIMB.5 ######## | FIMT.3 FIML.4 FIMG.5 FIMT.4 FIMD.4 .######## | FIMT.4 FIMF.5 FIMB.4 .########## |S FIMS.2 FIMB.3 FIMB.4 .#### | FIMT.2 FIML.3 FIMD.4 FIMT.3 FIMD.3 .####### | FIMT.3 -1 .##### +T FIMB.3 FIMB.4 FIMB.3 .### | FIMB.2 FIMD.3 FIMG.4 FIML.2 FIMT.2 FIMD.2 #### | FIMS.1 FIMT.2 FIMF.4 ##### S| FIMB.2 FIMB.3 FIMB.2 ### | FIMT.1 FIMG.3

84

Figure IV-8 cont .### | FIMD.2 FIMF.3 -2 .### + FIMB.1 FIMB.2 FIML.1 .### | FIMT.1 FIMG.2 FIMD.1 FIMT.1 .## | FIMB.1 FIMF.2 ### | FIMB.1 | FIMD.1 .# | FIMB.1 -3 T+ FIMG.1 | FIMF.1 .# | | | | -4 ############ + | 1 2 3 4 5 6 7 EACH "#" IS 4: EACH "." IS 1 TO 3

The CARE Item and Person maps show similar findings. The person map (Figure IV-9) shows a normal distribution of person ability, with the Item map (Figure IV-10) indicating that navigating 12 steps and picking an object off the floor were more difficult items than eating and performing oral hygiene.

Figure IV-9: CARE person/item map

MEASURE PERSON - MAP - ITEM | 5 ###### + | | | . | | 4 + | #### | . | T| .# | 3 .### + # | .# | .## | .## | . | 2 ## + .### | .### S| # |T ### | MOB12Steps_r .#### | MOBPickUp_r 1 .#### + MOBWalkWCd_r

85

Figure IV-9 cont .####### |S MOB1StepCurb_r MOB4Steps_r MOBCarTran_r .######## | ########## | MOBWalkWCc_r .### | .###### | MOBWalkWCb_r SCShower_r 0 .###### M+M MOBWalkWCa_r SCFootwear_r SCLBDress_r SCToilet_r .######### | MOBToilettran_r .###### | MOBChairTran_r MOBSitStand_r .#### | SCUBDress_r .#### | #### |S -1 .##### + MOBLying_r MOBSit_r .### | MOBRoll_r SCOralHyg_r .#### | SCEat_r .##### |T .#### S| ### | -2 .### + .### | .### | .### | # | .### | -3 + .# | .# T| | | # | -4 .###### + | EACH "#" IS 4: EACH "." IS 1 TO 3

Figure IV-10: CARE item map

MEASURE PERSON - MAP - ITEM - Measures for category scores (maximum probability of observing a category) | 1 2 3 4 5 6 5 ###### + | | | . | | 4 + MOB12.6 | MOBPi.6 #### | MOBWa.6 . | MOB4S.6 MOB1S.6 MOBCa.6 T| .# | MOBWa.6 3 .### + # | MOB12.5 SCSho.6 MOBWa.6 .# | MOBPi.5 SCFoo.6 SCLBD.6 MOBWa.6 .## | MOBWa.5 SCToi.6 MOBTo.6 .## | MOB4S.5 MOBSi.6 MOB1S.5 MOBCh.6 MOBCa.5 86

Figure IV-10 cont . | 2 ## + MOBWa.5 SCUBD.6 .### | .### S| MOB12.4 SCSho.5 MOBLy.6 MOBWa.5 MOBSi.6 # |T MOBPi.4 SCFoo.5 MOBRo.6 SCLBD.5 SCOra.6 MOBWa.5 SCToi.5 ### | MOBWa.4 MOBTo.5 SCEat.6 .#### | MOB4S.4 MOBSi.5 MOB1S.4 MOBCh.5 MOBCa.4 1 .#### + SCUBD.5 .####### |S MOBWa.4 .######## | MOB12.3 SCSho.4 MOBLy.5 ########## | MOBPi.3 MOBWa.4 MOBSi.5 .### | MOBWa.3 SCFoo.4 MOBRo.5 SCLBD.4 SCOra.5 MOBWa.4 SCToi.4 .###### | MOB4S.3 MOBTo.4 SCEat.5 MOB1S.3 MOBCa.3 0 .###### M+M MOB12.2 MOBSi.4 MOBCh.4 .######### | MOBPi.2 MOBWa.3 SCUBD.4 .###### | MOBWa.2 .#### | MOB4S.2 SCSho.3 MOBLy.4 MOB1S.2 MOBWa.3 MOBCa.2 .#### | SCFoo.3 MOBSi.4 SCLBD.3 MOBWa.3 #### |S MOB12.1 MOBWa.2 SCToi.3 MOBRo.4 MOBTo.3 SCOra.4 -1 .##### + MOBSi.3 SCEat.4 MOBCh.3 .### | MOBPi.1 SCSho.2 SCUBD.3 MOBWa.1 MOBWa.2 .#### | MOB4S.1 SCFoo.2 MOB1S.1 SCLBD.2 MOBWa.2 .##### |T MOBCa.1 SCToi.2 MOBTo.2 .#### S| MOBWa.1 MOBSi.2 MOBLy.3 MOBCh.2 MOBSi.3 ### | SCUBD.2 MOBRo.3 SCOra.3 -2 .### + SCSho.1 SCEat.3 MOBWa.1 .### | SCFoo.1 .### | SCLBD.1 MOBLy.2 MOBWa.1 MOBSi.2 SCToi.1 MOBTo.1 .### | MOBSi.1 MOBRo.2 MOBCh.1 SCOra.2 # | SCEat.2 .### | SCUBD.1 -3 + .# | MOBLy.1 MOBSi.1 .# T| | MOBRo.1 SCOra.1 SCEat.1 | # | -4 .###### + | 1 2 3 4 5 6

87

Concordance Tables

Two concordance tables were produced, one for FIM-M (see Table IV-22) in which the original FIM-M (column 3) was converted to a Rasch logits score (column 4) and the

CARE score (column 1) was converted to a Rasch score calibrated to a FIM standard (column

2). Table IV-23 shows the same information but for CARE. The conversion of scores to logits gives the ability to preserve the power of having a true interval scale, thus the analysis provided below assessed the crosswalks using the logit transformations.

Table IV-24: FIM concordance table using Rasch

CARE to CARE FIM-M FIM-M FIM-M Score Rasch Score Rasch 22 -5.52 13 -4.46 23 -4.48 14 -3.40 24 -3.96 15 -2.86 25 -3.66 16 -2.58 26 -3.46 17 -2.38 27 -3.30 18 -2.24 28 -3.16 19 -2.11 29 -3.04 20 -2.01 30 -2.94 21 -1.92 31 -2.84 22 -1.83 32 -2.75 23 -1.76 33 -2.67 24 -1.68 34 -2.59 25 -1.62 35 -2.52 26 -1.55 36 -2.45 27 -1.49 37 -2.38 28 -1.44 38 -2.31 29 -1.38 39 -2.25 30 -1.33 40 -2.19 31 -1.27 41 -2.13 32 -1.22 42 -2.07 33 -1.17 43 -2.01 34 -1.12

88

Table IV-22 cont 44 -1.95 35 -1.08 45 -1.90 36 -1.03 46 -1.85 37 -0.98 47 -1.79 38 -0.94 48 -1.74 39 -0.89 49 -1.69 40 -0.85 50 -1.64 41 -0.80 51 -1.59 42 -0.75 52 -1.54 43 -0.71 53 -1.49 44 -0.66 54 -1.45 45 -0.62 55 -1.40 46 -0.57 56 -1.35 47 -0.52 57 -1.30 48 -0.47 58 -1.26 49 -0.43 59 -1.21 50 -0.38 60 -1.16 51 -0.32 61 -1.12 52 -0.27 62 -1.07 53 -0.22 63 -1.03 54 -0.16 64 -0.98 55 -0.11 65 -0.93 56 -0.05 66 -0.89 57 0.01 67 -0.84 58 0.08 68 -0.79 59 0.14 69 -0.75 60 0.21 70 -0.70 61 0.28 71 -0.66 62 0.35 72 -0.61 63 0.42 73 -0.56 64 0.50 74 -0.51 65 0.58 75 -0.47 66 0.66 76 -0.42 67 0.75 77 -0.37 68 0.83 78 -0.33 69 0.92 79 -0.28 70 1.01 80 -0.23 71 1.10 81 -0.18 72 1.19 82 -0.13 73 1.28 83 -0.08 74 1.37

89

Table IV-22 cont 84 -0.03 75 1.46 85 0.02 76 1.56 86 0.07 77 1.66 87 0.12 78 1.76 88 0.17 79 1.86 89 0.23 80 1.97 90 0.28 81 2.08 91 0.33 82 2.20 92 0.39 83 2.33 93 0.45 84 2.47 94 0.50 85 2.63 95 0.56 86 2.81 96 0.62 87 3.02 97 0.68 88 3.28 98 0.73 89 3.65 99 0.79 90 4.28 100 0.86 91 5.41 101 0.92 102 0.98 103 1.04 104 1.11 105 1.17 106 1.24 107 1.30 108 1.37 109 1.44 110 1.51 111 1.58 112 1.65 113 1.72 114 1.80 115 1.87 116 1.95 117 2.03 118 2.12 119 2.20 120 2.29 121 2.39 122 2.49 123 2.60

90

Table IV-22 cont 124 2.71 125 2.84 126 2. 98 127 3.14 128 3.33 129 3.57 130 3.90 131 4.46 132 5.52

Table IV-25: CARE concordance table using Rasch

CARE CARE FIM-M FIM-M to Score Rasch Score CARE Rasch 22 -4.77 13 -4.66 23 -3.83 14 -3.49 24 -3.35 15 -2.90 25 -3.09 16 -2.59 26 -2.90 17 -2.37 27 -2.76 18 -2.21 28 -2.64 19 -2.07 29 -2.53 20 -1.96 30 -2.44 21 -1.85 31 -2.35 22 -1.76 32 -2.27 23 -1.68 33 -2.19 24 -1.60 34 -2.12 25 -1.52 35 -2.06 26 -1.45 36 -1.99 27 -1.39 37 -1.93 28 -1.32 38 -1.87 29 -1.26 39 -1.81 30 -1.20 40 -1.76 31 -1.14 41 -1.70 32 -1.09 42 -1.65 33 -1.03 43 -1.60 34 -0.98 44 -1.55 35 -0.93 45 -1.50 36 -0.87 46 -1.45 37 -0.82 47 -1.40 38 -0.77

91

Table IV-23 cont 48 -1. 35 39 -0.72 49 -1.31 40 -0.67 50 -1.26 41 -0.62 51 -1.22 42 -0.57 52 -1.17 43 -0.52 53 -1.13 44 -0.47 54 -1.09 45 -0.42 55 -1.04 46 -0.36 56 -1.00 47 -0.31 57 -0.96 48 -0.26 58 -0.92 49 -0.20 59 -0.87 50 -0.15 60 -0.83 51 -0.09 61 -0.79 52 -0.04 62 -0.75 53 0.02 63 -0.71 54 0.09 64 -0.66 55 0.15 65 -0.62 56 0.21 66 -0.58 57 0.28 67 -0.54 58 0.35 68 -0.50 59 0.42 69 -0.46 60 0.50 70 -0.41 61 0.57 71 -0.37 62 0.65 72 -0.33 63 0.74 73 -0.29 64 0.82 74 -0.24 65 0.91 75 -0.20 66 1.00 76 -0.16 67 1.09 77 -0.12 68 1.19 78 -0.07 69 1.28 79 -0.03 70 1.38 80 0.01 71 1.48 81 0.06 72 1.58 82 0.10 73 1.68 83 0.15 74 1.78 84 0.19 75 1.89 85 0.24 76 1.99 86 0.28 77 2.10 87 0.33 78 2.21

92

Table IV-23 cont 88 0.38 79 2.33 89 0.43 80 2.44 90 0.47 81 2.57 91 0.52 82 2.70 92 0.57 83 2.84 93 0.62 84 3.00 94 0.68 85 3.17 95 0.73 86 3.37 96 0.78 87 3.60 97 0.83 88 3.90 98 0.89 89 4.30 99 0.94 90 5.00 100 1.00 91 6.25 101 1.05 102 1.11 103 1.16 104 1.22 105 1.28 106 1.34 107 1.40 108 1.46 109 1.52 110 1.58 111 1.65 112 1.71 113 1.78 114 1.85 115 1.91 116 1.99 117 2.06 118 2.14 119 2.21 120 2.30 121 2.38 122 2.47 123 2.57 124 2.67 125 2.79 126 2.92 127 3.06

93

Table IV-23 cont 128 3.23 129 3.45 130 3.74 131 4.26 132 5.22

The figures below offer a visualization of the mapping of logits for both the FIM-M and CARE crosswalks. Both reveal somewhat normal distributions as seen by the marginal plots for each measure, however there is some dispersion seen at the upper scores of the measures. For the CARE crosswalk there is a floor effect with the CAREfromFIM distribution and a ceiling effect the original CARE scores. The opposite is true of the FIM-M crosswalk, in which there is a floor effect on the original FIM-M and a ceiling effect on the FIMfromCARE transformation.

94

Figure IV-11: Scatterplot of CARE using Rasch

95

Figure IV-12: Scatterplot of FIM-M using Rasch

Assessment of crosswalk

Reduction in Uncertainty

Like the previous crosswalk assessments, the correlations between the FIM-M to

CARE (in logits) as well as the FIM-M to FIMfromCARE and the CARE to CAREfromFIM were

high (r=0.92) in the training sample. This translates into a RiU of 61.4% for the original

scores (logit based) and the crosswalked scores (see Table IV-24). This held true for the

validation sample as well (r=0.914) with a 59.5% RiU.

Table IV-26: Correlations and RiU between the assessments using Rasch

Sample Crosswalk Correlation Correlation2 CoA RiU RiU Percent Training FIM-M – CARE 0.92 0.85 0.39 0.61 61.4 FIM-M – 0.92 0.85 0.39 0.61 61.4 FIMfromCARE CARE – CAREfromFIM 0.92 0.85 0.39 0.61 61.4

96

Table IV-24 cont

Validation FIM-M – CARE 0.91 0.84 0.41 0.60 59.5 FIM-M – 0.91 0.84 0.41 0.60 59.5 FIMfromCARE CARE – CAREfromFIM 0.91 0.84 0.41 0.60 59.5

Percent scores within ½ SD

Both the CARE and FIM-M Crosswalks using Rasch converted scores were within ½

SD of the original logit score 86% of the time for CARE and 77% of the time for FIM-M (see

Table IV-25) in the training sample. Similar, but slightly lower findings occurred in the validation sample. Eighty-five percent of the cases of CARE converted scores fell within ½ SD

(1.02) score of the FIM-M, and 76% of the Converted FIM-M scores fell within ½ SD (0.88) of the CARE.

Table IV-27: Percent of the scores within ½ SD of original score using Rasch method

Sample Crosswalk N Target Percent Training CARE 635 1.06 0.86 FIM-M 635 0.96 0.78 Validation CARE 274 1.02 0.85 FIM-M 274 0.88 0.76

Population Invariance

For the CARE crosswalk, mean differences amongst the demographic variables

remained consistent between the original score and the converted score in the training

sample. Persons who were females, severe, non-minority, over 35, and non-vehicular

etiologies scored higher than their dichotomous comparisons (Table IV-26). Like the

population invariance statistics for the validation sample in the equipercentile methods, the

direction and effect differed from the training sample in that sex and cause of injury were reversed, and age showed a difference of direction between the CARE and CAREfromFIM

97

distributions which is indicative of the conversion not being sensitive to invariance in age and sex categories.

Table IV-28: Demographic population invariance for CARE and CAREfromFIM using Rasch method

Poole Mean d CI CI Sample Demographic Outcome Difference SD Invariance Low High

Training Female -Male CARETotal -0.10 2.12 -0.04 -0.22 0.14 CAREfromFIM -0.03 2.13 -0.01 -0.19 0.17

Mild/moderate - Severe CARETotal 0.32 2.12 0.15 -0.03 0.33 CAREfromFIM 0.42 2.12 0.20 0.02 0.38

Non -minority - Minority CARETotal -0.12 2.12 -0.06 -0.22 0.11 CAREfromFIM -0.19 2.13 -0.09 -0.26 0.07

26 + - <26 CARETotal -0.19 2.12 -0.09 -0.27 0.10 CAREfromFIM -0.24 2.13 -0.11 -0.30 0.07

Non -vehicular - Vehicular CARETotal -0.01 2.12 0.00 -0.16 0.16 CAREfromFIM -0.04 2.13 -0.02 -0.18 0.14 Validation Female-Male CARETotal 0.25 2.04 0.12 -0.16 0.40 CAREfromFIM 0.21 1.93 0.11 -0.17 0.39 Mild/moderate - Severe CARETotal 0.26 2.04 0.12 -0.14 0.39 CAREfromFIM 0.28 1.93 0.15 -0.12 0.41 Non-minority - Minority CARETotal -0.21 2.04 -0.10 -0.35 0.15 CAREfromFIM -0.25 1.93 -0.13 -0.38 0.12 26+ - <26 CARETotal -0.11 2.04 -0.05 -0.33 0.22 CAREfromFIM 0.12 1.94 0.06 -0.21 0.34 Non-vehicular - Vehicular CARETotal 0.28 2.04 0.14 -0.10 0.38 CAREfromFIM 0.21 1.93 0.11 -0.13 0.35

Table IV-27 shows very small differences in the Invariance statistic difference (SMD) for all demographic variables, with only age in the validation sample showing a difference larger than 0.08 indicating that the crosswalk may not be equitable for people of different ages.

98

Table IV-29: Difference in SMD for demographic variables for CARE using Rasch method

Invariance Sample Demographic Difference Training Sex -0.032 Severity -0.050 Race 0.033 Age 0.023 Cause 0.016 Validation Sex 0.013 Severity -0.021 Race 0.027 Age -0.116 Cause 0.030

The demographic breakdown for the FIM-M for both the training and validation sample followed the same pattern as for the CARE. The same demographic variables were different in the same direction as the CARE (Table IV-28) and the absolute differences in invariances were all below 0.08, except age in the validation sample (-0.116) see Table

IV-29.

Table IV-30: Demographic population invariance for FIM and FIMfromCARE using Rasch method

Mean Pooled CI CI Sample Demographic Outcome Difference SD Invariance Low High

Training Female-Male FIM-M -0.03 1.93 -0.01 -0.19 0.17 FIMfromCARE -0.10 2.35 -0.04 -0.22 0.14

Mild/moderate - Severe FIM-M 0.38 1.92 0.20 0.02 0.38 FIMfromCARE 0.35 2.34 0.15 -0.03 0.33

Non -minority - Minority FIM-M -0.17 1.93 -0.09 -0.26 0.07 FIMfromCARE -0.14 2.35 -0.06 -0.22 0.11

26 + - <26 FIM-M -0.21 1.93 -0.11 -0.30 0.07 FIMfromCARE -0.21 2.35 -0.09 -0.27 0.10

Non -vehicular - Vehicular FIM-M -0.04 1.93 -0.02 -0.18 0.14 FIMfromCARE -0.01 2.35 0.00 -0.16 0.16 Validation Female-Male FIM-M 0.19 1.75 0.11 -0.17 0.39

99

Table IV-28 cont FIMfromCARE 0.27 2.26 0.12 -0.16 0.40 Mild/moderate - Severe FIM-M 0.26 1.75 0.15 -0.12 0.41 FIMfromCARE 0.28 2.26 0.12 -0.14 0.39 Non-minority - Minority FIM-M -0.23 1.75 -0.13 -0.38 0.12 FIMfromCARE -0.23 2.26 -0.10 -0.35 0.15 26+ - <26 FIM-M 0.11 1.75 0.06 -0.21 0.34 FIMfromCARE -0.12 2.26 -0.05 -0.33 0.22 Non-vehicular - Vehicular FIM-M 0.19 1.75 0.11 -0.13 0.35 FIMfromCARE 0.31 2.25 0.14 -0.10 0.38

Table IV-31: Difference in SMD for demographic variables for FIM-M using Rasch method

Invariance Sample Demographic Difference Training Sex 0.032 Severity 0.051 Race -0.032 Age -0.023 Cause -0.016 Validation Sex -0.012 Severity 0.021 Race -0.027 Age 0.116 Cause -0.029

Statistical Moments

As seen in Table IV-30, the mean differences for the CARE crosswalk (0.04) and the

FIM-M crosswalk (-0.04) was small, as were the differences in SDs (0.00 and -0.42 respectively) in the training sample. The table also shows similar but slightly larger differences in magnitude in the validation sample.

100

Table IV-32: Statistical moments 1 and 2 for CARE and FIM-M using the Rasch method

Mean Mean SD Sample Crosswalk Mean SD Converted SD Converted Difference Difference Training CARE 0.06 2.12 0.02 2.12 0.04 0.00 FIM-M -0.22 1.92 -0.18 2.34 -0.04 -0.42 Validation CARE 0.31 2.03 0.21 1.95 0.10 0.09 FIM-M -0.05 1.76 0.10 2.25 -0.15 -0.49

There was a slight discrepancy in the shape of the CARE vs CAREfromFIM

distributions as it relates to skewness in the training sample. The CARE distribution had a slightly positive skew, whereas the CAREfromFIM showed a slightly negative skew. This was exactly opposite for the FIM-M and FIMfromCARE distributions. For FIM-M the skew was negative and the skew was positive for the FIMfromCARE. Nevertheless all coefficients fell

within the CI of the other. All the distributions revealed a slightly positive kurtosis ranging

from 0.305 to 0.434. Again, all within the CI of the other. The validation sample revealed

both small positive skew and kurtosis falling within the 95% CI of the other assessment.

Table IV-33: Statistical moments 3 and 4 for CARE and FIM using the Rasch method

Sample Crosswalk Moment Coefficient SE Low High Training CARE Skewness 0.16 0.09 -0.03 0.34 CAREfromFIM Skewness -0.17 0.09 -0.36 0.01 FIM-M Skewness -0.17 0.09 -0.36 0.01 FIMfromCARE Skewness 0.16 0.09 -0.03 0.34 CARE Kurtosis 0.31 0.09 0.12 0.49 CAREfromFIM Kurtosis 0.43 0.09 0.24 0.62 FIM-M Kurtosis 0.43 0.09 0.25 0.62 FIMfromCARE Kurtosis 0.30 0.09 0.12 0.49 Validation CARE Skewness 0.33 0.14 0.04 0.61 CAREfromFIM Skewness 0.02 0.14 -0.26 0.31 FIM-M Skewness 0.02 0.14 -0.26 0.30 FIMfromCARE Skewness 0.32 0.14 0.04 0.61

101

Table IV-31 cont CARE Kurtosis 0.57 0.14 0.28 0.85 CAREfromFIM Kurtosis 0.62 0.14 0.34 0.91 FIM-M Kurtosis 0.63 0.14 0.34 0.91 FIMfromCARE Kurtosis 0.56 0.14 0.28 0.85

Effect Size

The effect size of the CARE and CAREfromFIM comparison was 0.020 and the

Cohen’s D for FIM-M and FIMfromCARE was 0.018 indicting that the overall distributions

were not significantly different for both crosswalks (see Table IV-32) in the training sample.

The validation sample also revealed small effect sizes albeit a bit larger than that of the

training sample crosswalks.

Table IV-34: Effect sizes of CARE and FIM-M crosswalks by Rasch method

Sample Outcome Mean Difference Pooled SD Cohen’s D CI Low CI High Training CARE - CAREfromFIM -0.04 2.13 -0.02 -0.13 0.09 FIM-M - FIMfromCARE 0.04 2.12 0.02 -0.09 0.13 Validation CARE - CAREfromFIM -0.15 2.01 -0.07 -0.24 0.09 FIM-M - FIMfromCARE 0.10 1.99 0.05 -0.12 0.21

102

CHAPTER V.

CONCLUSION

The results of this study developing and testing the accuracy of several crosswalks between the FIM-M and CARE were overall positive. To a great extent most criteria were met by each of the three crosswalk types (i.e., Expert Opinion, Equipercentile, and Rasch) and for both directions (i.e., CARE to FIM-M and FIM-M to CARE) of the crosswalks. Of the five evaluated crosswalks (ie, CAREeo to FIMeo (Expert Opinion), FIM to CARE

(Equipercentile), CARE to FIM-M (Equipercentile), FIM-M to CARE (Rasch), CARE to FIM-M

(Rasch)) using the training dataset, only four of the 17 evaluation criteria were not met. For the Expert Opinion crosswalk, using the invariance method of evaluation, the direction of race differed in that the CAREeo crosswalk showed that minorities scored higher (i.e., better function) than non-minorities, whereas the FIMeo crosswalk indicated that minorities scored lower (worse function), perhaps indicating that the Expert Opinion crosswalk is not population invariant with regards to race/ethnicity, however the difference in the SMD was small and didn’t meet the criteria to consider the characteristic invariant. Second, there was a difference in both directions of the crosswalks created by the Rasch method. For the evaluation of moments, the skewness of CARE scores was slightly positive (0.157) whereas

the skewness of the CAREfromFIM scores was slightly negative (-0.173). While the direction

was different, both coefficients were between -0.5 and 0.5 which indicates that the

distributions were approximately symmetric. Third, for the Rasch crosswalk method, using the moments evaluation criteria, the difference in direction of skewness was also seen in the FIM-M and FIMfromCARE, in which the FIM-M scores showed a negative skew (-0.175)

103

and the FIMfromCARE scores showed a positive skew (0.155). Similar to the skewness difference for the CARE and CAREfromFIM, neither of these skewness metrics identified a skewed distribution. Furthermore, for these two skewness differences, while the direction was different, none of the differences exceeded the 95% CI, and thus still met that evaluation criteria (see Table V-1). Fourth, the percent of scores falling within ½ SD of the referenced assessment was not met (78%, target of 80%) when using Rasch methodology for the FIM. The two crosswalks created by the Equipercentile method met all the criteria for evaluation.

Table V-1: Results of the evaluation criteria for the training dataset

Training Dataset Expert Equipercentile Rasch CAREeo - CARE - FIM-M - CARE - FIM-M - FIMeo CAREfromFIM FIMfromCARE CAREfromFIM FIMfromCARE % RiU Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria % Scores within 1/2 SD Met Criteria Met Criteria Met Criteria Met Criteria Different Sex direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Severity direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Race direction Different Met Criteria Met Criteria Met Criteria Met Criteria Age direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Cause direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Sex SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Severity SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Race SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Age SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Cause SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Moment 1 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Moment 2 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Moment 3 Met Criteria Met Criteria Met Criteria Different Different Moment 4 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Effect size Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria

104

Interestingly, the results of the evaluation of the training crosswalk did not hold when applied to the validation dataset. In fact, for the validation dataset the Expert Opinion met all the evaluation criteria. However, both the Equipercentile and Rasch methods creating crosswalks for both FIM-M to CARE and CARE to FIM-M revealed an issue regarding population invariance as it related to age. Both the Equipercentile and Rasch crosswalks showed that people younger than 26 scored higher on the CARE and FIM-M than people 26 and over; however, this direction was reversed for the CAREfromFIM and FIMfromCARE.

More importantly the difference in the SMD for both was greater than 0.08 indicating that the crosswalks may not be population invariant with regards to age (see Table V-2).

Furthermore for the FIM-M crosswalk using Rasch methodology, the percentage of scores falling within ½ SD failed to reach the 80% threshold (78%).

Table V-2: Results of the evaluation criteria for the validation dataset

Validation Dataset Expert Equipercentile Rasch CAREeo - CARE - FIM-M - CARE - FIM-M - FIMeo CAREfromFIM FIMfromCARE CAREfromFIM FIMfromCARE % RiU Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria % scores within ½ SD Met Criteria Met Criteria Met Criteria Met Criteria Different Sex direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Severity direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Race direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Age direction Met Criteria Different Different Different Different Cause direction Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Sex SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Severity SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Race SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Age SMD difference Met Criteria Different Different Different Different Cause SMD difference Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Moment 1 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Moment 2 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Moment 3 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria

105

Table V-2 cont Moment 4 Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria Effect size Met Criteria Met Criteria Met Criteria Met Criteria Met Criteria

Perhaps the relative success of all the crosswalk methods could be attributed to the fact that even the correlation between the original FIM-M and CARE was very high and accounted for a RiU of 69.3%. With such a high RiU coefficient, perhaps a simple linear crosswalk would also have performed well. Like other studies that compared the various types of crosswalks, there was not a clear indication as to which crosswalk was the best.

While each crosswalk performed well, there still exists some differences between them. The

Expert Opinion crosswalk would be the easiest to apply. If a researcher had the requisite variables for each instrument, a simple recalculation would suffice. However, the main drawbacks to this type of rescaling, is that it not only reduces the power of the instruments by not including all the items, but by reducing the scoring range, all the information for either FIM-M or CARE is not retained when rescoring the items 1-4. Second, since a true crosswalk was not performed, the resulting scale that was generated does not intrinsically have the same meaning as either the FIM-M or CARE, so that a score by itself is not directly comparable. For example, a score on the FIMeo of 25 is not comparable to either a FIM-M score or a CARE score.

The crosswalks developed using Rasch, also have the issue of translation of the logit scores. While keeping the logit scoring is statistically powerful and opens the door for analytics that would make use of a true interval scale, describing what that score means can be problematic for some. The idea of a logit score does not easily translate to clinical functional behavior, which is better understood when a person can link the score for FIM-M

106

(i.e. eating) to an ability of functional independence (i.e. complete independence).

Furthermore, the validation sample revealed some differences with the age population

invariance and SMD, meaning that there is some evidence that a separate crosswalk for

each age category may be necessary. Of course, this difference may be due to the definition

of the dichotomous variable. The dichotomy cut point was chosen to achieve a similar

sample size in each group; however, perhaps a better method of choosing age categories

would be to make a different cut point. If a researcher was more concerned with the

analytic properties of the measure, then a Rasch crosswalk converted to logits would be the

most practical.

The crosswalks developed using the Equipercentile method were the only crosswalks

to recalibrate the actual scores to that of the other assessment. Therefore, direct

comparisons with the existing scoring schemes remained intact. For example, if a person

scores 100 on CARE using the conversion in Table IV-11 would convert to a FIM-M score of

67.5, so the meaning of the FIM-M score would remain intact. Similar to the crosswalk

created with Rasch, the Equipercentile crosswalk also revealed some age discrepancies using the population invariance evaluation criteria when analyzing the validation dataset, indicating again that a separate crosswalk for each age category may be necessary.

If only the training sample were used then based on the assessment criteria, the

Equipercentile methodology would be most preferable. However, in the validation dataset, the Expert Opinion methodology was the only crosswalk to meet all of the evaluation criteria. While there was not a clear “best” methodology for a crosswalk between the FIM-

M score and the CARE, it is clear that all the methodologies were successful and each

107

provides strengths and limitations to their use. Choosing a methodology should be left up

to the researcher/statistician who can select which crosswalk best fits the needs of a

particular study. What can be concluded from this particular endeavor is that looking at

various crosswalk methodologies and evaluating them is a beneficial means of maximizing

the use of data collected from different measures.

Study Limitations

The main limitation of this study is the restriction of the dataset to be solely focused

on a moderate to severe TBI population that was specifically enrolled into the TBIMS

longitudinal study and only at either admission or discharge time-points. While the TBIMS

dataset has been shown to be mostly representative of the US population of adults

receiving inpatient rehabilitation for a primary diagnosis of TBI,106 the one characteristic in

which the dataset is not as representative would be that of older adults. Further

questioning the applicability of a single crosswalk being applicable for people of different

ages. While the results of this study support using crosswalks between FIM-M and CARE

using a variety of methodologies, further research is needed to determine if these

crosswalks would maintain viability with other diagnoses that receive post-acute care (e.g.,

spinal cord injury, stroke, etc.), or even a less severe TBI population. Furthermore,

additional research is needed to determine if the CARE tool is also crosswalk-able to both

the OASIS and MDS in order to completely evaluate whether or not the CARE assessment is

able to maintain comparability with all of the assessments it has replaced. Another

limitation of the CARE as it relates to functional outcome is that it does not capture

Cognitive function in a manner that makes it crosswalk-able. Thus, there is no FIM Cognitive

108

subscale crosswalk, which arguably would be vitally important for assessing persons with a diagnosis of TBI or Stroke.

109

Summary

Understanding the measurement of functional outcome in a PAC setting remains critical for care management and treatment evaluation. Having those measures change over time creates difficulties when trying to assess function of persons assessed longitudinally using different assessment tools. However, creating a crosswalk between the assessments is one way to overcome the limitations of having different people being assessed with different instruments, yet still maintaining comparability. The results from this study created and evaluated three methods of crosswalks between the FIM-M and CARE using a sample of people having sustained a moderate to severe TBI. Upon evaluation of the crosswalks, all three crosswalks methods produced acceptable criteria for use. Therefore functional outcome can be compared between cohorts having been assessed using different measures. The results indicate that for researchers wanting to compare cohorts that have been assessed using different instruments, any of the crosswalks could be utilized.

However, the particular application would help determine which crosswalk methodology should be used.

110

CHAPTER VI.

BIBLIOGRAPHY

1. Tian W. An all-payer view of hospital discharge to postacute care, 2013: Statistical Brief# 205. 2006. 2. Lawton MP. The functional assessment of elderly people. Journal of the American Geriatrics Society. 1971;19(6):465-481. 3. Dittmar SS, Gresham GE, National Institute on D, et al. Functional assessment and outcome measures for the rehabilitation health professional. Gaithersburg, Md.: Gaithersburg, Md. : Aspen Publishers; 1997. 4. Wilkerson DL, Batavia AI, DeJong G. Use of functional status measures for payment of medical rehabilitation services. Archives of physical medicine and rehabilitation. 1992;73(2):111-120. 5. Üstün TB, Chatterji S, Kostansjek N, Bickenbach J. WHO's ICF and functional status information in health records. Health care financing review. 2003;24(3):77. 6. Cohen ME, Marino RJ. The tools of disability outcomes research functional status measures. Archives of physical medicine and rehabilitation. 2000;81(12 Suppl 2):S21-29. 7. Services CfM. IRF-PAI. https://www.cms.gov/Medicare/Medicare-Fee-for-Service- Payment/InpatientRehabFacPPS/IRFPAI. Accessed May 4, 2019. 8. Services CfM. Minimum Data Set 3.0 Public Reports. https://www.cms.gov/Research- Statistics-Data-and-Systems/Computer-Data-and-Systems/Minimum-Data-Set-3-0-Public- Reports. Accessed May 4, 2019. 9. Services CfM. Home Health Quality Reporting Program. https://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment- Instruments/HomeHealthQualityInits. Accessed May 4, 2019. 10. Office GA. OASIS Data Use, Cost, and Privacy Implications. January 2001 2001. 11. Thomas KS, Dosa D, Gozalo PL, et al. A methodology to identify a cohort of Medicare beneficiaries residing in large assisted living facilities using administrative data. Medical care. 2018;56(2):e10. 12. Morris JN, Hawes C, Fries BE, et al. Designing the national resident assessment instrument for nursing homes. Gerontologist. 1990;30(3):293-307. 13. Shaughnessy PW, Schlenker RE, Hittle DF. Home health care outcomes under capitated and fee-for-service payment. Health Care Financ Rev. 1994;16(1):187-222. 14. Granger CV, Brownscheidle CM. Outcome measurement in medical rehabilitation. International journal of technology assessment in health care. 1995;11(2):262-268. 15. Medicare Cf, Medicaid Services H. Medicare program; prospective payment system for inpatient rehabilitation facilities. Final rule. Federal Register. 2001;66(152):41315. 16. Haley SM, Coster WJ, Andres PL, et al. Activity outcome measurement for postacute care. Medical Care. 2004:I49-I61. 17. Gage B, Ingber M, Smith L, et al. Post-Acute Care Payment Reform Demonstration: Final Report Volume 4 of 4. 2012. 18. Williams BC, Li Y, Fries BE, Warren RL. Predicting patient scores between the functional independence measure and the minimum data set: Development and performance of a fim- mds “crosswalk”. Archives of physical medicine and rehabilitation. 1997;78(1):48-54. 19. Buchanan JL, Andres PL, Haley SM, Paddock SM, Zaslavsky AM. An assessment tool translation study. Health care financing review. 2003;24(3):45.

111

20. Holland DE. The Medicare post-acute care payment reform initiative: impact and opportunity for case management. Prof Case Manag. 2008;13(1):37-42. 21. Gage B, Stineman M, Deutsch A, et al. Perspectives on the state-of-the-science in rehabilitation medicine and its implications for Medicare postacute care policies. Archives of physical medicine and rehabilitation. 2007;88(12):1737-1739. 22. Dorans NJ. Linking scores from multiple health outcome instruments. Quality of Life Research. 2007;16(1):85-94. 23. Dorans NJ, Moses TP, Eignor DR. Principles and practices of test score equating. ETS Research Report Series. 2010;2010(2):i-41. 24. Holland PW, Dorans NJ. Linking and equating. Educational measurement. 2006;4:187-220. 25. Kolen MJ, Brennan RL. Test equating, scaling, and linking. 2004. 26. Dorans NJ, Pommerich M, Holland PW. Linking and aligning scores and scales. Springer Science & Business Media; 2007. 27. Patz RJ, Yao L. Methods and models for vertical scaling. In: Linking and aligning scores and scales. Springer; 2007:253-272. 28. Yen WM. Vertical scaling and no child left behind. In: Linking and aligning scores and scales. Springer; 2007:273-283. 29. Pommerich M, Dorans NJ. Linking scores via concordance: Introduction to the special issue. In: Sage Publications Sage CA: Thousand Oaks, CA; 2004. 30. Holland PW, Dorans NJ, Petersen NS. 6 Equating Test Scores. Handbook of statistics. 2006;26:169-203. 31. Granger CV, Markello SJ, Graham JE, Deutsch A, Reistetter TA, Ottenbacher KJ. The uniform data system for medical rehabilitation: report of patients with traumatic brain injury discharged from rehabilitation programs in 2000-2007. American journal of physical medicine & rehabilitation. 2010;89(4):265-278. 32. Linacre JM, Heinemann AW, Wright BD, Granger CV, Hamilton BB. The structure and stability of the Functional Independence Measure. Archives of physical medicine and rehabilitation. 1994;75(2):127-132. 33. Bogasky S, Deutsch A, Kline CT, et al. Analysis of Crosscutting Medicare Functional Status Quality Metrics Using the Continuity and Assessment Record and Evaluation (CARE) Item Set. 2012. 34. Smith RM, Taylor PA. Equating rehabilitation outcome scales: Developing common metrics. Journal of Applied Measurement. 2004. 35. Granger C, Hamilton B, Keith R, Zielezny M, Sherwin F. Guide for the uniform data set for medical rehabilitation (Adult FIM). Buffalo, NY: State University of New York at Buffalo. 1993. 36. Hamilton BB, Granger CV. Totaled functional score. Can be valid. Archives of physical medicine and rehabilitation. 1989;70(12):861-863. 37. Stineman MG, Jette A, Fiedler R, Granger C. Impairment-specific dimensions within the functional independence measure. Archives of physical medicine and rehabilitation. 1997;78(6):636-643. 38. Heinemann AW, Linacre JM, Wright BD, Hamilton BB, Granger C. Relationships between impairment and physical disability as measured by the functional independence measure. Archives of physical medicine and rehabilitation. 1993;74(6):566-573. 39. Fiedler CR, Granger VC, Post AL. The Uniform Data System for Medical Rehabilitation: Report of First Admissions for 1998. American journal of physical medicine & rehabilitation. 2000;79(1):87-92.

112

40. Hamilton BB, Granger CV. Disability outcomes following inpatient rehabilitation for stroke. Physical therapy. 1994;74(5):494-503. 41. Granger CV, Hamilton BB, Keith RA, Zielezny M, Sherwin FS. Advances in functional assessment for medical rehabilitation. Topics in Geriatric Rehabilitation. 1986;1(3):59-74. 42. Granger CV, Karmarkar AM, Graham JE, et al. The uniform data system for medical rehabilitation: report of patients with traumatic spinal cord injury discharged from rehabilitation programs in 2002-2010. American journal of physical medicine & rehabilitation. 2012;91(4):289-299. 43. Hall KM, Cohen ME, Wright J, Call M, Werner P. Characteristics of the functional independence measure in traumatic spinal cord injury. Archives of physical medicine and rehabilitation. 1999;80(11):1471-1476. 44. Granger CV. The emerging science of functional assessment: our tool for outcomes analysis. Archives of physical medicine and rehabilitation. 1998;79(3):235-240. 45. MR U. Uniform Data Center for Medical rehabilitation. 2019; https://www.udsmr.org/products/inpatient-rehab. Accessed 3/20/2019. 46. Hsueh I-P, Lin J-H, Jeng J-S, Hsieh C-L. Comparison of the psychometric characteristics of the functional independence measure, 5 item Barthel index, and 10 item Barthel index in patients with stroke. Journal of Neurology, Neurosurgery & Psychiatry. 2002;73(2):188-190. 47. Dodds TA, Martin DP, Stolov WC, Deyo RA. A validation of the functional independence measurement and its performance among rehabilitation inpatients. Archives of physical medicine and rehabilitation. 1993;74(5):531. 48. Ottenbacher KJ, Hsu Y, Granger CV, Fiedler RC. The reliability of the functional independence measure: a quantitative review. Archives of physical medicine and rehabilitation. 1996;77(12):1226-1232. 49. Stineman MG, Shea JA, Jette A, et al. The Functional Independence Measure: tests of scaling assumptions, structure, and reliability across 20 diverse impairment categories. Archives of physical medicine and rehabilitation. 1996;77(11):1101-1108. 50. Granger CV, Cotter AC, Hamilton BB, Fiedler RC. Functional assessment scales: a study of persons after stroke. Archives of physical medicine and rehabilitation. 1993;74(2):133-138. 51. Oczkowski WJ, Barreca S. The functional independence measure: its use to identify rehabilitation needs in stroke survivors. Archives of physical medicine and rehabilitation. 1993;74(12):1291-1294. 52. Heinemann AW, Linacre JM, Wright BD, Hamilton BB, Granger C. Prediction of rehabilitation outcomes with disability measures. Archives of physical medicine and rehabilitation. 1994;75(2):133-143. 53. Kramer A, Holthaus D, Gage B, Green J, Saliba D, Coleman E. Uniform patient assessment for post-acute care: Final report. Aurora: US Division of Health Care Policy and Research. 2006. 54. Johnston MV, Graves D, Greene M. The uniform postacute assessment tool: systematically evaluating the quality of measurement evidence. Archives of physical medicine and rehabilitation. 2007;88(11):1505-1512. 55. Gage B, Constantine R, Aggarwal J, et al. The Development and Testing of the Continuity Assessment Record and Evaluation (CARE) Item Set, Volume 1 of 3. 2012. 56. Gage B, Smith L, Ross J, et al. The Development and Testing of the Continuity Assessment Record and Evaluation (CARE) Item Set: Final Report on Reliability Testing, Volume 2 of 3. 2012. 57. Chang KV, Hung CY, Kao CW, et al. Development and Validation of the Standard Chinese Version of the CARE Item Set (CARE-C) for Stroke Patients. Medicine (Baltimore). 2015;94(42):e1828.

113

58. Flanagan JC. Units, scores, and norms. Educational measurement. 1951:695-763. 59. Angoff WH. Educational measurement. American Council on Education; 1971. 60. Dorans NJ. Using subpopulation invariance to assess test score equity. Journal of Educational Measurement. 2004;41(1):43-68. 61. Braun HH, PW Observed-score test equating: A mathematical analysis of some ETS equating procedures. In: P. W. Holland and D. B. Rubin (Eds.), Test equating. New York: Academic Press; 1982. 62. Crocker L, Algina J. Introduction to classical and modern test theory. ERIC; 1986. 63. Rasch G. On general laws and the meaning of measurement in psychology. Paper presented at: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability1961. 64. Lord FM. Practical applications of item characteristic curve theory. Journal of Educational Measurement. 1977;14(2):117-138. 65. Wright BD. Solving measurement problems with the Rasch model. Journal of educational measurement. 1977;14(2):97-116. 66. Feuer MJ, Holland PW, Green BF, Bertenthal MW, Hemphill FC. Uncommon measures: Equivalence and linkage among educational tests. ERIC; 1999. 67. Harris DJ, Crouse JD. A study of criteria used in equating. Applied Measurement in Education. 1993;6(3):195-240. 68. Marco GL. Item characteristic curve solutions to three intractable testing problems 1. ETS Research Bulletin Series. 1977;1977(1):i-41. 69. Kolen MJ. Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement. 1981;18(1):1-11. 70. Rentz RR, Bashaw W. The national reference scale for reading: An application of the Rasch model. Journal of Educational Measurement. 1977:161-179. 71. Slinde JA, Linn RL. A note on vertical equating via the Rasch model for groups of quite different ability and tests of quite different difficulty. Journal of Educational Measurement. 1979:159-165. 72. Marco GL, PETERSEN NS, STEWART EE. A test of the adequacy of curvilinear score equating models. In: New horizons in testing. Elsevier; 1983:147-177. 73. Jaeger RM. Some exploratory indices for selection of a test equating method. Journal of Educational Measurement. 1981:23-38. 74. Skaggs G, Lissitz RW. An exploration of the robustness of four test equating models. Applied Psychological Measurement. 1986;10(3):303-317. 75. Moses T, Zhang W. Standard errors of equating differences: Prior developments, extensions, and simulations. Journal of Educational and Behavioral Statistics. 2011;36(6):779-803. 76. Kendall M, Stuart A. The advanced theory of statistics. Vol. 1: Distribution theory. London: Griffin, 1977, 4th ed. 1977. 77. Liou M, Cheng PE. Asymptotic standard error of equipercentile equating. Journal of Educational and Behavioral Statistics. 1995;20(3):259-286. 78. Wang T. Standard errors of equating for the percentile rank–based equipercentile equating with log-linear presmoothing. Journal of educational and behavioral statistics. 2009;34(1):7- 23. 79. Holland PW, Thayer DT. The kernel method of equating score distributions. ETS Research Report Series. 1989;1989(1):i-45. 80. Aşiret S, Sünbül SÖ. Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice. 2016;16(2).

114

81. Dorans NJ, Lawrence IM. Checking the statistical equivalence of nearly identical test editions. Applied Measurement in Education. 1990;3(3):245-254. 82. Livingston SA, Dorans NJ, Wright NK. What combination of sampling and equating methods works best? Applied Measurement in Education. 1990;3(1):73-95. 83. Moses T. A Comparison of Statistical Significance Tests for Selecting Equating Functions. Applied Psychological Measurement. 2009;33(4):285. 84. Fisher WP, Jr., Eubanks RL, Marier RL. Equating the MOS SF36 and the LSU HSI Physical Functioning Scales. J Outcome Meas. 1997;1(4):329-362. 85. Buchanan JL, Andres PL, Haley SM, Paddock SM, Zaslavsky AM. An assessment tool translation study. Health Care Financ Rev. 2003;24(3):45-60. 86. Velozo CA, Byers KL, Wang Y-C, Roberts Joseph B. Translating measures across the continuum of care: Using Rasch analysis to create a crosswalk between the Functional Independence Measure and the Minimum Data Set. The Journal of Rehabilitation Research and Development. 2007;44(3):467. 87. Wang YC, Byers KL, Velozo CA. Validation of FIM-MDS crosswalk conversion algorithm. J Rehabil Res Dev. 2008;45(7):1065-1076. 88. Haley SM, Ni P, Lai JS, et al. Linking the activity measure for post acute care and the quality of life outcomes in neurological disorders. Archives of physical medicine and rehabilitation. 2011;92(10 Suppl):S37-43. 89. ten Klooster PM, Voshaar MAO, Gandek B, et al. Development and evaluation of a crosswalk between the SF-36 physical functioning scale and Health Assessment Questionnaire disability index in rheumatoid arthritis. Health and quality of life outcomes. 2013;11(1):199. 90. Oude Voshaar MA, Ten Klooster PM, Taal E, et al. Linking physical function outcomes in rheumatology: performance of a crosswalk for converting Health Assessment Questionnaire scores to Short Form 36 physical functioning scale scores. Arthritis Care Res (Hoboken). 2014;66(11):1754-1758. 91. Noonan VK, Cook KF, Bamer AM, Choi SW, Kim J, Amtmann D. Measuring fatigue in persons with multiple sclerosis: creating a crosswalk between the Modified Fatigue Impact Scale and the PROMIS Fatigue Short Form. Qual Life Res. 2012;21(7):1123-1133. 92. Schalet BD, Rothrock NE, Hays RD, et al. Linking Physical and Mental Health Summary Scores from the Veterans RAND 12-Item Health Survey (VR-12) to the PROMIS((R)) Global Health Scale. J Gen Intern Med. 2015;30(10):1524-1530. 93. Ghomrawi HM, Lee YY, Herrero C, et al. A Crosswalk Between UCLA and Lower Extremity Activity Scales. Clin Orthop Relat Res. 2017;475(2):542-548. 94. Hong I, Woo HS, Shim S, Li CY, Yoonjeong L, Velozo CA. Equating activities of daily living outcome measures: the Functional Independence Measure and the Korean version of Modified Barthel Index. Disability and rehabilitation. 2018;40(2):217-224. 95. International R. Inpatient Rehabilitation Facility Quality Reporting Program- Specifications for the Quality Measures Adopted through Fiscal Year 2016 Final Rule. 2016. 96. Hanson BA, Zeng L, Colton D. A Comparison of Presmoothing and Postsmoothing. American College Testing. 1994. 97. Albano AD. equate: An R package for observed-score linking and equating. Journal of Statistical Software. 2016;74(8):1-36. 98. Masters GN. Common-Person Equating with the Rasch Model. Applied Psychological Measurement. 1985;9(1):73-82. 99. Dorans NJ, Holland PW. Population invariance and the equatability of tests: Basic theory and the linear case. Journal of educational measurement. 2000;37(4):281-306.

115

100. von Davier AA, Liu M, Gao X, et al. Population invariance of test equating and linking: Theory extension and applications across exams. ETS Research Report Series. 2006;2006(2):i-197. 101. Cohen J. Statistical power analysis for the behavioral sciences. Routledge; 2013. 102. McNemar Q. Psychological statistics. 3d ed.. ed. New York: New York : Wiley; 1962. 103. Lord FM, Wingersky MS. Comparison of IRT true-score and equipercentile observed-score" equatings". Applied Psychological Measurement. 1984;8(4):453-461. 104. Nakase-Richardson R, Stevens LF, Dillahunt-Aspillaga C, et al. Predictors of employment outcomes in veterans with traumatic brain injury: a VA traumatic brain injury model systems study. Journal of head trauma rehabilitation. 2017;32(4):271-282. 105. Linacre M. Equating and linking test. https://www.winsteps.com/winman/equating.htm. Accessed 6/23/2020, 2020. 106. Corrigan JD, Cuthbert JP, Whiteneck GG, et al. Representativeness of the traumatic brain injury model systems national database. The Journal of head trauma rehabilitation. 2012;27(6):391.

116