Training and Interpreting Machine Learning Algorithms to Evaluate

Home , Relative risk

3 s ’ 18 – 21,22 15 ¶ city. fi ‡ AUC as a 19,20 Despite previous Despite guidelines 13 ‡ 1,2 public health concern, ned task of estimated NNT this practice has not been fi cation based on AUC are 12 fi – f an algorithm to screen patients 10 Volume 57, Number 7, July 2019 Emergency department (ED) patients 5,6 566) Varun Sah, MS, – screening for fall risk remains inconsistent ‡ 4 14 cant additional resources in the ED is through the making the ED an important additional setting to cant morbidity and mortality. Medical Care falls, screening, electronic health record, machine fi fi The ability to translate the results of our analysis to the 9 – 2019;57: 560 7 The random forest model achieved an AUC of 0.78, with One potential solution to increase screening rates without alls among older adultswith are signi a major RTICLE Med Care Results: slightly lower performance inwith regression-based similar models. performance, Algorithms whenplaced into evaluated a by clinical context AUC,in with differed the a de when real-world scenario. Conclusion: potential tradeoff betweensionmakers referral the numbers ability andvention to NNT before envision offers implementation. the deci- effects ofKey a Words: proposed inter- learning, emergency medicine ( proposed as candidates suitable for implementation. in the primary care setting. single number may do a poor job of conveying an algorithm efforts, no existing screeningadaptable, tools and satisfy theimplementation. measurable need for instrument a scalable, suitable for widespread requiring signi development and implementation o using the information present inat the the electronic time health of recordrise (EHR) an in ED the visit. implementationfor of Recently, predicting machine health risk learning-derived care across algorithms has a seen broad a range sharp of clinical scenarios. widely implemented for manyof reasons, including screening the in effortlimited burden the availability of high referrals intensity, for high intervention. volume ED setting and identify high-riskscreening for patients. fall risk Although in the guidelines ED, recommend are generally atpopulation, higher risk of outpatient falls than the general Often, performance of theseusing algorithms is the evaluated and areacurve, compared also under referred to a asAlgorithms area receiver under offering the operating curve superior (AUC) characteristic or classi C-statistic. (ROC) Clinicians are generally interested in applying an algorithm to aid in F and quality measures, performance for a predictiverequire task in a a particular clinical context balance which may of sensitivity and speci A – Collin J. Engstrom, MS, † RIGINAL cacious in- O Family Medi- fi ∥ Health Innovation † Eneida A. Mendonça, MD, PhD,¶# Michael S. Pulia, MD, MS,* ∥ § and Manish N. Shah, MD, MPH*§** † 2019 Wolters Kluwer Health, Inc. All rights reserved. r cation algorithm using elec- fi s website, www.lww-medicalcare. ’ ict of interest. Copyright fl Brian W. Patterson, MD, MPH,* cial views of the Agency for Healthcare Research and

ﬁ Machine learning is increasingly used for risk strat- Department of Computer Sciences, University of Wisconsin

‡ This study evaluated several machine learning method- Data available at the time of ED discharge were retro- www.lww-medicalcare.com

Michael D. Repplinger, MD, PhD,* Azita G. Hamedani, MD, MBA, MPH,* David Page, PhD, Evaluate Fall Risk After Emergency Department Visits cation and referral intervention to screen older adults for fall | Maureen A. Smith, MD, PhD, ﬁ

Wisconsin School ofProgram; Medicine and PublicMadison; Health; Departments of §Population Health Sciences; tions appear in theversions printed of text this and article are provided on in the the journal HTML and PDF represent the of Quality or the NIH. Medicine and Public Health,Code 800 9123, University Madison, Bay WI Drive, 53705. Suite E-mail: 310, [email protected]. Mail (AHRQ), grant numbers:(M.S.P.), K08HS024558 as (B.W.P.), well as andbers K08DK111234 the K08HS024342 (M.D.R.) National and K24AG054560 Instituteswas (M.N.S.). of The research also Health (NIH), supported(CTSA) grant by num- program, theTranslational through Clinical Sciences (NCATS), the and grant Translational UL1TR000427. NIH Science National Award Center for Advancing cine; ¶Biostatistics andWisconsin Medical School Informatics; of Medicine #Pediatrics,Medicine, and University Division Public of Health; ofconsin and Geriatrics School **Department and of of Medicine Gerontology, and University Public Health, of Madison, Wis- WI. com. cation in health care. Achieving accurate predictive models do not ﬁ 560 tronic health recordintervention data based on and algorithm estimated performance the inMethods: test effects data. ofspectively a collected resultant andAlgorithms separated were developed into to trainingfor predict and the fall outcome test within ofrandom datasets. 6 a return forests, months visit ofevaluated AdaBoost, an models and ED bothcharacteristic index regression-based by (ROC) visit. curve, the methods. also Models(AUC), referred area and included to We by under as projected area theto clinical under impact, treat receiver the estimating (NNT) number curve and operating needed referrals per week for a fall risk intervention. From the *BerbeeWalsh Department of Emergency Medicine, University of ologies for the creation of a risk strati The authors declare no con Supplemental Digital Content is available for this article. Direct URL cita- tervention. Here we examinestrati the potential utilityrisk of after automated emergency risk department (ED) visits. Objective: The content is solely the responsibility of the authors and does not necessarily Reprints: Brian W. Patterson, MD, MPH, University of Wisconsin School of Supported by funding from the Agency for Healthcare Research and Quality improve outcomes if they cannot be translated into ef i Background: Training and Interpreting Machine Learning Algorithms to Copyright © 2019 WoltersISSN: Kluwer 0025-7079/19/5707-0560 Health, Inc. All rights reserved.

Downloaded from https://journals.lww.com/lww-medicalcare by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1AWnYQp/IlQrHD3KirptPALrnC71pQBOooedCzq7X3ZCQuWSPlOpMBpByijEdxw0HDj8g== on 06/27/2019 Downloaded from https://journals.lww.com/lww-medicalcare by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1AWnYQp/IlQrHD3KirptPALrnC71pQBOooedCzq7X3ZCQuWSPlOpMBpByijEdxw0HDj8g== on 06/27/2019 Medical Care Volume 57, Number 7, July 2019 Machine Learning for Fall Risk in the ED risk stratification for a particular scenario, such as ruling out a rare diagnoses. Features were selected based on their availability, disease, confirming a particular diagnosis, or reducing population clinical relevance, and potential to provide predictive value for risk via an intervention—in this case, referral for a fall risk fall revisit risk estimation. Another important criteria for feature reduction intervention. selection were to exclude attributes that contained information Such an intervention already exists at our institution in the obtained after an index visit. form of a multidisciplinary falls clinic. On the basis of prior The data were organized and analyzed at the level of an literature, we estimate a relative risk reduction of 38% for future ED visit (as opposed to patient level) as our objective was to falls for patients enrolled in such a program.23 Currently, very stratify risk for a fall revisit based on index visit data alone. few referrals are made to the falls clinic from the ED. Before Visits by patients who were transferred from other health care initiating an automated referral program, decisionmakers must facilities were rejected as part of our primary exclusion cri- understand both the anticipated number of referrals generated and teria. We excluded visits that resulted in hospital admissions, the effectiveness of such referrals in preventing future falls. To do as our algorithm would only be implemented for patients who so, decisionmakers may be better served by extrapolations of a were discharged from the ED. Finally, we excluded patients model’s performance in a given population than by test charac- who did not have a primary care provider (PCP) in our net- teristics such as AUC. This information would allow a clinical work, as our intervention was specifically aimed towards site to select the most appropriate risk stratification algorithm, referring in-network patients. At the end of the exclusion and most appropriate threshold point, to maximize patient benefit procedures, we were left with 10,030 records. within the constraints of available resources and acceptable effectiveness. In this study, we developed several machine learning Feature Preparation models to predict 6-month fall risk after an ED visit. We eval- The encoding process for features depended on whether they uated these models both using AUC analysis and by interpreting were numerical or categorical in nature. Numerical features such as model performance to describe potential clinical tradeoffs more age, vital signs during the index ED visit, duration of the index concretely in terms of referrals per day and numbers needed to visit, and a number of primary care or hospital visits in the treat (NNT) for prevention of a fall. 6 months before the index visit were treated as continuous values. Attributes related to Elixhauser comorbidity index, Hendrich II ’ METHODS score, patients demographics, medications, and laboratory results were treated as categorical variables. In the case of numerical Study Design and Setting features, we dropped records that had missing values due to the We performed a retrospective observational study using relatively small number of records that were incomplete in this patient EHR data at a single academic medical center ED with regard, which left us with 9687 records. However, for categorical level 1 trauma center accreditation and ∼60,000 patient visits per variables, missing values were considered as a separate category— year. The goal of developing the models was to create an alert at in general, the absence of most categorical features could be po- the time of an ED visit suggesting referral of patients who are at tentially informative for decision making by the predictive models. heightened risk of fall for an existing multidisciplinary falls in- At the end of the feature engineering process, we obtained our final tervention. In our case, based on discussions with our falls clinic, dataset which was comprised of 725 features. The feature prepa- an estimated 10 referrals per week were seen as operationally ration phase was completely independent of outcome status. feasible. Using the availableEHRdata,wecreatedrisk- stratification models for fall revisits to the ED. Our outcome of Model Development interest was a fall visit to the same ED in the 6 months after an Once our features were selected and prepared, we created index visit. Although this paper focuses on predicting fall re- predictive models from the data. We tested several regression- visits, the methodology we describe is robust and lends itself to based methodologies, including thresholded linear regression any clinical risk stratification prediction task. and logistic regression, both unregularized and including lasso30 and ridge31 penalties. We also included 2 tree-based method- Data Selection and Retrieval ologies: random forests32 and AdaBoost.33 Appendix A (Sup- EHR data for patients aged 65 years and older who visited plemental Digital Content 1, http://links.lww.com/MLR/B801) the study ED were acquired for a duration of 43 months starting provides a nontechnical description of the methods used. Models January 2013, with an additional 6 months of follow-up data were generated using the Scikit-learn package in Python.34 The collected for outcome determination. Available EHR features dataset created at the end of feature preparation was split into were evaluated for inclusion under the conceptual framework of training and test sets in a 3:1 ratio. We split data chronologically, the Andersen Behavioral Model of Health Services Use, a well- with the final 25% of visits kept as a holdout test set, and the established model which provides a context for characterizing earliest 75% of data retained as a training set. The training set the many factors which lead to health care utilization.24–27 This was further split, again chronologically in a 3:1 ratio, to create a model has been used to frame numerous prior studies involving tuning set for interim validation. ED use and falls among older adults.28,29 For each visit, discrete Models were initially trained on the smaller training set, data available within the EHR at the time of the ED visit were where tunable parameters were varied using a grid search pattern collected to create data features including patient demographics, to achieve best results within the tuning set. Finally, we picked the historical visits, and visit patterns and diagnoses, as well as visit- 6 models that performed best on the tuning set and trained 1 model specific information including timing, laboratory tests performed of each type on the entire training set. These models were and results thereof, vital signs, chief complaint, and discharge then evaluated on the test data that had been held out during the

Copyright r 2019 Wolters Kluwer Health, Inc. All rights reserved. Patterson et al Medical Care Volume 57, Number 7, July 2019 previous phase. As our dataset was skewed, with more patients who did not fall than those who did, we upsampled the positive class records while training models to provide a weighting effect to incentivize correct classification of fall cases. This was achieved by randomly duplicating positive cases in the training set until their frequency equaled that of negative cases. Upsampling was carried out only after the training set had been split into a tuning set, to ensure that no duplicate records created as a result of upsampling on the entire training set were members of both the training and tuning set. Further, the test set was not subjected to any upsampling, to maintain the true population distribution in the evaluation set to simulate performance assessment on future data. Model Evaluation Our initial evaluation of the trained models involved comparing the AUC. 95% confidence intervals (CIs) were generated in STATA (StataCorp., College Station, TX) using a nonparametric bootstrapping with the Rocreg command and 1000 iterations.35 We then generated classification statistics for each model at each potential threshold value, consisting of performance within the evaluation set in terms of true positives (TP), false positives (FP), true FIGURE 1. Patient allocation. Once the study population was negatives (TN), and false negatives (FN). We were able to use this defined; it was split at a 3:1 ratio into training and test sets. The data to extrapolate both referrals per week and NNT.36 Estimated training set was further split to create an intermediate tuning set. referrals per week were calculated by taking the total percentage of ED indicates emergency department; PCP, primary care provider. TP and FP results (all patients flagged “positive”) at a given and insurance status but were older, more likely to have fallen threshold from each model and multiplying by the weekly visit on their index visit, and more likely to have been brought to the volume. NNT was estimated by assuming that the falls reduction ED by an ambulance. clinic would provide a relative risk reduction of 38% (95% CI: When comparing models based on AUC, the random 21%–52%) based on the results of the PROFET randomized forest model achieved an AUC of 0.78 (95% CI: 0.74–0.81), clinical trial which studied a similar intervention in practice and and AdaBoost also had an AUC of 0.78 (95% CI: 0.74–0.81). found the percentage of fallers decreased from 52% to 32% in a These tree-based models were the highest performers, followed high-risk cohort of patients discharged from the ED.23 Relative risk by ridge-penalized logistic regression at 0.77 (95% CI: reduction and CIs were generated from the reported PROFET data 0.73–0.80), lasso-penalized logistic regression at 0.76 (95% using STATA. The absolute fall risk for a population of patients CI: 0.73–0.80), unpenalized linear regression at 0.74 (95% CI: above a given risk threshold in our models was calculated as the 0.71–0.78), and unpenalized logistic regression at 0.72 (95% ratio of TPs (patients we predicted would fall who did go on to fall) CI: 0.68–0.76). Figure 2 shows AUC plots for all tested as compared with all model identified positives for all patients at or machine learning models. Appendix B (Supplemental Digital above the risk threshold in the test dataset (TP/TP+FP). This absolute risk was multiplied by the relative risk reduction of 0.38 to estimate an absolute risk reduction, and the inverse of the absolute TABLE 1. Characteristics of Analyzed Visits 36 risk reduction was taken to generate the NNT. For instance; if the n (%) absolute fall risk in the flagged positive group was 60%, the esti- = Visits With mated NNT was 1/(0.38×0.6) 4.4 referrals per fall prevented. Visits Without 180-Day These projected performance measures were used to create plots All Analyzed 180-Day Return For that visually described the tradeoff between risk reduction gained Visits Return For Fall Fall per referral and number of referrals expected per day. We have N 9687 8830 857 created a toolkit for recreating these plots from ROC data, which is Age [mean (SD)] 76.0 (8.4) 75.7 (8.3) 79.3 (8.9) available at to https://www.hipxchange.org/NNT. Female 5863 (60.5) 5286 (59.9) 577 (67.3) White race 8980 (92.7) 8187 (92.7) 793 (92.5) RESULTS Insurance status Medicare 8444 (87.2) 7705 (87.3) 739 (86.2) We had 32,531 visits to the ED during the study period Commercial/ 1210 (12.5) 1095 (12.4) 115 (13.4) by adults aged 65 and older, of which 9687 were both dis- workers’ charged and had a PCP in our network and full numerical compensation data, making up our study population (Fig. 1). Within this Other/self-pay 26 (0.3) 23 (0.3) 3 (0.4) Mode of arrival population, 857 patients returned within 6 months for a Family or self 6641 (68.6) 6263 (70.9) 378 (44.1) fall-related visit; the overall return rate for fall within 6 months Emergency 30 (31.4) 2567 (29.1) 479 (55.9) was 8.8%. Demographics of patients by the outcome are medical services presented in Table 1. As compared with patients who did not or police return for falls, those with falls were similar with regards to sex Fall at index visit 1543 (15.9) 1267 (14.4) 272 (31.7)

approach offers the best performance in terms of NNT versus referrals in the proposed operational scenario, offering the ability to refer 10 patients per week at an NNT of 12.4 referrals to reduce the risk of an ED revisit for fall. Although these data are technically inferable based on the shape of the receiver operating characteristic curves, the degree of dis- tinction between the models would likely not be apparent based on visual inspection alone to a reader not already expert in machine learning or statistics. Algorithms derived by machine learning have become increasingly common in medicine, with significant excitement surrounding their potential to improve the ability to risk stratify patients.41,42 Unfortunately, gaps still exist between the ability to predict a potentially avoidable event and specificactionable interventions.43 In the majority of studies evaluating machine learning techniques, model performance is reported based on fi 44 FIGURE 2. Area under the receiver operating characteristic AUC or test characteristics such as sensitivity and speci city. curve (AUC) for the random forest, elastic net regression, lasso These test characteristics may be useful for establishing predictive regression, AdaBoost, ridge regression, linear regression. performance generally but may be misleading when not set into clinical context.21 Once AUC curves have been generated for a Content 2, http://links.lww.com/MLR/B802) describes model given risk stratification model in test data, calculating additional parameters and variable importance. information including NNT and anticipated referrals requires only Models were further characterized by estimating both the an algebraic transformation of the data, as long as a proposed number of referrals per week from the study ED, and NNT of intervention has been identified along with estimated effective- referred patients. Figure 3 shows these plots. In this analysis, we ness. The curves generated for this study communicate this present the relationship between increasing the number of patients tradeoff to policymakers, and provide a basis for comparison of referred, and the decrease in effectiveness per referral as the anticipated “real-world” effects of model performance. threshold for defining “high risk” is lowered. The plots additionally For any particular harm reduction intervention, there is a contain 2 fixed points for reference: a “refer all patients” scenario in tradeoff when choosing a risk cutoff for a referral. The most total which all patients are marked as high risk, and a “perfect model” harm-reduction would be accomplishedbysimplyreferringall scenario, in which the model predicts with 100% accuracy which patients in a given population, however, such nonspecificre- patients would go on to fall without the intervention and refers only ferral would be costly in terms of time and resource use, and these patients. In our case, the maximum achievable NNT is 2.6, in inefficient as many low-risk patients would receive minimal the case where a 38% relative risk reduction is applied to a benefit, or potentially be exposed to risks of intervention. At the population at 100% risk of falling. Table 2 illustrates model same time, selecting only those patients who are at extremely performance in terms of predicted NNT at various referrals per high risk of harm reduces the overall potential benefitofarisk week. At the predefined threshold of 10 referrals per week (setting reduction strategy by not offering it to a large proportion of a high-risk threshold), the random forest model outperformed the patients who will go on to have the outcome of interest. In our other models, generating an NNT of 12.4. At other thresholds, example, where a set number of referral slots per week were ridge regression and AdaBoost outperformed the random forest available, and the task was to select the highest risk patients to model. The lasso and nonpenalized regression models had poorer fill those slots, the random forest algorithm was the best per- performance across the spectrum of anticipated referrals. former. If there had been only 5 referral spots available, however, the ridge-penalized logistic regression model would have been the top performer. Despite an overall slightly lower AUC, DISCUSSION the ridge-penalized model had better performance in selecting The various machine learning models tested in this study those 5 patients at highest risk, achieving an NNT of ∼10 versus differed in their ability to predict falls, with the random forest and ∼12 for the tree-based models. If the intervention tied to the AdaBoost models offering the best overall performance with an algorithm were a referral to a less resource-intensive commun- AUC of 0.78. On the basis of AUC alone, penalized regression- ity-based falls prevention program with more availability, poli- based models including ridge-penalized logistic regression offered cymakers may be looking in a region of higher referrals per similar performance with an AUC of 0.77. This result is consistent week and higher NNT—in this region, model performance was with other studies evaluating the performance of tree-based algo- generally similar between the various models. rithms alongside regression-based methodologies.16,18,37–39 As The projections of performance generated in this study were opposed to traditional methods, tree-based methodologies have an based on model performance on a set of test data which imme- improved ability to deal with complex variable interactions and diately followed the training data chronologically. Although these nonlinear effects in large databases, which may explain their ad- projections are expected to help policymakers envision potential vantage in these instances.40 operational performance, they are not intended to replace When translating the models into potential deliverable the evaluation of performance during and after implementation. performance at individual thresholds, the random forest-based Machine learning models are tuned to specific population

FIGURE 3. Number needed to treat (NNT) versus anticipated referrals per week. This line shows the tradeoff between rising NNT and a rising number of potential referrals as a lower threshold for risk is selected within the model. The square represents a potential scenario in which all patients are referred regardless of model risk. The triangle represents the performance of a model with perfect discrimination (one which only refers patients who would definitely fall in the future and no one else). Error bars represent the 95% confidence interval of the relative risk estimation. parameters, and subject to calibration drift as patient and data of falling in their lowest risk group and 42% among the characteristics change over time,45 necessitating continued post- highest. Tiedemann et al9 developed and externally validated implementation monitoring to ensure effective results. a screening instrument with an AUC of 0.70, and Greenberg To our knowledge, 3 ED-speciﬁc fall screening in- et al46 utilized modiﬁed CAGE criteria but did not report fall struments have been examined: Carpenter et al7 examined a outcomes in their pilot. As compared with these prior efforts, number of factors for association with future falls, proposing the machine learning-derived algorithms here offer improved a screen of 4 independent factors, reporting a 4% probability performance in terms of test characteristics, and the advantage

TABLE 2. Model Performance at Various Referrals Per Week Thresholds Referrals Per Week Ridge Logistic Lasso Logistic Logistic Threshold Random Forest AdaBoost Regression Regression Linear Regression Regression 5 12.74 11.94 10.03* 10.70 10.70 11.08 10 12.41* 13.82 13.13 13.13 13.70 14.01 15 15.36 15.49 15.24* 15.75 17.18 17.50 20 18.65 17.40* 18.52 18.38 18.79 20.15 25 21.28 20.76 21.27* 21.71 22.32 22.97 30 23.60* 24.00 24.68 24.52 25.52 26.22 35 26.96* 27.21 27.19 27.02 28.06 28.79 40 29.91 29.44* 29.79 30.14 30.51 31.27

*Best performing model (lowest number needed to treat) at each referral per week threshold. of not requiring the devotion of scant ED resources to in- week to our fall clinic with a predicted NNT of 12 referrals to person screening.42 reduce the risk of a single fall. Our ability to translate the results of our analysis to the potential tradeoff between referral numbers Limitations and NNT offers decisionmakers the ability to envision the When generating our NNT we assumed that the relative effects of a proposed intervention before implementation. risk reduction generated by our proposed intervention would remain constant across varying absolute risks. This assump- tion, while broadly made in medical decision making liter- REFERENCES ﬁ 47,48 1. Centers for Disease Control and Prevention (CDC). Fatalities and injuries ature, is a simpli cation that is often, but not always, true. from falls among older adults—United States, 1993-2003 and 2001- Furthermore, for the sake of simplifying our calculations, we 2005. MMWR Morb Mortal Wkly Rep. 2006;55:1221–1224. assumed that all patients referred for fall intervention would 2. Sterling DA, O’Connor JA, Bonadies J. Geriatric falls: injury severity is attend the required intervention. If an estimate of the like- high and disproportionate to mechanism. J Trauma. 2001;50:116–119. 3. Kenny RA, Rubenstein LZ, Tinetti ME, et al. Summary of the updated lihood of completed referral were available, it could be taken American Geriatrics Society/British Geriatrics Society clinical practice into account in the NNT calculation. guideline for prevention of falls in older persons. J Am Geriatr Soc. We presented our NNT versus anticipated referrals per 2011;59:148–157. week curves with error bars based on the effectiveness esti- 4. Centers for Medicare & Medicaid Services. 2016 Physician Quality mate from the PROFET trial. PROFET measured the effec- Reporting System (PQRS) measures groups specifications manual; 2016. 5. Phelan EA, Mahoney JE, Voit JC, et al. Assessment and management of tiveness of an intervention similar to our own falls clinic, but fall risk in primary care settings. Med Clin North Am. 2015;99:281–293. on a somewhat different outcome (any reported fall vs. ED 6. Landis SE, Galvin SL. Implementation and assessment of a fall screen- visit for fall) and with somewhat different inclusion criteria ing program in primary care practices. J Am Geriatr Soc. 2016;62: (only selected older adults reporting to the ED for fall as 2408–2414. 7. Carpenter C, Scheatzle M, D’Antonio J, et al. Identification of fall risk opposed to all older adults). Given the relatively wide CI of factors in older adult emergency department patients. Acad Emerg Med. the PROFET results, we feel the included error bars provide a 2009;16:211–219. reasonable estimate of uncertainty, however, these could be 8. Carpenter CR, Avidan MS, Wildes T, et al. Predicting geriatric falls widened to incorporate the estimated impact of other sources following an episode of emergency department care: a systematic review. – of potential variation in predicted effectiveness. Acad Emerg Med. 2014;21:1069 1082. 9. Tiedemann A, Sherrington C, Orr T, et al. Identifying older people at During model development, we chose to censor visits high risk of future falls: development and validation of a screening tool which were missing data features encoded as continuous vari- for use in emergency departments. Emerg Med J. 2013;30:918–922. ables (categorical variables were encoded to allow a “missing” 10. Weigand JV, Gerson LW. Preventive care in the emergency department: category). Although the inclusion of only complete records has should emergency departments institute a falls prevention program for 49 elder patients? A systematic review. Acad Emerg Med. 2001;8:823–826. the potential to introduce bias, only 343 (3%) of records were 11. Kenny R, Rubenstein LZ, Tinetti ME, et al. Summary of the updated dropped for incompleteness, suggesting the minimal potential American Geriatrics Society/British Geriatrics Society clinical practice for change in algorithm performance if this data were imputed. guideline for prevention of falls in older persons. J Am Geriatr Soc. Our model was trained on an outcome of return visits to 2011;59:148–157. our ED for falls. Patients who fell may in some instances have 12. Rosenberg M, Carpenter CR, Bromley M, et al. Geriatric emergency department guidelines. Ann Emerg Med. 2014;63:e7–e25. presented to other EDs, in which case they were not captured 13. Carpenter CR, Griffey RT, Stark S, et al. Physician and nurse acceptance by our deﬁnition. We limited our analysis to patients with a of technicians to screen for geriatric syndromes in the emergency PCP in our system, and only analyzed patients who presented department. West J Emerg Med. 2011;12:489–495. to our ED in an index visit in an attempt to minimize this risk. 14. Carpenter CR, Lo AX. Falling behind? Understanding implementation science in future emergency department management strategies for geriatric fall prevention. Acad Emerg Med. 2015;22:478–480. CONCLUSIONS 15. Goldstein BA, Navar AM, Pencina MJ, et al. Opportunities and challenges In this analysis, we developed an algorithm which had an in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc.2018;24:198–208. AUC of 0.78 for prediction of a return visit to the ED for fall 16. Churpek MM, Yuen TC, Winslow C, et al. Multicenter comparison of within 6 months of an index visit. Placed in the clinical context machine learning methods and conventional regression for predicting of harm reduction, this offered the ability to refer 10 patients per clinical deterioration on the wards. Crit Care Med. 2016;44:368–374.

17. Ting DSW, Cheung CY, Lim G, et al. Development and validation of a Science (Lecture Notes in Artificial Intelligence), Vol 904. Berlin, deep learning system for diabetic retinopathy and related eye diseases Heidelberg: Springer; 1995:23–37. using retinal images from multiethnic populations with diabetes. JAMA. 34. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine 2017;318:2211–2223. learning in Python. J Mach Learn Res. 2011;12:2825–2830. 18. Li X, Liu H, Du X, et al. Integrated machine learning approaches for 35. Janes H, Longton G, Pepe MS. Accomodating covariates in receiver predicting ischemic stroke and thromboembolism in atrial fibrillation. operating characteristic analysis. Stata J. 2009;9:17–39. AMIA Annu Symp Proc. 2016;2016:799–807. 36. Cook RJ, Sackett DL. The number needed to treat: a clinically useful 19. Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: measure of treatment effect. BMJ. 1995;310:452–454. challenges, strategies, and a comparison of machine learning approaches. 37. Kalscheur MM, Kipp RT, Tattersall MC, et al. Machine learning algorithm Med Care. 2010;48:S106–S113. predicts cardiac resynchronization therapy outcomes: lessons from the 20. Weng SF, Reps J, Kai J, et al. Can machine-learning improve COMPANION Trial. Circ Arrhythm Electrophysiol. 2018;11:e005499. cardiovascular risk prediction using routine clinical data? PLoS ONE. 38. Karnik S, Tan SL, Berg B, et al. Predicting atrial fibrillation and flutter 2017;12:e0174944. using electronic health records. Conf Proc IEEE Eng Med Biol Soc. 21. Lobo JM, Jiménez‐Valverde A, Real R. AUC: a misleading measure of 2012;2012:5562–5565. the performance of predictive distribution models. Glob Ecol Biogeogr. 39. Philip F, Gornik HL, Rajeswaran J, et al. The impact of renal artery stenosis 2008;17:145–151. on outcomes after open-heart surgery. J Am Coll Cardiol. 2014;63:310–316. 22. Kruppa J, Ziegler A, Konig IR. Risk estimation and risk prediction using 40. Cairney J, Veldhuizen S, Vigod S, et al. Exploring the social determinants machine-learning methods. Hum Genet. 2012;131:1639–1654. of mental health service use using intersectionality theory and CART 23. Close J, Ellis M, Hooper R, et al. Prevention of falls in the elderly trial analysis. J Epidemiol Community Health. 2014;68:145–150. (PROFET): a randomised controlled trial. Lancet. 1999;353:93–97. 41. Chen JH, Asch SM. Machine learning and prediction in medicine—beyond 24. Aday LA, Andersen R. A framework for the study of access to medical the peak of inflated expectations. NEnglJMed. 2017;376:2507–2509. care. Health Serv Res. 1974;9:208–220. 42. Deo RC. Machine learning in medicine. Circulation. 2015;132:1920–1930. 25. Andersen RM. Revisiting the behavioral model and access to medical 43. Bates DW, Saria S, Ohno-Machado L, et al. Big data in health care: using care: does it matter? J Health Soc Behav. 1995;36:1–10. analytics to identify and manage high-risk and high-cost patients. Health 26. Ricketts TC, Goldsmith LJ. Access in health services research: the battle Aff (Millwood). 2014;33:1123–1131. of the frameworks. Nurs Outlook. 2005;53:274–280. 44. Alanazi HO, Abdullah AH, Qureshi KN. A critical review for developing 27. Andersen RM, Rice TH, Kominski GF. Changing the U.S. Health Care accurate and dynamic predictive models using machine learning methods System: Key Issues in Health Services Policy and ManagementSan in medicine and health care. J Med Syst. 2017;41:69–79. Francisco, CA: Jossey-Bass; 2007. 45. Davis SE, Lasko TA, Chen G, et al. Calibration drift among regression 28. Stephens CE, Newcomer R, Blegen M, et al. Emergency department use and machine learning models for hospital mortality. AMIA Annu Symp by nursing home residents: effect of severity of cognitive impairment. Proc. 2017;2017:625–634. Gerontologist. 2012;52:383–393. 46. Greenberg MN, Porter BG, Barracco RD, et al. Modified CAGE as a 29. Chatterjee S, Chen H, Johnson ML, et al. Risk of falls and fractures in older screening tool for mechanical fall risk assessment. A pilot survey. 2013; adults using atypical antipsychotic agents: a propensity score–adjusted, 62:S107–S108. retrospective cohort study. Am J Geriatr Pharmacother. 2012;10:83–94. 47. Furukawa TA, Guyatt GH, Griffith LE. Can we individualize the 30. Tibshirani R. Regression shrinkage and selection via the lasso. J R Statist ‘number needed to treat’? An empirical study of summary effect Soc B. 1996:267–288. measures in meta-analyses. Int J Epidemiol. 2002;31:72–76. 31. Hoerl AE, Kennard RW. Ridge regression: biased estimation for 48. Barratt A, Wyer PC, Hatala R, et al. Tips for learners of evidence-based nonorthogonal problems. Technometrics. 1970;12:55–67. medicine: 1. Relative risk reduction, absolute risk reduction and number 32. Breiman L. Random forests. Mach Learn. 2001;45:5–32. needed to treat. CMAJ. 2004;171:353–358. 33. Freund Y, Schapire RE. A desicion-Theoretic Generalization of On-Line 49. Rusanov A, Weiskopf NG, Wang S, et al. Hidden in plain sight: bias towards Learning and an Application to Boosting. In: Vitanyi P, ed. Computa- sick patients when sampling patients with sufficient electronic health record tional Learning Theory EuroCOLT 1995 Lecture Notes in Computer data for research. BMC Med Inform Decis Mak. 2014;14:51–60.

ERRATUM

Psychometric Evaluation of an Instrument to Measure Prospective Pregnancy Preferences: The Desire to Avoid Pregnancy Scale: Erratum

In the February 2019 issue of Medical Care, there was an error in the article “Psychometric Evaluation of an Instrument to Measure Prospective Pregnancy Preferences: The Desire to Avoid Pregnancy Scale.” On page 156, the sentence “As hypothesized, women not using contraception scored signiﬁcantly higher on the DAP than those using a method (0.75 logits higher) and those who had not had sex in 30 days (0.74 logits higher; P < 0.001)” should have read (corrected text in bold):

As hypothesized, women not using contraception scored signiﬁcantly lower on the DAP than those using a method (0.75 logits lower) and those who had not had sex in 30 days (0.74 logits lower; P < 0.001).

REFERENCE 1. Rocca CH, Ralph LJ, Wilson M, et al. Psychometric evaluation of an instrument to measure prospective pregnancy preferences: the desire to avoid pregnancy scale. Med Care. 2019;57:152–158.

ISSN: 0025-7079/19/5707-0566