<<

EDITORIAL BMJ Qual Saf: first published as 10.1136/bmjqs-2015-003934 on 20 January 2015. Downloaded from What is a performance outlier?

David M Shahian,1,2 Sharon-Lise T Normand3,4

1Department of Surgery, Center Healthcare performance measurement is a using appropriate tests of statistical signifi- for Quality and Safety, complex undertaking, often presenting a cance. For instance, Paddock et al9 show Massachusetts General Hospital, Boston, Massachusetts, USA number of potential alternative approaches that for each bottom-tier hospital, there 2Harvard Medical School, and methodological nuances. Important was at least one mid-tier hospital with stat- Boston, Massachusetts, USA considerations include richness and quality istically indistinguishable performance. 3 Department of Health Care of sources; data completeness; choice Among mid-tier (‘’) hospitals, 60– Policy, Harvard Medical School, Boston, Massachusetts, USA of metrics and target population; 75% had performance that was not statis- 4Department of , size; patient- and provider-level data collec- tically significantly different than that of Harvard School of Public Health, tion periods; risk adjustment; statistical some bottom-tier hospitals. Boston, Massachusetts, USA methodology (eg, logistic regression vs How can this be? On the one hand, Correspondence to hierarchical models); model performance, hospitals appear to have been appropri- Dr David M Shahian, reliability and validity; and classification of ately divided into three discrete groups Department of Surgery, Center outliers. Given these many considerations, based on their performance rankings— for Quality and Safety, Massachusetts General Hospital, as well as the absence of nationally bottom, mid and top tiers. On the other 55 Fruit St, Boston, MA 02114, accepted standards for provider profiling, hand, direct comparisons between spe- USA; it is not surprising that different rating cific pairs of hospitals in adjacent tiers [email protected] organisations and methodologies may often showed no statistically significant Accepted 5 January 2015 produce divergent results for the same hos- difference, which seems inconsistent with – pitals.1 6 their original rankings The answer to this Outlier classification, the last step in apparent paradox illustrates several statis- the measurement process, has particularly tical concepts, some unfamiliar to non- important ramifications. For patients, it statisticians but of fundamental import- may lead them to choose or avoid a par- ance to the correct interpretation of ticular provider. For providers, outlier risk-adjusted outcomes and outlier status. status may positively or negatively impact First and most fundamentally, Paddock referrals and reimbursement, and may et al9 use a completely different statistical http://qualitysafety.bmj.com/ influence how scarce hospital resources methodology for their direct hospital to are deployed to address putative areas of hospital comparisons than the approach concern. Misclassification is probably used in the original Hospital Compare tier – more common than generally appre- assignments.10 12 The latter employed ciated. For example, partitioning of hos- Bayesian hierarchical regression models pitals (eg, terciles, , quintiles, with 95% credible intervals (similar to CIs) deciles) to determine outliers may lead to to determine outliers. From the perspective – excessive false positives—hospitals of causal inference ,13 17 the labelled as having above or below average Hospital Compare approach considers the performance when, in fact, their results following unobservable counterfactual: on October 2, 2021 by guest. Protected copyright. do not differ significantly from the mean ‘What would the results have been if this based on appropriate statistical tests.78 hospital’s patients had been cared for by an “average” hospital in the reference popula- THE CURRENT STUDY tion?’ This is often referred to as the In this issue, Paddock et al9 address a ‘expected’ outcome. A level of statistical — ▸ http://dx.doi.org/10.1136/ seemingly straightforward question what certainty for the hospital-level estimates is bmjqs-2014-003405 precisely does it mean to be a performance chosen (eg, 95% credible interval), the outlier? Using Hospital Compare data, the actual results of a given hospital are com- authors demonstrate an apparently contra- pared to the expected or counterfactual dictory finding. When directly compared outcomes, and any hospital whose 95% one to another, some individual hospitals credible interval for their risk-adjusted in a given performance tier may not be mortality rate excludes the expected mor- To cite: Shahian DM, statistically significantly different than indi- tality rate is designated an outlier. Normand S-LT. BMJ Qual Saf vidual hospitals in adjacent tiers, even Because Paddock et al 9 did not have 2015;24:95–99. when those tier assignments were made access to the patient-level data on which

Shahian DM, et al. BMJ Qual Saf 2015;24:95–99. doi:10.1136/bmjqs-2015-003934 95 Editorial BMJ Qual Saf: first published as 10.1136/bmjqs-2015-003934 on 20 January 2015. Downloaded from the Hospital Compare analyses were based, they first for patients having certain types of risk factors. converted CIs to SEs, then re-estimated performance Consequently, in virtually all healthcare profiling tiers (presumably, though not stated, using one-sample applications, risk adjustment is performed using indir- z-tests), which were similar to the original Hospital ect rather than direct standardisation. The incremental Compare ratings. Finally, they performed two-sample risks associated with each predictor variable (eg, a risk z-tests using the results from various hospital combi- factor such as insulin-dependent diabetes) are derived nations in adjacent performance tiers. Their counter- from the reference population using regression. As in factual is not the expected outcome if a hospital’s the original Hospital Compare approach discussed patients were cared for by an average hospital, but above, the expected outcomes in the study population rather by one specific alternative hospital. Their corre- reflect the anticipated results if those patients had sponding null hypothesis is that the difference in been cared for by an average hospital in the reference mean mortality rates between the two hospitals being population, a quite different counterfactual than in compared is zero (or, alternatively, that the ratio of direct standardisation.17 Expected results for each their mean mortality rates is unity). patient of a given hospital are summed and compared Thus, the direct hospital–hospital comparisons per- with their observed results to estimate an O/E ratio formed by Paddock et al 9 ask a different question (eg, standardised mortality ratio), which can be multi- than the original Hospital Compare analyses, with a plied by the average mortality to yield a risk-adjusted different counterfactual statement and statistical or risk-standardised rate. approach. Viewed from this perspective, it is no longer paradoxical but completely logical that they COVARIATE OVERLAP found different results. In this particular study, failure Direct hospital–hospital comparisons using indirectly to reject the null hypothesis of no difference in per- standardised observational data are inappropriate in formance among pairs of hospitals from adjacent tiers virtually all profiling scenarios. The only exception is was also driven by the large SEs (resulting from small a very specific circumstance—when all regions of the hospital sample sizes—see below). Indistinguishable covariate space defined by patient risk factors contain performance would be particularly likely for pairs of observations from all hospitals being compared— hospitals whose performance was close to the bound- which would be an uncommon and chance occurrence ary between two adjacent performance categories. in most profiling applications.17 19 In the absence of That the authors only required at least one hospital covariate overlap, there may be patients from one hos- from an adjacent tier to be statistically indistinguish- pital for whom there are no comparable patients in able is a relatively low bar. the other hospital (in causal inference parlance, there is no empirical counterfactual19), and thus no way to DIRECT AND INDIRECT STANDARDISATION

fairly compare performance in all patients cared for http://qualitysafety.bmj.com/ Notwithstanding the results from this specific study, by the two hospitals. For example, it is unlikely that which are largely a function of small sample sizes, the each hospital would have octogenarians with renal authors do not address the more fundamental error of failure and chronic liver disease who underwent emer- using indirectly standardised results to directly gency aortic valve replacement (AVR), but one of compare pairs of hospitals. The differences between them might. No adjustment (eg, model-based extrapo- direct and indirect standardisation13 17 18 remain lation) can reliably remedy the lack of data in the area unappreciated by most non-methodologists, resulting of non-overlap, and statistical inferences should gener- in their frequent misapplication and misinterpretation. ally be limited to regions where there is overlap. In direct standardisation, rates from each stratum of Thus, ‘risk-adjusted’ results derived using indirect the study population are applied to a reference popula- standardisation cannot be used to directly compare on October 2, 2021 by guest. Protected copyright. tion. This type of standardisation is common in epi- two hospitals unless their patient mix has been demiological studies where there may be only a few demonstrated to be similar (eg, overlapping propen- strata of interest (eg, age–sex strata). Directly standar- sity score distributions).17 Indirectly standardised rates dised results estimate what the outcomes would have for each hospital are estimated only for the patients been in the reference population if these patients had they actually treated, and their results only apply to been cared for by a particular study hospital. In causal their particular case mix. It cannot be assumed that a inference terminology, this is the unobservable coun- hospital achieving better than average results in a gen- terfactual. The results from many different hospitals erally low risk population could do the same in a can be applied to the reference population in exactly population of very high risk patients that it has never the same fashion, and it is therefore permissible to dir- treated. Because their indirectly standardised rates ectly compare their directly standardised results. were obtained by applying reference population rates The conditions that make direct standardisation to their low risk patients, assuming that they would possible are not found in most profiling applications have similar performance if confronted with a high- because of the large number of risk factors and the risk, tertiary patient population is optimistic and fact that any given hospital may have no observations unwarranted.

96 Shahian DM, et al. BMJ Qual Saf 2015;24:95–99. doi:10.1136/bmjqs-2015-003934 Editorial BMJ Qual Saf: first published as 10.1136/bmjqs-2015-003934 on 20 January 2015. Downloaded from

COVARIATE IMBALANCE AND BIAS only coronary artery bypass grafting surgery was per- Irrespective of whether there is overlap in their respect- formed with sufficient volume by most providers to reli- ive distributions of patient risk, these distributions may ably allow detection of a doubling of mortality rate. still vary across hospitals being compared (ie, the preva- Krell et al27 found that most surgical outcomes mea- lence of relevant risk factors may be different) and this sures estimated from the American College of Surgeons’ covariate imbalance19 may bias the interpretation of National Surgical Quality Improvement Program results and the determination of outliers.20 Covariate (NSQIP) registry data had low reliability to detect per- imbalance is a common problem in profiling using formance differences for common procedures. Similar observational data because patients are not randomised findings have been observed with common medical – (the method used to achieve covariate balance in clinical diagnoses.28 30 trials). Standard regression-based adjustment may not At volumes typically encountered in practice, and completely address bias when there is substantial lack of even assuming perfect patient-level risk adjustment, covariate balance. Covariate imbalance was the motiv- much of the variation in healthcare performance mea- ation for the development of propensity score sures is random; the extent of random variation and approaches for matching, modelling or stratification in potential misclassification is greater at lower volumes studies using observational data,21 22 and propensity and event rates.31 As a consequence, there is substan- approaches to profiling have been investigated.20 tial fluctuation from one sampling period to another in the rates of adverse events and performance rank- CASE MIX BIAS ings among providers.32 Longitudinal assessment of Despite excellent patient-level risk adjustment, substan- provider performance over longer periods of time and tial case mix bias (eg, due to marked differences in the investigation of trends are more prudent approaches distributions of high and low risk cases between hospi- than relying on results in one sampling period.33 tals) may be present and may impact performance esti- Different approaches have been used to address the mates and outlier status. For example, the target limitations of small sample size in provider profiling population (condition or procedure) may be very and outlier classification. These include establishing broadly defined, which is usually done in an effort to lower limits for sample size below which estimates are increase sample size. Instead of focusing only on iso- not calculated; collecting provider data over longer lated aortic valve replacement (AVR), a relatively time periods to increase the number of observations; homogeneous cohort, measure developers may include broadening the target population inclusion criteria all patients with an AVR, even when this procedure has (although this may lead to aggregation issues discussed been combined with other operations (such as simul- previously, including ecological bias); attribution of taneous coronary artery bypass grafting surgery).7 results to larger units (eg, hospitals rather than indi-

These combined procedures generally are associated vidual physicians); and use of composite measures http://qualitysafety.bmj.com/ with higher average mortality than their corresponding that effectively increase the number of endpoints.34 isolated procedures, so the resulting study population Many statisticians also advocate the use empirical will have a heterogeneous range of expected mortality Bayes or fully Bayesian approaches which shrink – rates. Sometimes, completely dissimilar conditions or sample estimates towards the population mean.35 38 procedures with quite different inherent risk are aggre- This yields more accurate estimates of true underlying gated into a heterogeneous composite measure to performance, with less chance of false positive out- increase sample size or to give the appearance of being liers, especially in small samples. broadly representative. For example, the hospital stan- dardised mortality ratio (HSMR) encompasses nearly STATISTICAL CERTAINTY all of the admissions at a given hospital.4 Closely related to these sample size concerns is the on October 2, 2021 by guest. Protected copyright. In all these examples, even with perfect patient-level degree of statistical certainty chosen to classify a hos- risk adjustment, comparisons among providers may be pital as an outlier (eg, 90%, 95%, 99% CI). The biased and inaccurate unless differences in the relative overall health policy ‘costs’ of higher specificity and distributions of higher and lower risk cases are properly fewer false outliers versus higher sensitivity and more accounted for,423a profiling analogue of Simpson’s false outliers must be considered, and there is no one paradox.24 25 The impact of this phenomenon is not correct answer.39 Furthermore, the p values and CIs uniform. Centres performing a greater proportion of from traditional frequentist approaches may some- more complex cases, with higher inherent risk of adverse times be misleading. With very small sample sizes, vir- outcomes, may falsely appear to have worse results. tually no provider can be reliably identified as an outlier; conversely, with very large sample sizes, outly- THE LIMITATION OF SMALL SAMPLE SIZE ing results identified by statistical criteria may have Small sample sizes are common in provider profiling, little practical difference from the average. Bayesian and this makes it difficult to reliably differentiate hos- approaches may provide more intuitive interpretation, pital performance and classify outliers. In a study of such as estimating the probability that a hospital’s per- – major surgical procedures, Dimick et al26 found that formance exceeds some threshold.36 40 42

Shahian DM, et al. BMJ Qual Saf 2015;24:95–99. doi:10.1136/bmjqs-2015-003934 97 Editorial BMJ Qual Saf: first published as 10.1136/bmjqs-2015-003934 on 20 January 2015. Downloaded from

THE IMPACT OF OUTLIERS ON THE REFERENCE lead to misallocation of scarce resources. Scientific POPULATION (‘EXPECTED’ VALUES) rigour and sound judgment are required to accurately Additional problems with outlier classification can arise classify outliers and to constructively use this informa- if the expected outcome for a particular provider is tion to improve healthcare quality. derived from a relatively small reference population (eg, the cardiac surgery programmes in a particular state). Competing interests None. Every provider’s outcomes impact not only their own Provenance and peer review Not commissioned; internally peer observed value but also the ‘expected’ value (the E in reviewed. O/E) for their programme, which is based on the refer- ence population to which they belong.13 Asubstantially aberrant result from one or two providers will expand REFERENCES the range of values that are considered average, and will 1 Healthcare Association of New York State. HANY’s report on reduce the likelihood of a truly abnormal outlying pro- report cards: understanding publicly reported hospital quality vider being correctly classified as such. Several measures. 2013. http://www.hanys.org/quality/data/report_ approaches to this problem have been suggested, includ- cards/2013/docs/2013_hanys_report_card_book.pdf (accessed ing replication with posterior predicted p values, and 5 Feb 2014). leave-one-out cross validation, in which the expected 2 Leonardi MJ, McGory ML, Ko CY. Publicly available hospital performance for each hospital is estimated from a comparison web sites: determination of useful, valid, and model developed from all other hospitals.37 appropriate information for comparing surgical quality. Arch Surg 2007;142:863–8. GRAPHICAL TOOLS FOR OUTLIER DETECTION 3 Rothberg MB, Morsi E, Benjamin EM, et al. Choosing the best hospital: the limitations of public quality reporting. Health Aff Finally, various graphical methods have also been used (Millwood) 2008;27:1680–7. to monitor healthcare performance and to determine 4 Shahian DM, Wolf RE, Iezzoni LI, et al. Variability in the 43 44 outliers. These include funnel plots, in which measurement of hospital-wide mortality rates. N Engl J Med unadjusted or adjusted point estimates of provider per- 2010;363:2530–9. formance are plotted against sample size (volume), with 5 Delong ER, Peterson ED, DeLong DM, et al. Comparing superimposed CIs around the population average to risk-adjustment methods for provider profiling. Stat Med indicate warning or outlier status. Other methods 1997;16:2645–64. include real time graphical monitoring using cumulative 6 Peterson ED, Delong ER, Muhlbaier LH, et al. Challenges in sum (CUSUM) approaches, in which results are immedi- comparing risk-adjusted bypass surgery mortality results: – results from the Cooperative Cardiovascular Project. J Am Coll ately updated with each patient or procedure.45 47 Cardiol 2000;36:2174–84. 7 Shahian DM, He X, Jacobs JP, et al. Issues in quality CONCLUSION measurement: target population, risk adjustment, and ratings. http://qualitysafety.bmj.com/ Outlier determination, the final step in the perform- Ann Thorac Surg 2013;96:718–26. ance measurement process, is a more complicated 8 Bilimoria KY, Cohen ME, Merkow RP, et al. Comparison of undertaking than most non-experts appreciate, with outlier identification methods in hospital surgical quality many nuances in implementation and interpretation. improvement programs. J Gastrointest Surg 2010;14:1600–7. Those involved in provider profiling have a responsi- 9 Paddock SM, Adams JL, Hoces de la Guardia F. Better-than- bility to explicitly state the approaches they use for average and worsethan-average hospitals may not significantly outlier classification, and to explain the proper inter- differ from average hospitals: an analysis of Medicare Hospital – pretation of outlier status to end users of varying stat- Compare ratings. BMJ Qual Saf 2015;24:128 34. 10 Bratzler DW,Normand SL, Wang Y, et al. An administrative istical sophistication. For example, as demonstrated in 9 claims model for profiling hospital 30-day mortality rates for on October 2, 2021 by guest. Protected copyright. the study of Paddock et al, it should be recognised pneumonia patients. PLoS ONE 2011;6:e17401. that while the CMS website is named Hospital 11 Krumholz HM, Wang Y, Mattera JA, et al. An administrative Compare, the statistically valid comparison is between claims model suitable for profiling hospital performance based each hospital and a hypothetical average hospital, not on 30-day mortality rates among patients with an acute between pairs of hospitals. myocardial infarction. Circulation 2006;113:1683–92. Given the historical lack of comparative performance 12 Krumholz HM, Wang Y, Mattera JA, et al. An administrative data in healthcare and the urgent need to foster claims model suitable for profiling hospital performance based informed consumer choice and performance improve- on 30-day mortality rates among patients with heart failure. – ment,itisunderstandablethatvariousstakeholders Circulation 2006;113:1693 701. (patients, payers, regulators) might be tempted to view 13 Draper D, Gittoes M. Statistical analysis of performance indicators in UK higher education. J R Stat Soc Series A the issue of outliers too simplistically, sometimes misin- ( in Society) 2004;167:449–74. terpreting or unintentionally misusing outlier results. 14 Maldonado G, Greenland S. Estimating causal effects. However, this may lead to consequences that are at least Int J Epidemiol 2002;31:422–9. as undesirable as having no performance data at all. 15 Little RJ, Rubin DB. Causal effects in clinical and epidemiological Misclassification of providers may misdirect consumers, studies via potential outcomes: concepts and analytical unfairly discredit or commend certain providers, and approaches. Annu Rev Public Health 2000;21:121–45.

98 Shahian DM, et al. BMJ Qual Saf 2015;24:95–99. doi:10.1136/bmjqs-2015-003934 Editorial BMJ Qual Saf: first published as 10.1136/bmjqs-2015-003934 on 20 January 2015. Downloaded from

16 Pearl J. Causality: models, reasoning, and inference. 32 Siregar S, Groenwold RH, Jansen EK, et al. Limitations of Cambridge, UK; New York: Cambridge University Press, 2000. ranking lists based on cardiac surgery mortality rates. Circ 17 Shahian DM, Normand SL. Comparison of “risk-adjusted” Cardiovasc Qual Outcomes 2012;5:403–9. hospital outcomes. Circulation 2008;117:1955–63. 33 Bronskill SE, Normand SL, Beth LM, et al. Longitudinal 18 Fleiss JL, Levin BA, Paik MC. Statistical methods for rates and profiles of health care providers. Stat Med 2002;21:1067–88. proportions. Hoboken, NJ: J. Wiley, 2003. 34 O’Brien SM, Shahian DM, Delong ER, et al. Quality 19 Gelman A, Hill J. using regression and multilevel/ measurement in adult cardiac surgery: part 2--Statistical hierarchical models. Cambridge; New York: Cambridge considerations in composite measure scoring and provider University Press, 2007. rating. Ann Thorac Surg 2007;83(4 Suppl):S13–26. 20 Huang IC, Frangakis C, Dominici F, et al. Application of a 35 Ash A, Fienberg SE, Louis TAP, et al. Statistical issues in propensity score approach for risk adjustment in profiling assessing hospital performance. Commissioned by the multiple physician groups on asthma care. Health Serv Res Committee of Presidents of Statistical Societies. 2011. http:// 2005;40:253–78. www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment- 21 Rosenbaum PR. Observational studies. New York: Springer, 2002. Instruments/HospitalQualityInits/Downloads/Statistical-Issues- 22 D’Agostino RB Jr. Propensity scores in cardiovascular research. in-Assessing-Hospital-Performance.pdf (accessed 18 Sep 2013). Circulation 2007;115:2340–3. 36 Normand S-LT, Glickman ME, Gatsonis C.A. Statistical 23 Glance LG, Osler TM. Comparing outcomes of coronary methods for profiling providers of medical care: issues and artery bypass surgery: is the New York Cardiac Surgery applications. J Am Stat Assoc 1997;92:803–14. Reporting System model sensitive to changes in case mix? Crit 37 Normand S-LT, Shahian DM. Statistical and clinical aspects of Care Med 2001;29:2090–6. hospital outcomes profiling. Stat Sci 2007;22:206–26. 24 Manktelow BN, Evans TA, Draper ES. Differences in case-mix 38 Goldstein H, Spiegelhalter DJ. League tables and their can influence the comparison of standardised mortality ratios limitations: statistical issues in comparisons of institutional even with optimal risk adjustment: an analysis of data from performance. J R Stat Soc (Series A) 1996;159:385–443. paediatric intensive care. BMJ Qual Saf 2014;23:782–8. 39 Austin PC, Anderson GM. Optimal statistical decisions for 25 Marang-van de Mheen PJ, Shojania KG. Simpson’s paradox: hospital report cards. Med Decis Making 2005;25:11–19. how performance measurement can fail even with perfect risk 40 Austin PC. A comparison of Bayesian methods for profiling adjustment. BMJ Qual Saf 2014;23:701–5. hospital performance. Med Decis Making 2002;22:163–72. 26 Dimick JB, Welch HG, Birkmeyer JD. Surgical mortality as an 41 Christiansen CL, Morris CN. Improving the statistical indicator of hospital quality: the problem with small sample approach to health care provider profiling. Ann Intern Med size. JAMA 2004;292:847–51. 1997;127(8 Pt 2):764–8. 27 Krell RW,Hozain A, Kao LS, et al. Reliability of risk-adjusted 42 Austin PC, Naylor CD, Tu JV.A comparison of a Bayesian vs. outcomes for profiling hospital surgical quality. JAMA Surg a frequentist method for profiling hospital performance. J Eval 2014;149:467–74. Clin Pract 2001;7:35–45. 28 Hofer TP,Hayward RA. Identifying poor-quality hospitals. Can 43 Spiegelhalter D. Funnel plots for institutional comparison. hospital mortality rates detect quality problems for medical Qual Saf Health Care 2002;11:390–1. diagnoses? Med Care 1996;34:737–53. 44 Spiegelhalter DJ. Funnel plots for comparing institutional http://qualitysafety.bmj.com/ 29 Hofer TP,HaywardRA,GreenfieldS,et al. The unreliability performance. Stat Med 2005;24:1185–202. of individual physician “report cards” for assessing the costs 45 Rogers CA, Reeves BC, Caputo M, et al. Control chart and quality of care of a chronic disease. JAMA 1999;281: methods for monitoring cardiac surgical performance and their 2098–105. interpretation. J Thorac Cardiovasc Surg 2004;128:811–19. 30 Thomas JW,Hofer TP.Accuracy of risk-adjusted mortality 46 de Leval MR, Francois K, Bull C, et al. Analysis of a cluster of rate as a measure of hospital quality of care. Med Care surgical failures. Application to a series of neonatal arterial 1999;37:83–92. switch operations. J Thorac Cardiovasc Surg 1994;107:914–23. 31 Austin PC, Reeves MJ. Effect of provider volume on the 47 Grunkemeier GL, Wu YX, Furnary AP.Cumulative sum accuracy of hospital report cards: a Monte Carlo study. Circ techniques for assessing surgical results. Ann Thorac Surg Cardiovasc Qual Outcomes 2014;7:299–305. 2003;76:663–7. on October 2, 2021 by guest. Protected copyright.

Shahian DM, et al. BMJ Qual Saf 2015;24:95–99. doi:10.1136/bmjqs-2015-003934 99