Time to Reality Check the Promises of Machine Learning-Powered
Total Page:16
File Type:pdf, Size:1020Kb
Viewpoint Time to reality check the promises of machine learning- powered precision medicine Jack Wilkinson, Kellyn F Arnold, Eleanor J Murray, Maarten van Smeden, Kareem Carr, Rachel Sippy, Marc de Kamps, Andrew Beam, Stefan Konigorski, Christoph Lippert, Mark S Gilthorpe, Peter W G Tennant Machine learning methods, combined with large electronic health databases, could enable a personalised approach to Lancet Digital Health 2020 medicine through improved diagnosis and prediction of individual responses to therapies. If successful, this strategy Published Online would represent a revolution in clinical research and practice. However, although the vision of individually tailored September 16, 2020 medicine is alluring, there is a need to distinguish genuine potential from hype. We argue that the goal of personalised https://doi.org/10.1016/ S2589-7500(20)30200-4 medical care faces serious challenges, many of which cannot be addressed through algorithmic complexity, and call Centre for Biostatistics, for collaboration between traditional methodologists and experts in medical machine learning to avoid extensive Manchester Academic Health research waste. Science Centre, Division of Population Health, Health Introduction clinical diagnostics. However, acknowledging potential is Services Research and Primary Care, University of Manchester, Proponents of precision medicine make a compelling a poor substitute for robust scientific evidence of actual Manchester, UK 8 pitch: traditional approaches to health science have benefit, and here research is lacking. Although news (J Wilkinson PhD); Leeds Institute focused too much on comparing effectiveness in the media is filled with enthusiastic stories about novel mac for Data Analytics average person and too little on the needs of actual hine learning applications,9–12 a systematic review com (K F Arnold PhD, M de Kamps PhD, Prof M S Gilthorpe PhD, 1,2 individuals. The blame could lie with outdated statis paring the performance of deep learning versus health P W G Tennant PhD), Faculty tical and epidemiological tools, which might offer professional assessment in diagnosis of various diseases of Medicine and Health decreasing relevance to the needs of contemporary from medical images makes for sobering reading.13 Only (K F Arnold, Prof M S Gilthorpe, clinical decision making.3 The proposed solution speaks 20 (24%) of the 82 studies identified evaluated the P W G Tennant), and School of Computing (M de Kamps), to the zeitgeist: our newfound abundance of detailed and performance of their algorithm in an external cohort, and University of Leeds, Leeds, UK; accessible longitudinal data on individuals combined only 14 (17%) studies compared this outofsample perfor Department of Epidemiology, with the practical realisation of various flexible machine mance with that of health professionals. This number is Boston University School of learning approaches offer an exciting chance for alarmingly small, especially given that many of the Public Health, Boston, MA, USA (E J Murray ScD, A Beam PhD); 2 revolution. At the apex sits the dream of precision studies were flawed. The authors found that reporting Department of Clinical medicine crafted by machine learning, a new framework standards were typically poor, internal validation was Epidemiology, Leiden University that promises to revolutionise how we identify the best weak and, perhaps most worryingly, model performance Medical Center, therapy for each person as an individual, while auto was often evaluated under unrealistic conditions that had Leiden, Netherlands (M van Smeden PhD); 13 mating everyday tasks like diagnosis and prognosti little relevance to routine clinical practice. For example, Department of Biostatistics, cation with unprecedented accuracy.4 there is little use comparing the performance of a Harvard TH Chan School of But how realistic are these claims? And when, if ever, machine learning algorithm with health professional Public Health, Boston, MA, USA (K Carr MSc); Institute for Global can we expect them to be routinely realised? We consider judgment for making diagnoses from medical images Health and Translational the evidence underlying two of the most common claims without providing further contextual information about Science, SUNY Upstate Medical about the potential of machine learningpowered precision the patient, as this would never happen in practice.14 A University, Syracuse, NY, USA medicine and call for a reality check of expectations. more recent systematic review of studies comparing deep (R Sippy PhD); Department of Geography (R Sippy) and learning with clinical judgement corroborated the wide Emerging Pathogens Institute Claim 1: machine learning will enable automated spread issues with poor study design and reporting, but (R Sippy), University of Florida, diagnoses with unprecedented accuracy identified some well designed randomised clinical trials Gainesville, FL, USA; Digital Machine learning is often heralded by health and med that evaluated the technology.15 Health & Machine Learning Research Group, Hasso Plattner ical commentators as a powerful prediction tool that Emphasis on predictive performance over clinical utility Institut for Digital Engineering, will revolutionise disease screening and diagnosis. The is not unusual. The ability of machine learning to process Potsdam, Germany inherent flexibility and scope for automation makes highdimensional data, for example, appears to be (S Konigorski PhD, machine learning well suited to examining complex distracting from the often greater benefits of simple Prof C Lippert PhD); Hasso 16 Plattner Institute for Digital highdimensional data (ie, with many variables or clinical variables. Volkmann and colleagues showed how Health at Mount Sinai, Icahn features) that would be challenging to model using predictions based on large dimensional omics data can School of Medicine at conventional approaches. Such strategies have enabled easily be improved by including more common clinical Mount Sinai, New York, NY, USA the development of several innovative diagnostic algor information. Indeed, the added value of omics data for (S Konigorski, Prof C Lippert); and Alan Turing Institute, London, ithms—for example, to identify patients most in need clinical prediction can be marginal once all relevant UK (Prof M S Gilthorpe, 5 of intervention from knee MRI, to detect cardiac clinical variables are included. There is also an increasing P W G Tennant) 6 arrhythmias from electrocardiograms, and to diagnose focus on classifying patients into simple categories (eg, Correspondence to: pneumonia from chest xrays.7 with and without the disease) rather than predicting a Dr Jack Wilkinson, Centre for Given such innovation, it is hard to dispute the revolu continuum of risk. This trend is an unfortunate departure Biostatistics, Manchester Academic Health Science Centre, tionary potential of machine learning for improving from the supposed aim of increasing individual relevance www.thelancet.com/digital-health Published online September 16, 2020 https://doi.org/10.1016/S2589-7500(20)30200-4 1 Viewpoint Division of Population Health, and is particularly puzzling because it is not a requirement identify the best treatment for an individual person by Health Services Research and of most machine learning methods. machine learning. Primary Care, University of In terms of using machine learning to automate This problem does not simply reflect the adage that Manchester, Manchester M13 9PL, UK diagnoses, this remains more of a potential promise than correlation does not imply causation, nor the widely held jack.wilkinson@manchester. a proven product, although notable exceptions exist, belief that causal inference can only be achieved in ac.uk such as an automated system for detection of diabetic experimental data. A suite of methodological approaches retinopathy.17 At present, the benefits of most proposals are available to aid estimation of causal effects in non to automate clinical diagnoses with machine learning are experimental data,23,24 such as electronic health records, unknown because they have not been meaningfully which potentially offer more diverse and relevant assessed. Novel machine learning studies are not samples than clinical trials (although representative unusual in this regard. Clinical journals are flooded with samples might not be necessary to achieve representative traditional prognostic models that were neither developed results25). The limitation arises because almost all nor evaluated using appropriate methods.18 However, machine learning algorithms have been designed to since few of these models ever end up being used, they make predictions (eg, to most accurately predict or are arguably benign. By contrast, there is tangible classify those with a particular trait or prognosis) and this concern that the scale of enthusiasm around machine is fundamentally distinct from causal explanation and learning means substandard models might get unduly causal effect estimation.26,27 adopted by a clinical audience that is not equipped to To estimate a causal effect, we instead need to estimate assess them. not just what is most likely to happen (ie, prediction) but Regardless, the performance and utility of a machine what would most likely have happened if things had been learning algorithm is highly dependent on the quality different (ie, counterfactual prediction27). Accordingly, the and relevance of the data on which it is trained. Training prospect of automated