Machine Learning Reduced Workload with Minimal Risk of Missing Studies

Journal of Clinical Epidemiology - (2020) - ORIGINAL ARTICLE Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews James Thomasa,*, Steve McDonaldb, Anna Noel-Storrc,d, Ian Shemilta, Julian Elliotte, Chris Mavergamesd, Iain J. Marshallf aEPPI-Centre, UCL Social Research Institute, University College London, London, UK bCochrane Australia, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia cRadcliffe Department of Medicine, University of Oxford, Oxford, UK dCochrane, London, UK eDepartment of Infectious Diseases, Monash University and Alfred Hospital, Melbourne, Australia fSchool of Population Health & Environmental Sciences, Kings College London, London, UK Accepted 3 November 2020; Published online xxxx Abstract Objectives: This study develoed, calibrated, and evaluated of a machine learning classifier designed to reduce study identification workload in Cochrane for producing systematic reviews. Methods: A machine learning classifier for retrieving randomized controlled trials (RCTs) was developed (the ‘‘Cochrane RCT Clas- sifier’’), with the algorithm trained using a data set of titleeabstract records from Embase, manually labeled by the Cochrane Crowd. The classifier was then calibrated using a further data set of similar records manually labeled by the Clinical Hedges team, aiming for 99% recall. Finally, the recall of the calibrated classifier was evaluated using records of RCTs included in Cochrane Reviews that had abstracts of sufficient length to allow machine classification. Results: The Cochrane RCT Classifier was trained using 280,620 records (20,454 of which reported RCTs). A classification threshold was set using 49,025 calibration records (1,587 of which reported RCTs), and our bootstrap validation found the classifier had recall of 0.99 (95% confidence interval 0.98e0.99) and precision of 0.08 (95% confidence interval 0.06e0.12) in this data set. The final, calibrated RCT classifier correctly retrieved 43,783 (99.5%) of 44,007 RCTs included in Cochrane Reviews but missed 224 (0.5%). Older records were more likely to be missed than those more recently published. Conclusions: The Cochrane RCT Classifier can reduce manual study identification workload for Cochrane Reviews, with a very low and acceptable risk of missing eligible RCTs. This classifier now forms part of the Evidence Pipeline, an integrated workflow deployed within Cochrane to help improve the efficiency of the study identification processes that support systematic review production. Ó 2020 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Funding: This work received funding from Cochrane via Project Transform; Australian National Health & Medical Research Council (Partnership Project grant APP1114605); U.S. National Library of Medicine (Award 2R01-LM012086-05); I.J.M. was supported by a Medical Research Council (UK) fellowship (MR/N015185/1). A portion of James Thomas’s time was supported by the National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care North Thames at Barts Health NHS Trust. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. Ethics approval and consent to participate: Not applicable: study does not involve human subjects. Conflict of interest: The authors declare that they have no competing interests. Authors’ contributions: J.T., A.N.S., S.M., and I.J.M. designed the study. J.T. and I.J.M. built the classifiers and calibration models. A.N.S. and S.M. worked on evaluation data sets. C.M. provided overall Cochrane direction and governance. I.S. and J.E. provided methodological input throughout. All authors read and approved the final article. Consent for publication: Not applicable: no participant data presented. Availability of data and materials: Hyperlinks to source code repositories are supplied in the text. Cochrane’s CENTRAL database is referenced and is available at https://www.cochranelibrary.com/central. All Cochrane Crowd labels are open and available at http://crowd.cochrane.org/DownloadData.php. * Corresponding author. University College London, UCL Social Research Institute, 18 Woburn Square, London WC1H 0NR, UK. Tel.: þ44 207 612 6844. E-mail address: [email protected] (J. Thomas). https://doi.org/10.1016/j.jclinepi.2020.11.003 0895-4356/Ó 2020 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). 2 James Thomas et al. / Journal of Clinical Epidemiology - (2020) - Keywords: Machine learning; Study classifiers; Searching; Information retrieval; Methods/methodology; Randomized controlled trials; Systematic reviews; Automation; Crowdsourcing; Cochrane Library 1. Background effects of the large majority of health care interventions evaluated in Cochrane Reviews, approximately 90% of Co- Cochrane is a leading producer of systematic reviews, chrane Reviews aim to include only RCTs. Thus, the capa- with more than 8,000 currently published in the Cochrane bility to efficiently identify studies with designs at scale Library [1]. These reviews incorporate the results of tens will generate large corollary cost savings and efficiency of thousands of randomized controlled trials (RCTs) and gains in review production and updating systems across other primary studies. The manual effort invested in identi- thousands of Cochrane Reviews, reducing research waste. fying primary studies eligible for inclusion in these and Second, searches conducted for Cochrane and non- other systematic reviews is vast. Author teams and informa- Cochrane health and nonhealth systematic reviews of tion specialists typically search a large number of biblio- RCT evidence also retrieve many records of studies that graphic databases to find the comparatively small number are not RCTs (often well over 50%). Thus, the capability of studies eligible to be included [2]. These searches are to automatically exclude non-RCTs from manual checking sensitive to identify as many relevant studies as possible in such reviews will reduce manual workload (because, but therefore yield large number of irrelevant records, even if they are about the right topic, the fact that they which are then screened manually by author teams. This are not RCTs means that they are ineligible for inclusion), is a time-consuming and therefore costly process, espe- with corollary cost savings and efficiency gains. cially when all records are checked by at least two people We have previously described methods for automatically to aid reliability. With the rapidly increasing volume of identifying RCTs from research databases [8]. In that eval- research being conducted and published [3], systematic re- uation, we found machine learning classification systems views tend to be resource-intensive projects, which can take are more accurate than manually crafted Boolean string years to complete. As a consequence, many important searches of databases (the current standard practice). Yet, research questions are not covered by systematic reviews, showing higher accuracy in a validation study is not suffi- and it is increasingly difficult to maintain an up-to-date syn- cient to ensure new technologies are adopted in practice. thesized evidence base [4]. This is a waste of global invest- We have engaged with the Cochrane Information Retrieval ment in research, leading to suboptimal decision-making Methods Group (IRMG) with whom we agreed additional and poorer health outcomes [5]. requirements for this technology to be adopted by Co- Automation has been proposed as one possible solution chrane. First, the classifier must recall at least 99% of RCTs to reduce the manual burden of many systematic review (a more stringent threshold than we had applied in our pre- tasks [6]. For example, machine learning classification al- vious work) [8]. Second, the classifier should provide an gorithms (‘‘classifiers’’) can ‘‘learn’’ the eligibility criteria indicative probability score to users. Third, an additional of a review through exposure to a manually classified set assessment should be done on whether the classifier would of documents, thus reducing the human effort required to be at risk of missing any of the studies included in existing find relevant studies [7]. Cochrane Reviews. In this article, we describe the develop- To date, most automation approaches operate at the level ment, calibration, and evaluation of a machine learning of individual reviews [8,9], rather than addressing structural classifier designed to meet these requirements, which has deficiencies in research curation [10]. This paper describes subsequently been adopted by and deployed within an important component in an alternative approach, which Cochrane. aims to improve the efficiency of study identification across multiple systematic reviews of RCTs. The system com- prises (1) database searching, (2) machine learning, and (3) crowdsourcing (via the Cochrane Crowd citizen science 2. Materials and methods project) to populate an existing database of RCTs (CEN- TRAL) [11]. The interlinked system or ‘‘workflow’’ is 2.1. Cochrane Evidence Pipeline workflow known as the Cochrane ‘‘Evidence Pipeline.’’ Here we Cochrane publishes a database of RCTs that are relevant describe the machine learning component of the Pipeline for current or potential future reviews (CENTRAL),

Machine Learning Reduced Workload with Minimal Risk of Missing Studies

No-Decision Classification: an Alternative to Testing for Statistical

Comparison of Machine Learning Techniques When Estimating Probability of Impairment

Binary Classification Models with “Uncertain” Predictions

Social Dating: Matching and Clustering

How Will AI Shape the Future of Law?

Gaussian Process Classification Using Privileged Noise

Binary Classification Approach to Ordinal Regression

Analysis of Confirmed Cases of COVID-19

Evidence Type Classification in Randomized Controlled Trials Tobias Mayer, Elena Cabrio, Serena Villata

Estimating the Operating Characteristics of Ensemble Methods

Binary Classification and Logistic Regression Models Application to Crash Severity

Benchmarking Machine Learning Models for the Analysis of Genetic Data Using FRESA.CAD Binary Classiﬁcation Benchmarking