Identifying Clinical Features in Primary Care Electronic Health Record Studies: Methods for Codelist Development
Total Page:16
File Type:pdf, Size:1020Kb
Open Access Research BMJ Open: first published as 10.1136/bmjopen-2017-019637 on 22 November 2017. Downloaded from Identifying clinical features in primary care electronic health record studies: methods for codelist development Jessica Watson,1 Brian D Nicholson,2 Willie Hamilton,3 Sarah Price3 To cite: Watson J, ABSTRACT Strengths and limitations of this study Nicholson BD, Hamilton W, et al. Objective Analysis of routinely collected electronic health Identifying clinical features in record (EHR) data from primary care is reliant on the ► This paper presents rigorous reproducible methods primary care electronic health creation of codelists to define clinical features of interest. record studies: methods for for codelist generation to increase transparency and To improve scientific rigour, transparency and replicability, codelist development. BMJ Open replicability in electronic health record studies. we describe and demonstrate a standardised reproducible 2017;7:e019637. doi:10.1136/ ► Clear a priori definition of the feature of interest methodology for clinical codelist development. bmjopen-2017-019637 ensures clinical relevance and enables future Design We describe a three-stage process for developing researchers to assess the applicability of existing ► Prepublication history and clinical codelists. First, the clear definition a priori of codelists to future research questions. additional material for this the clinical feature of interest using reliable clinical paper are available online. To ► Generation of auditable, replicable and modifiable resources. Second, development of a list of potential codes view these files, please visit syntax for codelists enables replication and ‘future- using statistical software to comprehensively search all the journal online (http:// dx. doi. proofs’ codelists. available codes. Third, a modified Delphi process to reach org/ 10. 1136/ bmjopen- 2017- ► Using a Delphi approach to reach consensus on consensus between primary care practitioners on the most 019637). inclusion of codes allows sensitivity analysis to relevant codes, including the generation of an ‘uncertainty’ explore the impact of uncertainty in coding. Received 19 September 2017 variable to allow sensitivity analysis. ► Using multiple clinicians in a Delphi panel reviewing Revised 13 October 2017 Setting These methods are illustrated by developing a codes may be unfeasible and inefficient for studies Accepted 23 October 2017 codelist for shortness of breath in a primary care EHR with large numbers of codes; a compromise of using sample, including modifiable syntax for commonly used two clinicians per feature from a panel of six offers a statistical software. reasonable trade-off. Participants The codelist was used to estimate the frequency of shortness of breath in a cohort of 28 216 patients aged over 18 years who received an incident http://bmjopen.bmj.com/ diagnosis of lung cancer between 1 January 2000 and 30 resource for researchers and are increas- November 2016 in the Clinical Practice Research Datalink ingly used in epidemiological and medical (CPRD). research resulting in over 1500 publications Results Of 78 candidate codes, 29 were excluded as since 2000, increasing from ~80 in 2005 to inappropriate. Complete agreement was reached for 44 more than 450 in 2015/2016. (90%) of the remaining codes, with partial disagreement There are three well-established UK over 5 (10%). 13 091 episodes of shortness of breath primary care EHR databases: the Clin- were identified in the cohort of 28 216 patients. Sensitivity ical Practice Research Datalink (CPRD) on September 30, 2021 by guest. Protected copyright. analysis demonstrates that codes with the greatest including 4.4 million currently registered uncertainty tend to be rarely used in clinical practice. patients, covering 6.9% of the UK popula- Conclusions Although initially time consuming, using a 2 rigorous and reproducible method for codelist generation tion ; The Health Improvement Network ‘future-proofs’ findings and an auditable, modifiable syntax including 3.6 million currently registered 3 for codelist generation enables sharing and replication patients giving ~5.7% coverage of the nation ; of EHR studies. Published codelists should be badged and QResearch including approximately by quality and report the methods of codelist generation 5 million currently registered patients in the 1Centre for Academic Primary including: definitions and justifications associated with UK.4 All three databases record coded anony- Care, Bristol Medical School, each codelist; the syntax or search method; the number of mised information about patients: demo- University of Bristol, Bristol, UK candidate codes identified; and the categorisation of codes 2 graphics, diagnoses, symptoms, prescriptions, Nuffield Department of Primary after Delphi review. Care Health Sciences, University immunisation history, referral information of Oxford, Oxford, UK and test results. Linkages enable follow-up 3University of Exeter Medical of patients beyond the primary care setting, School, Exeter, UK INTRODUCTION for example, to data recorded by the Office Correspondence to Electronic health records (EHRs) have been for National Statistics (ONS), the National Dr Jessica Watson; used in routine primary care practice in the Cancer Registration Service and to Hospital jessica. watson@ bristol. ac. uk UK for at least 20 years.1 EHRs are a rich Episode Statistics. Integrated primary and Watson J, et al. BMJ Open 2017;7:e019637. doi:10.1136/bmjopen-2017-019637 1 Open Access BMJ Open: first published as 10.1136/bmjopen-2017-019637 on 22 November 2017. Downloaded from Figure 1 The method for codelist collation consists of three steps. secondary care databases are also being developed. For as it is not sufficient to address the issues of scientific example, ResearchOne includes data for over 5 million rigour and reproducibility in codelist development. patients from General Practice, Child Health, Commu- The problem is illustrated by brief examination of nity Health, Out-of-Hours, Palliative Hospital, Accident codelists recently deposited on the repository. Without and Emergency and Acute Hospital (http://www. resear- a clear definition of the clinical variable a codelist is chone. org/). designed to encapsulate, it is not possible to critique A key stage in EHR research is identifying exposures or evaluate it for peer review or to decide whether it and outcomes of interest. This apparently simple task is generalisable to other studies. For example, codelists is made more complicated by the fact that EHR clin- deposited for cancer (https:// clinicalcodes. rss. mhs. ical data is generally stored as codes, often including man. ac. uk/ medcodes/ article/ 50/) do not adhere to qualitative information, such as ‘abdominal pain’, ‘left the standardised International Classification of Diseases iliac fossa pain’ and ‘intermittent abdominal pain’. (ICD) definition of cancer, that is, ICD codes C00–C97, http://bmjopen.bmj.com/ These separate codes need to be grouped into codel- as 193 (~9%) of the 2254 Read codes related to carci- ists or thesauri, with the groups containing all the noma in situ (ICD D00–D09). Furthermore, 100 (~4%) codes pertaining to the variable of interest. However, codes were obsolete, or they indicated the absence of the methods used to develop codelists are not stan- cancer or they were completely unrelated to cancer. dardised and are often poorly reported. They are an This demonstrates the need to establish standardised increasingly recognised source of bias in EHR research, methods for codelist development. Currently, recom- owing to both inclusion of inappropriate codes and mended methods, for example, Davé and Petersen9 omission of important codes. To address this, the and CALIBERcodelists (http:// caliberanalysis. r- forge. r- on September 30, 2021 by guest. Protected copyright. REporting of studies Conducted using Observational project.org/), need updating. This is because they omit Routinely-collected health Data (RECORD) Statement steps to standardise the definition of clinical terms and states that ‘a complete list of codes and algorithms used to because they are based in the Read code system, which classify exposures, outcomes, confounders, and effect modifiers is being superseded by SNOMED CT codes (System- should be provided’.5 Clinicalcodes. org has been devel- atized Nomenclature of Medicine – Clinical Terms) in oped by the University of Manchester to encourage April 2018. researchers to publish clinical codelists used in EHR We have significant experience in EHR research, research,6 and some other universities are developing with ~40 published studies conducted in the CPRD their own open-access, citable repositories of codelists, since 2012. We have developed and refined rigorous for example, the University of Bristol7 and University methods for developing clinical codelists for use in of Cambridge.8 The current clinicalcodes. org reposi- CPRD studies independent of the Read code system. tory contains 72 916 clinical codes deposited within 432 The aim of this paper is to report a clear, standardised, codelists (https:// clinicalcodes. org), in the format of reproducible methodology and to increase scientific a list of papers and associated codes. This repository is rigour in conduct of EHR research. The method is illus- a necessary step forward towards addressing transpar- trated using the CPRD but applies equally well to other ency; however, it does not tackle the potential for bias,