PhUSE 2016

Paper RW04

An Algorithm To Distinguish Between COPD and

Berber T. Snoeijer, Jetty A. Overbeek, PHARMO Institute for Drug Outcomes Research, Utrecht, Netherlands

ABSTRACT Data from out-patient pharmacies are a valuable source for research in real-world. However, indication of use is usually not registered in this database. As drugs for obstructive airway diseases are used for both COPD and asthma, the distinction between the two is difficult, but often necessary. We developed a validated algorithm to distinguish between COPD and asthma. From PHARMO’s General Practitioner (GP) Database all patients with a GP recorded diagnosis for COPD, asthma or another respiratory disease were selected. These patients were linked to their drug dispensings in PHARMO’s Out-patient Pharmacy Database and the indication was modeled based on: nasal preparations (ATC code R01), drugs for obstructive airway disease (ATC code R03), age, gender, duration of use, prescriber and number of dispensings. The algorithm was built on raw data to make sure that the actual situation is represented as much as possible.

INTRODUCTION Drugs for treating COPD and asthma are coming from the same class and many of them can be used for both diseases. However, the pathways of treatment and combinations of drugs often differ, especially in different stages of the diseases. For this reason, it is highly important for the pharma companies to know when their drug is used , during which disease, and in what combination. Therefore, they need a distinction between COPD and asthma patients which is difficult if you have only claims data and no indications . As COPD patients are almost always above 40 years of age, the age of the patients is often used as a rough estimation for the indication. However, asthma is also diagnosed in patients older than 40 years as can be seen in the picture below (Figure 1). Furthermore, there are also patients using drugs from the same drug class who have neither of the diagnoses. This paper describes how a model was built to distinguish between asthma and COPD patients based on profile and basic characteristics.

Figure 1: Age at diagnosis of respiratory disease (based on Dutch GP database)

1 PhUSE 2016

BUILDING A MODEL For an accurate discrimination between asthma and COPD patients the model was based on the linkage between GP data and out-patient pharmacy data. All patients with an ICPC related to pulmonary diseases were selected from the GP Database and linked to their corresponding data in the Out-patient Pharmacy Database. Only patients who were selected from the GP Database and had data available in the Out-patient Pharmacy Database were included in the analysis. The patients were divided into 4 indication groups: Patients with an ICPC for asthma, patients with an ICPC for COPD, patients with an ICPC for chronic rhinitis/allergies and patients with ICPCs for other pulmonary diseases. The analysis was performed in several steps: 1. Exploratory analysis to find discriminating factors between the groups 2. Logistic modelling to find the best predictive factors for having an asthma or COPD indication 3. Checking and defining the cut-offs to get the actual results with an accurate representation of the actual proportions of asthma and COPD patients.

EXPLORATORY ANALYSIS The exploratory analysis is necessary to get a limited number of factors which can be tested in the model. Next to getting the standard demographic statistics like age, gender and year of diagnosis, it was checked as well what drugs were used by the patients within each of the different indication groups. Those drugs included drug substances, standard drug combinations and drug classes within the ATC groups R01 and R03 but also other drugs often used by elderly people. As can be seen from the example below (See Table 1), some drugs are clearly often used by one of the groups (e.g. almost half of the patients with an ICPC for COPD were using tiotropium (48.7%)), while for other drugs there is less difference, or they are less used overall.

Table 1: An overview of the summary statistics for a few of the tested drugs

Chronic Other Lung WHO ATC Drug name Asthma COPD Allergies diseases R03BB04 TIOTROPIUM N (%) 601 (3.8%) 3886 (48.7%) 285 (3.2%) 710 (4.3%) R03DC03 N (%) 590 (3.7%) 102 (1.3%) 204 (2.3%) 166 (1.0%) R03DA04 N (%) 27 (0.2%) 122 (1.5%) 16 (0.2%) 40 (0.2%) N (%): Number and percentage of patients within the indication group using the drug

LOGISTIC REGRESSION The SAS procedure PROC LOGISTIC was used to perform the logistic regression modelling. The outcome variable was the original indication (as recorded by the GP) and the explanatory variables were the age category, gender, all possible drugs and drug classes which proved to be relevant from the exploratory analysis , and other variables like duration of use and prescriber. Because there were 2 different indications and another group not having one of the indications we had to make two models; one for the indication COPD (yes/no) and one for the indication asthma (yes/no). This is because logistic modelling is performed on a binary outcome variable. proc logistic data=lngWrk descending OUTest=TmpEst outmodel=data.model1; class agecat gender ; model COPD= AgeCat gender Med1 Med2 Med3 … / selection=backward slstay=0.001; OUTPUT out=TmpPred pred=pred1; run;

A detailed description of all options that are possible for the proc logistic can be found i n the SAS support site. The output dataset retrieved from the Outmodel statement can later be used for a new dataset to get the prediction for having the outcome yes/no. We used for this model backward selection where we included variables with a significance of p<=0.001. We could use such a low significance border because the number of observations was very high compared to clinical research (N=21187).

THE C-STATISTIC

2 PhUSE 2016

Because we used individual drugs as well as drug classes and drug combinations, a number of the variables were correlated. Based on the results of the models for several consecutive years the input of variables in the proc logistic was adapted to get the best and most uncomplicated model. While doing that the c-statistic of the model was closely monitored to get a model as strong as possible. The c-statistic indicates which part of the actual indications is according to the predictions. A c-statistic of 0.5 corresponds to the model randomly predicting the response, and a 1 corresponds to the model perfectly discriminating the response. A c-statistic 0.8 or higher is generally accepted to be good..A description of this this factor can be found on the SAS support pages : https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_logistic_sect042.htm For the COPD model we got a c-statistic of 0.9 which is quite high in model building. For the asthma model it was more difficult. There we got a c-statistic of 0.78. However, as we used real world data without any further limitations this was accepted in combination with the COPD model. The validity of variables included in the model was also considered. With regard to the c-statistic we considered the number of explanatory variables in the model. The aim was to limit the number such that the model was simple and easy to use. Explanatory variables that had a close interaction, for example because the drug and drug class were both in the model were examined and the best option of both was used. For this purpose we also compared the results of logistic regression models for more consecutive years.

THE CUT-OFF VALUE When the model is satisfying both in c-statistic as in number and validity of variables the resulting model dataset can be used to get a prediction for each patient for having got the indication COPD and the indication asthma. Below is the code for the prediction of COPD. The input dataset needs to have equal variables with equal names and formats as to those used for the model building.

PROC LOGISTIC inmodel=data.model1; score data=InputData OUT=plong(RENAME=(P_1=pcopd) DROP=p_0); RUN;

As a result of the procedure above, each row in the input dataset will get additional variables which indicate the chance for being positive (P_1) or negative (P_0). In this case, P_1 is the chance that the patient had a COPD indication from the GP. However, this can be a value between 0 and 1. A cut-off point is needed to actually give a positive or negative result on whether the patient was categorized as having the indication. In this case, because there were two models, both the P value for COPD and for asthma were added to the dataset and both cut-off values had to be chosen with regard to the other indication. The best cut-off values were chosen based on the sensitivity and specificity and also the positive predictive value (PPV) and negative predictive value (NPV). There are many statistical references in which these estimates are described. The total number of patients and the proportions of asthma and COPD patients were checked against publishe d results and the proportions in the GP Database and proved to be comparable and representative.

CONCLUSION The above method for estimating the indication based on medication profiles proved to be useful. The final model for the indication COPD is based on age, gender, duration of medication use and 5 different or grouped medications. The indication asthma is based on age, duration of medication use and 5 different medications or grouped medications. Each patient that receives a drug for respiratory disease (ATC code R03) in one year will be labeled as having asthma, COPD or other respiratory disease. This model is used for market analysis for new drugs in the COPD/asthma market. Since the drug market is changing and new drugs emerge on the market, this model is build yearly so that the model remains accurate.

CONTACT INFORMATION (HEADER 1) (In case a reader wants to get in touch with you, please put your contact information at the end of the paper.) Your comments and questions are valued and encouraged. Contact the author at: The PHARMO Institute Van Deventerlaan 30-40 3528 AE Utrecht The Netherlands T: +31 30 7440 800 E: [email protected]

3