University of South Australia

School of Computer and Information Science Bachelor of Software Engineering

Discover Patterns in Adverse Drug Reaction

Name: Ernst J Joham

ID Number: 10005126 SUPERVISOR: DR JIUYONG LI : DR JAN STANEK

7TH AUGUST 2009

ii ABSTRACT

This research will use medical data to investigate and find patterns through data mining for adverse drug reaction. Wilson, Thabane and Holbrock (2003) define data mining as the importance of extracting valid, unknown and actionable information from databases. According to Furey (2005) ‘each year 2.2 million Americans suffer serious adverse reactions to drugs which are referred to as Adverse Drug Reaction (ADR)’. The World Health Organization (2002) overview of adverse events clearly highlights this importance and describes these adverse events as fatal, life-threatening and permanently/significantly disabling, requires or prolongs hospitalization. By using data mining to discover patterns involving factors such as age, height, and weight with certain conditions or taking different drugs together it can lead to outcomes that cause adverse events. The purpose of the research is to try to discover patterns through data mining on a far ideal dataset data set that contains noise and missing values. Two core questions are explored: (1) is it possible to discover patterns in spares datasets? , and (2) what patterns can be identified through data mining for ADRs? This research project will seek answers to these questions using pre-recorded data. The data being used will provide real-world evidence for detecting adverse drug reaction. An interpretative quantitative methodology will be used. The research will involve data sorting through approximately twelve thousand existing records and the selection of relevant information. R statistical package will be use to find patterns and interpret communalities. R (R Project for Statistical Computing) software is an open source package with functional language capabilities allowing graphical display and statistical exploration from datasets. Once the results are obtained an in-depth analysis and interpretation of the data will take place. Our conclusion to the research will determine if a far from ideal data set can be mined with certain techniques that are more suitable for medical datasets.

iii DECLARATION

I declare the following to be my own work, unless otherwise referenced, as defined by the University’s policy on plagiarism.

Ernst J Joham

iv TABLE OF CONTENTS

1. INTRODUCTION...... 1

1.1 BACKGROUND...... 1

1.2 RESEARCH OBJECTIVE AND STUDY QUESTIONS...... 2

2. LITERATURE REVIEW...... 2-6

3. METHODOLOGY...... 6

3.1. DATASET...... 6-7

3.2. RESEARCH PROCESS...... 7-8

3.3. DATA MINING TOOL...... 9-10

3.4. ALGORITHMS...... 10

4. MEDICAL DATAMING MINING...... 11

4.1 Introduction...... 11 4.2 issues...... 11-12 4.3 Data Quality...... 12

5. RESULTS………………………………………………………...13

5.1 Data Visualisation…………………………………………………………..14 5.2 Logistic Regression………………………………………………………..15 5.3 Decision Tree………………………………………………………………..16 5.4 Risk Pattern Algorithm…………………………………………………….17

6. DISCUSSION…………………………………………………….18

REFERENCE...... 19

APPENDIX A: Results For the Regression Logistic Technique…………....20 APPENDIX B: Decision Tree Result……………………………………………..21 APPENDIX C: Representation of the Resulting Risk Pattern Algorithm….22

v 1. INTRODUCTION

1.1 Background

Discovering patterns in medical datasets is still very difficult and challenging but very rewarding (Roddick & Graco 2003). Compared to other fields if you can data mine medical datasets it will also work for any dataset. There are a lot more constraints and issues that limit the way the data mining is undertaken for medical datasets. Some of these issues facing medical data is the why the data is collected; accuracy of the data, ethical, legal and social issues that comes with patients records (Cios & Moore 2002).

The World Health Organization (2002) reports that some countries the admission due to ADRs is more than 10%. The growing problem of these medical morbidity and mortality has a high financial burden on hospitals. This growing problem needs to be addressed by monitoring system and other alternatives.

Data mining can be one of these alternatives in helping detect ADRS by following a data mining process and using certain techniques in extracting patterns in medical datasets to identify the cause of adverse events that are life- threatening, and prolong hospitalization.

Data mining techniques have improved from when data mining began and with the introduction of databases, but the database does not benefit the health professional(s) until the information is turned into useful information. By using effective data mining tools and algorithms and a step by step data mining process it is possible to produce useful and new information from the dataset (Wilson, Thabane & Holbrook 2003).

1 This research attempts to explore using data mining techniques in discovering patterns in medical data. There are many issues that make it difficult for mining medical data and a need to overcome this complexity is important. By using medical datasets, data mining techniques and technologies are pushed to their limits (Roddick & Graco 2003). This aspect will test the effectiveness of various algorithms used in evaluating these results.

1.2 Research Objective and Study Questions

The aim of this research is to use data mining methods in an attempt to produce relevant results from real world data. The interpretation of the results from this research will determine if data sets that are faced with issues and constraints like noisy, incompleteness and limitation on attributes can still produce patterns of interest.

The following research questions for this thesis will be addressed:

(1) Is it possible to discover patterns in spares datasets?

(2) What patterns can be identified through data mining for ADR?

2. LITERATURE REVIEW

With the growth of data mining and finding informative information in datasets it is not surprising that more research is needed in data quality and effective data mining algorithms to be able to detect interesting relationships within the dataset. There are still relatively few publications and research done for data mining especially for medical datasets with noise and missing values. Several studies have focused on the problems encountered with datasets and best techniques to be used when data mining medical applications. For example Cios & Moore (2002) addresses the difficulty and constraints of collecting medical data to mine and the technical and social reasons behind missing values in the data set. Study by Brown & Kros (2003) focuses further on the impact of missing data and how existing methods can help with the problems of missing data. They categories methods for dealing with missing data into:

 Use complete data only

2  Delete selected case or variables  Data imputation  Model-based approaches

Before any of these methods can be applied to the data set the analyst must understand each type of missing values only then can a discussion be made in how to address them (Brown & Cros 2003).Types of missing values can be of type data missing at random, Data missing completely at random, non-ignorable missing data, and outliers treated as missing data (Brown & Cros 2003).

Another alternative approach to handling missing values is by conceptual reconstruction where only conceptual aspects of the data are mined from the incomplete data set (Aggarwal & Parthasarathy 2001). They further argue that some of the methods like data imputation are prone to errors. Aggarwal & Parthasarathy (2001) gives an example where in table1 it shows how entries that are missing 20% to 40% in the data set. When using the conceptual reconstruction method the first three were 92% accurate as the original data sets.

Dataset Cao CAM(20% CAM(40% ) )

Musk (1) 76.2 0.943 0.92

Musk (2) 95.0 0.96 0.945

Letter Recognition 84.9 0.825 0.62

Table 1 Conceptual reconstructed data sets (Aggarwal & Parthasarathy 2001)

Other Studies have gone further with impact of missing values and explore the impact of noise and how this can influence the output of models. Zhu & Wu (2004) puts these into class noise and attributes noise. Their research concentrated on attribute noise as class noise is much cleaner them first thought (Zhu & Wu 2004). Attribute noise is more difficult to handle and include:

3 (1) Incorrect attribute values (2) Missing or don’t know attribute values (3) Incomplete attributes or don’t care values

Some researchers have focused on data cleansing tools to help eliminate noise but this can only achieve a reasonable result (Zhu & Wu 2004). Noise handling methods can help to eliminate noise in data sets. Hulse et al (2007) introduces the Pair wise Noise Attribute Detection Algorithm (PANDA) that can detect attribute noise within datasets allowing the removal of noisy data only if required. The other algorithm introduced is the (DM) distance-based outlier detection technique which is similar but not as good as PANDA in detecting attribute noise. When the noise is detected then we can remove it or if not removed it may cause a low quality set of hypotheses. Table 2 displays the result of a dataset using PANDA and Dm. PANDA identifies more noise instances. Instance category 1–10 11–20 21–30 1 – 30 PANDA DM PANDA DM PANDA DM PANDA DM Noise 6 6 7 4 8 8 21 18 Outliers 2 4 2 6 1 2 5 12 Exceptions 2 0 1 0 1 0 4 0 Typical 0 0 0 0 0 0 0 0

Table 2 10% of a dataset of 30 most suspicious instances (Hulse et al 2007)

Several researches have focused on the techniques that have built in mechanism to handle noise and missing values and which are more appropriate to use for medical applications. Laverač (1999) reviews a number of techniques that have been applied and are more suited to medical data sets. These include decision tree, logic programs, K- nearest neighbour, and Bayesian classifiers. Laverač (1999) describes these as ‘intelligent data analysis techniques in the extraction of knowledge, regularities, trend and representation cases from patients data stored in medical records’. Lee et al (2000) believes that techniques that users can easily extract specific knowledge are the key for making medical decisions and studies have concluded that Bayesian networks and decision trees are the primary techniques applied in medical information systems. Fayyad et al (Lee et al 1999, p.85) indicates that the diverse fields for knowledge discovery draw upon the main components and methods shown in figure 1.

4 Figure 1 Main components of KDD and DM and there relationship (Lee et al 1999)

A study on drug discovery Obenshain (2004) showed that neural networks performed better then logistic regression, but the decision tree did better in identify active compounds most likely to have biological activity.

Other researchers into data mining for medical datasets have focused on data mining process which includes dealing with missing values, noise and choosing the techniques for knowledge discovery. Cios & Moore (2002) acknowledges that it is important for medical data mining to follow a procedure for success in knowledge discovery. These can follow a few steps like a nine-step process or the DMKD process which adds several steps to the CRISP-DM model and has been applied to several medical problem domains. Figure 2 shows how the process model works which can be semi-automated for medical applications (Cios & Moore 2002).

Figure 2 DMKD process model (Cios & Moore 2002)

5 Wang (2008) argues that most process models focus on the results but not in gaining new knowledge. Medical data mining applications is expected to discover new knowledge and should follow a five stage data mining development cycle: planning tasks, developing data mining hypotheses, preparing data, selecting data mining tools, and evaluating data mining results.

Current literature has focused on ways to improve data sets by applying methods for missing values and noise. Not many methods have been applied on medical data sets. The same with techniques where tests have been done, but still there is room for further research into techniques that when using real-world medical data sets for data mining. This study will further investigate ways for a successful outcome of discovering patterns in a medical data set. The CRISP-DM data mining process will be used and R statistical package tool for handling noise and missing values. Zhu & Wu (2004) indicate that powerful tools can greatly assist in the data cleansing process which are cost effective are necessary and may help to achieve data quality level for data mining. A number of algorithms will also be tested on the medical data set to see how well they can perform on the data set that contains noise.

3. METHODOLOGY

3.1. Dataset

The study has been based on pre-record dataset provide by external clients who are kept anodynes. Also because of the confidentiality, ethical and legal issues in the dataset there was a necessity to remove sensitive information before we were able to view and use the data. There are a total of 1286 records of patients with ADR that will be used for the data mining project. The period of interest is from 1996 to 2008 recorded data.

The information in the dataset included characteristic of patient, drugs, and treatment for adverse drug reactions. The information that was made available in the dataset includes:

 Date when the patient was admitted for ADR.

6  Age record in days  Brand is the generic drug for the main drug  Drug that was given to the patient  Route of administration  Probability of the drug being the cause of ADR  Severity of the ADR  Recovered or not  UR number which includes patients details  ATC Anatomical Therapeutic Chemical is a classification system for drugs

It is worth nothing that, due to the limited attributes, incomplete and missing information only attributes that could provide some sort of interesting relationship when combined were chosen for the research. The discussion of the research process and the attributes kept for analysis are outlined in the following section.

3.2. Research process

The research uses the data mining method of CRISP_DM where the consortium uses a six step data mining process as shown in figure 1. Details of the first five phases are discussed as how they were used for the research into discovering patterns in ADRs.

Figure 3: CRISP-DM – six step process model (CRISP-DM, 2000)

Understand the business this is where the project was reviewed by the client, supervisor and team member as which direction we were going to take and what

7 was the goal of the project. The main aim of this research is to test techniques to see if patterns are formed using a sparse dataset.

Understand the dataset for this stage the dataset was reviewed by using Rattle tool to give a summary of the attributes as a whole and query each attributes separately to visualise the data in various format to aid in the decision which attribute to keep for further analysis. Since the attributes for this dataset was limited a few attributes stood out more and were considered for the next phase.

Data preparation this is where the data went through two extra processes, Data Cleaning and Data Transformation all done in the R tool because of the ease of use of scripts to scan the dataset and correct mistakes and transform the data. The objective for this phase was to decide on the structure of the data for the next phase. Five attributes were chosen they included Date, Age in Days, Route, Recovered, and ATC code for the drug. These attributes were chosen in consideration of giving a better result for modelling. Table 1 shows attributes abbreviation name and given values.

Variable Abbreviation

Date when the patient was admitted ADRDATE to hospital for ADRs (October- March =1, April-September = 0)

How old the patient is categorised into equal number of records. (0-2 years old = 1, 2-5 years old = 2, 5- AGE 11 years old = 3, 11-16 years old = 4, and above 16 years of age = 5)

The administration of the medication that caused the ADR is ROUTE either oral or intravenous.(Oral = 1, Intravenous = 0)

Recovered from ADRs or not. RECOV (Recovered = 0, Not recovered = 1)

The drugs given to the patient either are classified antibiotics or not. ATC (Antibiotics =1, Not Antibiotics =0)

8 Table 3 shows the binary values that the attributes were given.

Modelling phase for the process included the decision of selecting the most appropriate algorithms for the research which for this study included logistic regression, decision tree, and risk pattern.

Evaluation phase was the last phase for the project where the models were interpreted and the results determined if the project objectives were met. Due to time constraint the results of the three techniques were used to answer the project objectives and the first three phases were only completed once.

3.3. Data mining tool

The data mining tool chosen for the project is R package for statistical computing and graphics with programming capabilities, and Rattle a user interface that can be combined with R package. These tools can be run on a variety of platforms including UNIX, Windows, and MacOS and R also allows binding with other languages such as Python, XML, Soap, and Perl. Both of these packages are under the free software environment and provide a sophisticated way of performing data mining. A screenshot of the R and Rattle tools is shown in figure 2.

Figure 4: R and Rattle tool for data mining screenshot.

9 Rattle is used by many governments and private organisations around the world including the Australian Taxation office and is being adopted by a number of colleges and university in teaching data mining.

The R and Rattle combined provides a good set of data mining algorithms for modelling selection. They include cluster, association rules, liner models, tress, and neutral models. Besides the models there is the variety of ways for visualizing the data like histograms, plots. Also data form of almost any source can be loaded and used.

Most of the data preparation was done in R by using Scripting language and the decision tree and logistic regression was modelled using Rattle. The only other algorithm used for the project was Mining Risk Patterns. The software for this algorithm was run on Linux 9.0 platform.

3.4. Algorithms

The data mining techniques adopted for the project included logistic regression, decision tree, and risk pattern mining algorithm. Each of these techniques provides their own unique way of analysing the medical dataset that was provided.

Decision tree and logistic regression have been applied and used across a wide range of applications including medical applications. Ji et al (2009, p. 2) in reporting Andrews study, emphasizes the benefits of logistic regression and decision tree method for ‘identifying commonalities and differences in medical databases variables. The risk pattern algorithm has also been applied to medical data for patients on ACE inhibitors who have an allergic event (Li et al, 2005). As this project explores the use of medical dataset to detect adverse drug reaction it was important to use techniques that are reliant and have proven to work in similar studies.

The difference between these techniques is that logistic regression is appropriate when variables are of two possibilities (0, 1) and variables with multiple categories. This makes the logistic regression method useful for this study in determining whether patient’s medical details given have any association of the patient not

10 recovering from adverse drug reactions. Where else the decision tree is also well suited to binary values but can also be modelled with more than two values and can easily be understood by people because of the tree like structure and leaf nodes that can easily be analysed to determine the patterns given. The last algorithm ‘makes use of anti-monotone property to efficiently prune searching space’ (Li et al, 2005). The optical risk pattern mining returns the highest relative risk pattern among the patterns discovered. This model is easily interpreted and shows the odds ratio, risk ratio and the fields associated with the pattern.

11 4. MEDICAL DATAMING MINING

4.1. Introduction

Data mining for medical applications as for business applications is finding unknown patterns. For example vast amounts of information is stored for treating patients, by using appropriate tools useful information can be retrieved from the records. Wang & Wang (2004) separate medical data mining into medical diagnosis and drug development. Where for medical diagnosis data mining can improve medical treatment when knew knowledge is discovered. Lee et al (2005) reports that medical computer applications have been developing for about 40 years. But even with the improvement in computer technologies it is still difficult to produce information that is considered to be of value because of the complexability of medical information systems (Lee et al 2000). What makes medical information more complex to data mine are the issues that comes with medical records from the data collection process to the time when the data miner wants to use. In the future as more research is done for data mining medical information systems and new tools are developed with improvements in data cleansing and techniques, it will make knowledge discovery easier and more reliable for medical applications.

4.2. Issues

As discussed previously medical data has issues which limit the way data mining can be done for medical information systems. The first issue is how medical data is collected. Medical data can be collected from, physician’s notes, interviews with patients, images. All this information needs to be stored and put into a comprehensible format (Cios & Moore, 2002). Doing this for medical data is a lot more difficult and complex then for other data sets as a lot more word description is used, meanings can be similar for many records in the dataset, and information which needs to be included may not be collected or entered into the records.

Ethical and legal issues are other concerns for medical data miners. All private information on patients can not be used and some information needs to be removed

12 before data mining may begin, but this can set a limitation for data mining and sometimes it may be necessary to get access to the original data. To get access a lot of time may be taken up for putting polices in place to protect patient’s information and this will have to be considered before undertaking medical data mining projects.

4.3. Data Quality

Data quality is another issue of concern and is taking up a lot of time and budget in the data mining process (Zhu et al 2007). It is also one of the key issues for data mining and even with all the study into the importance of good quality data it is still impossible to find a data set that doesn’t have any errors (Hand et al 2000).

Cios & Moore (2002) describes the medical data as unique for data mining because of the features that make medical data more demanding to analysis. Some of these features include the way data is collected for medical data. Another feature is that medical data is constantly updated. Results may have to be repeated and updated for the patient etc. So over time some data becomes redundant, insignificant and inconsistent (Cios & Moore 2002).

The other main problem with medical information is that it is often incomplete; either data was accidentally not entered or purposely not entered through ethical or some other reason. Many mistakes are also found in medical data through ambiguity of definition and data being distorted (Hand et al 2000).

For the improvement in data quality changes would have to be made the way data is collected and recorded. At this present time it is still impossible to eliminate noise in medical data. Hand (2000) goes one step further by saying that poor data quality can lead to most patterns discovered to be of no real interest.

13 5. RESULTS

5.1. Data Visualisation In all five category variables were used as described in section 3.2 table 1. Figure 3 shows that the distribution for ADRDATE is quite evenly split, around 48% are (0 in X axis) and 52% are (1 in X axis) with no missing values. Next figure 3 shows the distribution of continues variables for AGEDAYS. The X axis indicates the groups (1, 2, 3, 4, And 5) and the number of patients in each group. No missing values recorded. Next we have the histogram for Route of administration which was split 45% for (0 in X axis) and 55 % (1 in X axis) with 570 entries missing. For REC the X axis indicates the group (90% 0: recovered, 10% 1: not recovered) with 344 values missing. The final variable ANTIBIOTICS shows a 45% for (0 in X axis) and 55% for (1 in X axis) with 191 values not recorded.

Figure 3. Histograms show the distribution of the five input variables used (ADRDATE, AGEDAYS, ROUTE, RECOV, and ANTIBIOTICS)

14 5.2. Logistic Regression This is the first technique used in the study. Logistic regression was used to analyse whether patients information in the data set are associated with the likelihood of patients not recovering from adverse drug reactions. The advantage of using regression analysis for the project is that it examines the association between the likelihood of not recovering from adverse drug reaction which is the factor of interest for this study.

The form of the logistic equation is as follows:

Logit (p) = b0 + b1x1 + b2x2 + b3x3 +...+ bkxk

P is the characteristics of interest and b0 is the interception and X1 is ADRDATE + X2 AGE etc.

The bi shows the increase or decrease for the logistic regression coefficients.

Logistic regression results are shown in Appendix C.

From the results one of the coefficient are not significantly different from zero, 0.002067 for AGE this prediction is not relevant for the prediction for patients not recovering from ADRs.

Also when looking at the deviance we can see that ADRDATE and ANTIBIOTICS have the largest increase making these two attribute more important.

The coefficients associated with ARDDATE (0.136312), and ROUTE (0.059532) is positive, indicating that the probability of not recovering from ADRs is more likely from October to March and oral administration of drugs. For ATC (-0.181255) which is negative indicates that taking non antibiotics drugs increases the chance of not recovering when associated with the other variables for the patient with adverse drug reaction.

15 5.3. Decision Tree The next technique decision tree was applied to the medical dataset to determine whether we could predict factors that contribute to patients not recovering from ADRs. The decision tree is a tree structure that starts from the root and builds the tree with leaf nodes and decision nodes. Wilson (2008) in citing Qualin decision tree algorithm shows how this is accomplished.

 The algorithm operates over a set of training instances, C.  If all instances in C are in class P, create a node P and stop, otherwise select a feature or attribute F and create a decision node.  Partition the training instances in C into subsets according to the values of V.  Apply the algorithm recursively to each of the subsets C.

The decision tree for this study used the same attributes as for logistic regression algorithm.

ADRDATE: This variable was split up into wet and dry. Months from October - March was set to binary value 1 for dry and months and wet months included April to September and contained the value 0.

AGE: This variable was categorised into 5 categories. All patients between the ages of 0-2 years old were set to 1, 2-5 years old variable 2, 5-11 years old variable 3, 11-16 years old variable 4, and above 16 years of age variable 5.

ROUTE: Route of administration had all data in the data set changed to oral or intravenous and binary set to 1 for oral and 0 for Intravenous.

RECOV: Patients either recovered or did not. Recovered = Yes and not recovered = No

ANTIBIOTICS: Use the ATC classification to classify the drugs into antibiotics or not if the drug were antibiotics they had the binary number 1 and if they were not they had the binary number 0.

The results for the decision tree are shown in Appendix D. The tree used 1035 entities from the 1286 entities. 248 observations were not included due to the excessive missing information.

The root node, which contained the 1035 entities, showed that 45% were not antibiotics and 54% were antibiotics.

The decision tree based its decision on all of the variables. These variables were quit evenly spread across the tree. By using a simple logical-if-else statement and following

16 the tree makes the interpretation straight forward. Two examples from the decision tree are as follows.

The decision tree for patients recovered and did not recover. Node 10 (ADRDATE >= 0.5) includes the month from October to March, node 10(AGE > = 4.5) age greater than 16 years of age, node 21(ROUTE < 0.5) were receiving the drugs intravenously, node 42(yes) 18 patients did recover taking non antibiotic drugs (0), and node 43(No) where 3 patients did not recover from antibiotic drugs.

The decision tree for patients that did not recovering. Node 3 (Age < 3.5) where age is less than 11 years old, node 6 (ROUTE < 0.5) were receiving the drugs intravenously, and node 12 (RECOV, NO) 24 patients did not recover where drugs (0) were not antibiotic drugs.

The limitation in using the decision tree for the project is that informative patterns for defining patients that have not recovered are chosen first. The growth of the tree continues until all the attributes are classified. This is where the problems starts because you do not have a guarantee that the best attributes will be chosen, but the advantage for the decision tree is that it is easy interpreted by the way the tree is structured and does not need much data preparation for the classification to work and missing values are easily included into the tree.

17 5.4. Risk Pattern Algorithm This is the newest technique and the aim of using the risk pattern algorithm is to analysis the risk fact of recovered or not recovered when a patient has an adverse reaction to drugs. Li et al (2005) explains that risk patterns exist in a small population for medical applications and most algorithms uncover more frequent patterns.

The algorithm outlined below shows how to retain a one pattern amongst the highest risk patterns (Li et al, 2005) 1) Set R′ = Ø 2) For each record r in D belonging to class a 3) Find all patterns in R that are subsets of r 4) Add the pattern with the highest relative risk to R′ 5) Sort all patterns in R′ in the RR decreasing order 6) Return R′ The attributes used include date, age in days, route, recovered, and ATC are the same attributes and format as used for the decision tree. The data contains 746 records with 85 patients that haven’t recovered and 661 patients that recovered from adverse drug reaction. The results from the risk algorithm are shown in Appendix E. The following shows the patterns discovered from the risk pattern algorithm being used on the medical data set. In total 12 patterns were recorded

Pattern 1 where Risk Ratio = 2.48 Pattern 3 where Risk Ratio = 2.19 Agedays = between 5-11 years old Adrdate = between October - March Adrdate = months between October – March Agedays = between 11-16 years old Antibiotics = No Route = Oral administration

Pattern 2 where Risk Ratio = 2.55 Pattern 4 where Risk Ratio = 2.17 Agedays = between 5-11 years old Agedays = between 11-16 years old Antibiotics = No Route = Oral administration Antibiotics = No

The first four patterns seem to be closely linked where most patients between ages of 5- 16 do not take antibiotics and the drugs are administrated orally have a higher probability of ADRs. To get a more accurate reason behind these patterns further investigation would have to be made into each pattern.

18 6. Discussion

The purpose of the study was to use a real world medical dataset that contained more noise and missing values then usual and try to find patterns using a number of techniques. Three techniques were chosen from a number of different techniques because of decision made through research into there use in previous data mining medical applications that they might produce the best results in finding patterns that cause ADRs and to see how well they can perform using a real world medical dataset with noise. All three techniques produced similar results and dealt with missing values well by using inbuilt mechanism to sort out the attributes that will be used. For instance for Risk Pattern algorithm used only 746 records from the initial 1284 records Most of the common mistakes like duplicates or entry mistakes were handled by using the R tool which today is used a number of organisations for data mining. The dataset was transformed as in its current state it was impossible to use and find any patterns. The final results for these techniques contained uncertainties, more due for the need for more attributes like patient characteristics and drug treatment. But still the results showed that even limitation due to the dataset patterns can still be produced if the data is transformed into values that can be used for data mining while still keeping the original information. This does not mean that the dataset is all right. Data quality is still important and ways to improve this for medical datasets still needs to be addressed. Noise will never be completely eliminated for medical datasets. So tools and techniques will have to be able to deal with this. As mentioned by Roddick & Graco (2004) medical datasets are much harder to data mine but also the most rewarding then any other application because of the complexity and varying quality and ‘there exists a substantial medical knowledge base which demands a robust collaboration between the data miner and the health professional(s) if useful information is to be extracted’. This research proves usefulness for further research which might include studies into tools for medical datasets and ways to improve noise and missing values so that techniques most suitable for medical applications can produce patterns that are useful for medical experts.

19 REFERENCE

Aggarwal CC & Srinivasan, P 2001, Mining massively incomplete data sets by conceptual reconstruction, ACM, San Francisco, California.

Brown, ML & Kros, JF 2003, 'Data mining and the impact of missing data', Industrial Management & Data Systems, vol. 103, pp. 611-621.

Cios, K 2002, 'Uniqueness of medical data mining', Artificial intelligence in medicine, vol. 26, no. 1-2, pp. 1-24.

CRISP_DM 2000, Cross Industry Standard Process for Data Mining, viewed 27 August 2008, .

Hand, DJ 2000, 'Data Mining: New Challenges for Statisticians', Social Science Computer Review, vol. 18, no. 4, November 1, 2000, pp. 442-449.

Hand, DJ, Gordon, B, Kelly, MG & Adams, NM 2000, 'Data Mining for Fun and Profit', Statistical Science, vol. 15, no. 2, pp. 111-126.

Li, J, Fe, AW-c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R & Kelman, C 2005, Mining risk patterns in medical data, ACM, Chicago, Illinois, USA.

Lavrač, N 1999, 'Selected techniques for data mining in medicine', Artificial intelligence in medicine, vol. 16, no. 1, pp. 3-23.

Lee, I-N, Liao, S-C & Embrechts, M 2000, 'Data mining techniques applied to medical information', Medical Informatics & the Internet in Medicine, vol. 25, no. 2, pp. 81-102.

Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’, Infection Control and Hospital Epidemiology, vol.25, no 8, pp. 690-695.

Roddick, JF, Fule, P & Graco, WJ 2003, 'Exploratory medical knowledge discovery: experiences and issues', SIGKDD Explor. Newsl., vol. 5, no. 1, pp. 94-99.

Safety of Medicines 2002, A Guide to Detecting and Reporting Adverse Drug Reaction Why Health Professionals Need to Take Action, WHO publications, viewed 15 April 2008, < http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf>.

Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper presented at the IT in Medicine and Education, 2008. ITME 2008. IEEE International Symposium on, Xiamen

Wilson, AM, Thabane, L & Holbrook A 2003, 'Application of data mining techniques in pharmacovigilance',British Journal of Clinical Pharmacology, vol.57, no.2,pp. 127-134.

Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on mining low-quality data', Knowledge and Information Systems,vol.11, no.2, pp.131-136.

Zhu, X & Wu, X 2004, 'Class Noise vs. Attribute Noise: A Quantitative Study', Artificial Intelligence Review, vol. 22, no. 3, pp. 177-210.

20 Appendix A:

Results for the logistic regression technique

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.901353 0.466304 -4.077 4.55e-05 *** ADRDATE 0.136312 0.285722 0.477 0.633 AGEDAYS 0.002067 0.115482 0.018 0.986 ROUTE 0.059532 0.290016 0.205 0.837 ANTIBIOTICS -0.181255 0.300150 -0.604 0.546 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 336.17 on 428 degrees of freedom Residual deviance: 335.49 on 424 degrees of freedom (854 observations deleted due to missingness) AIC: 345.49

Number of Fisher Scoring iterations: 4

==== ANOVA ==== Analysis of Deviance Table Model: binomial, link: logit

Response: RECOV

Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 428 336.17 ADRDATE 1 0.22 427 335.95 AGEDAYS 1 0.03 426 335.92 ROUTE 1 0.06 425 335.86 ANTIBIOTICS 1 0.37 424 335.49

21 Appendix B:

Decision Tree Result Summary of the rpart model: n=1035 (248 observations deleted due to missingness) node), split, n, loss, yval, (yprob) * denotes terminal node

1) root 1035 473 1 (0.4570048 0.5429952) 2) AGE>=3.5 407 140 0 (0.6560197 0.3439803) 4) ADRDATE< 0.5 203 61 0 (0.6995074 0.3004926) * 5) ADRDATE>=0.5 204 79 0 (0.6127451 0.3872549) 10) AGE>=4.5 100 35 0 (0.6500000 0.3500000) 20) ROUTE>=0.5 79 27 0 (0.6582278 0.3417722) * 21) ROUTE< 0.5 21 8 0 (0.6190476 0.3809524) 42) RECOV=Yes 18 6 0 (0.6666667 0.3333333) * 43) RECOV=NO 3 1 1 (0.3333333 0.6666667) * 11) AGE< 4.5 104 44 0 (0.5769231 0.4230769) 22) ROUTE< 0.5 77 30 0 (0.6103896 0.3896104) * 23) ROUTE>=0.5 27 13 1 (0.4814815 0.5185185) * 3) AGE< 3.5 628 206 1 (0.3280255 0.6719745) 6) ROUTE< 0.5 236 109 1 (0.4618644 0.5381356) 12) RECOV=NO 24 6 0 (0.7500000 0.2500000) 24) AGE>=2.5 14 0 0 (1.0000000 0.0000000) * 25) AGE< 2.5 10 4 1 (0.4000000 0.6000000) * 13) RECOV=Yes 212 91 1 (0.4292453 0.5707547) 26) AGE< 2.5 72 36 0 (0.5000000 0.5000000) 52) ADRDATE< 0.5 37 15 0 (0.5945946 0.4054054) * 53) ADRDATE>=0.5 35 14 1 (0.4000000 0.6000000) 106) AGE>=1.5 17 8 0 (0.5294118 0.4705882) * 107) AGE< 1.5 18 5 1 (0.2777778 0.7222222) * 27) AGE>=2.5 140 55 1 (0.3928571 0.6071429) * 7) ROUTE>=0.5 392 97 1 (0.2474490 0.7525510) 14) ADRDATE>=0.5 180 58 1 (0.3222222 0.6777778) 28) AGE>=2.5 29 13 1 (0.4482759 0.5517241) 56) RECOV=NO 3 1 0 (0.6666667 0.3333333) * 57) RECOV=Yes 26 11 1 (0.4230769 0.5769231) * 29) AGE< 2.5 151 45 1 (0.2980132 0.7019868) 58) AGE< 1.5 78 24 1 (0.3076923 0.6923077) 116) RECOV=NO 2 0 0 (1.0000000 0.0000000) * 117) RECOV=Yes 76 22 1 (0.2894737 0.7105263) * 59) AGE>=1.5 73 21 1 (0.2876712 0.7123288) * 15) ADRDATE< 0.5 212 39 1 (0.1839623 0.8160377) *

22 Appendix C:

Representation of the Resulting Risk Pattern Algorithm

#The number of data = 746

# 85 in class NO

# 661 in class YES

#Pattern number length Odds Risk Cohort class class field field field name field ... Ratio Ratio size 0 size 1 name value value size #Risk patterns for NO

1 3 3.032 2.485 26 19 7 ADRDATE 1 AGEDAYS 3 ANTIBIOTICS 0 4 2

6 2 2.034 1.831 56 45 11 AGEDAYS 4 ROUTE 1 8 6 7 3 1.941 1.763 58 47 11 ADRDATE 0 AGEDAYS 4 ANTIBIOTICS 0 9 3

8 2 1.733 1.602 28 23 5 AGEDAYS 1 ANTIBIOTICS 0 7 7

10 3 1.702 1.581 58 48 10 ADRDATE 1 ROUTE 1 ANTIBIOTICS 0 8 6

#The end of the report

23

24