Data Mining the Genetics of Leukemia


Geoff Morton

A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science

Queen’s University Kingston, Ontario, Canada January 2010

Copyright °c Geoff Morton, 2010 Abstract

Acute Lymphoblastic Leukemia (ALL) is the most common cancer in children under the age of 15. At present, diagnosis, prognosis and treatment decisions are made based upon blood and bone marrow laboratory testing. With advances in microarray technology it is becoming more feasible to perform genetic assessment of individual patients as well. We used Singular Value Decomposition (SVD) on Illumina SNP, Affymetrix and cDNA -expression data and performed aggressive attribute se- lection using random forests to reduce the number of attributes to a manageable size. We then explored clustering and prediction of patient-specific properties such as disease sub-classification, and especially clinical outcome. We determined that integrating multiple types of data can provide more meaningful information than individual datasets, if combined properly. This method is able to capture the cor- relation between the attributes. The most striking result is an apparent connection between genetic background and patient mortality under existing treatment regimes. We find that we can cluster well using the mortality label of the patients. Also, using a Support Vector Machine (SVM) we can predict clinical outcome with high accu- racy. This thesis will discuss the data-mining methods used and their application to biomedical research, as well as our results and how this will affect the diagnosis and treatment of ALL in the future. i Acknowledgments

I would like to thank my supervisor Prof. David Skillicorn for the opportunity to work on this project, for all of the guidance he has given me along the way and for the chance to continue my work in Australia. The School of Computing at Queen’s University has provided me not only with a wonderful education but also the funding that made this work possible and for that I am grateful. Thanks also to Dr. Daniel Catchpoole for providing a different view into my work and making sure all of our results were practical and applicable, as well as for his hospitality for my time in Australia. To my friends and colleagues at the Children’s Cancer Research Unit at The Children’s Hospital at Westmead, I thank you for making my transition so easy and making it fun to come into work every day. And finally I would like to thank my friends and family for their support throughout this whole process. Although my stories may sound like gibberish to you, you were always there to listen.

ii Table of Contents

Abstract i

Acknowledgments ii

Table of Contents iii

List of Tables vi

List of Figures viii

Chapter 1: Introduction ...... 1 1.1 Problem ...... 1

1.2 My Contribution ...... 3

Chapter 2:

Background ...... 4 2.1 Acute Lymphoblastic Leukemia (ALL) ...... 4

2.2 The Datasets ...... 7 2.3 Random Forests ...... 11

2.4 Singular Value Decomposition (SVD) ...... 13

iii 2.5 Support Vector Machine (SVM) ...... 15

2.6 Related Research ...... 16

Chapter 3: Experiments ...... 20 3.1 Pre-processing ...... 20

3.2 Normalization ...... 23 3.3 Experimental Model ...... 24 3.4 Combination of Datasets ...... 25

3.5 Attribute Selection ...... 26 3.6 Analysis of Selected Attributes ...... 27 3.7 Validation of Results ...... 28

3.8 Further SNP Analysis ...... 29

Chapter 4: Results ...... 31

4.1 SVD Results ...... 31 4.2 Combination of Datasets ...... 40 4.3 Analysis of Selected Data ...... 42 4.4 Attribute Selection ...... 61

4.5 Validation of Results ...... 81 4.6 Extended SNP Analysis ...... 82

4.7 Discussion of the Nature of the Data ...... 96

Chapter 5: Conclusion ...... 99

iv 5.1 Future Work ...... 101

Bibliography ...... 103

v List of Tables

2.1 Symptoms of ALL ...... 5

3.1 Random Forests Setup ...... 27

4.1 Combination Method 1 ...... 41 4.2 Combination Method 2 ...... 41 4.3 SNP Subset SVM Results ...... 42

4.4 cDNA Subset SVM Results ...... 48 4.5 Affy Subset SVM Results ...... 51 4.6 SNP and cDNA Subset SVM Results ...... 53 4.7 SNP and Affy Subset SVM Results ...... 56

4.8 cDNA and Affy Subset SVM Results ...... 57 4.9 SNP, cDNA and Affy Subset SVM Results ...... 60

4.10 Top 100 SNP Attributes ...... 62 4.11 Top 100 cDNA Attributes ...... 64 4.12 Top 100 Affy Attributes ...... 67

4.13 Top 100 SNP-cDNA Attributes ...... 70

4.14 Top 100 SNP-Affy Attributes ...... 72 4.15 Top 100 Affy-cDNA Attributes ...... 75

4.16 Top 100 SNP-cDNA-Affy Attributes ...... 78

vi 4.17 Label Shuffling SVM Results ...... 82

4.18 SNP Relapse SVM Results ...... 85

4.19 Comparison of Top Attributes ...... 93

vii List of Figures

4.1 All SNP SVD ...... 32

4.2 All cDNA SVD ...... 33 4.3 All Affy SVD ...... 34

4.4 All Clinical SVD ...... 36 4.5 All SNP-cDNA SVD ...... 37 4.6 All SNP-Affy SVD ...... 38 4.7 All Affy-cDNA SVD ...... 39

4.8 SNP-cDNA-Affy SVD ...... 40 4.9 SNP Subset SVD ...... 45

4.10 SNP Subset SVD ...... 46 4.11 cDNA Subset SVD ...... 50 4.12 Affy Subset SVD ...... 52 4.13 SNP-cDNA Subset SVD ...... 55 4.14 SNP-Affy Subset SVD ...... 57

4.15 cDNA-Affy Subset SVD ...... 59 4.16 All Combined Subset SVD ...... 61

4.17 SNP Relapse SVD ...... 84

4.18 SNP Graph Analysis SVD ...... 87

4.19 Reformatted SNP Analysis SVD ...... 89 viii 4.20 250 SNP SVD for old and updated labels ...... 91

4.21 SNP SVD for updated labels ...... 92

4.22 Intersecting SNP SVD ...... 96

ix Chapter 1


1.1 Problem

Cancer, in all of its forms, is the second leading cause of death in the United States [15] and accounts for 13% of all deaths worldwide [25]. It is estimated that in the United States in 2009 a total of approximately 1.5 million people will have been diagnosed with cancer and of these, approximately 560,000 will die from their disease [22]. It is also estimated that approximately 30% of these cancer deaths are preventable [25]. The National Cancer Institute in Washington spends approximately $4.8 billion per year towards cancer research with most of the funding going towards breast, prostate, lung, colorectal and leukemia research [21]. Leukemia is the most common malignancy affecting children under the age of

15, but it also affects many adults. There are four subtypes of leukemia; acute lymphoblastic leukemia, chronic lymphoblastic leukemia, acute myelogenous leukemia and chronic myelogenous leukemia. It is estimated that approximately 45,000 new cases of leukemia will have been diagnosed in the United States in 2009 [16]. The 1 CHAPTER 1. INTRODUCTION 2

survival rate for persons with leukemia has dramatically increased over the past four decades. In the 1960s the five-year event-free survival rate was a mere 14%. In more recent years these figures have been quoted as being as high as 80% [12, 17, 31]. Although there has been a significant improvement in the treatment of this disease, 20% of all leukemia cases result in death. With the completion of the Project, the understanding of genetics has increased significantly. As such, many new technologies have been developed to study the genome in many different forms. One of these technologies is the microarray, which is a high-throughput device allowing for the analysis of thousands of gene expression levels simultaneously. As a result, there is a wealth of data being generated every day for many different purposes. The microarray has become a useful research tool and has allowed researchers to begin looking at problems on a much larger scale. As this technology evolves, so to do the applications for microarrays. In cancer research, researchers can now look at the expression levels for many thousands of as well as a description of an individuals genome given by a set of Single Nucleotide Polymorphisms (SNPs). The amount of data that is being generated is staggering, and there is a need to develop methods for analyzing this data efficiently.

These high-throughput technologies have led to many great discoveries which have had many clinical applications. With 20% of leukemia cases leading to death, there is an opportunity for this type of technology to have a positive effect in this area of research. The present method for diagnosis, prognosis and treatment decisions involve a series of clinical tests and assessments by physicians who ultimately place the patient in a specific risk category that dictates the treatment they receive. Although this process has clearly shown improvement over the past four decades, there is still CHAPTER 1. INTRODUCTION 3

a need for improvement.

1.2 My Contribution

The goal of this resarch is to explore the relationship between the genetics of individ- uals who have leukemia and whether or not they survive the disease. Our hypothesis is that there is a genetic relationship between an individual’s genetics and their sur- vivability of this disease. We use data from several microarray experiments that have generated data about both the SNP profiles of patients as well as gene expression data. In order to analyze these complex data we have developed a data-mining pro- cess that involves filtering the data to remove uninformative data attributes and then using a matrix-decomposition technique to cluster these data. We perform this data- mining technique on the individual datasets as well as all possible combinations of them, to see if combining data together provides more useful information for the ex- ploration. We use the clinical information that is available for each patient to develop an understanding of the results that our method has produced and attempt to see if any biological implications can be drawn. These data are complex, high-dimensional and evolving. These properties make them difficult to work with and standard data-mining methods must be modified in order to have the appropriate functionality. This research is preliminary work in this field and will provide a foundation for further experiments which will potentially lead to work with clinical implications. With the amount of data that is being generated through these high throughput experiments, it is imperative that proper methods for analyzing these data be developed. We believe that this is the first step in that direction for this particular research area. Chapter 2


In this chapter, we provide the necessary background information for this thesis. First, we explain the specific type of leukemia which we are investigating as well as discussing the types of datasets that are being used for this experiment. We also explain the techniques used and their role in the field of biomedical research.

2.1 Acute Lymphoblastic Leukemia (ALL)

Acute Lymphoblastic Leukemia (ALL) is the most common cancer in children under the age of 15 [12, 17, 31]. The term ’acute’ refers to the short amount of time it takes to develop this disease. It is estimated that in the US in 2008, 5430 people were diagnosed with ALL, 60% of whom were children [16]. Overall this is a rare cancer, accounting for only 0.3% of all cancers diagnosed every year [32]. ALL is a cancer of the blood, but more specifically it affects cells known as lymphocytes. These cells are more commonly known as white blood cells, and are an integral part of the immune system. The bone marrow is responsible for producing blood cells,


and in individuals with ALL, the bone marrow produces too many lymphocytes too quickly. As a result, the cells never fully mature nor do they develop their proper functionality [17]. This becomes a problem for several reasons. First, lymphocytes are an important part of the immune response. If they do not function properly then the individual is going to be unable to fight off the infection or disease and can become ill. Second, if there are too many lymphocytes being produced, then there is less room in the bloodstream for the other vital blood cells: red blood cells and platelets. As a result of this, the individual may develop anemia or bleeding problems. Finally, the abnormal lymphocytes can build up in lymph nodes as well as the spleen, liver, brain and testicles and can cause swelling which can also lead to complications

[12, 17]. There are several risk factors that make it more likely that certain individuals will develop ALL than others. These include; radiation exposure, benzene exposure, smoking, genetic predisposition, certain viruses and past chemotherapy [12, 16]. There are many symptoms associated with ALL, and they are listed in Table 2.1.

Table 2.1: Symptoms of ALL


General weakness Fatigue High fever Weight loss Frequent infections Bruising easily with no obvious cause Bleeding from the gums or nose A rash of dark red spots Blood in urine the stool Pain in the bones or joints Breathlessness Swollen lymph glands A feeling of fullness in the abdomen caused by a swollen liver or spleen CHAPTER 2. BACKGROUND 6

At this present time, the method for diagnosis of the disease is as follows. Blood and bone-marrow laboratory tests are performed to look for any leukemia cells. A complete blood count is used to look at levels of white blood cells, platelets, etc. A bone-marrow aspirate is done to look for blast cells while a bone-marrow biopsy is done to see how much disease is in the bone marrow. Also, immunophenotyping is done to determine if the disease is B-cell or T-cell leukemia [12, 17]. There are two stages to the treatment procedure. First, induction therapy is conducted with a goal of killing as many leukemia cells as possible to get the blood counts back to normal and induce remission. This is done using various types of chemotherapy. In certain extreme cases when the disease has spread to the brain and spinal cord, radiation therapy or a bone-marrow transplant is also necessary. Once the patient is in remission the second stage of treatment begins. This is called post-induction therapy. This involves giving doses of treatment every 2-3 years. This is necessary because not all of the leukemia cells will always be killed. This treatment is usually different from the induction chemotherapy [12, 17]. Survival rates for children diagnosed with ALL have increased dramatically over only a few decades. In the 1960s the survival rate was a dismal 4%. Through an improvement in both diagnosis and treatment, this number has risen to around 80% today and survival rates for high risk patients remain around 40% [12, 17, 31]. This is a remarkable increase; however, room for improvement still exists and so it is necessary to continue to look for ways to improve on diagnosis, prognosis and treatment. CHAPTER 2. BACKGROUND 7

2.2 The Datasets

2.2.1 SNP data

DNA is the blueprint from which all living creatures are created. It is the variations in this DNA that allow for the differences both between species as well as within a species. There are four nucleotides that make up DNA; adenosine (A), cytosine (C), guanine (G), and thymine (T). Because DNA is double stranded, these nucleotides work in pairs; A binds with T and C binds with G. In the lifespan of a living being, this DNA will be replicated numerous times. The process of replication is subject to error, and although there are many error-checking processes, it is still possible for a mistake to happen. This is known as a mutation, and there are many different types of mutations. Some are harmful while some are not. Over the course of time these mutations are passed down from generation to generation, and are subject to their own errors as well. Mutations are the reason why there is so much interspecies individuality [2]. One variation in DNA that has become important to researchers is a Single Nu- cleotide Polymorphism (SNP). This is the term used to describe a single nucleotide base-pair difference between individuals. In order to be considered a SNP, this dif- ference must occur in at least 1% of the population [2, 5]. SNPs are responsible for approximately 90% of all genetic variation in humans, and two out of three times the change in nucleotide is from a cytosine to a thymine. SNPs can occur in all parts of the DNA; in coding, non-coding and intergenic regions. Due to its appearance in cod- ing regions, it is believed that SNPs have an effect on an individual’s predisposition to disease as well as their response to drugs. This has become an important research CHAPTER 2. BACKGROUND 8

tool, as it may explain why certain individuals do not respond well to treatment, or why they contracted a certain disease [2, 5].

For this research, the SNP data was collected on the Illumina HumanNS-12 Geno- typing BeadChip platform. The SNPs which are analyzed in this research are all non-synonymous SNPs. This means that a change in the DNA results in a change in the amino acid coding sequence of the . This platform contains 13917 SNPs and this was done for 137 patients. The samples used for this experiment come from peripheral blood during remission. The data that is generated from this analysis comes in two forms; theta values and allele values. A theta value generates a B allele frequency which has a value between 0 and 1. Thus, a value close to 1 represents a homozygous B allele, a value close to 0 represents a homozygous A allele and a value close to 0.5 represents a heterozygous allele [29]. The other form of these data are the allele values. There are three possible values these data can take; 0, 1 or 2. A value of 0 represents the homozygous major allele, 1 represents the heterozygous case and 2 represents the homozygous minor allele. Unlike the theta values, there is no range of values and each individual has one of these three values.

2.2.2 cDNA data

DNA microarrays are high-throughput devices that allow for the collection of a large amount of information on a small glass slide. The basis of this technology relies on how DNA is transcribed. When a particular gene becomes activated, it is tran- scribed many times in order to produce the necessary . This process involves creating a complementary strand of mRNA which is then used as a template for build- ing these proteins. Thus, a gene that is highly expressed will have many identical CHAPTER 2. BACKGROUND 9

mRNA molecules within a cell [33]. This is the basis for a microarray experiment. Researchers are interested to know which genes are active under certain conditions.

To do this, a microarray slide is spotted with thousands of single stranded problems representing particular genes. Then, fluorescently labeled mRNA molecules are put onto the microarray slide. Due to the complementarily of the strands, if the labeled mRNA molecule finds its match it will hybridize to the probe. As more mRNA bind to particular strands of the DNA, the more fluorescence there is at that spot on the microarray. After this process is complete, the microarray is then scanned using a special scanner which detects the amount of fluorescence. The intensity of the spot is translated into a numerical form, and this becomes the data from which researchers work. A spot with a large amount of fluorescence represents a gene that is active under those particular conditions. These intensity values are often compared to a “normal” population in order to look at intensity fold changes which can give the researchers information about which genes are up-regulated or down-regulated given a certain condition [33]. For this study the platform contained 10027 genes for 68 patients. The samples used for the microarray experiment were taken from the pa- tient’s bone marrow during remission. As such, this sample should represent healthy bone marrow.

2.2.3 Affymetrix data

The Affymetrix microarray, known as a GeneChip, works similarly to the cDNA microarray. The main difference between these two methods is the way in which they are created. An Affymetrix GeneChip is created through a process known as photolithography. This technique uses masks and ultraviolet light to build the DNA CHAPTER 2. BACKGROUND 10

probes directly on the slide. This is different from the cDNA method where the DNA probes are spotted into wells on a slide. Before the creation of the GeneChip, the researchers must decide the composition of the probes so that the masks can be created. The slide (known as a wafer) will be sectioned off so that each probe can be created. The process of photolithography begins by first coating the wafer with silane which will bind to the glass. Next, a linker molecule with a photosensitive molecule will bind to the silane. Now, a mask is applied to the wafer which protects certain areas of the slide, while others will remain unprotected. Next, ultraviolet light is shown down on the slide and the unprotected parts of the slide will lose their photosensitive molecule. Then, one of the nucleotides (A, C, G or T) will be added into the mix and will combine with the unprotected molecules. This is the beginning of the DNA probe. Next, another mask is applied and then another nucleotide will be added. This process is repeated until each probe is the desired length and with the desired sequence. Creating the DNA probes in this way allows for researchers to be specific about the DNA probes that they want on the slide without having to isolate all of the DNA as they would for a cDNA microarray [1]. Compared to the cDNA microarray, the Affymetrix probes are much shorter. To compensate for this, the Affymetrix chips contain many redundancies [1]. For this study the platform contained 22277 probes for 144 patients. Similarly to the cDNA experiment, the samples used for the Affymetrix experiment were also taken from bone marrow during remission.

From the above three datasets there were 49 patients who had both SNP and cDNA data, 118 who had both SNP and Affy, 55 who had both Affy and cDNA, and 49 who had all three types of data. CHAPTER 2. BACKGROUND 11

2.2.4 Clinical data

When it is suspected that an individual may have ALL, many clinical tests are run in order to confirm the diagnosis. These tests include full blood counts as well as bone- marrow biopsies. If the test results show an increased white blood cell count and a decreased platelet count then these are the first signs of leukemia. Other clinical results such as an enlarged spleen or liver, chromosomal abnormalities such as a translocation and cytogenetic counts, subtype of the disease as well as the patients age and sex are all used to diagnose this disease. The clinical data for each patient in this study was processed at the same facility and therefore can be considered to be comparable. Certain patients were missing various types for clinical data, but the mortality was known for every patient. All of these data is used in this study to represent the view of a patient from a clinical perspective. Since this is the data that is available to clinicians making the diagnosis, we use these data with our techniques to see how effective these decisions were.

2.3 Random Forests

The random forests algorithm [8] is an ensemble classifier that consists of many binary decision trees. A binary decision tree is a method of using nodes in a tree structure to test the attributes of a dataset. The result of these tests are used to split the training data into subsets which are then passed onto the next layer of the tree. This continues until each subset at a node contains only one class. There are many popular decision-tree algorithms, including ID3, CART and C4.5 [23, 30]. Each of these is a supervised learning method, as they require that the data have class labels. One of CHAPTER 2. BACKGROUND 12

the challenges with a decision-tree classifier is deciding at each layer of the tree, which attribute will provide the best split of the data. Two popular choices for this task are information gain and the gini index. The gini index is the splitting method that is used in random forests and is defined by:

Xk 2 gini(D) = 1 − pi (2.1) i=1

where D is the dataset, k is the number of classes and pi is the relative frequency of class i in dataset D. After splitting on attribute X, the gini index is defined as

n n gini(D) = 1 gini(D ) + 2 gini(D ) (2.2) X n 1 n 2

where D1 and D2 are the subsets of the dataset at each branch which contain n1 and n2 objects respectively. The splitting decision is based on the difference in the gini impurity from a node to its child, and is given by

∆gini(D) = gini(D) − gini(D)X (2.3)

The attribute that provides the largest reduction of impurity is the best attribute to split on. The random forests algorithm was developed by Breiman and Cutler, and is known as one of the most robust classification algorithms developed to date [8]. For each CHAPTER 2. BACKGROUND 13

tree that is grown, a training set is created by randomly selecting, with replacement, n objects from the original set of N objects, where n is less than N . By selecting with replacement, about one third of the data will not be selected and this becomes the Out-Of-Bag (OOB) objects which are used as a test set. If there are M attributes in total, at each internal node a number of attributes m is chosen to be much smaller than M. Then, m attributes are selected randomly and the gini index is used to determine which attribute provides the best split. The choice of m is difficult, but a √ rule of thumb is to select m to be equal to M. Each tree is grown to its full size; there is no pruning in this algorithm. The error rate is estimated at the end of each run. Assume that class j received the most votes every time that case n was OOB.

The average of the number of times that j is not equal to the class of n is the OOB error estimate [8].

2.4 Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a matrix decomposition technique which is given by the formula

A = USVT (2.4) where A is a matrix of n rows (objects/patients) and m columns (attributes). If A has r linearly-independent columns, then U is an nxr matrix, S is an rxr positive diagonal matrix with non-decreasing singular values, and VT is an rxm matrix. U and V are orthogonal matrices. The U matrix captures the variation of the rows CHAPTER 2. BACKGROUND 14

of A, which correspond to the objects. The first column of U captures the most variation, the second column contains the second most variation, and so on. The singular values on the diagonal of the S matrix correspond to the importance of the amount of variation captured in each column of U. The V matrix captures the variation along the columns of A, which corresponds to the attributes. One of the many useful properties of SVD is that the results that can be visualized. Since the most variation in the data is captured in the first few columns of the U and V matrices, and the variation is captured in an orthogonal manner, it is possible to plot the first 2 or 3 columns of these matrices. The resulting image can show clusters or trends in the data that may have otherwise been difficult to see. It is a useful tool for

finding structure in complicated datasets. It is especially useful since it can be used on very large datasets which are difficult to handle [28]. The easiest interpretation of SVD is the geometric interpretation. By plotting the U matrix, the data points correspond to the objects plotted in a new space. Data points that lie close to each other in space are correlated with each other and therefore are more alike. Points that lie opposite of each other are negatively correlated with each other and are less alike. Points that are orthogonal to each other have no association. Also, points that lie at the origin are either correlated with everything or correlated with nothing. Either way these points can usually be discarded as not interesting. The power of SVD as a method of clustering can be seen from this explanation. It is able to find points that are interesting, points that are not interesting, as well as associations between points [28]. CHAPTER 2. BACKGROUND 15

2.5 Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a supervised learning method used for clas- sification. This method uses a decision boundary to separate the classes in space. However, when picturing two classes of data points that are linearly separable, there are an infinite number of boundary lines that can be drawn and it is impossible to know which of these boundaries is the best. This method uses what is known as the maximum-margin hyperplane. The idea is that a separating line is chosen so as to maximize the distance from the nearest data point on each side. The margin of the linear classifier is the width that the boundary can reach before hitting a point on each side. The support vectors are the points that lie on the decision boundary and these are the only points used in determining the best way to separate the classes [28].

One common problem with complex data is that there is no simple linear bound- ary between classes. The idea behind SVM is to project these data into a higher dimensional space using mathematical functions, known as kernels, to a point where a hyperplane can separate the data. If this is done properly, the necessary number of calculations can be minimized so that only the original attributes are needed, making it an efficient algorithm. It is possible to extend this algorithm to allow misclassi- fications while incurring a penalty for each. As such, there are several parameters that can be changed and may require testing to determine the best setup for a par- ticular experiment. Although this is primarily a two class separator, it is possible to extend this to multiclass prediction. This is one of the most popular and effective classification algorithms to date [9]. CHAPTER 2. BACKGROUND 16

2.6 Related Research

2.6.1 Related research using SNP data

Yang and colleagues [35] state that during the remission induction phase of therapy it can be see that there is considerable interindividual variation. They report that some patients drop from 100% to under 0.01% leukemia cells in the bone marrow with only 2 or 3 weeks of induction therapy. However, some individuals still exhibit high levels of leukemia cells in the bone marrow after 4 to 6 weeks of induction therapy.

They attribute most of this variation to host-related factors as opposed to tumor- related factors. In order to test this hypothesis, the team used two groups of patients consisting of 318 and 169 people respectively. They studied a total of 476 796 germline SNPs and after using several statistical tests were able to identify 102 SNPs that are believed to be related to this variation between individuals. It was also found that 63 of these SNPs are linked to early response, relapse and drug disposition. This demonstrates that some interindividual variation can be attributed to differences in individual genetics.

2.6.2 Related research using microarray data

Microarray experiments are the ideal candidate subject for data mining. They are noisy, very large and complex, but are filled with a wealth of information. With the ability to capture so much information it is imperative that data miners discover a way to effectively extract this information. It is no surprise that there has been a lot of research in this field on both the biological side as well as the computational side.

As an example, Chopra et al. [11] looked at the problem of clustering as it applies to CHAPTER 2. BACKGROUND 17

microarray data. They state that most of the available clustering algorithms only find clusters that are independent of the biological context of the analysis. The authors have developed a novel clustering algorithm, SigCalc, which generates many different versions of clusters from one dataset where each one provides a different insight into the dataset. They test their algorithm on three yeast microarray datasets and discover that they can find many of the same clusters of genes across all datasets. Being able to include biological data into the clustering process is critical in order to discover the best possible clusters of genes.

In another experiment, Hoffmann et al [19] used gene expression profiles to at- tempt to predict long-term outcome of individuals with pre-B ALL. They used mi- croarray data for 101 children diagnosed with pre-B ALL and, using statistical tech- niques as well as the random forest algorithm, were able to identify an 18-gene clas- sifier which they state can predict the long-term outcome of these patients better than conventional methods. This demonstrates the power of using the random forest algorithm with microarray data.

2.6.3 Related research using random forest

The random forests algorithm is used extensively in the biomedical field of research and because of its design it is well suited for microarray data. These data are generally very large as well as containing a lot of noise. This algorithm can handle these two properties much better than many of the other classification algorithms that exist. Also, random forests has the ability to not only act as a powerful classifier, but as an attribute selection algorithm as well. All of these properties make this algorithm a popular choice for biomedical research. CHAPTER 2. BACKGROUND 18

Diaz-Uriarte and Alvarez de Andres [13] use random forests on a number of dif- ferent microarray datasets in order to compare its performance with many other well known algorithms such as SVM, K-nearest neighbours, etc. The authors report that the classification accuracy of random forests is similar to those of the best algorithms that are already in use. On top of the classification accuracy, they explore random forests potential for attribute selection and propose a method for gene selection using the OOB error rates. The authors state that because of the algorithm’s ability to perform well as a classifier while also allowing for excellent attribute selection, this algorithm should become an essential part of the tool-box for prediction and gene selection with microarray data.

Archer and Kimes [4] perform a similar evaluation of the random forests algorithm and came to many of the same conclusions about the effectiveness of the algorithm. In this particular case they apply the algorithm to ALL microarray data as a means of trying to discover the genes which are responsible for the difference between subtypes of the disease. There has been much previous research on this topic and so they were able to compare their results using random forests to these. They were able to identify many genes which play a role in this and validated this with the previous research.

2.6.4 Related research using SVD

As previously discussed, microarrays are high throughput devices often containing many tens of thousands of gene expression values. It is difficult to analyze these data since most diseases or conditions only affect a small number of genes. It is common practice to reduce the number of genes in the analysis by discarding those that have CHAPTER 2. BACKGROUND 19

a low expression level. However, the effect of this is to keep genes that have a large difference in expression value but does nothing to consider correlations with other genes or other subtle connections. By performing a singular value decomposition on these data and sorting the columns of U based on the distance from the origin, it is possible to select the genes with the most interesting expression values, not simply the largest difference. An expression level for a gene that does not change across patients tend to lie near the origin and so these will not appear near the top of the sorted list [28]. Chapter 3


In this chapter, we explain the experimental model used to explore these datasets. First, the preprocessing procedures that were performed are explained. Next, the different normalization techniques used are discussed. Finally, we present the setup of the various experiments that were performed.

3.1 Pre-processing

Careful preprocessing was required to ensure that the data used for the study was of the highest possible quality. Preprocessing includes such tasks as replacing missing values, excluding patients or attributes that are not useful, converting the data into appropriate numerical forms, and many other tasks. These steps were done on each of the data sets; SNP data, cDNA data, Affymetrix data and Clinical data. The samples for the SNP experiment were taken from the patient’s peripheral blood during remission. The samples for the cDNA and Affy microarray experiments were both taken from the patient’s bone marrow during remission. By taking the samples during


remission the cells represent “normal” cells and therefore allow us to investigate the differences between patient’s genetics.

3.1.1 SNP data

The SNP dataset contained such information as the patient ID tag, the SNP names, the theta values of each SNP for each patient, as well as the genotype of each SNP for each patient. There were many missing values in this dataset and this is a fundamental challenge in data mining. The method for replacing missing values must be chosen carefully so as to not add any information to the data. Many solutions exist for replacing missing values; two of the more common approaches are to replace the missing values with either the column mean or a value of zero. For the purposes of this thesis, both methods were tested and there was little difference between the two methods. Therefore, the missing values were all replaced with a value of zero. Another problem encountered with the SNP dataset was that there were several duplicate subject records. All of the duplicates were removed from the dataset. After the preprocessing step, the dataset contained data for 137 patients and 13917 SNPs.

3.1.2 cDNA data

The cDNA dataset went through a similar preprocessing step. The data that was received had already undergone some preprocessing and normalization to compensate for technical errors to do with the creation of the microarray. This is a standard procedure done for all microarray experiments. This dataset contained the patient ID tags, the gene names, and the microarray values. These values represent the expression ratio between the patient and a “normal” bone marrow sample. There were CHAPTER 3. EXPERIMENTS 22

many missing values in this dataset as well, and again, both methods of replacement were tested. There was little difference between the two methods of replacement and so all of the missing values were replaced with a value of zero. This was not surprising as the mean values in a column were very close to zero. After the preprocessing step, the dataset contained data for 68 patients and 10027 genes.

3.1.3 Affymetrix data

The Affymetrix (Affy) dataset was received in three separate files, due to the fact that the Affy data is comprised of three separate experiments. Two of these experiments were conducted at The Children’s Hospital at Westmead and the other experiment was performed in Washington. This presented a challenge as these experiments were each subject to their own sources of error which would not be consistent between all other experiments. One of the experiments had many more attributes than the other two, so these extra records were removed in order to combine all of the datasets. These datasets had already had been preprocessed before we received them, and so there were no missing values to replace. The Affy dataset also contained “normal” patients in one of the experiments, that is patients who do not have leukemia. These patients were removed from the combined datasets and kept separate for further testing. In the end, this dataset contained data for 144 patients and 22277 attributes.

3.1.4 Clinical data

The clinical dataset contained, for each patient, laboratory test results, such as ini- tial white blood cell count and platelet counts, as well as patient information, such as CHAPTER 3. EXPERIMENTS 23

treatment received, sex, age, etc. For these experiments, we used a selection of labo- ratory test results. The preprocessing of these data involved converting the attributes into appropriate numerical forms. The clinical tests which resulted in numerical re- sults were left as they were. However, there were several attributes that had values that did not translate directly into linearly significant numbers. For these, nominal values were assigned to represent the different values of the attribute. For example, one attribute contained information about the size of the liver at diagnosis. The pos- sible values were nil, 0-1cm, 1-5cm, and 6-10cm, and so these values were translated into 0, 1, 2 and 3. The final dataset contained data for 117 patients and 11 attributes.

3.2 Normalization

Normalization of the data is important when using a technique such as SVD. It is important to ensure that the data is properly scaled and centered on the origin in order for SVD to function correctly. In this study, z-scores are calculated for each attribute which effectively centers the data on the origin and scales all of the values so that SVD is able to correctly capture the variation in the data. A z-score is calculated by subtracting the column mean from each value in the column, and dividing by the column standard deviation.

This technique works well; however, there are some inherent problems associated with it. Scaling data in this way, assumes two things: that all of the attributes are equally important, and that all of the values have a linear relationship between significance and magnitude. With the SNP, cDNA and Affy datasets, all of the attributes are treated as both equal and linearly significant so this is not a concern for this particular study. CHAPTER 3. EXPERIMENTS 24

Another normalization concern is with the microarray data when it is collected. There are many visual artifacts which must be accounted for in the microarray scan, such as dust, uneven surfaces, poor washing, etc. Fortunately, the data which is used for this study was already normalized to account for these problems.

The clinical dataset contained attributes which were either linearly significant or nominal. For the attributes which were linearly significant, z-scores were calculated. For the attributes that were not, the logarithm of each value was calculated.

3.3 Experimental Model

In order to gain a better understanding of these datasets, the experimentation pro- cess began on a general level. This involved performing SVD on the entire dataset. Once some knowledge was gained about these datasets, attribute selection was then performed using the random forests algorithm, and then further exploration of the data was done using SVD and SVM. To see whether or not data integration would be beneficial, the three datasets were combined together, as well as in pairs.

3.3.1 SVD analysis of data

Using the geometric interpretation of SVD, the goal of these experiments was to see if there were significant clusters in the data. By labeling these images with different clinical features, e.g. mortality, we hoped to be able to understand what the clusters represented. By doing this for each of the datasets listed below, we wanted to see which datasets held the most structure and if combining them would give more information. CHAPTER 3. EXPERIMENTS 25

• SNP Dataset

• cDNA Dataset

• Affy Dataset

• Clinical Dataset

• SNP and cDNA Dataset

• SNP and Affy Dataset

• cDNA and Affy Dataset

• SNP, cDNA and Affy Dataset

The geometric interpretation of SVD is based upon plotting the U*S matrix to see if there are clusters in space. The farther away a point is from the origin, the more interesting that point is. Likewise, the points that lie together in space are more correlated with each other than those points which lay further away in space. For all of the following experiments, the first three columns of the U*S matrix were used to plot the points.

3.4 Combination of Datasets

There are two possible methods of combining datasets for performing attribute selec- tion. First, the datasets can be combined and then the random forests algorithm can be run to select the best attributes. Second, attribute selection can be performed on the individual datasets and then the best attributes selected from each and combined CHAPTER 3. EXPERIMENTS 26

together. In order to determine what the most appropriate method was for these experiments all possibilities were created. SVM was then run on all of the combined datasets to determine which approaches were the best.

3.5 Attribute Selection

Attribute selection is the process of removing attributes which contain less useful information for the task at hand. By choosing attributes that appear to provide the most useful information, the dimensionality of the problem decreases which helps to improve the quality of the experiments. This is especially true for datasets which are as large as these. For example, in the SNP dataset, not all 13917 SNPs are likely to be relevant to this problem. Having such a large number of attributes not only makes it difficult to perform accurate classification, but the tests themselves become very inefficient to run.

There are many ways of performing attribute selection, but because of the size of these datasets and the quality of the process a good choice is to use random forests. At the completion of the random forest algorithm, an output file contains the gini index values for all of the attributes which were used for splitting. One problem that exists when using this algorithm with such a large dataset is that, in order for every attribute to be selected in the algorithm, a large number of trees must be built. However, because of the size of these datasets and the limitations of current hardware, it was not possible to run the algorithm long enough for every attribute to be considered. The solution to this was to run the algorithm several times and combine the results of each trial until all, or almost all, of the attributes have gini index values. CHAPTER 3. EXPERIMENTS 27

The setup of the algorithm for each dataset is shown in Table 3.1. For each dataset the top 25, 50, 100, 250, 500, 1000, 2500 and 5000 genes were selected for further analysis. These subsets were chosen in order to capture the features of the datasets as they change from very few attributes to a large number of attributes.

Table 3.1: Attribute selection using Random Forests

Dataset Patients Attributes Trees Built

SNP 137 13917 4x30000 cDNA 68 10027 4x30000 Affy 144 22277 4x30000 SNP & cDNA 49 23944 8x20000 SNP & Affy 118 36194 10x14000 cDNA & Affy 55 32304 10x14000 SNP & cDNA & Affy 49 46221 12x11000

3.6 Analysis of Selected Attributes

By removing a large fraction of the attributes, we have discarded a lot of information that is not informative for these experiments. We are not taking a minimalist ap- proach to this problem and assuming that there is a small set of genes or SNPs which can be used to distinguish between whether a patient will live or die. We believe that it is important to incorporate as much information as possible in order to build the best model. However, in datasets of this size, it is clear that there are likely to be many attributes which are irrelevant to the problem at hand. These are the attributes we are interested in removing. All decisions about how many attributes to select are arbitrary. However, it is possible to make a principled decision by testing the effect of adding more attributes and observing if they provide any new information. Including CHAPTER 3. EXPERIMENTS 28

too many attributes in an analysis makes it difficult to find any interesting informa- tion due to noise and variation unrelated to the properties of interest. On the other hand, including too few attributes can make it difficult to find any useful information or to make the model generalizable. This is why we have chosen to test several subsets of varying size so we can find the subset of attributes that best describes the dataset. For each of the subsets, we performed classification using an SVM. This was done to gain an understanding about the effect of increasing the size of the subsets. Then, we performed SVD to see whether there were any significant clusters in the data when we labeled the images with mortality or subtype of disease.

We performed this analysis on each of the following datasets:

• SNP dataset

• cDNA dataset

• Affy dataset

• Clinical dataset

• SNP and cDNA dataset

• SNP and Affy dataset

• cDNA and Affy dataset

• SNP, cDNA and Affy dataset CHAPTER 3. EXPERIMENTS 29

3.7 Validation of Results

It is important to be able to validate results and prove that they are not due to random chance or to overfitting the data. Because of the difficulty of obtaining new and relevant data in this field, it was necessary to use sophisticated techniques in order to validate the results we have obtained. We used a random label shuffling to test our results.

3.8 Further SNP Analysis

Because of the interesting results found for the SNP dataset, we decided to investigate this further. There are many different experimental pathways that can be explored during a data-mining experiment. We decided to look down five different pathways; predicting relapse, graph analysis, a reformatting of the data, updating the patient labels and a cross validation of the attributes.

3.8.1 Predicting relapse

The previous attribute selection was performed using the patient mortality as a class label. In order to select attributes that are predictive of relapse, the random forest algorithm was performed for the SNP data with the class label being whether or not the patient relapsed.

3.8.2 Graph analysis

The previous analysis using SVD focused on an individual’s attribute values to deter- mine their place in the object space. Another way to view this problem is to compare CHAPTER 3. EXPERIMENTS 30

patients to each other using the dot product to create a similarity matrix. This ma- trix is n by n and an entry at x(i,j) represents the similarity between patients i and j. By applying a threshold to these data, any value less than the threshold becomes a 0 and the matrix can now be viewed as an adjacency matrix for a graph. SVD is then applied to this matrix and plotted. All non-zero entries in the matrix represent an edge between the points in the graph.

3.8.3 Reformatting the data

As explained previously, the SNP data is coded to represent the three possible alleles.

The previous data had one attribute for each SNP which had three possible values. Another approach we developed was to split each SNP into three attributes; one for each allele. For each SNP, a patient would have a value of 1 for the allele they had and 0 for the other two alleles. We then performed attribute selection on this dataset with the hope that if there were any specific alleles for certain SNPs that were interesting, they would be selected using this method. We then applied SVD to the resulting datasets and observed the results.

3.8.4 Updating patient labels

Late in the development of this research, we were able to obtain updated data for the patients. This included five patients who had since died and seven who had relapsed. We were interested to see where these patients lie in our previous space, and so the SVD images were relabeled with the updated information. We also reran the random forest algorithm to see the effect on the attribute-selection process with updated labels. CHAPTER 3. EXPERIMENTS 31

3.8.5 Cross validation of the attributes

To further support the attribute-selection process, we divided the selected subsets into smaller subsets that we then ran through the random forest algorithm. We then compared the resulting attribute lists in order to see which attributes appeared in multiple lists or only in one. Chapter 4


In this chapter we report and explain the results of the experiments. First, we look at the clustering of the data based on the SVD images for each of the individual datasets as well as the combined datasets. Next, we explore the subsets of the data created through the attribute-selection process and evaluate these using SVD as well as SVM. Then, we discuss the results of attribute selection and present the top attributes from each dataset. We also explore the SNP dataset further in order to find the biological significance of the results.

4.1 SVD Results

Each of the following results were obtained by plotting the first three dimensions of the U matrix in order to visualize the data. All of the images are labeled with clinical information, such as the patient mortality outcome or the subtype of the disease. By doing SVD on the entire dataset as an early step, we learn some general structures that may exist in these data. We can also use these images as a benchmark for how


our subsequent analysis performs.

4.1.1 SVD on SNP data

The resulting SVD image for the full SNP dataset can be seen in Figure 4.1. There appear to be two fairly well defined clusters in the data but they are clearly not related to mortality, seen in Figure 4.1(a), or subtype of disease, seen in Figure 4.1(b). Since these datasets includes 13917 SNPs, not all of these SNPs are expected to be associated with leukemia. Upon further investigation we were able to determine that the spread of the data is caused by the way in which the data is coded. This causes the data to tend to form three clusters based on whether patients have a majority of genes of type AA, AB or BB.


−74 −74.5

−74.5 −75

−75 −75.5 −75.5 0 0 −76 −76 −5 −5 −76.5 −10 −76.5 −10 −15 −15 −77 −77 −20 −20

(a) SNP data labeled with mortality (b) SNP data labeled with subtype

Figure 4.1: SVD images of SNP data. (a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown CHAPTER 4. RESULTS 34

4.1.2 SVD on cDNA data

The SVD image for the cDNA dataset does not appear to contain any noticeable clusters based on the mortality label, as seen in Figure 4.2(a). When labeling this image with the patients’ subtype of the disease, more interesting results appear. As seen in Figure 4.2(b), there is a fairly well defined cluster of T-cell patients. This implies that the cDNA genes are able to capture a variation in the subtype of disease that the patients have. This is not surprising since it is well known that the difference between subtypes can be distinguished by the expression of only one gene [27].



−30 −10 −40 −20

−50 −30 −40 −60 20 −50 −70 0 −60 −80 −70 −20 −90 −80 20 −40 0 −90 −20 −100 −60 −100 −40 −60 −110 −80 −110 −80

(a) cDNA data labeled with mortality (b) cDNA data labeled with subtype

Figure 4.2: SVD images of cDNA. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown

4.1.3 SVD on Affy data

The SVD results of the Affy dataset did not provide much useful information. The images, labeled with mortality and subtype, are seen in Figures 4.3(a) and (b). It is CHAPTER 4. RESULTS 35

clear that there are no separations in the data based on either label. However, when labeling by type it can be seen that there are no T-cell patients in the main cluster of points. This suggests that the Affy data contains some information regarding the subtype of the disease. Since there are so many points whose disease type is labeled as unknown it is difficult to be confident in this assessment. It is interesting to note that the cDNA data separates the data better than the Affy data does. This was moderately surprising as the Affymetrix technology is generally more accepted to be more reliable than cDNA microarrays. We believe this is related to the way in which the data was collected and perhaps to issues combining datasets from different operators.



−150 −140 −40 −145

−135 −20 −140 −40 0 −135 −20 −130 0 20 −130 20 −125 40 40 −125

(a) Affy data labeled with mortality (b) Affy data labeled with subtype

Figure 4.3: SVD images of Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown CHAPTER 4. RESULTS 36

4.1.4 SVD on clinical data

The shape of this dataset is interesting, as seen in Figure 4.4. There are four parallel clusters of data, but these cannot be explained by the mortality or subtype labels. Upon further investigation it is found that the three leftmost clusters consist of the data for patients who had a genetic translocation. More specifically the cluster far- thest to the left contains data for patients who had a BCR-ABL translocation, and to the right of that is a cluster of data for patients who had a TCL-AML translocation. The cluster of data points on the far right can only be described as the data points related to patients who did not have a translocation. The spread of data from the bottom to the top of the cluster has been identified as being affected by the size of the liver at diagnosis, the size of the spleen at diagnosis as well as the initial platelet count. It is quite clear that any medical decisions about diagnosis, prognosis or treat- ment based on these data would probably be poor. This is the type of information that is presently being used for decisions regarding leukemia. Based on the results of the genetic datasets, we believe that it is important that this information be used to help support the diagnosis, prognosis and treatment decisions. CHAPTER 4. RESULTS 37

4 4 3 3 2 2 1 1 0 0 −1 −1 2 −2 1 −2 2 0 −3 1 −3 −1 0 −4 −2 −4 −1 −3 −2 −5 −5 −3 −4 −6 −4 −6 −5 −5

(a) Clinical data labeled with mortality (b) Clinical data labeled with subtype

Figure 4.4: SVD images of Clinical. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown

4.1.5 SVD on combined SNP and cDNA data

The combined datasets have many more attributes than the individual datasets. With a large dataset there are going to be many attributes which are irrelevant for our purposes and will not contain any useful information. If there are many more of these attributes than useful attributes, the SVD may not be able to discover any information from the useful attributes are the experiment will not be successful. The images in Figures 4.5 (a) and (b) demonstrate this effect. When labeling by subtype, all but one of the T-cell patients are clustered together. However, there are also several B-cell patients in this cluster as well. This does support the theory that the cDNA dataset primarily contains information regarding the subtype of the disease. CHAPTER 4. RESULTS 38


−60 −90 −100 −40 −100 −60 −20 −110 −40 0 −110 −20 −120 20 0 −120 20 40 40 −130 −130 60 60

(a) SNP-cDNA data labeled with mortality (b) SNP-cDNA data labeled with subtype

Figure 4.5: SVD images of combined SNP-cDNA. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown

4.1.6 SVD on combined SNP and Affy data

It is interesting to note that the shape of the data in the SVD image, shown in Figure 4.6, is similar to that of the Affy dataset alone. This shows that the SNP dataset does not have a powerful global effect when it is combined with the larger Affy dataset. As such, the evaluation of this image is similar to that of the previously describe Affy dataset. Once again, the mortality and subtype labels, shown in Figures 4.6(a) and (b), do not provide any explanation for the shape of this dataset. CHAPTER 4. RESULTS 39

−166 −164 −162 −166 −164 −160 −162 −158 −40 −160 −156 −158 −20 −40 −154 −156 −20 −152 0 −154 −152 0 −150 20 −150 −148 20 −148 40 −146 40 −146

(a) SNP-Affy data labeled with mortality (b) SNP-Affy data labeled with subtype

Figure 4.6: SVD images of combined SNP-Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown

4.1.7 SVD on combined Affy and cDNA data

This dataset was interesting, because the shape of the data is similar to that of the cDNA dataset alone. This is surprising because the Affy dataset is more than double the size of the cDNA dataset, and so in order for this to happen the cDNA dataset must contain many more globally powerful attributes. As seen in Figure 4.7(a), the mortality label does not provide any meaningful separation in the data, but in Figure 4.7(b), the subtype label does appear to be fairly well separated. It is clear that the cDNA data has the ability to capture variation in the patients based upon their subtype of the disease. It is important to note that the combination of these two datasets is different from combining either of them with the SNP data. These two datasets are intended to capture the same information, that is, the genetic expression CHAPTER 4. RESULTS 40

levels of certain genes. This suggests that they should be similar to each other and the combination may not provide any interesting information.



−165 −175 −170 −160 −165 −155 −160 −150 −155

−145 −150 −145 −140 −40 −40 −140 −20 −135 −20 −135 0 0 −130 20 20 −130 40 40 −125 −125 60 60

(a) Affy-cDNA data labeled with mortality (b) Affy-cDNA data labeled with subtype

Figure 4.7: SVD images of combined cDNA-Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown

4.1.8 SVD on combined SNP, cDNA and Affy data

The shape of this dataset is also similar to that of the cDNA dataset, suggesting again that the cDNA dataset contains the most obvious structure. The results of labeling by mortality are shown in Figure 4.7(a). It can be seen that there are no tight clusters of patients. When labeling by subtype, as shown in Figure 4.7(b), the T-cell patients appear to cluster together on the bottom left of the image. Since the separation based on subtype continues to appear with these large microarray datasets, it is clear that the genetic expression patterns for these two subtypes are quite distinct which enables the SVD to discover the separation between them. CHAPTER 4. RESULTS 41

−175 −150 −170 −155 −165 −160 −160

−165 −155

20 −170 −150 −145 0 −175 −20 −40 −140 −180 −20 −135 −40 0 −185 −60 20 −130 40 −190 −125 −80 60

(a) SNP-Affy-cDNA data labeled with mortality (b) SNP-Affy-cDNA data labeled with subtype

Figure 4.8: SVD images of combined SNP-cDNA-Affy. a) blue = alive, red = de- ceased. (b) red = T-cell, blue = B-cell, green = unknown

So far, we have seen that the entire datasets contain only weak clusters that are mostly related to the subtype of the disease. Next, we look at subsets of the attributes as determined by the random forests algorithm.

4.2 Combination of Datasets

To properly test which method of combination was better, we explored the combined SNP and cDNA dataset. The SVM results for the two methods of combination are shown in Table 4.1 and Table 4.2.It is quite clear that by first combining the datasets and then performing attribute selection the accuracy of the datasets is much better than performing attribute selection on the individual datasets and then combining the top attributes. The reason this method works better is that the random forests CHAPTER 4. RESULTS 42

algorithm is able to find correlations between the two sets of data when selecting the best attributes for splitting. It is interesting that this SNP-cDNA dataset showed

a significant improvement with this method of combination, because it means that the correlation between the SNP dataset and the cDNA dataset provides meaningful information about the mortality of patients. Because of this, we decided to use this method of combination for all experiments.

Table 4.1: SVM prediction accuracy of combining datasets and then doing attribute selection (6-fold cross validation)

Atttributes % Class Alive % Class Deceased

25 100 100 50 100 100 100 100 100 250 100 100

Table 4.2: SVM prediction accuracy of attribute selection and then combining datasets (6-fold cross validation)

Atttributes % Class Alive % Class Deceased

25 97.62 57.14 50 97.62 100 100 100 85.71 250 97.62 71.43 CHAPTER 4. RESULTS 43

4.3 Analysis of Selected Data

Here we discuss the results of the experiments which look further into the selected datasets. For each of the following datasets, we first look at SVM results from each subset of attributes predicting mortality to develop and understanding about which subsets were going to be the most predictive. Then we used SVD to see if there were any new and interesting clusters of data that we could not see with the original datasets. We look for separation of the data for both mortality and subtype in order to attempt to explain the clusters. The goal is to try to find the subset of attributes for each dataset that best separate the patients based on their mortality. It is important to note that we are not looking for the smallest possible subset of genes to classify the data; rather we are looking for the number of genes that incorporates as much data as possible without including so much data that any interesting separations are lost. This is an arbitrary process. However, through careful use of visualizations, classification techniques and many test sets it is possible to make an educated decision about where this cut off should be.

4.3.1 Analysis of selected SNP data

The SVM results for the subsets of the SNP dataset are shown in Table 4.3. This dataset contained 137 patients, and of these 123 (89.8%) are alive and 14 (10.2%) are deceased.

Table 4.3: SVM prediction accuracy of SNP subsets (6-fold cross validation)

Atttributes % Class Alive % Class Deceased

25 100 71.43

Continued on Next Page. . . CHAPTER 4. RESULTS 44

Table 4.3 – Continued

Atttributes % Class Alive % Class Deceased 50 99.19 92.86 100 100 100 250 100 100 500 100 92.86 1000 100 92.86 2500 100 64.29 5000 100 21.43

The results of the SVM classification are quite promising. It is clear that 25 attributes is not sufficient to capture the necessary information, and 2500 and 5000 attributes contain too much irrelevant information that any meaningful separation is lost. A range between 50 and 1000 attributes appears to be the best for a subset of attributes for this dataset.

The resulting SVD of the top 25 selected SNPs is shown in Figure 4.9(a). It is clear that there is clustering in this data based on the mortality label, as we expected from the SVM classification. However, it is not a perfect separation. Next, the top 50 attributes (not shown) show a similar situation with only slightly more distance between the two clusters. However, when looking at the top 100 attributes in Figure 4.9(b) we can now see clear separation between the two clusters. This is exactly what we expected to find based upon the SVM results. This becomes even more apparent when looking at the top 250 attributes in Figure 4.9(c). The distance between the two clusters has been maximized in this subset, and so we believe that most of the important attributes for mortality are contained in this subset of the data. In order to fully understand the classification power of this subset of 250 SNPs in separating data based on a mortality label, it is necessary to apply this to a new set of patients CHAPTER 4. RESULTS 45

and see if the desired result is still obtained. However, due to the lack of available data for ALL patients this was not possible for this study. CHAPTER 4. RESULTS 46

−5 −2




−3 −6.5

1.5 1 0.5 0 −3.5 −0.5 −7 1 0.5 −1 0 −0.5 −1.5 −1 −1.5

(a) 25 SNP (b) 100 SNP

−12.5 −8.5 −13 −9 −13.5

−9.5 −14

−10 −14.5 2 1 0 −1 −2 2 1 0 −3 −4 −1 −2 −3 −4

(c) 250 SNP (d) 500 SNP


−31 −44 −31.5 −46 −32

−32.5 −48 2 0 −2 −4 −6 −8 −10 2 0 −2 −4 −6 −8 −10 −12

(e) 2500 SNP (f) 5000 SNP

Figure 4.9: SVD images of selected SNP data. blue = alive, red = deceased. CHAPTER 4. RESULTS 47

This dataset becomes interesting when looking at the top 500 attributes in Figure 4.9(d). The data is still clearly separated based on mortality. However, there appears to be another separation forming as the data begins to divide up into four clusters. This suggests that the data is becoming more like clusters that we saw in the original SNP dataset. This can be seen even more clearly in the top 2500 and 5000 attribute subsets in Figure 4.9(e) and (f). By this time, the data has migrated back to the original clusters, but it is still possible to see separation based on mortality. However, when looking at the results of the SVM it is clear that for these two datasets it has become difficult to classify the deceased patients properly. This suggests that there are too many attributes included in this dataset for the relevant information to be discovered.

Med Med Med High Med Std Med High Med Med Std Med Med Med Med Std Med Med Med Med Med Med Low Med Med High Med Med Med High High Med Med Med Med Med High Med Med Med Med Med Med Std Med Std Std Med Med Med High High Med High Med Med Med StdStd Med Med Med Med Std Med Med Med Med Med High Med Med Med Med Med Std Med Med Med Med Med Med High Med Std High Med Avg Med Med Med High Med

Med High Med

Med Med Std High High High High High

Med −14 −13 −8 −15 −13.5 −6 −14 −4 −8 −14.5 −2 −16 −6 −15 0 −4 −15.5 2 0 −2 −16 −17 2 −16.5 4 6 4 6

(a) 250 SNP Treatment (b) 250 SNP Risk

Figure 4.10: SVD images of 250 SNP data. (a) blue = BFM98, red = Study8 (b) blue = alive, red = deceased, labeled by risk category

The most important and unexpected result that came from this analysis was that there appears to be a relationship between the genetics of an individual and their CHAPTER 4. RESULTS 48

mortality under current treatment regimes. Although it is known that there are genetic factors associated with leukemia, it is not widely believed that there exist genetic factors which distinguish between those who will live and who die once they have this disease. It is also surprising that the separation that is seen in the data is clear. There is no spectrum where the risk is spread from low to high; there only appears to be extreme risk and lower risk. When labeling these images with such information as the treatment the patients received, or the risk classification they were given, as seen in Figure 4.10, it is seen that there is no structure to the data. This suggests that what is currently being done to treat leukemia is not enough. Based on these results we believe that this genetic information will provide physicians with another level of understanding which they can use to make better decisions about treatment. The current leukemia treatment regimes do not take into account patient genetics, but rather their clinical manifestations of the disease. What these results suggest is that certain individual’s genetics place them at extreme risk. Therefore, treatment plans for these higher risk individuals need to be tailored to their unique genetics. It is also possible that current treatment plans are insufficient for these genetically different patients.

4.3.2 Analysis of selected cDNA data

The SVM results for the subsets of the cDNA dataset are shown in Table 4.4. This dataset contained 68 patients, and of these 59 (86.7%) are alive and 9 (13.2%) are deceased. CHAPTER 4. RESULTS 49

Table 4.4: SVM prediction accuracy of cDNA subsets (6-fold cross validation)

Atttributes % Class Alive % Class Deceased

25 96.61 77.78 50 100 66.67 100 100 77.78 250 100 77.78 500 100 66.67 1000 100 66.67 2500 100 44.44 5000 100 22.22

The SVM classification results for this dataset were not as good as they were for the SNP dataset. As such, we did not expect to see a clear separation in the data based on the mortality label from the SVD images. Through the exploration of the full cDNA dataset, we discovered that it forms two clusters based on the subtype of the disease. Although this is not what we are looking, for it was interesting to see that the expression level of a patient’s genes can be used to distinguish between T-cell and B-cell.

From the first 25, 50 and 100 attribute subsets, shown in Figure 4.11(a),(b) and (c), we can see a general clustering based on the mortality label. When labeling these same images with the subtype label (not shown) there is no clear grouping of these patients. However, when the top 250 attributes are plotted, as seen in Figure 4.11(d), it is quite clear that there are two well formed clusters. These clusters are based on the subtype of the disease and this raises a critical point about this dataset. When looking at the SVM classification results, there is no drop off in accuracy between 100 and 250 attributes even though it is clear visually that they cluster differently.

Upon further investigation it was seen that all but two of the deceased patients in CHAPTER 4. RESULTS 50

this dataset have T-cell ALL which can create misleading results when only looking at the classification. If the data separates well based on subtype, and most of the deceased are of one type, then the classification results will suggest that the data separates better than it does. This is why it is important to understand the data being analyzed and to scrutinize every result that is found. As more attributes are added, the separation based on mortality begins to dis- appear and eventually even the SVM does not perform well. From these results, we believe that there is important information about mortality contained within the first 100 attributes of this dataset. This again suggests that there is a link between the genetics of a patient and the outcome of their disease and since this observation has been made in two separate datasets our confidence in this conclusion has increased. CHAPTER 4. RESULTS 51






−10 8 6 4 2 0 6 4 2 0 −2 −4 −6 −2 −4 −6

(a) 25 cDNA (b) 50 cDNA

5 −6 0 −15 −5 −8 −10 0 −10 −5 −5 −15 −10 −10 0 −15 −20 5 −20 −12 −25 −25 10

(c) 100 cDNA (d) 250 cDNA

Figure 4.11: SVD images of selected cDNA data. blue = alive, red = deceased. CHAPTER 4. RESULTS 52

4.3.3 Analysis of selected Affy data

The SVM results for the subsets of the Affy dataset are shown in Table 4.5. This dataset contains 144 patients, of which 128 (88.9%) are alive and 16 (11.1%) are deceased.

Table 4.5: SVM prediction accuracy of Affy subsets

Atttributes % Class Alive % Class Deceased

25 100 0 50 99.92 0 100 93.75 31.25 250 94.53 25 500 96.09 31.25 1000 96.09 31.25 2500 92.19 25 5000 94.53 25

The results of the SVM classification were poor for this dataset. This suggested that the Affy dataset would not provide much structure to the data based on the mortality label. The SVD results from the entire Affy dataset did not show any significant clusters based on the patient’s mortality or subtype of the disease. This continues to hold true for the subsets as well. The top 50 and 250 attribute subsets are shown in Figure 4.12 labeled with both mortality and subtype. It is quite clear that there are well defined clusters of the data based on the subtype of the disease. There are many missing labels for some patients but it is easy to see what their subtype should be labeled as. Although there does appear to be clustering based on the mortality label, we run into the same problem as with the cDNA data. It is the nature of the disease that patients with T-cell ALL are at a higher risk, and it is CHAPTER 4. RESULTS 53

therefore not surprising that more of the deceased patients have T-cell ALL.



28 24 26 2 22 24 0 20 22 4 −2 20 2 18 −4 0 18 −2 16 −6 16 −4 −6 14 −8 14 −8 −10 12 −10 12

(a) 50 Affy Mortality (b) 50 Affy Subtype


−36 −34

−34 −32 −32 −30 −30 −28 −28 −26 −26 −24 −24 −10 −10 −22 −5 −22 −5 0 −20 0 −20 5 5 −18 −18 10 10

(c) 250 Affy Mortality (d) 250 Affy Subtype

Figure 4.12: SVD images of selected Affy data. (a/c) blue = alive, red = deceased. (b/d) blue = B-cell, red = T-cell, green = unknown CHAPTER 4. RESULTS 54

4.3.4 Analysis of combined SNP data and cDNA data

The SVM results for the subsets of the combined SNP and cDNA datasets are shown in Table 4.6. In this dataset there were 49 patients, of those 42 (85.7%) are alive and 7 (14.3%) are deceased.

Table 4.6: SVM prediction accuracy of combined SNP and cDNA subsets

Atttributes % Class Alive % Class Deceased #SNP/#cDNA

25 100 100 10/15 50 100 100 21/29 100 100 100 46/54 250 100 100 114/136 500 100 100 232/268 1000 100 85.71 489/511 2500 100 85.71 1339/1161 5000 100 71.43 2856/2144

It was expected that this dataset would perform well since both of the individual datasets showed the ability to separate the data based on the mortality label. The SVM results for this dataset were impressive with the first five subsets all showing 100% classification of both classes.

The top 25, 500, 1000 and 2500 attribute subsets are shown in Figure 4.13. The separation is clear and it is interesting again to notice that there is no transition of points from a low risk to a higher risk but rather a stark separation of dead and alive. It is also interesting to note that the separation can be seen from 25 attributes all the way to 5000 attributes. Although the SVM prediction is not 100% for the largest 3 datasets, the images all show the same separation. This further goes to show that the minimalist approach to a biological problem such as this is not the most feasible CHAPTER 4. RESULTS 55

solution. Although there is an excellent separation with only 25 genes, there is an equally good separation with 250 genes or even 500 genes. It is quite possible that there may be a genetic difference between the patients who live and who die based on the combination of many attributes. By keeping as many attributes as possible, more information about the patients is gained and more can be learned from them. Our approach is one of removing the poor attributes as opposed to finding the smallest possible subset of attributes that separate the data.

This result confirms that the combination of datasets can be beneficial if per- formed properly. The distribution of SNPs and cDNA within each subset in Table 4.6 is roughly equal but with slightly more cDNA entries in each of the subsets that perform well. The random forests algorithm is able to find the most informative at- tributes for classifying the data based on the label provided. By selecting the top attributes from the combined data we can see that we achieve a new level of sepa- ration spanning multiple subsets. Since we are not looking for the smallest possible subset of attributes, we can safely assume that the top 500 attributes are sufficient to separate the data accurately. Validating these results is difficult since there is little available data to test the subsets on. The problem of validation will be further explored in a later section.

Our conclusion that the genetics of a patient affects their survivability is supported again by this result. We have seen that the SNP and cDNA datasets individually can separate the data based on the mortality label and now the combination of the two datasets provides an even better separation. This is an important finding as it suggests that there is more information that should be available to physicians when they diagnose and treat this disease. By incorporating as much useful information CHAPTER 4. RESULTS 56

as possible it is believed that the higher risk patients can be identified earlier and treated appropriately, thereby reducing the number of deaths.


6 −15

4 −10

2 −5 0 0 −4 5 −2 −2 5 0 0 10 −5 2 −10 4 −4 −15 15 6 −20

(a) 25 SNP-cDNA (b) 500 SNP-cDNA



−15 −10 −5 −10 0 −5 5 0 10 5 15

10 20 10 5 15 25 0 −5 −10 −15 20 30 −20 −25 20 10 0 −10 −20 −30 −40 (c) 1000 SNP-cDNA (d) 2500 SNP-cDNA

Figure 4.13: SVD images of selected SNP-cDNA data. blue = alive, red = deceased CHAPTER 4. RESULTS 57

4.3.5 Analysis of combined SNP data and Affy data

The SVM results for the subsets of the combined SNP and Affy datasets are shown in Table 4.7. This dataset contained 118 patients, of which 105 (88.9%) are alive and 13 (11%) are deceased.

Table 4.7: SVM prediction accuracy of combined SNP and Affy subsets

Atttributes % Class Alive % Class Deceased #SNP/#Affy

25 100 0 11/14 50 100 0 19/31 100 100 0 47/53 250 91.43 0 122/128 500 94.29 30.77 243/257 1000 94.29 7.69 502/498 2500 96.19 15.38 1215/1285 5000 94.29 7.69 2360/2640

Initially it was expected that this combination should behave similarly to the combined SNP and cDNA dataset since the cDNA and Affy measure the same type of information. However, it can be seen from the SVM results that this dataset did not perform well at all. Due to the poor classification, it was not expected that the SVD would show anything interesting. The top 25 and 100 attribute subset images are shown in Figured 4.14(a) and (b). It is quite clear that in these images there is no clear separations of the data. We can therefore conclude that this dataset does not provide any useful information for this study and the investigation of this dataset did not go any further. CHAPTER 4. RESULTS 58

0 −4

0.2 −1.2 −0.5 −4.5 −1.4 0 −5 −1.6 −1 −5.5 −0.2 −1.8 −1.5 −6 −0.4 −2 −2.2 −6.5 −0.6 −2 −2.4 −7 −0.8 −2.5 −2.6 −7.5

(a) 25 SNP-Affy (b) 100 SNP-Affy

Figure 4.14: SVD images of selected SNP-Affy data. blue = alive, red = deceased.

4.3.6 Analysis of combined cDNA data and Affy data

The SVM results for the subsets of the combined cDNA and Affy datasets are shown in Table 4.8. This dataset contains 55 patients, and of these 47 (85.4%) are alive and 8 (14.6%) are deceased.

Table 4.8: SVM prediction accuracy of combined cDNA and Affy subsets

Atttributes % Class Alive % Class Deceased #cDNA/#Affy

25 97.87 87.50 14/11 50 95.74 75 29/21 100 100 100 49/51 250 97.87 62.50 128/132 500 100 75 233/267 1000 100 62.50 479/521 2500 100 62.50 1172/1328 5000 100 62.50 2231/2769 CHAPTER 4. RESULTS 59

It was expected that this dataset would be good at separating the patients based upon their subtype of the disease, since both of the individual datasets displayed an ability to do this. This did appear to be true for subsets which contained a larger number of attributes. However, from the results of the SVM we can see that the smaller datasets perform quite well, especially the 100 attribute subset. This can be seen in Figures 4.15(a) and (b). This becomes interesting when looking at a subset of 250 attributes, seen in Figure

4.15(d). Clearly, there are two well defined clusters of the data when it is labeled with the subtype of the disease with only a few points which appear to be either mislabeled or misclassified. However, when looking at this same figure labeled with mortality in Figure 4.15(c), we see that the data appears to be somewhat separated within the clusters based upon the mortality label. One problem that exists with this dataset is there is only one deceased subject with B-cell leukemia so it becomes more difficult to see if the separation we see on the T-cell side of the image would hold true for the B-cell side. The larger subsets of the dataset lose this separation based upon mortality and only separate based upon the subtype of the disease, as seen in Figures 4.15(e) and (f). This is the first dataset in which we have any indication that the Affy dataset contains any useful information regarding the mortality of the patients. CHAPTER 4. RESULTS 60

−3 −4 −5 −6 −7 −8 −9 8 −10 6 4 2 −11 0 −2 −12 4 −4 0 2 −6 −13 −4 −2 −8 −8 −6

(a) 25 cDNA-Affy (b) 100 cDNA-Affy


−5 −10 −5 −16 0 −18 0 5 −16 −20 −18 5 10 −22 −20 10 −24 −22 15 −24 15 −26 −26 −28 20 −28 20

(c) 250 cDNA-Affy (d) 250 cDNA-Affy Subtype



−30 −40 −30 −45 −20 −25 −10 −20 −50 0 −15 10 −10 −20 20 −5 −55 0 30 5 40 10 −15 −60 15 50

(e) 1000 Affy-cDNA Subtype (f) 2500 Affy-cDNA Subtype

Figure 4.15: SVD images of selected cDNA-Affy data. (a/b/c) blue = alive, red = deceased. (d/e/f) blue = B-cell, red = T-cell, green = Unknown CHAPTER 4. RESULTS 61

4.3.7 Analysis of combined SNP data, cDNA data and Affy


The SVM results for the subsets of the combined SNP, cDNA and Affy datasets are shown in Table 4.9. This dataset contains 49 patients, and of those 42 (85.7%) are alive and 7 (14.3%) are deceased.

Table 4.9: SVM prediction accuracy of combined SNP, cDNA and Affy subsets

Atttributes % Class Alive % Class Deceased #SNP/#cDNA/#Affy

25 100 85.71 9/12/4 50 100 85.71 16/19/15 100 100 91.43 29/39/32 250 97.62 85.71 85/86/79 500 97.62 71.43 178/155/167 1000 100 62.50 368/283/349 2500 97.62 71.43 934/698/868 5000 97.62 71.43 1877/1329/1794

From the results of the SVM we see that there is fairly good classification for most of the smaller subsets. The top 25 and 100 attributes are shown in Figures 4.16(a) and (b). These images look similar to the results of the combined cDNA and Affy data. This suggests that the SNP data does not have a large influence on the data, which is surprising because the SNP data can separate the data well. After the 250 subset, the data begins to be separated based upon subtype label as we have seen in most of the other datasets. It is clear that by incorporating the Affy dataset, the useful information contained in both the SNP and the cDNA datasets is overpowered resulting in a poorer separation. CHAPTER 4. RESULTS 62

−6 −4 2 4 6 −2 −6 −4 −2 0 2 3 −10 −8 −1 0 1 −4 −3 −2

(a) 25 All Combined (b) 100 All Combined

Figure 4.16: SVD images of selected SNP, cDNA and Affy data. blue = alive, red = deceased.

4.4 Attribute Selection

Here we present the results of attribute selection for each of the datasets previously mentioned. We have analyzed these subsets in the previous section, and based on these results we have determined the best subset for each dataset. For each of the following lists, a bold-face entry represent an attribute that was found in multiple lists and the corresponding number in brackets is the number of times that attribute was seen, with a maximum of seven.

4.4.1 Attribute selection on SNP data

Here we present the top 100 SNPs that were selected by the random forest algorithm. These attributes are listed in Table 4.10. CHAPTER 4. RESULTS 63

Table 4.10: Top 100 SNP Attributes

SNP Gene Description

rs735482 CD3EAP Epsilon associated protein rs17511668 N4BP2 NEDD4 binding protein rs1533594 RTP4 Receptor (chemosensory) transporter protein rs2808096 ARHGAP12 Rho GTPase activating protein (2) rs4820853 SEC14L3 SEC14-like 3 (S. cerevisiae) rs6077510 PLCB4 Phospholipase C, beta 4 rs4726514 Unknown rs1140380 TMEM208 Transmembrane protein 208 rs12093154 FAM132A Family with sequence similarity 132, member A rs1551118 C12orf64 12 open reading frame 64 rs1109278 THA1P Threonine aldolase 1 psuedogene rs831510 MRAP Melanocortin 2 recepter accessory protein rs2657879 GLS2 Glutaminase 2 (liver,mitochondrial) rs2491132 SDC3 Syndecan 3 rs5992917 LOC100129113 Hypothetical protein rs2303063 SPINK5 Serine peptidase inhibitor, Kazal type 5 rs2289642 KIAA0753 Uncharacterized protein rs4985404 PDPR Pyruvate dehydrogenase phosphotase rs2305830 CEP164 Centrosomal protein 164kDa rs8113704 NFKBID Nuclear factor of kappa light polypeptide rs2288681 DIP2C Disco-interacting protein 2 (2) rs15702 NSL1 NSL1, MIND kinetochore complex component rs3743044 USP8 Ubiquitin specific peptidase 8 rs6720173 ABCG5 ATP-binding cassette, sub-family G rs6687605 LDLRAP1 Low density lipoprotein receptor adaptor rs4642516 Unknown rs11191274 GBF1 Golgi-specific brefeldin A resistant exchange factor 1 rs289723 NLRC5 NLR family, CARD domain containing 5 rs2306242 GAK Cyclin G associated kinase rs11543349 OGFR Opioid growth factor receptor rs10137972 SYNE2 Spectrin repeat containing, nuclear envelope 2 rs1265100 PSORS1C2 Psoriasis susceptibility 1 candidate 2 rs584367 PLA2G2D Phospholipase A2, group IID rs260462 ZNF544 Zinc finger protein 544 rs848210 SPEN Spen homolog, transcriptional regulator 2 rs17080284 UQCRC1 Ubiquinol-cytochrome c reductase core protein I rs2072770 RIBC2 RIB43A domain with coiled-coils 2 rs970547 COL12A1 Collagen, type XII, alpha 1 rs3181320 CASP5 Caspase 5, apoptosis-related cysteine peptidase rs2178004 MGA MAX gene associated rs10408676 NOTCH3 Notch homolog 3 (Drosophila) rs2043449 CYP20A1 Cytochrome P450, family 20, subfamily A, polypeptide 1

Continued on Next Page. . . CHAPTER 4. RESULTS 64

Table 4.10 – Continued

SNP Gene Description rs291102 PIGR Polymeric immunoglobulin receptor rs4865615 SLC38A9 Solute carrier family 38, member 9 rs2108622 CYP4F2 Cytochrome P450, family 4, subfamily F, polypeptide 2 rs17165906 VWDE von Willebrand factor D and EGF domains rs1042023 APOB Apolipoprotein B (including Ag(x) antigen) rs2295778 HIF1AN Hypoxia inducible factor 1, alpha subunit inhibitor rs35018800 TYK2 Tyrosine kinase 2 rs344140 SHROOM3 Shroom family member 3 rs1126642 GFAP Glial fibrillary acidic protein rs9928053 ACSM5 Acyl-CoA synthetase medium-chain family member 5 rs13058467 TTLL12 Tubulin tyrosine ligase-like family, member 12 rs4253301 KLKB1 Kallikrein-related peptidase 3 rs1050239 SMPD1 Sphingomyelin phosphodiesterase 1, acid lysosomal rs4870 TNFRSF14 Tumor necrosis factor receptor superfamily, member 14 rs16971436 ZFHX3 Zinc finger homeobox 3 rs12625565 LPIN3 Lipin 3 rs848209 SPEN Spen homolog, transcriptional regulator rs1105879 UGT1A9 UDP glucuronosyltransferase 1 family, polypeptide A9 rs4073918 SLC6A18 Solute carrier family 6, member 18 rs854777 MYO15A Myosin XVA rs2297595 DPYD Dihydropyrimidine dehydrogenase rs2248490 WDR4 WD repeat domain 4 rs4987310 SELL Selectin L rs966384 LRG1 Leucine-rich alpha-2-glycoprotein 1 rs5745325 MSH4 MutL homolog 1, colon cancer, nonpolyposis type 2 (E. coli) rs2281929 ZBTB46 zinc finger and BTB domain containing 46 rs609320 RHCE Rh blood group, CcEe antigens rs500049 OBSCN Obscurin, calmodulin and titin-interacting RhoGEF (2) rs854800 MYO15A Myosin XVA rs597371 VWA2 von Willebrand factor A domain containing 2 rs6052 FGA Fibrinogen alpha chain rs2736155 BAT2 HLA-B associated transcript 2 rs3796318 FBLN2 Fibulin 2 rs6586179 LIPA Lipase A, lysosomal acid, cholesterol esterase rs3750904 SCN9A Sodium channel, voltage-gated, type IX, alpha subunit rs3848519 FECH Ferrochelatase (protoporphyria) rs6094752 NCOA3 Nuclear receptor coactivator 3 rs3800939 FBXL13 F-box and leucine-rich repeat protein 13 rs2240040 ZNF749 Zinc finger protein 749 rs2244492 TTN Titin rs292575 WDR91 WD repeat domain 91 (2) rs3735035 PODXL Podocalyxin-like rs2234962 BAG3 BCL2-associated athanogene 3

Continued on Next Page. . . CHAPTER 4. RESULTS 65

Table 4.10 – Continued

SNP Gene Description rs4299811 Unknown rs9550987 TNFRSF19 Tumor necrosis factor receptor superfamily, member 19 rs292592 WDR91 WD repeat domain 91 rs7259845 ZNF844 Zinc finger protein 844 rs474534 STK19 Serine/threonine kinase 19 rs11254408 TRDMT1 tRNA aspartic acid methyltransferase 1 rs435549 Unknown rs4761944 Unknown rs1052500 C2orf76 Chromosome 2 open reading frame 76 rs9423502 PITRM1 Pitrilysin metallopeptidase 1 rs1035442 MUC16 Mucin 16, cell surface associated rs7995033 MTMR6 Myotubularin related protein 6 rs2305612 COLQ Collagen-like tail subunit of asymmetric acetylcholinesterase rs4407724 SYNE1 Spectrin repeat containing, nuclear envelope 1 rs6659553 POMGNT1 Protein O-linked mannose beta1,2-N-acetylglucosaminyltransferase

4.4.2 Attribute selection on cDNA data

Here we present the top 100 cDNA genes that were selected by the random forest algorithm. These top 100 attributes are listed below in Table 4.11.

Table 4.11: Top 100 cDNA Attributes

Gene Description

WDR77 WD repeat domain 77 (4) TRIM37 Triparite motife-containing 37 (3) PSMC4 Proteasome 26S (2) CTNNA1 Catenin (cadherin-associated protein) (4) HIST1H2AM Histone cluster 1,H2am (2) HIST1H2AL Histone cluster 1,H2al (4) Unknown SLCO2A1 Solute carrier organic anion transporter (2) LCP1 Lymphocyte cytosolic protein 1 MYH10 Myosin, heavy chain 10, non-muscle PWP1 PWP1 homolog (S.cervisiae) (3)

Continued on Next Page. . . CHAPTER 4. RESULTS 66

Table 4.11 – Continued

Gene Description PVR Poliovirus receptor (4) PRG1 p53-responsive gene 1 ROD1 ROD1 regulator of differentiation 1 (3) FTO Fat mass and obesity associated (4) FKBP5 FK506 binding protein 5 (4) ZFHX1B Zinc finger E-box binding homeobox 2 GNAS Guanine nucleotide binding protein (3) SIL Endoplasmic reticulum chaperone (3) FMO1 Flavin containing monooxygenase 1 (2) Unknown CLPP ClpP caseinolytic peptidase (3) KIF21A Kinesin family member 21A (4) PSME1 Proteasome activator subunit 1 (3) Unknown RPS4X Ribosomal protein S4, X-linked IFI44L Interferon-induced protein 44-like BNIP1 BCL2/adenovirus E1B 19kDa interacting protein 1 (3) PSIP1 PC4 and SFRS1 interacting protein 1 (2) HLADMA Major histocompatibility complex, class II, DM alpha (4) SMNDC1 Survival motor neuron domain containing 1 (4) CEB1 Hect domain and RLD 5 MYL6 Myosin, light chain 6, alkali, smooth muscle and non-muscle FKBP2 FK506 binding protein 2, 13kDa FXR1 Fragile X mental retardation, autosomal homolog 1 MT1F Metallothionein 1F QDPR Quinoid dihydropteridine reductase (2) DNAJC4 DnaJ (Hsp40) homolog, subfamily C, member 4 PCM1 Pericentriolar material 1 (4) METAP2 Methionyl aminopeptidase 2 (2) KAT2B K(lysine) acetyltransferase 2B (2) HIST1H2AE Histone cluster 1, H2ae (2) DUSP12 Dual specificity phosphatase 12 G3BP GTPase activating protein binding protein 1 (2) GSPT1 G1 to S phase transition 1 (3) IFIT2 Interferon-induced protein tetratricopeptide repeats 2 (2) PML Promyelocytic leukemia (2) GNA11 Guanine nucleotide binding protein alpha 11 GNB2L1 Guanine nucleotide binding protein beta polypeptide 2-like 1 Unknown NPY Neuropeptide Y AKAP13 A kinase (PRKA) anchor protein 13 (2) MFAP4 Microfibrillar-associated protein 4 U1SNRNPBP U11/U12 snRNP 35K (2)

Continued on Next Page. . . CHAPTER 4. RESULTS 67

Table 4.11 – Continued

Gene Description RBPMS RNA binding protein with multiple splicing GABRG2 Gamma-aminobutyric acid A receptor, gamma 2 (3) Unknown KLK4 Kallikrein-related peptidase 4 RCP9 Calcitonin gene-related peptide-receptor CD24 CD24 molecule CASQ1 Calsequestrin 1 CKS2 CDC28 protein kinase regulatory subunit 2 GMFG Glia maturation factor, gamma MPP2 Membrane protein, palmitoylated 2 (MAGUK p55 subfamily member 2) RAB35 RAB35, member RAS oncogene family (2) KIAA1045 KIAA1045 (2) DDX19A DEAD (Asp-Glu-Ala-As) box polypeptide 19A POU3F4 POU class 3 homeobox 4 EBI3 Epstein-Barr virus induced gene 3 ZNF294 Zinc finger protein 294 (2) SNX22 Sorting nexin 22 MYOZ3 Myozenin 3 NGDN Neuroguidin, EIF4E binding protein THBS1 Thrombospondin 1 (4) AQR Aquarius homolog (mouse) (2) ING1L Inhibitor of growth family, member 2 BTK Bruton agammaglobulinemia tyrosine kinase (3) Unknown Unknown RARG Retinoic acid receptor, gamma (2) SIRT6 Sirtuin 6 (S. cerevisiae) CNTNAP1 Contactin associated protein 1 (2) NFS1 NFS1 nitrogen fixation 1 homolog (S. cerevisiae) (2) NRL Neural retina leucine zipper gene DSCR1 Regulator of calcineurin 1 DTX4 Deltex 4 homolog (Drosophila) (4) ZNF142 Zinc finger protein 142 BAT8 Euchromatic histone-lysine N-methyltransferase 2 ELAVL1 ELAV-like 1 (Hu antigen R) (2) CA2 Carbonic anhydrase II DLG1 Discs, large homolog 1 (Drosophila) BRSK2 BR serine/threonine kinase 2 TP53I3 Tumor protein p53 inducible protein 3 PPP1R10 Protein phosphatase 1, regulatory (inhibitor) subunit 10 PLK1 Polo-like kinase 1 (Drosophila) Unknown SLC2A8 Solute carrier family 2 (facilitated glucose transporter), member 8

Continued on Next Page. . . CHAPTER 4. RESULTS 68

Table 4.11 – Continued

Gene Description Unknown IGF2BP2 Insulin-like growth factor 2 mRNA binding protein 2 (2) IFI6 Interferon, alpha-inducible protein 6 Unknown

4.4.3 Attribute selection on Affy data

Although the analysis of the Affy subsets did not provide much useful information, we still believe that there are some useful attributes in this dataset for separating the data based on the mortality label. In Table 4.12 we present the top 100 attributes from the dataset.

Table 4.12: Top 100 Affy Attributes

Affy ID Gene Description

210249 s at NCOA1 Nuclear receptor coactivator 1 (3) 204689 at HHEX Hematopoietically expressed homeobox 207805 s at PSMD9 Proteasome 26S subunit 209644 x at CDKN2A Cyclin-dependent kinase inhibitor 2A 221569 at AHI1 Abelson helper integration site 1 220068 at VPREB3 Pre-B lymphocyte gene 3 200026 at RPL34 Ribosomal protein L34 217728 at S100A6 S100 calcium binding protein A6 (2) 207426 s at TNFSF4 Tumor necrosis factor superfamily, member 4 (2) 205548 s at BTG3 BTG family, member 3 212812 at Unknown 217373 x at MDM2 Mdm2 p53 binding protein homolog (mouse) 200855 at Unknown 213056 at FRMD4B FERM domain containing 4B (2) 206995 x at SCARF1 Scavenger receptor class F, member 1 (3) 214003 x at RPS20 Ribosomal protein S20 209995 s at TCL1A T-cell leukemia/lymphoma 1A (3) 212423 at ZCCHC24 Zinc finger, CCHC domain containing 24

Continued on Next Page. . . CHAPTER 4. RESULTS 69

Table 4.12 – Continued

Affy ID Gene Description 209808 x at ING1 Inhibitor of growth family, member 1 (3) 200032 s at RPL9 Ribosomal protein L9 202695 s at STK17A Serine/threonine kinase 17a 205726 at DIAPH2 Diaphanous homolog 2 (Drosophila) 204075 s at KIAA0562 Uncharacterized protein 217820 s at ENAH Enabled homolog (Drosophila) 200062 s at RPL30 Ribosomal protein L30 218820 at Unknown 203577 at GTF2H4 General transcription factor IIH, polypeptide 4, 204218 at C11orf51 Chromosome 11 open reading frame 51 203233 at IL4R Interleukin 4 receptor 203616 at POLB Polymerase (DNA directed), beta 200660 at S100A11 S100 calcium binding protein (3) 208438 s at FGR Gardner-Rasheed feline sarcoma viral oncogene (2) 215017 s at Unknown 222146 s at TCF4 Transcription factor 4 (5) 212810 s at SLC1A4 Solute carrier family 1 217939 s at AFTPH Aftiphilin (2) 209152 s at TCF3 Transcription factor 3 218281 at MRPL48 Mitochondrial ribosomal protein (2) 221543 s at ERLIN2 ER lipid raft associated 2 212324 s at VPS13D Vacuolar protein sorting 13 39318 at TCL1A T-cell leukemia/lymphoma 1A (3) 201094 at RPS29 Ribosomal protein S29 208690 s at PDLIM1 PDZ and LIM domain 1 212587 s at PTPRC Protein tyrosine phosphatase, receptor type, C (4) 206752 s at DFFB DNA fragmentation factor 207416 s at NFATC3 Nuclear factor of activated T-cells calcineurin-dependent 3 (4) 215000 s at FEZ2 Fasciculation and elongation protein zeta 2 213746 s at Unknown 203688 at PKD2 Polycystic kidney disease 2 205786 s at ITGAM Integrin, alpha M 217168 s at HERPUD1 Homocysteine and ER stress-inducible, ubiquitin-like domain member 1 218847 at IGF2BP2 Insulin-like growth factor 2 (2) 203414 at MMD Monocyte to macrophage differentiation 209107 x at NCOA1 Nuclear receptor coactivator 1 (3) 218380 at NLRP1 NLR family, pyrin domain containing 1 208645 s at Unknown 212386 at TCF4 Transcription factor 4 (5) 211991 s at HLADPA1 Major histocompatibility complex 217712 at Unknown 202016 at MEST Mesoderm specific transcript homolog (mouse) 204061 at PRKX Protein kinase, X-linked 212436 at TRIM33 Tripartite motif-containing 33

Continued on Next Page. . . CHAPTER 4. RESULTS 70

Table 4.12 – Continued

Affy ID Gene Description 212588 at PTPRC Protein tyrosine phosphatase, receptor type, C (4) 215411 s at Unknown 210555 s at NFATC3 Nuclear factor of activated T-cells calcineurin-dependent 3 (4) 217542 at MGC5370 Hypothetical protein MGC5370 (2) 201461 s at MGC5370 Hypothetical protein MGC5370 (2) 209035 at MDK Midkine 200951 s at CCND2 Cyclin D2 200025 s at RPL27 Ribosomal protein L27 220960 x at RPL22 Ribosomal protein L22 60471 at RIN3 Ras and Rab interactor 3 213434 at STX2 Syntaxin 2 201739 at SGK Serum/glucocorticoid regulated kinase 1 203753 at TCF4 Transcription factor 4 (5) 203279 at EDEM1 ER degradation enhancer 203434 s at MME Membrane metallo-endopeptidase (2) 201254 x at RPS6 Ribosomal protein S6 218434 s at AACS Acetoacetyl-CoA synthetase 206542 s at SMARCA2 SWI/SNF related, matrix associated, dependent regulator of chromatin 213891 s at TCF4 Transcription factor 4 (5) 208720 s at RBM39 RNA binding motif protein 39 (3) 200602 at APP Amyloid beta (A4) precursor protein 203435 s at MME Membrane metallo-endopeptidase (2) 206656 s at Unknown 212332 at RBL2 Retinoblastoma-like 2 217979 at TSPAN13 Tetraspanin 13 (2) 204552 at Unknown 210776 x at EST63624 Jurkat T-cells V Homo sapiens cDNA 5- end, mRNA sequence 210676 x at RGPD5 RANBP2-like and GRIP domain containing 5 208894 at HLADRA Major histocompatibility complex, class II, DR alpha (2) 204866 at PHF16 PHD finger protein 16 54037 at HPS4 Hermansky-Pudlak syndrome 4 217707 x at Unknown 201373 at PLEC1 Plectin 1 210982 s at HLADRA Major histocompatibility complex, class II, DR alpha (2) 209927 s at C1orf77 Chromosome 1 open reading frame 77 212480 at SPECC1L SPECC1-like KIAA0376 209269 s at Unknown 221865 at C9orf91 Chromosome 9 open reading frame 91 CHAPTER 4. RESULTS 71

4.4.4 Attribute selection on SNP data and cDNA data

In Table 4.13 we list the top 100 attributes for these data.

Table 4.13: Top 100 SNP-cDNA Attributes

SNP Gene Description

SMNDC1 Survival motor neuron (4) CLPP ClpP caseinolytic peptidase (3) WDR77 WD repeat domain 77 (4) rs16972193 SPTBN5 Spectrin, beta, non-erythrocytic 5 (2) rs3842787 PTGS1 Prostaglandin-endoperoxide synthase 1 (2) rs1132780 CAMKK2 Calcium/calmodulin-dependent protein kinase Unknown PVR Poliovirus receptor (4) GABRG2 GABA A receptor, gamma 2 (3) rs11153174 Unknown PROZ Protein Z, vitamin K-dependent plasma glycoprotein NOL7 Nucleolar protein 7, 27kDa KAT2B K(lysine) acetyltransferase 2B (2) CTNNA1 Catenin (cadherin-associated protein), alpha 1 (4) rs2277125 Unknown rs35760989 Unknown PWP1 PWP1 homolog (S. cerevisiae) (3) SIL Endoplasmic reticulum chaperone (3) DTX4 Deltex 4 homolog (Drosophila) (4) rs35835241 TBCKL Unknown rs869801 DOCK1 Dedicator of cytokinesis 1 rs4969258 Unknown TPMT Thiopurine S-methyltransferase (2) rs6020 F5 Coagulation factor V rs2276805 AADACL1 Arylacetamide deacetylase-like 1 BCAS2 Breast carcinoma amplified sequence 2 rs2511241 P2RY2 Purinergic receptor P2Y, G-protein coupled, 2 rs2270384 SLC7A4 Solute carrier family 7 member 4 GSPT1 G1 to S phase transition 1 (3) HLADMA Major histocompatibility complex (4) HTR6 5-hydroxytryptamine (serotonin) receptor 6 rs4969259 Unknown RSN CAPGLY domain containing linker protein 1 rs3751315 FBRSL1 Fibrosin-like 1 GNAS GNAS complex locus (3) RCAN1 Regulator of calcineurin 1

Continued on Next Page. . . CHAPTER 4. RESULTS 72

Table 4.13 – Continued

SNP Gene Description rs4987262 PTGIR Prostaglandin I2 (prostacyclin) receptor (IP) Unknown rs7578597 THADA Thyroid adenoma associated (2) Unknown rs1966265 FGFR4 Fibroblast growth factor receptor 4 NFS1 NFS1 nitrogen fixation 1 homolog (S. cerevisiae) (2) NPTX1 Neuronal pentraxin I rs7338333 ING1 Inhibitor of growth family, member 1 (3) rs248248 C5orf45 Chromosome 5 open reading frame 45 rs8027765 AEN Apoptosis enhancing nuclease CEP290 centrosomal protein 290kDa (2) BTK Bruton agammaglobulinemia tyrosine kinase (3) PPP1CA Protein phosphatase 1, catalytic subunit, alpha (2) rs1999663 C20orf114 Chromosome 20 open reading frame 114 (2) THBS1 Thrombospondin 1 (4) rs1137078 HLAA29.1 Major histocompatibility complex class I (2) rs3730947 LIG1 Ligase I, DNA, ATP-dependent rs1065761 CHIT1 Chitinase 1 (chitotriosidase) (2) SLC39A7 Solute carrier family 39 (zinc transporter), member 7 rs1122326 HSPB9 Heat shock protein, alpha-crystallin-related, B9 (2) rs2427536 SLC2A4RG SLC2A4 regulator (2) rs753381 PLCG1 Phospholipase C, gamma 1 rs8179070 PLIN Perilipin KIF21A Kinesin family member 21A (4) RARG Retinoic acid receptor, gamma (2) BCKDHB Branched chain keto acid dehydrogenase E1 CDR1 cerebellar degeneration-related protein 1 (2) SHFM1 Split hand/foot malformation (ectrodactyly) type 1 HIST1H2AL Histone cluster 1, H2al (4) FTO Fat mass and obesity associated (4) rs3179969 SPATA7 Spermatogenesis associated 7 rs11240604 ZC3H11A Zinc finger CCCH-type containing 11A FKBP5 FK506 binding protein 5 (4) rs2523720 TRIM26 Tripartite motif-containing 26 (2) rs13110318 TBC1D1 TBC1 domain family, member 1 AQP9 Aquaporin 9 (3) PCM1 Pericentriolar material 1 (4) BNIP1 BCL2/adenovirus E1B 19kDa interacting protein 1 (3) Unknown rs16833032 NID1 Nidogen 1 rs3803414 MEGF11 Multiple EGF-like-domains 11 rs3747243 Unknown Unknown ZAK Sterile alpha motif and leucine zipper

Continued on Next Page. . . CHAPTER 4. RESULTS 73

Table 4.13 – Continued

SNP Gene Description rs3741554 KIAA1602 Uncharacterized protein MAD2L2 MAD2 mitotic arrest deficient-like 2 (yeast) Unknown TIMP2 TIMP metallopeptidase inhibitor 2 MAPK8IP3 Mitogen-activated protein kinase 8 (2) QDPR Quinoid dihydropteridine reductase (2) MB Myoglobin (3) rs12090611 MEGF6 Multiple EGF-like-domains 6 rs11164066 Unknown rs2569491 KLK14 Kallikrein-related peptidase 14 rs3803641 KCNG4 Potassium voltage-gated channel, subfamily G, member 4 rs1143684 NQO2 NAD(P)H dehydrogenase, quinone 2 rs3747532 CER1 Chromosome 3 common eliminated region 1 rs3777721 RNASET2 Ribonuclease T2 ALAD Aminolevulinate, delta-, dehydratase (2) UPK1B Uroplakin 1B PSME1 Proteasome activator subunit 1 (3) HNRPU Heterogeneous nuclear ribonucleoprotein U rs11264581 PEAR1 Platelet endothelial aggregation receptor 1

4.4.5 Attribute selection on SNP data and Affy data

This dataset provided poor separation of the data. However, we still believe the top attributes from each dataset will contain important information for separating based upon mortality. In Table 4.14 the top 100 attributes are listed.

Table 4.14: Top 100 SNP-Affy Attributes

SNP Affy ID Gene Description

201425 at ALDH2 Aldehyde dehydrogenase 2 family rs1021580 CDC20B Cell division cycle 20 homolog B 221573 at C7orf25 Chromosome 7 open reading frame 25 204179 at MB Myoglobin (3) rs1051484 PREP Prolyl endopeptidase rs12932514 Unknown

Continued on Next Page. . . CHAPTER 4. RESULTS 74

Table 4.14 – Continued

SNP Affy ID Gene Description rs11016071 MKI67 Antigen identified by monoclonal antibody Ki-67 rs7201721 Unknown 201612 at ALDH9A1 Aldehyde dehydrogenase 9 family, member A1 rs8106130 OSCAR Osteoclast associated, immunoglobulin-like receptor rs4782591 TAF1C TATA box binding protein RNA polymerase I, C rs3794153 ST5 Suppression of tumorigenicity 5 201424 s at CUL4A Cullin 4A 212591 at RBM34 RNA binding motif protein 34 204079 at TPST2 Tyrosylprotein sulfotransferase 2 rs1611149 Unknown 218285 s at BDH2 3-hydroxybutyrate dehydrogenase, type 2 216981 x at SPN Sialophorin rs35187177 SGK2 Serum/glucocorticoid regulated kinase 2 215971 at Unknown 205552 s at OAS1 2’,5’-oligoadenylate synthetase 1 220498 at ACTL7B Actin-like 7B 218851 s at WDR33 WD repeat domain 33 rs2307289 MBD4 Methyl-CpG binding domain protein 4 212816 s at CBS Cystathionine-beta-synthase 200030 s at SLC25A3 Solute carrier family 25 218976 at DNAJC12 DnaJ (Hsp40) homolog, subfamily C, member 12 217732 s at ITM2B Integral membrane protein 2B 202146 at IFRD1 Interferon-related developmental regulator 1 204693 at CDC42EP1 CDC42 effector protein 203283 s at HS2ST1 Heparan sulfate 2-O-sulfotransferase 1 rs9500989 Unknown 207432 at BEST2 Bestrophin 2 55093 at CSGlcAT Chondroitin sulfate glucuronyltransferase 212526 at SPG20 Spastic paraplegia 20 209931 s at FKBP1B FK506 binding protein 1B rs7732300 Unknown rs940871 DKFZp547K054 Hypothetical protein 217716 s at SEC61A1 Sec61 alpha 1 subunit 209794 at SRGAP3 SLIT-ROBO Rho GTPase activating protein 3 208442 s at ATM Ataxia telangiectasia mutated (2) 208697 s at EIF3E Eukaryotic translation initiation factor 3 rs4371530 NAALADL2 N-acetylated alpha-linked acidic dipeptidase-like 2 222150 s at tcag7.1314 Hypothetical protein 215004 s at SF4 Splicing factor 4 rs4861066 NSUN7 NOL1/NOP2/Sun domain family, member 7 rs2297270 DSCAM Down syndrome cell adhesion molecule rs2239808 KCTD20 Potassium channel tetramerisation domain containing 20 rs17784583 DEF8 Differentially expressed in FDCP 8 (3) 202778 s at ZMYM2 Zinc finger, MYM-type 2

Continued on Next Page. . . CHAPTER 4. RESULTS 75

Table 4.14 – Continued

SNP Affy ID Gene Description rs2274670 FAM113A Family with sequence similarity 113, member A rs2734971 Unknown rs500049 OBSCN Obscurin calmodulin and titin-interacting RhoGEF (2) rs3918232 NOS3 Nitric oxide synthase 3 (endothelial cell) 200878 at EPAS1 Endothelial PAS domain protein 1 220072 at CSPP1 Centrosome and spindle pole associated protein 1 rs4304840 CLEC4D C-type lectin domain family 4, member D 201465 s at JUN Jun oncogene 219326 s at B3GNT2 UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltransferase 2 rs1281013 C1orf127 Chromosome 1 open reading frame 127 207420 at COLEC10 Collectin sub-family member 10 207632 at MUSK Muscle, skeletal, receptor tyrosine kinase 202020 s at LANCL1 LanC lantibiotic synthetase component C-like 1 201611 s at ICMT Isoprenylcysteine carboxyl methyltransferase rs2243620 Unknown rs2844759 Unknown rs567083 MAK Male germ cell-associated kinase rs4785751 DEF8 Differentially expressed in FDCP 8 (3) 213895 at EMP1 Epithelial membrane protein 1 209131 s at SNAP23 Synaptosomal-associated protein 204048 s at PHACTR2 Phosphatase and actin regulator 2 rs2282284 FCRL3 Fc receptor-like 3 216945 x at PASK PAS domain containing serine/threonine kinase rs1042303 GPLD1 Glycosylphosphatidylinositol specific phospholipase D1 209397 at ME2 Malic enzyme 2 217976 s at DYNC1LI1 Dynein, cytoplasmic 1, light intermediate chain 1 rs3820071 RP11265F14.2 Elastase 2B 207811 at KRT12 Keratin 12 203466 at MPV17 MpV17 mitochondrial inner membrane protein 207430 s at MSMB Microseminoprotein rs2568023 C11orf16 Chromosome 11 open reading frame 16 rs11543211 PSMC5 Proteasome (prosome, macropain) 26S subunit, ATPase, 5 rs12199003 GFRAL GDNF family receptor alpha like rs4977196 KIAA1875 Protein similar to KIAA1875 rs3934462 ARL13A ADP-ribosylation factor-like 13A rs5744463 CD180 CD180 molecule 212525 s at Unknown rs2050189 C6orf10 Chromosome 6 open reading frame 10 rs10843438 OVCH1 Ovochymase 1 208708 x at EIF5 Eukaryotic translation initiation factor 5 rs10163657 LOXHD1 Lipoxygenase homology domains 1 rs3820011 KIAA1751 Similar to KIAA1751 protein rs1886544 NEK5 NIMA (never in mitosis gene a)-related kinase 5 204933 s at TNFRSF11B Tumor necrosis factor receptor superfamily, member 11b

Continued on Next Page. . . CHAPTER 4. RESULTS 76

Table 4.14 – Continued

SNP Affy ID Gene Description rs2020860 FMO2 Flavin containing monooxygenase 2 (non-functional) 218137 s at SMAP1 Stromal membrane-associated GTPase-activating protein 1 rs910397 PXMP4 Peroxisomal membrane protein 4 rs4785766 GAS8 Growth arrest-specific 8 203418 at CCNA2 Cyclin A2 rs3873374 Unknown

4.4.6 Attribute selection on cDNA data and Affy data

Table 4.15 lists the top 100 attributes for this dataset.

Table 4.15: Top 100 Affy-cDNA Attributes

Affy ID Gene Description

WDR46 WD repeat domain 46 (2) SMNDC1 Survival motor neuron (4) Unknown DTX4 Deltex 4 homolog (Drosophila) (4) TRIM37 Tripartite motif-containing 37 (3) FKBP5 FK506 binding protein 5 (4) CDR1 Cerebellar degeneration-related protein (2) SNAPC1 Small nuclear RNA activating complex 200660 at S100A11 S100 calcium binding protein A11 (3) 201089 at ATP6V1B2 ATPase, H+ transporting 210555 s at NFATC3 Nuclear factor of activated T-cells (4) 201990 s at CREBL2 cAMP responsive element binding protein (2) 222146 s at TCF4 Transcription factor 4 (5) PPP1CA Protein phosphatase 1, catalytic subunit, alpha isoform (2) PSMC4 Proteasome (prosome, macropain) 26S subunit, ATPase, 4 (2) CTNNA1 Catenin (cadherin-associated protein), alpha 1 (4) 201105 at LGALS1 Lectin, galactoside-binding, soluble, 1 (2) Unknown IFIT2 Interferon-induced protein with tetratricopeptide repeats 2 (2) 209239 at NFKB1 Nuclear factor of kappa light polypeptide gene enhancer in B-cells 1 212587 s at PTPRC Protein tyrosine phosphatase, receptor type, C (4) THBS1 Thrombospondin 1 (4) 208620 at PCBP1 Poly(rC) binding protein 1 (2)

Continued on Next Page. . . CHAPTER 4. RESULTS 77

Table 4.15 – Continued

Affy ID Gene Description 221475 s at RPL15 Ribosomal protein L15 208720 s at RBM39 RNA binding motif protein 39 (3) RAB35 Member RAS oncogene family (2) 212423 at C10orf56 Chromosome 10 open reading frame 56 (2) HLA-DMA Major histocompatibility complex, class II, DM alpha (4) PVR Poliovirus receptor (4) KIF21A Kinesin family member 21A (4) 206050 s at RNH1 Ribonuclease/angiogenin inhibitor 1 GNAS GNAS complex locus (3) 204011 at SPRY2 Sprouty homolog 2 ELAVL1 ELAV (embryonic lethal, abnormal vision, Drosophila)-like 1 (2) 221269 s at SH3BGRL3 SH3 domain binding glutamic acid-rich protein like 3 PWP1 PWP1 homolog (S. cerevisiae) (3) 215621 s at IGHG1 Immunoglobulin heavy constant mu MAPK8IP3 Mitogen-activated protein kinase 8 interacting protein 3 (2) 211672 s at ARPC4 Actin related protein 2/3 complex, subunit 4 KIAA1045 Hypothetical protein (2) PSIP1 PC4 and SFRS1 interacting protein 1 (2) 206656 s at Unknown 213056 at FRMD4B FERM domain containing 4B (2) G3BP GTPase activating protein (SH3 domain) binding protein 1 (2) 216652 s at DR1 Down-regulator of transcription 1 HIST1H2AM Histone cluster 1, H2am (2) 212387 at TCF4 Transcription factor 4 (5) FTO Fat mass and obesity associated (4) CSF2RB Colony stimulating factor 2 receptor, beta Unknown Unknown GTL3 Gene trap locus 3 201012 at ANXA1 Annexin A1 203165 s at SLC33A1 Solute carrier family 33 (acetyl-CoA transporter), member 1 209653 at KPNA4 Karyopherin alpha 4 ROD1 ROD1 regulator of differentiation 1 (3) 208438 s at FGR Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene (2) FMO1 Flavin containing monooxygenase 1 (2) HIST1H2AL Histone cluster 1, H2al (4) 213891 s at TCF4 Transcription factor 4 (5) PSG3 Pregnancy specific beta-1-glycoprotein 3 BNIP1 BCL2/adenovirus E1B 19kDa interacting protein 1 (3) 209864 at FRAT2 Frequently rearranged in advanced T-cell lymphomas 2 CNTNAP1 Contactin associated protein 1 (2) 202833 s at SERPINA1 Serpin peptidase inhibitor, clade A 217728 at S100A6 S100 calcium binding protein A6 (2) AQP9 Aquaporin 9 (3)

Continued on Next Page. . . CHAPTER 4. RESULTS 78

Table 4.15 – Continued

Affy ID Gene Description PCM1 Pericentriolar material 1 (4) 200872 at S100A10 S100 calcium binding protein A10 EID1 EP300 interacting inhibitor of differentiation 1 (2) 207654 x at DR1 Down-regulator of transcription 1 221497 x at EGLN1 Egl nine homolog 1 (C. elegans) 201932 at LRRC41 Leucine rich repeat containing 41 GABRG2 Gamma-aminobutyric acid (GABA) A receptor, gamma 2 (3) ZNF294 Zinc finger protein 294 (2) 214687 x at ALDOA Aldolase A 202741 at PRKACB Protein kinase, cAMP-dependent, catalytic, beta 218987 at ATF7IP Activating transcription factor 7 interacting protein (2) 203568 s at TRIM38 Tripartite motif-containing 38 217939 s at AFTPH Aftiphilin (2) TSC22 TSC22 domain family, member 3 GSPT1 G1 to S phase transition 1 (3) 202086 at MX1 Myxovirus (influenza virus) resistance 1 CEP290 Centrosomal protein 290kDa (2) 218281 at MRPL48 Mitochondrial ribosomal protein L48 (2) 220046 s at CCNL1 Cyclin L1 210249 s at NCOA1 Nuclear receptor coactivator (3) 219451 at MSRB2 Methionine sulfoxide reductase 209648 x at SOCS5 Suppressor of cytokine signaling 5 U1SNRNPBP U11/U12 snRNP (2) 201421 s at WDR77 WD repeat domain 77 (4) 212504 at DIP2C Disco-interacting protein 2 homolog C (Drosophila) (2) 210202 s at BIN1 Bridging integrator 1 (2) ALAD Aminolevulinate, delta-, dehydratase (2) 202081 at IER2 Immediate early response 2 207426 s at TNFSF4 Tumor necrosis factor (ligand) superfamily, member 4 (2) 208686 s at BRD2 Bromodomain containing 2 206995 x at SCARF1 Scavenger receptor class F, member 1 (3) SLCO2A1 Solute carrier organic anion transporter family, member 2A1 (2) CP Ceruloplasmin (ferroxidase)

4.4.7 Attribute selection on SNP data, cDNA data and Affy


The top 100 attributes for this dataset are listen in Table 4.16. CHAPTER 4. RESULTS 79

Table 4.16: Top 100 SNP-cDNA-Affy Attributes

SNP Affy ID Gene Description

rs7578597 THADA Thyroid adenoma associated (2) rs7729440 Unknown CTNNA1 Catenin (cadherin-associated protein) (4) rs2427536 SLC2A4RG SLC2A4 regulator (2) FTO Fat mass and obesity associated (4) CLPP ClpP caseinolytic peptidase (3) rs1133090 DPEP2 Dipeptidase 2 rs2523720 TRIM26 Tripartite motif-containing 26 (2) WDR46 WD repeat domain 46 (2) PVR poliovirus receptor (4) rs4713380 Unknown 201990 s at CREBL2 cAMP responsive binding protein (2) PWP1 PWP1 homolog (S. cerevisiae) (3) Unknown rs2304380 RYR3 Ryanodine receptor 3 200660 at S100A11 S100 calcium binding protein A11 (3) rs16972193 SPTBN5 Spectrin, beta, non-erythrocytic 5 (2) rs3842787 PTGS1 Prostaglandin-endoperoxide Unknown PSME1 Proteasome activator subunit 1 (3) 222146 s at TCF4 Transcription factor 4 (5) 201421 s at WDR77 WD repeat domain 77 (4) 210448 s at P2RX5 Purinergic receptor P2X 212385 at TCF4 Transcription factor 4 (5) Unknown 202239 at PARP4 Poly (ADP-ribose) polymerase family, member 4 MB Myoglobin (3) 212423 at C10orf56 Chromosome 10 open reading frame 56 (2) 218983 at C1RL Complement component 1 rs6151428 ARSA Arylsulfatase A 217979 at TSPAN13 Tetraspanin 13 (2) 201487 at CTSC Cathepsin C rs1065761 CHIT1 Chitinase 1 (chitotriosidase) (2) 215543 s at LARGE Like-glycosyltransferase KIF21A Kinesin family member 21A (4) 204449 at PDCL Phosducin-like METAP2 Methionyl aminopeptidase 2 (2) ROD1 ROD1 regulator of differentiation 1 (S. pombe) (3) 201608 s at PWP1 PWP1 homolog (S. cerevisiae) (3) 218133 s at NIF3L1 IF3 NGG1 interacting factor 3-like 1 rs2808096 ARHGAP12 Rho GTPase activating protein 1 (2) rs16844401 HGFAC HGF activator

Continued on Next Page. . . CHAPTER 4. RESULTS 80

Table 4.16 – Continued

SNP Affy ID Gene Description rs1122326 HSPB9 Heat shock protein, alpha-crystallin-related, B9 (2) AP1B1 Adaptor-related protein complex 1, beta 1 subunit 212587 s at PTPRC Protein tyrosine phosphatase, receptor type, C (4) SNUPN Snurportin 1 208720 s at RBM39 RNA binding motif protein 39 (3) PCM1 Pericentriolar material 1 (4) rs1999663 C20orf114 Chromosome 20 open reading frame 114 (2) rs1137078 HLA-A29.1 Major histocompatibility complex class I (2) HLA-DMA Major histocompatibility complex, class II (4) 217737 x at C20orf43 Chromosome 20 open reading frame 43 PML Promyelocytic leukemia (2) 210202 s at BIN1 Bridging integrator 1 (2) THBS1 Thrombospondin 1 (4) rs12567377 CELSR2 Cadherin, EGF LAG seven-pass G-type receptor 2 FKBP5 FK506 binding protein 5 (4) 215273 s at TADA3L Transcriptional adaptor 3 AKAP13 A kinase (PRKA) anchor protein 13 (2) FASN Fatty acid synthase rs4723884 Unknown RAMP Receptor (G protein-coupled) activity modifying protein 1 rs16883930 SLC17A5 Solute carrier family 17 (anion/sugar transporter), member 5 BTK Bruton agammaglobulinemia tyrosine kinase (3) 218998 at C9orf6 Chromosome 9 open reading frame 6 rs2304237 ICAM3 Intercellular adhesion molecule 3 206995 x at SCARF1 Scavenger receptor class F, member 1 (3) 210555 s at NFATC3 Nuclear factor of activated T-cells (4) 208540 x at Unknown rs4978584 DFNB31 Deafness, autosomal recessive 31 (2) rs3731644 SH3BP4 SH3-domain binding protein 4 HSPB1 Heat shock 27kDa protein 1 39318 at TCL1A T-cell leukemia/lymphoma 1A (3) Unknown HIST1H2AL Histone cluster 1, H2al (4) rs1048786 PDIA2 Protein disulfide isomerase family A, member 2 rs3747243 Unknown AQP9 Aquaporin 9 (3) rs2274158 DFNB31 Deafness, autosomal recessive 31 (2) rs13894 SAT2 Spermidine/spermine N1-acetyltransferase family member 2 211792 s at CDKN2C Cyclin-dependent kinase inhibitor 2C 201105 at LGALS1 Lectin, galactoside-binding, soluble, 1 (2) 207198 s at LIMS1 LIM and senescent cell antigen-like domains 1 SIL SIL1 homolog, (S. cerevisiae) (3) 204971 at CSTA Cystatin A rs2287546 SART3 Squamous cell carcinoma antigen recognized by T cells 3

Continued on Next Page. . . CHAPTER 4. RESULTS 81

Table 4.16 – Continued

SNP Affy ID Gene Description 201461 s at MAPKAPK2 Nitogen-activated protein kinase-activated protein kinase 2 208620 at PCBP1 Poly(rC) binding protein 1 (2) DTX4 Deltex 4 homolog (Drosophila) (4) TRIM37 Tripartite motif-containing 37 (3) 218987 at ATF7IP Activating TF 7 interacting protein (2) EID1 EP300 interacting inhibitor of differentiation 1 (2) AQR Aquarius homolog (mouse) (2) 212387 at TCF4 Transcription factor 4 (5) rs1801516 ATM Ataxia telangiectasia mutated (2) HIST1H2AE Histone cluster 1, H2ae (2) rs10676 DHRS12 Dehydrogenase/reductase (SDR family) TPMT Thiopurine S-methyltransferase (2) SMNDC1 Survival motor neuron domain containing 1 (4) 217972 at CHCHD3 Coiled-coil-helix-coiled-coil-helix domain containing 3


It is interesting to note that there are few repeated SNPs in individual SNP dataset. However, in the combined datasets there are many more SNP repeats between those two lists. The top cDNA genes consistently appear in multiple lists, and in many cases the top attributes remain near the top in the other lists. There are a large number of repeats in the Affy dataset as well, but it is interesting to point out that in the combined SNP-Affy dataset there were few repeats at all. This supports the fact that the dataset performed poorly.

None of the SNP profiles identified by Yang and colleagues [35] were identified in any of the previous list. There was also no overlap in genes identified by Hoffmann and colleagues [19] as being predictive of long-term clinical outcome CHAPTER 4. RESULTS 82

4.5 Validation of Results

One of the most important aspects of any scientific experiment is validating the results. For a typical data mining experiment, this is done by obtaining new data and applying the model that has been built. If the model performs well then this is validation that the model is correct. However, if the model does not perform well then it is thought that the model has been overfit to that the training dataset and does not generalize well. Unfortunately it is difficult to obtain new data for this type of experiment. This is due to a combination of the rarity of the disease, the availability of the data and the fact that not all experiments are done on the same platform. As such, this method of validation was not immediately available to us. We performed one method of validation based on the information that was available to us: label shuffling.

4.5.1 Label shuffling validation

One method of validation when using a classification technique is known as label shuf- fling. In this technique, the class labels for the dataset are randomly permuted. This newly labeled dataset is then used for classification and the classification accuracy is then compared to the original dataset. If the classification accuracy drops with the randomly permuted labels, then the original dataset results were not due to random chance. However, if the classification accuracy remains the same or increases then the results obtained from the original dataset can be said to be due to random chance.

As an example, this technique was performed on the SNP dataset. The original SVM results can be seen in Table 4.3. Table 4.17 displays the results of the permuted class label classification. Clearly, this does not perform well at all. In fact, in most CHAPTER 4. RESULTS 83

cases the classification accuracy drops below the base line accuracy. The base line accuracy is the overall percentage of if everything was classified as the majority class.

This suggests that the results we have obtained are not due to random chance.

Table 4.17: SVM prediction accuracy of Shuffled Label SNP subsets

Atttributes % Class Alive % Class Deceased

25 100 0 50 100 0 100 100 0 250 95.93 0 500 97.56 0 1000 96.75 0 2500 98.37 14.29 5000 98.37 0

4.6 Extended SNP Analysis

One important property of SNP data is that it is unlikely that it will change. It is a description of the genome of an individual and, barring any genetic mutations, will remain the same throughout the lifetime of that individual. This is a useful property for data mining as it means that any results found are not due to data values at that particular time. Because of this property and due to the positive results seen with the SNP data analysis we decided to perform an extended analysis using different data-mining techniques. CHAPTER 4. RESULTS 84

4.6.1 Predicting relapse

Random forests attribute selection is a supervised learning technique, meaning that a class label must be provided for each object in the dataset. For the previous analysis the class label was the mortality of the patients. The reasoning behind this was so the random forests algorithm could find the attributes which best discriminated based on that particular label. Another important factor for ALL treatment is whether or not the patient relapses. If a patient does relapse then the course of treatment needs to be changed. If it were possible to know which patients had a higher chance of relapsing, then it is possible that a relapse could be avoided.

In order to explore this possibility, the random forests algorithm was run for the SNP data with the class label being whether or not the patient relapsed. We expected that there would be some overlap in the top SNPs that are selected with the previous set, as all of the patients who have passed away had also relapsed. However, there were several patients who have relapsed and survived which emphasizes how complex these data are.

As in the previous analysis, the top 25, 50, 100, 250, 500, 1000, 2500 and 5000 SNPs were selected into subsets which were analyzed by SVD and SVM. Although the results were not as clear cut as the mortality data, the results proved to be quite interesting. The SVD images for the top 250 and 1000 SNPs are shown in Figure

4.17. It can be seen in the 250 attribute subset that there is a fairly good separation of the data. This suggests that there is also a genetic reason for why some patients relapse. The separation in the 1000 attribute subset is not quite as clear, but it can be seen that the overall structure of the data is beginning to form two clusters, much like what we saw with the previous SNP analysis. This again confirms that the more SNP CHAPTER 4. RESULTS 85

attributes you include, the more general structure about the nature of the patients is visible.

Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y 10 Y Y Y Y Y Y 5 Y Y Y 0 −2 20 −5 0 Y 15 2 Y −10

4 Y Y 10 6 −15 5 −5 8 0 −20 10 0 5 10 12 −5 15 −25 20

(a) 250 SNP Relapse subset (b) 1000 SNP Relapse subset

Figure 4.17: SVD images of relapse selected SNP data. blue = alive, red = deceased, Y = relapsed

To further this analysis we also performed an SVM on the data to see how predic- tive each subset is for the relapse label. The results are shown in Table 4.18. Clearly these results are not as good as the previous analysis using the mortality information. What this may suggest is that since almost all of the deceased patients also relapsed, the random forests algorithm is finding attributes which describe mortality more than relapse. Due to the interconnectivity of these two properties, it is difficult to fully understand what is going on with these data. CHAPTER 4. RESULTS 86

Table 4.18: SVM prediction accuracy of SNP subsets for relapse

Atttributes % Class Alive % Class Deceased

25 97.3 69.23 50 98.2 61.54 100 100 76.92 250 100 81.54 500 100 50 1000 100 50 2500 100 26.92 5000 100 15.38

When comparing the top SNPs selected using each of the two class labels, mor- tality and relapse, we find that there are few common SNPs found for the smaller subsets. For the top 25 and 50 subsets there is only 1 common SNP. As the datasets get larger, as would be expected, there are more and more common SNPs. Of the top 5000 SNPs for both subsets, there are 2354 shared SNPs. This is an encouraging result since we expected to find some similarities due to the fact that relapse is a fairly good predictor of mortality. This analysis has supported our hypothesis that there is a relationship between a subject’s genetics and their response to this disease.

4.6.2 Graph analysis of SNP data

Another way of looking at these data is to look at how each patient is similar or dissimilar to all other patients. Instead of describing a patient in space by the values of their attributes, it is possible to create a space where the position of the patients is based on their similarity to each other directly. This can be thought of as a graph approach where there is a connection between two patients if they are above a similarity threshold. CHAPTER 4. RESULTS 87

In order to accomplish this, the dot product is calculated for each pair of patients and the result is stored in a an nxn matrix with zero values for each x(i,i) entry.

This can be regarded as an adjacency matrix and is the basis of the graph approach. Once this matrix is calculated, the next step is to determine a threshold value which defines the point at which two patients are considered similar or not. Any entry below this threshold is set to equal zero, otherwise the value remains unchanged. From here we chose to analyze these data using an SVD as before. The top 250 SNP attribute subset is used for this analysis and the resulting SVD image is shown in Figure 4.18. In this figure the picture does not contain any connections so as to show the shape of the data more clearly. There is a structured U shape to the data which suggests that there is an inherent structure within the data which this approach was able to find. We hypothesize that this shape is due to the way in which this data was coded. Each SNP value takes on a theta value of approximately 0, .5 or 1 as explained previously. Therefore, we believe that this had an effect on the shape of the data since it can be thought of as each data point migrating to one of three positions. It can be seen that although the deceased patients are spread out around the U shape, they are still being separated out vertically. This is an important result since we are finding the same separation in the data but through a new technique where we are now comparing individuals instead of taking each individual on its own. CHAPTER 4. RESULTS 88




10 5 0 −5 −5 −10 −15 (a) 250 SNP graph analysis without connections

Figure 4.18: SVD images of SNP graph analysis. blue = alive, red = deceased

4.6.3 Reformatting SNP data

The previous SNP analysis used the theta values for all of the experiments. There is another representation based on the genotype for each patient which has three possible values; 0, 1 and 2. These values represent the homozygous major allele, heterozygous allele and homozygous minor allele respectively. Another approach to this is to separate each SNP attribute into three attributes representing each of the alleles where each subject will have a value of 1 for the allele they contain and a 0 for the other two positions. These data was then run through the random forests algorithm to find the most significant SNPs. The difference between this method and the previous method is that this allows the random forests algorithm to pick out specific alleles within a SNP attribute which may be more important than the others. One problem that had to be dealt with was the fact that some heterozygous minor CHAPTER 4. RESULTS 89

alleles are so uncommon that none of these patients contained them at all. This would result in entire columns of values being 0 which can skew the results of running these data-mining algorithms. In order to prevent this, all of the 0 columns were removed before any further analysis was done.

The analysis of these data was similar to what was done previously. First, the data was run through the random forests algorithm to discover the most important attributes. Then, these attribute rankings were used in order to select the top subsets which were then analyzed by SVD. As before the top 25, 50, 100, 250, 500, 1000, 2500 and 5000 attribute subsets were used. We expected that the attribute selection would select many of the same SNPs as with the previous dataset as this random forests run was also labeled with the mortality of the patients. However, this method should identify specific alleles of each SNP which may be more informative than simply knowing which SNPs are important. The SVD images for the top 250 and 2500 attributes are shown below in Figure 4.19. The separation between deceased and alive is clearly evident in both. However, it is interesting to note that the separation becomes even more clear when more attributes are used. This is not what we have observed with the previous dataset, but since there are many more attributes in this dataset it is not surprising that this is true. We also see with the 2500 subset that there appears to be a separation within the deceased patients as well as they form two or even three clusters. It is important to note that the surviving patients are clustered around the origin while the deceased patients appear as the outliers. This is significant because it suggests that these patients are somehow different from the collective group of surviving patients. This is an encouraging result as it, again, supports our hypothesis that there is a link CHAPTER 4. RESULTS 90

between the genetics of the individual and their outcome with the disease.

−5 −5.5 −6 −6.5 −7 −18 0 −1 −2 −7.5 −3 −4 −5 −6 −8 −16 15 20 −7 −8 5 10 −5 0 (a) 250 SNP reformatted (b) 2500 SNP reformatted

Figure 4.19: SVD images of reformatted SNP analysis. blue = alive, red = deceased

When we compare the new set of top ranked attributes to the previous set we see that there is approximately a 60% similarity across most subsets. This suggests that in our previous subsets there were several attributes included which are not globally predictive of mortality. but rather may have been correlated with those attributes that were. When we isolate only the attributes which are shared we see a similar separation.

4.6.4 Updated data labels

Late in the progress of this study, we were able to obtain updated patient information for our datasets. Compared to the previous labels, the updated data contained five more patients who are deceased as well as seven who have since relapsed. We were CHAPTER 4. RESULTS 91

interested to use these labels in two ways; labeling the previous results with the new labels to see where these updated patients lie in space, and performing a new analysis with these new labels as the basis for attribute selection. For the purposes of this study, we have focussed on the SNP data results.

Relabeling previous results

We were interested to see where these newly updated patients would lie in the pre- viously defined space of objects. Since there was a clear separation of the data, we did not expect to find that these new patients all clustered together as if to suggest that they were predicted to die, and this was in fact the case. Figure 4.20 shows a comparison of the top 250 SNP SVD for both the old and new labels. It can be seen that the newly labeled deceased scatter throughout the large cluster of alive patients. This raises several questions about both the model that has been built as well as the nature of these data. Both of these points will be addressed in later sections. CHAPTER 4. RESULTS 92

−1.5 −1 −10 −0.5 −10 0 −1.5 −9.5 −1 0.5 −0.5 −9.5 1 0 −9 0.5 1.5 1 −9 1.5 2 2 2.5 −8.5 2.5 −8.5

(a) 250 SNP old labels (b) 250 SNP updated labels

Figure 4.20: SVD images of 250 SNP subset with old and new labels. blue = alive, red = deceased

Attribute selection

In order to obtain a better understanding of the implications of these new labels we decided to perform the same line of experiments as we did with the old data labels. Since the random forests algorithm selects the best attributes based on the data labels provided, we expected to find many new attributes being selected as compared to the previous list obtained. As we saw with the cross validation approach described previously, there are many attributes which are selected purely due to their correlation with more informative attributes. Following this same principle, we believe that the attributes which are found in both lists are the most predictive ones. As before, the top 25, 50, 100, 250, 500, 1000, 2500 and 5000 subsets were analyzed using SVD. CHAPTER 4. RESULTS 93

SVD analysis

Figure 4.21 shows the result of the SVD for the top 250 and 2500 attributes. It is clear that there is still a good separation of the data based on the mortality label. The 250 subset image also shows more separation within each class as well than in the previous labels. We believe that this is due to the nature of the coding of the data. The 2500 subset image is reminiscent of the previous data where it is clear that the data forms two clusters while still maintaining the separation based on mortality. This is the same separation we believe to be due to the coding of the data. This suggests that the top 2500 attribute subsets for both the old and new labels most likely contain many of the same attributes.


−31 −8 −32 −10

−12 −33

1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 2 0 −2 −4 −6 −8 (a) 250 SNP new data (b) 2500 SNP new data

Figure 4.21: SVD images of 250 SNP new labels. blue = alive, red = deceased CHAPTER 4. RESULTS 94

4.6.5 Cross validation of top attributes

One of the effects of changing data labels is that an attribute which may be truly predictive will have appeared to be less predictive with that incorrect labeling. This is a problem that cannot be avoided due to the nature of the data. However, if we develop a more intelligent methodology for attribute selection then we can compensate for this.

One way that this can be accomplished is by splitting the data into randomly generated subsets and then performing attribute selection on each. The idea is that by

finding the attributes which appear in multiple lists, we are filtering out the attributes which may only be predictive of that particular subset or attributes which appear only by chance. Also, we believe that if an attribute appears in multiple lists then it is a more general predictor than one which only appeared in one list. As an example of this, we divided the SNP data into two subsets and ran each through the random forest algorithm. Table 4.19 shows the number of common SNPs between each of the subsets we created.

Table 4.19: Comparison of Top Attribute Lists

Attributes No. Shared Attributes % Shared Attributes

25 9 34.61 50 18 36 100 34 34 250 96 38.4 500 230 46 1000 502 50.2 2500 1442 57.68 5000 3266 65.32 CHAPTER 4. RESULTS 95

These results are as we expected; for small subsets there isn’t much overlap but as the size of the subsets increase there are more and more intersections. This is an interesting result because the previous SNP analysis showed a perfect linear separation between for the 250 attribute subset, and yet when we create two subsets of these data we find only a 38.4% overlap. This suggests that most of the attributes included for that particular subset were either only informative for that specific dataset or not informative at all. The other side of that is that 38.4% of the attributes appear in both lists and we can therefore assume that they are more informative. In order to verify this we performed an SVD analysis as before.

SVD analysis of attribute intersection

To see the effect of removing the attributes that only appear in one of the lists, we performed an SVD on each of the subsets and used the updated labels. The results are shown in Figure 4.22 for the intersection of the top 250, 1000, 2500 and 5000 subsets. For the intersection of the top 250 SNPs it can be seen that the clear separation between the deceased and alive patients is no longer present. However, there is still a good clustering of deceased patients. Also, since there is not a clear separation it could suggest that the patients who lie closer to the deceased are at a much higher risk. This would be a much more beneficial result as it would provide more information about each individual patient rather than a model which is specific to the data. This also suggests that some of the attributes which have been removed may have been overfitting that particular dataset and thus made the resulting SVD images appear to have a much bigger separation. Looking next to the intersection of the top 1000 SNPs, it is interesting to note CHAPTER 4. RESULTS 96

that the separation between the patients is more clear. This again supports the idea of wanting to keep as much information as possible. The separation is still not as clear as we have previously seen, which is the preferred result. For the intersection of the top 2500 SNPs we see a more familiar picture in that the data is beginning to form the two clusters we have seen previously. However, the data is still separated quite well based on the mortality label. Finally, for the intersection of the top 5000 SNPs we see the same separation as with the previous image with more distance between the clusters. It is important to note that these images have shown almost exactly the same separation as the previously displayed results, but with approximately half the number of attributes. This confirms that there are many attributes in the previous subsets which are not powerful separators. By removing these attributes we get a much more clear picture of what is really going on. CHAPTER 4. RESULTS 97



−6 1 −12.8 −12.6 0 −13 −1 −13.2 −2 −13.6 −13.4 −6.5 −3 −13.8 −4 −14.2 −14 1 0.5 0 −0.5 −1 −1.5 −2 (a) 250 SNP Intersection (b) 1000 SNP Intersection


−23 −36 −23.5 1 0 −1 −2 −3 −4 −24 −5 −6 −7 −24.5 −38 2 0 −2 −4 −6 −8 −10 (c) 2500 SNP Intersection (d) 5000 SNP Intersection

Figure 4.22: SVD images of intersecting SNP attributes. blue = alive, red = deceased.

4.7 Discussion of the Nature of the Data

This type of data is complex and constantly changing which makes it difficult to build accurate models. The complexity of the data is represented by the sheer amount of data that exists for each patient. When it is all combined, there are over 40000 CHAPTER 4. RESULTS 98

attributes for each patient and this is number could increase significantly with newer technology. It is difficult to accurately remove attributes that do not provide useful information and select those that do. It is necessary to take many different approaches and be clever with the available techniques to be able to discover anything useful from these data. Another challenge with these data is that it is constantly changing. As we saw from the updated labels, the model that we had built was completely changed due to only four patients having been updated. This is difficult to deal with since a model at one time may quickl become obsolete. This is why it is necessary to do such things as cross validation in order to try to isolate the attributes that are responsible for the separation and not the attributes that appear in a list due to their correlation in that particular dataset.

When dealing with the mortality labels, a patient is listed as either alive or de- ceased. However, not every patient who has ALL ends up dying because of the disease. Due to the intensity of the treatment, the patient’s immune system becomes compromised and so it is possible that the patient may have died due to an infection or some other health concern. This becomes a problem for this type of analysis since all deceased patients are treated as equal. Since the number of deceased patients is small in comparison to the number of surviving patients, this could quite easily skew the results. At this present moment the cause of death is unavailable and so we must assume that all patients who have died have done so because of their disease.

One final challenge is what the data is representing. Biological systems are com- plex and there are many levels of regulation within each system. In this study we are using both SNP and gene expression data. In a biological system, an individual’s CHAPTER 4. RESULTS 99

genome affects the genes, which in turn affects the proteins which then affects the phenotype. By looking at the SNP data we are looking at the genome level. Any significant changes in the SNPs can affect the genes which could affect the gene ex- pression values. As a result, these datasets are dependent and connected and thus cannot be treated as mutually exclusive. Chapter 5


The goal of this research was to investigate the relationship between an individual’s genetics and whether or not they survived their battle with acute lymphoblastic leukemia. The data that was used for this study was produced from microarray analysis of the individual’s SNPs and gene expression values. These data are complex and high dimensional which provided many challenges for the analysis. We used data- mining techniques to analyze these data and created a process of attribute selection followed by clustering through the use of a Singular Value Decomposition (SVD). We used various clinical labels to understand the results that this technique produced. This study has produced many conclusions about both the data and the techniques that were used. Our analysis has shown that a separation can be found between patients who live and who die based on both the SNP values and the gene expression values. This suggests that there may be a genetic explanation for why some patients die within the context of current treatment regimes. This is significant and novel as it is not widely accepted that there is a genetic factor which can distinguish patients who live and die. Rather, the genetic factors that are known are related to individuals


developing this disease or not. We have not been able to pinpoint which attributes are responsible for this, but we believe that our attribute selection method creates subsets which contain these informative attributes. This finding was supported through many different analyses. The SNP, cDNA and the combined dataset analysis using our data- mining procedure showed a clear separation of the data based on the mortality label. Also, our further analysis of the SNP data using various techniques all showed similar results. The validation technique we used also showed that these results were not due to random chance. It would be ideal to obtain new data which could be run through our model, but at this current time this is not possible. We believe that this finding has merit. However, it will take further research and fine tuning of the techniques to discover any biological significance.

The process of attribute selection is one which must be done carefully. It is un- realistic to assume that the attribute-selection algorithm, in this case the random forest algorithm, will be able to identify all of the biologically significant attributes with such a large dataset. We have shown that by evaluating the attribute selection process through a cross validation of attributes in smaller subsets, there are many attributes which are included in these subsets which may only be informative for that particular dataset and are not globally predictive. It is necessary to be more intelli- gent about the attribute selection process in order to distinguish between predictive attributes and those that only appear to be predictive. We have also shown that the current process of using clinical data to make deci- sions about diagnosis, prognosis and treatment is not adequate. Although the survival rate is approximately 80%, it can be seen from our analysis that based on the genetics of these individuals there does not appear to be any meaningful relationship to the CHAPTER 5. CONCLUSION 102

risk classification the physicians have assigned as seen in the SVD analysis of the clinical data.

The nature of these data being complex, high dimensional, constantly changing and with the datasets being biologically connected, makes it difficult to work with. We believe that data mining provides the necessary tools to attempt to understand and learn from this type of data. The data-mining process is involved and requires the researchers to constantly scrutinize the results and to learn from them in order to develop a more intelligent process. With so much data being produced from these high throughput devices every day, it is necessary to develop intelligent and efficient methods of learning from these data and we believe that data mining is necessary to take advantage of the wealth of knowledge hidden in these datasets.

5.1 Future Work

This study is a step in a new direction of using data-mining techniques with microar- ray data for clinical applications to cancer treatment. We believe that we have shown the power of data mining and its uses in this field of research. This research will be the foundation for many other studies which use similar techniques and will be able to build off of these results.

We have identified a need to develop a more intelligent methodology for the pro- cess of attribute selection. Although simply using the random forest algorithm was able to identify interesting attributes, we were able to demonstrate that a large pro- portion of these attributes were not generally predictive. We are working in this area, attempting to use techniques such as curve fitting, correlation, SVD and others to improve attribute selection. It is also important to add domain knowledge into this CHAPTER 5. CONCLUSION 103

process as well. Based on what we know about the interdependence of SNPs and gene expression values, we are beginning to develop a method of filtering out attributes which do not appear to contain any useful information.

The ultimate goal of this project is to be able to create a clinical tool which can be used to assist physicians in assigning a patient into an appropriate risk category so that treatment will be more targeted to that particular individual. This is a form of personalized medicine which we believe will be the future of medical diagnosis. This will be possible by creating a “space” where these patients will lie based upon their genetic and clinical information. This space can then be labeled with such information as treatment received, outcome, risk, whether or not the patient relapsed, etc. Based on this information, when a new patient with ALL is received they can be put into this space and their position will be based on their genetics and clinical information. It is then possible to look in a neighbourhood around this patient and look at the neighbours which will be biologically similar to this new patient. By observing the neighbour’s treatment and outcome, more informed decisions about this new patient can be made. If the neighbours all received the same type of treatment and all survived, then it would make sense to prescribe this treatment for the new patient. However, if all of the neighbours received the same treatment and died, then it would be wise to explore the option of a different course of treatment. This is just an example of how this system could work once it is developed. This methodology can be applied to many other types of diseases and we believe that this will lead to more informed treatment decisions resulting in a higher percent- age of individuals who survive their disease. This is truly what personalized medicine is all about and we believe that this is the future of medicine. Bibliography

[1] Affymetrix. Genechip microarrays: Student manual.

about_affymetrix/outreach/educator/microarray_curricula.affx, 2004. Accessed on October 20, 2009.

[2] G. Alsbeih, N. Al-Harbi, M. Al-Buhairi, K. Al-Hadyan, and M. Al-Hamed. As- sociation between tp53 codon 72-single nucleotide polymorphism and radiation sensitivity of human fibroblasts. Radiation Research, pages 535–540, 2007.

[3] Orly Alter. Discovery of principles of nature from mathematical modeling of dna microarray data. PNAS, 103:16063–16064, 2006.

[4] A. Archer and R. Kimes. Empirical characterization of random forest variable

importance measures. Computational Statistics and Data Analysis, 52:2249– 2260, 2007.

[5] E. Asgarian, M.H. Moeinzadeh, S. Sharifian, A. Najafi, A. Ramezani, J. Habibi, and J. Mohammadzadeh. Solving mec model of haplotype reconstruction using information fusion, single greedy and parallel clustering approaches. Computer Systems and Applications, pages 15–19, 2008.


[6] Deepa Bhojwani, Huining Kang, Renee Menezes, Wenjian Yang, Harland Sather, Naomi Moskowitz, Dong-Joon Min, Jeffrey Potter, Richard Harvey, Stephen

Hunger, Nita Seibel, Elizabeth Raetz, Rob Pieters, Martin Horstmann, Mary Relling, Monique den Boer, Cheryl Willman, and William Carroll. Gene expres- sion signatures predictive of early response and outcome in high-risk childhood acute lymphoblastic leukemia: a children’s oncology group study. Journal of Clinical Oncology, 26(27):4376–4384, 2008.

[7] Sikic Branimir, Robert Tibshirani, and Norman Lacayo. Genoics of childhood

leukemia: the virtue of complexity. Journal of Clinical Oncology, 26(27):4367– 4368, 2008.

[8] L Breiman and A Cutler. Random forests. RandomForests/cc_manual.htm, 2004. Accessed on October 20, 2009.

[9] Christopher J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998.

[10] Daniel Catchpoole, Andy Lail, Dachuan Guo, Qing-Rong Chen, and Javed Khan. Gene expression profiles that segregate patients with childhood acute lymphoblastic leukaemia: an independent validation study identifies that en- doglin associates with patient outcome. Leukemia Research, 31:1741–1747, 2007.

[11] P. Chopra. Microarray data mining using landmark gene-guided clustering. BMC

Bioinformatics, 9(92), 2008. BIBLIOGRAPHY 106

[12] Nigel Crawford, John Heath, David Ashley, Peter Downie, and Jim Buttery. Survivors of childhood cancer: An australian audit of vaccination status after

treatment. Pediatric Blood Cancer, pages 128–133, 2009.

[13] R. Diaz-Uriate and S. Alvarez de Andres. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3), 2006.

[14] Christian Flotho, Elain Coustan-Smith, Deqing Pei, Cheng Cheng, Guangchun Song, Ching-Hon Pui, James Downing, and Dario Campana. A set of genes that regulate cell proliferation predicts treatment outcome in childhood acute lymphoblastic leukaemia. Blood, 110(4):1271–1277, 2007.

[15] Centers for Disease Control and Prevention. Leading causes of death. www.cdc. gov/nchs/FASTATS/lcod.htm, May 2009. Accessed on November 11, 2009.

[16] Leukaemia Foundation. Acute lymphoblastic leukemia. web/aboutdiseases/leukaemias_all.php, 2004. Accessed on July 25, 2009.

[17] Clare Frobisher, Emma Lancashire, David Winter, Aliki Taylor, Raoul Reulen, and Michael Hawkins. Long-term population based divorce rates among adult survivors of childhood cancer in britain. Pediatric Blood Cancer, pages 116–122, 2009.

[18] Lan Guo, Yan Ma, Rebecca Ward, Vince Castranova, Xianglin Shi, and Yong Qian. Constructing molecular classifiers for the accurage prognosis of lung ade- nocarcinoma. Clinical Cancer Research, 11:3344–3354, 2006.

[19] Katrin Hoffmann, Martin J. Firth, Alex H. Beesley, Joseph R. Freitas, Jette Ford, Saranga Senanayake, Nicholas H. de Klerk, David L. Baker, and Ursula R. Kees. BIBLIOGRAPHY 107

Prediction of relapse in paediatric pre-b acute lymphoblastic leukaemia using a three gene risk index. British Journal of Haematology, 140:656–664, 2008.

[20] Amy Holleman, Meyling Cheok, Monique den Boer, Wenjian Yang, Anjo Veer- man, Karin Kazemier, Deqing Pei, Cheng Cheng, Ching-Hon Pui, Mary Relling, Gritta Janka-Schaub, Rob Pieters, and William Evans. Gene-expression patterns in drug-resistant acute lymphoblastic leukemia cells and response to treatment. The New England Journal of Medicine, 351(6):533–542, 2004.

[21] National Cancer Institute. Cancer research funding. cancertopic/factsheet/NCI/research-funding, 2009. Accessed on Novem- ber 11, 2009.

[22] National Cancer Institute. Seer cancer statistics review. statfacts/html/all.html, 2009. Accessed on November 11, 2009.

[23] John Luk, Brian Lam, Nikki Lee, David Ho, Pak Sham, Lei Chen, Jirun Peng, Xisheng Leng, Phillip Day, and Sheung-Tat Fan. Artificial neural networks and decision tree model analysis of liver cancer proteomes. Biochemical and Biophys- ical Research Communications, pages 68–73, 2007.

[24] Charles Mullighan, Salil Goorha, Ina Radtke, Christopher Miller, Elaine Coustan-Smith, James Dalton, Kevin Girtman, Susan Mathew, Jing Ma, Stan- ley Pounds, Xiaoping Su, Ching-Hon Pui, Mary Relling, William Evans, Sheila

Shurtleff, and James Downing. Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature, 1038, 2007. BIBLIOGRAPHY 108

[25] World Health Organization. Cancer fact sheet. factsheets/fs297/en/, February 2009. Accessed on November 11, 2009.

[26] Daniel Peiffer, Jennie Le, Frank Steemers, Weihua Chang, Tony Jenniges, Fran- cisco Garcia, Kirt Haden, Jiangzhen Li, Chad Shaw, John Belmont, Sau Wai Cheung, Richard Shen, David Barker, and Kevin Gunderson. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome geno- typing. Genome Research, 16:1136–1148, 2006.

[27] Mary Ross, Xiaodong Zhou, Guangchun Song, Sheila Shurtleff, Kevin Girtman, W. Kent Williams, Hsi-Che Liu, Rami Mahfouz, Susana Raimondi, Noel Lenny, Anami Patel, and James Downing. Classfication of pediatric acute lymphoblastic

leukemia by gene expression profiling. Blood, 102(5):2951–2959, 2003.

[28] David B. Skillicorn. Understanding Complex Datasets. CRC Press, 2007.

[29] Johan Staaf, Johan Vallon-Christersson, David Lindgren, Gunnar Juliusson, Richard Rosenquist, Mattias Hoglund, Ake Borg, and Markus Ringner. Nor- malization of illumina infinium whole-genome snp data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics, 409, 2008.

[30] Hequan Sun, Qinke Peng, Quanwei Zhang, and Dan Mou. Splice site prediction based on characteristics of sequential motifs and c4.5 algorithm. Fifth Interna- tional Conference on Fuzzy Systems and Knowledge Discovery, pages 417–422,


[31] Nobuhiro Suzuki, Keiko Yamura-Yagi, Makato Yoshida, Junichi Hara, Shinichiro Nishimura, Tooru Kudoh, Akio Tawa, Ikuya Usami, Akihiko Tanizawa, Hirkoi BIBLIOGRAPHY 109

Hori, Yasuhiko Ito, Ryosuke Miyaji, Megumi Oda, Koji Kato, Kazuko Hamamoto, Yuko Osugi, Yoshiko Hashii, Tatsutoshi Nakahata, and Keizo

Horibe. Outcome of childhood acute lymphoblastic leukemia with induction failure treated by japan association of childhood leukemia society (jacls) all f- protocol. Pediatric Blood Cancer, pages 71–78, 2009.

[32] Cancer Research UK. Acute lymphoblastic leukemia and the blood. www., March 2009. Accessed on July 25, 2009.

[33] Wei Wang, Ji Xiang Peng, Jie Quang Yang, and Lian Yue Yang. Identification of gene expression profiling in heptocellular carcinoma using cdna microarrays.

Digestive Diseases and Sciences, pages 2729–2735, 2008.

[34] Jun Wei, Braden Greer, Frank Westermann, Seth Steinberg, Chang-Gue Son, Qing-Rong Chen, Craig Whiteford, Sven Bilke, Alexei Krasnoselsky, Nicola Ce- nacchi, Daniel Catchpoole, Frank Berthold, Manfred Schwab, and Javed Khan. Prediction of clinical outcome using gene expression profiling and artificial neu- ral networks for patients with neuroblastoma. Cancer Research, 64:6883–6891, 2004.

[35] J. J. Yang, C. Cheng, and W. Yeng. Genome-wide interrogation of germline genetic variation associated with treatment response in childhood acute lym-

phoblastic leukemia. The Journal of the American Medical Association, 301(4):393–403, 2009.