Data Preparation
Total Page:16
File Type:pdf, Size:1020Kb
Data Preparation
Initially, the data set contained 7071 columns, one for each gene and one for a serial number for each instance. The information about each patient was recorded in rows. There were 70 rows, 69 for each patient and one with names of each gene.
Your first step was to normalize the data. Domain experts had previously established that values within the range of 20 and 16000 were reliable for this experiment. You probably need to write a small program (e.g., Java) to read the data set and changed any gene value less than 20 to have the value of 20 and any gene value greater than 16000 to have the value of 16000. Subsequently, we proceeded to select the genes that seemed to be correlated to the outcome.
Since many learning algorithms look for non-linear combinations of features, having a large data set with few records and several genes might give us false positives. Gene reduction thus increases classification accuracy. To do that you probably need to write a small java program to remove genes with a fold difference of less than 2. Fold difference is defined as (max-min)/2 where. max and min are the maximum and minimum values of the gene expression for all the instances, respectively. A fold difference of less than 2 means that across the samples, the gene value does not change much and as such, cannot influence the class significantly.
Then you are needed to calculate T-values of each class for all genes. T-value is a linear method and helps in the gene reduction, which, as observed above, helps in the classification accuracy. In this case, we take the absolute values of the T-values and only use the highest. T-value is calculated as follows:
(Avg1 Avg2 ) 2 2 (1 / N1 2 / N2 )
Avg1 is the average for one class across the gene sample and Avg2 is the average for the other 4 classes. Stdev1 is the standard deviation for one class and Stdev2 is the standard deviation for the other classes. Similarly, N1 is the number of samples that have the class whose T-value we are interested in, and N2 is the number of samples that does not have the T-value that we are interested in.
You can use Microsoft Excel and the CSV file to calculate the T-values of all classes.
Then you can follow the instructions on the website.