MICROARRAY DATA ANALYSIS TOOL (MAT)

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Sudarshan Selvaraja

December, 2008

MICROARRAY DATA ANALYSIS TOOL (MAT)

Sudarshan Selvaraja

Thesis

Approved: Accepted:

______Advisor Department Chair Dr. Zhong-Hui Duan Dr. Wolfgang Pelz

______Committee Member Dean of the College Dr. Yingcai Xiao Dr. Ronald F. Levant

______Committee Member Dean of the Graduate School Dr. Xuan-Hien Dang Dr. George R. Newkome

______Date

ii

ABSTRACT

Microarray is a technology that has been widely used by the biologists to probe the presence of genes in a sample of DNA or RNA. Using the technology, the oligonucleotide probes can be massively parallel immobilized on a microarray chip. It allows the biologists to check the expression levels of thousands of genes together. This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data. The repository currently allows datasets of GenepixPro format to be deposited, although it can be expanded to include datasets of other formats. The user interface of the repository allows users conveniently upload data files and perform preferred data preprocessing and analysis. The analysis methods implemented includes the traditional k-nearest neighbor

(kNN) methods and two new kNN methods developed in this study. Additional analysis methods can be added by future developers. The system was tested using a set of microRNA gene expression data. The design and implementation of the software tool are presented in the thesis along with the testing results from the microRNA dataset. The results indicate that the new weighted kNN method proposed in this study outperforms the traditional kNN method and the proposed mean method. We conclude that the system developed in the thesis effectively provides a structured microarray data repository, a flexible graphical user interface, and rational data mining methods.

iii

ACKNOWLEDGEMENTS

I would like to thank my advisor Dr. Zhong-Hui Duan for giving me an opportunity to work on this project for my Masters thesis. I was motivated to choose this topic after I took Introduction to course. I would like to thank her for invaluable suggestions and steady guidance during the entire course of the project.

I am thankful to my committee members Dr. Yingcai Xiao and Dr. Xuan-Hien

Dang for their guidance, invaluable suggestion and time.

I would like to thank my friends Shanth Anand and Prashanth Puliyadi for helping me to do Master’s and change my career path. I couldn’t have achieved this without their help.

I would like to thank my friend Manik Dhawan for his guidance in writing and formatting this report.

I would finally like to express my gratefulness towards my parents and all my family members who were always there for me and cheering me on all situations and for their great interest in my venture.

iv

TABLE OF CONTENTS

Page

LIST OF TABLES...... viii

LIST OF FIGURES...... ix

CHAPTER

I. INTRODUCTION...... 1

1.1 Introduction to Bioinformatics...... 1

1.2 Introduction to Microarray Technology……………………………….. 2

1.2.1 Genepix Experiment Procedural………………………………. 3

1.3 Applications of Microarrays...... 5

1.4 Need for Automated Analysis…………………………………………. 6

1.5 Knowledge Discovery in Data………………………………………… 7

1.5.1 KDD Steps……………………………………………………... 8

1.6 Classification…………………………………………………………... 8

1.6.1 General Approach……………………………………………... 9

1.6.2 Decision Trees…………………………………………………. 10

1.6.3 k Nearest - Neighbor Classifiers………………………………. 12

v

1.7 Outline of the Current Study………………………………………… 14

II. LITERATURE REVIEW...... …...... 17

2.1 Previous Work...... …...... 17

2.2 Existing Tools for Normalizing GPR Datasets……………………… 20

2.3 Stanford Microarray Database (SMD)…...... 21

2.4 Microarray Tools…...... 21

2.5 Available Source for Microarray Data………………………………… 24

III. MATERIALS AND METHODS …...... …...... 25

3.1 Database Design…...... …...... 25

3.1.1 Schema Design………………………………………………… 25

3.1.2 Table Details…………………………………………………... 26

3.1.3 Attributes.…………………………………………………….... 28

3.2 Description of Genepix Data Format...... 29

3.2.1 Features and Blocks…………………………………………… 31

3.2.2 Sample Dataset………………………………………………… 32

3.2.3 Transferring Genepix Dataset to Database…………………….. 33

3.3 Data Selection …...... 34

3.3.1 Creation of Training and Testing Dataset…………………… 34

3.4 Preprocessing………………………………………………………….. 38

3.4.1 Preprocessing in MAT………………………………………… 39

3.5 Normalization …...... 41

3.6 Feature Selection …...... 42

vi

3.6.1 Student T-Test…………………………………………………. 42

3.6.2 Implementation of T-Test in MAT…………………………….. 43

3.7 Classification ...... …...... 44

3.7.1 Classical kNN Method………………………………………… 44

3.7.2 Weighted kNN Method……………………………………… 46

3.7.3 Mean kNN Method…………………………………………… 47

IV. RESULTS AND DISCUSSIONS………………...... 49

4.1 A Case Study…………………………………………………………... 49

4.2 Results…………………………………………………………………. 53

4.3 Discussion……………………………………………………………... 55

V. CONCLUSIONS AND FUTURE WORK…...... 56

5.1 Conclusion……………………………………………………………... 56

5.2 Future Work…………………………………………………………… 56

REFERENCES…...... …...... 58

APPENDICES………………………………………………………………………... 61

APPENDIX A COPYRIGHT PERMISSION FOR FIGURE 1.2……... 62

APPENDIX B PERL SCRIPT FOR T-TEST - TTEST.PL…………… 63

APPENDIX C CLASSIFICATION ALGORITHMS…………………. 66

vii

LIST OF TABLES

Table Page

1.1 Confusion matrix for a 2-class problem ……………………...... 9

1.2 Software used ……………………...... 16

2.1 List of microarray tools ...…………...... 22

2.2 Available source for microarray data…………………………………………………. 25

3.1 Tables used in MAT …………………………………………………...... 26

3.2 Attributes and their description………………………………………………... 29

3.3 List of default choices for feature selection…………………………………… 40

4.1 Training and testing samples – Experiment 1………………………………… 53

4.2 Accuracy of three classification methods for different N features…………….. 53

4.3 Training and testing samples – Experiment 2………………………………… 54

4.4 Accuracy of three classification methods for different N features…………….. 54

viii

LIST OF FIGURES

Figure Page

1.1 Schematic view of a typical microarray experiment…………………………. 3

1.2 Genepix experimental procedure……………………………………………. 4

1.3 Overview of KDD process...... 7

1.4 Mapping an input attribute set x into its class label y………………………... 8

1.5 A decision tree for the mammal classification problem……………………... 11

1.6 Classifying an unlabeled vertebrate………………………………………….. 12

1.7 Schematic representation of k-NN classifier………………………………… 13

1.8 System diagram …...... 14

1.9 Application flow diagram……………………………………………………. 15

2.1 Sketch of the ProGene algorithm…………………………………………………… 19

3.1 Database schema…………………………………………………………….. 26

3.2 Genepix_version table design………………………………………………... 27

3.3 Genepix_header table design……………………………………………….... 28

3.4 Genepix_sequence table design……………………………………………… 28

3.5 Hypothetical arrays of blocks………………………………………………... 31

3.6 Sample dataset……………………………………………………………….. 32

ix

3.7 Creation of repository………………………………………………………... 33

3.8 Selection of datasets………………………………………………………….. 34

3.9 Temporary table names for training and testing datasets ……………………. 35

3.10 Flowchart – Creation of dataset……………………………………………… 36

3.11 Replication of gene…………………………………………………………... 37

3.12 Sample training dataset with median intensity values……………………….. 37

3.13 Preprocessing in MAT……………………………………………………… 39

3.14 T-Test formulas……………………………………………………………… 42

3.15 Calculated p-values for the genes……………………………………………. 44

3.16 Pseudo code of kNN classical method……………………………………….. 45

3.17 Pseudo code of kNN mean method…………………………………………... 48

4.1 Training samples selected for the experiment……………………………….. 50

4.2 Testing samples selected for the experiment………………………………… 51

4.3 Attribute selection and constraint specification for normalization………… 51

4.4 Training datasets……………………………………………………………... 52

4.5 Testing datasets………………………………………………………………. 52

4.6 Feature selection and normalization………………………………………… 52

x

CHAPTER I

INTRODUCTION

1.1 Introduction to Bioinformatics

The central dogma of molecular biology is that DNA (deoxyribonucleic acid) acts as template to replicate itself, DNA is transcribed to RNA and RNA is translated into protein. DNA is the genetic material. It represents the answers to most of the researchers and scientists for years: “What is the basis of inheritance?” The information stored in

DNA that allows the organization of inanimate molecules into functioning, living cells and organism that are able to regulate their internal chemical composition, growth and reproduction [1]. This is what allows us to inherit our parents’ features, ex: our parents’ curly hair, their nose and others. The various units that govern those characteristics at the genetic level are called genes. The term bioinformatics refers to the use of computers to retrieve, process, analyze and simulate biological information. Bioinformatics has led to huge researches and has well proven itself for diagnosis, classification and discovery of many aspects that lead to diseases. Although bioinformatics began with sequence comparison it now encompasses a wide spread of activity for the modern scientific research. It requires mathematical, biological, physical, and chemical knowledge. Its implementation may further more require knowledge of computer science and etc.

1

1.2. Introduction to Microarray Technology

A DNA microarray is an orderly arrangement of tens to hundreds of thousands of

DNA fragments (probes) of known sequence. It provides a platform for probe hybridization to radioactive or fluorescent labeled cDNAs (targets). The intensity of the radioactive or fluorescent signals generated by the hybridization reveals the level of the cDNAs in the biological samples under study. Figure 1.1 shows the major processes in a typical microarray experiment. Microarray technology has been widely used to investigate gene expression levels on a genome-wide scale [1, 2, 5, 10]. It can be used to identify the genetic changes associated with diseases, drug treatments, or stages in cellular processes such as apoptosis or the cycle of cell growth and division [10]. The scientific tasks involved in analyzing microarray gene expression data include the identification of co-expressed genes, discovery of sample or gene groups with similar expression patterns, study of gene activity patterns under various stress conditions, and identification of genes whose expression patterns are highly discriminative for differentiating discerned biological samples.

Microarray platforms include Affymetrix GeneChips which uses presynthesized oligonucleotides as probes and cDNA microarrays which use full length cDNAs as probes. The array experiment uses slides or blotting membranes. The spot sizes are typically less than 200 microns in diameter usually containing thousands of spots. The spotted samples are known as probes. The spots can be DNA, cDNA or oligonucleotides

[2]. These are used to determine complementary binding of the unknown sequences thus allowing parallel analysis for gene expression and gene discovery. An orderly

2

arrangement of probes is important as the location of each spot on the array is used for the identification of a gene. The diagram of the microarray experiment is shown in Figure

1.1.

targets with microarray chip fluorescent dye with probes

targets hybridized to probes

Figure 1.1 Schematic view of a typical microarray experiment.

In the current study we are using the microarray dataset which were generated through cDNA microarray experiments. The arrays were scanned using Genepix pro biological kit. The forthcoming section explains the experimental procedure of the creation of the dataset.

1.2.1 Genepix Experiment Procedural

Genepix Pro is an automatic microarray slide scanner. Genepix Pro automatically loads, scans, does analysis and saves results. It can accommodate up to 36 slides. The auto loader accommodates microarrays on micro slides labeled with up to four fluorescent dies. These micro arrays can contain few hundred spots or few thousand spots representing an entire genome.

3

Figure 1.2 Genepix experimental procedure [Copyright – Appendix A]

When the slide career is inserted into the scanner, sensors detect the location of the scanner. Software helps to select of slides to be scanned. The graphical representation of the slides will be shown on the screen for user selection which makes it easier for the user to identify the slide. For each slide or for group of slide we can set the settings for the experiment. We can also choose automatic analysis option from the software. If the email address is specified in the settings, the experiment is done and the results will be sent to the email address.

The robotic arm takes the first slide from the slide career and scans the bar code in the slide and the slide is positioned for scanning. Genepix can be configured with four

4

lasers. Laser power wheel is used to adjust the laser strength for especially bright samples. The laser excitation beam is delivered to the surface of the microarray slide and the beam scans shortly across the access of the slide. As robotic arm moves slowly the slide fluorescent signals emitted from the sample is collected by a photo multiplier tube.

Sensors detect any non-uniformity in the slide surface and robotic arm is used to adjust the focus of the scan. Each channel is scanned sequentially and the developing images are displayed on the monitor. The multichannel tiff images are saved automatically according to file naming conventions specified by the user.

Once the scan has been completed the robotic arm replaces the slide in the career and repeats the process for the other slides selected from the tray. Genepix automatically finds the spot and calculates up to 108 measures and saves the result as GPR files. If the experiment is conducted with single channel the number of measures will be 50 or else the number of measure will be 50 to 108.

1.3 Applications of Microarrays

As we know the basic working of microarrays, we can now explore the different applications of microarray technology.

Gene discovery: Microarray technology helps in the identification of new genes. They help to know about the functioning and expression levels under different conditions.

Disease diagnosis: Microarray technology helps to learn more about different diseases such as heart disease, mental illness, infectious disease and especially the study of cancer.

Different types of cancer have been classified on the basis of the organs in which the

5

tumors develop. With the help of microarray technology, it will be possible for the researchers to further classify the types of cancer on the basis of the patterns of gene activity in the tumor cells. This will help the pharmaceutical community to develop more effective drugs as the treatment strategies will be targeted directly to the specific type of cancer

Drug discovery: Pharmacogenomics is the study of correlations between therapeutic responses to drugs and the genetic profiles of the patients [2]. Comparative analysis of the genes from a diseased and a normal cell will help the identification of the biochemical constitution of the proteins synthesized by the diseased genes. The researchers can use this information to synthesize drugs which combat with these proteins and reduce their effect.

Toxicological research: Microarray technology provides a robust platform for the research of the impact of toxins on the cells and their passing on to the progeny [2].

Toxicogenomics establishes correlation between responses to toxicants and the changes in the genetic profiles of the cells exposed to such toxicants [2].

1.4 Need for Automated Analysis

The intrinsic problem of a typical data set produced by microarrays is the sample size and the high dimensionality of the data set. The dataset created by genepix pro has various measures for thousands of genes. There is no way of analyzing the samples manually. In this study we propose a microarray analysis tool (MAT) with their ability of appropriately representing new methods of classification and finding new classes. The

6

tool follows the knowledge discovery in data (KDD) steps which are explained in detail in the forthcoming section.

1.5 Knowledge Discovery in Data

The term knowledge discovery in data (KDD) refers to the process of finding the knowledge in data and application of particular data mining methods. It involves the evaluation and possible interpretation of the patterns known as knowledge. The unifying knowledge of the KDD process is to extract useful information from large database.

Overview of the KDD process is shown in Figure 1.3.

Figure 1.3 Overview of KDD process

7

1.5.1 KDD Steps

Data selection processes the knowledge in the application domain and selects the dataset that are relevant to the problem to be solved. Preprocessing step removes the unwanted data from the database and find strategies to update the missing fields in the dataset. Transformation is the process of transforming data from one type to another. In this step we find the useful features to represent the data depending on the goal of the task and normalize the data set. In data mining step we decide the algorithms suitable for the study. The current study is mainly about classification and hence we choose the classification algorithm to be implemented in this step. Interpretation and evaluation is the process of creating the model. The model is tested with the test sample and accuracy of the prediction is calculated.

1.6 Classification

Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y [4].

Input Output Classification Attribute set (x) model Class Label (y)

Figure 1.4 Mapping an input attribute set x into its class label y [4]

8

The input data for the classification model is a collection of records. Each record is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated as the class label. A classification model can also serve as an explanatory tool to distinguish between objects of different classes.

1.6.1 General Approach

Several approaches are taken in creating classification including decision trees, networks, KNN classifiers and others. Each approach has a learning algorithm which creates a model based on the input attribute set given. The model generated by the learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before.

The training set consists of records whose class labels are known. The classification model is build using the training set and the model is applied to the test data with unknown class labels. The evaluation of the classification model is done using the confusion matrix.

Table 1.1 Confusion matrix for a 2-class problem [4]

Predicted Class

Class = 1 Class = 0

Actual class Class = 1 f11 f10

Class = 0 f01 f00

9

Each entry fij in this table denotes the number of records from class i predicted to be of class j. For instance, f01 is the number of records from class 0 incorrectly predicted as class1. Based on the entries in the confusion matrix, the total number of correct predictions made by the model is ( f11 + f00 ) and the total number of incorrect predictions is ( f10 + f01 ). Accuracy is calculated using the (Eq.1.1) and the error rate is calculated using the (Eq. 1.2).

f + f Accuracy = 11 00 (1.1) f11 + f10 + f 01 + f 00

f + f Error rate = 10 01 (1.2) f11 + f10 + f 01+ f 00

1.6.2 Decision Trees

In data mining, a decision tree is a predictive model; that is, a mapping from observations about an item to conclusions about its target value [1]. More descriptive names for such tree models are classification tree or regression tree. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning or decision trees. The tree has three types of nodes [4].

• A root node that has no incoming edges and zero or more outgoing edges.

• Internal node, each of which has exactly one incoming edge and two or more

outgoing edges.

• Leaf or terminal nodes, each of which has exactly one incoming edge and no

10

outgoing edges.

Body Root node Temperature Internal Node Warm Cold

Gives Birth Non - Mammals Yes No

Mammals Non - Leaf Mammals Nodes

Figure1.5. A decision tree for the mammal classification problem [4].

In the decision tree, each leaf node is assigned a class label. The non-terminal nodes, which include the root and other internal nodes, contain attribute test conditions to separate records that have different characteristics. For example, the root node shown in Figure 1.5 uses the attribute Body Temperature to separate warm- blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non- mammals, a leaf node labeled Non-mammals is created as the right child of the root node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds.

Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. This will lead us

11

either to another internal node, for which a new test condition is applied, or to a leaf

node. The class label associated with the leaf node is then assigned to the record. As

an illustration Figure 1.6 traces the path in the decision tree that is used to predict the

class label of a flamingo. The path terminates at a leaf node labeled Non-mammals.

Figure 1.6 Classifying an unlabeled vertebrate [4].

1.6.3 k Nearest - Neighbor classifiers

k Nearest neighbor method is a simple machine learning algorithm which is used for classification purposes based on the training samples in the feature space. In this method, the target object is classified by the majority vote of its neighbors and the object

12

is assigned to the class to which most of the neighbors belong (Figure 1.7). For the purpose of identification of neighbors, objects are represented by position vectors in a multidimensional feature space. In this method k training samples that are most similar to the attributes of the test sample are found, which are considered as nearest neighbors and are used to determine the class label of the test sample. The distance between sample x and y can be calculated using the Euclidean distance (Eq. 1.3), Manhattan distance (Eq.

1.4), or other distance measures.

n 2 Euclidean distance d(x, y) = ∑(xi − yi ) (1.3) i=1

n

Manhattan distance d(x, y) = ∑ xi − yi (1.4) i=1

Where xi is the expression level of gene i in sample x; yi is the expression level of gene i in sample y; and n in the number of genes whose expression values are measured.

Figure 1.7 Schematic representation of k-NN classifier

13

1.7 Outline of the Current Study

The objective of this study is to create database repository to store different microarray datasets and create a microarray analysis tool (MAT) which can be used for analysis of gene expressions. The tool has been designed such that it follows the KDD steps. The database repository currently allows the genepix datasets although it can be expanded to include different formats. The analysis methods implemented includes three different kNN methods, classical kNN, weighted kNN and mean kNN. The system diagram, application flow diagram and software used are shown below.

User Interface (C++) Screen’s for data mining process

Database (SQL Server 2005) Dynamic scripts for creation of training and testing datasets

Text files Perl Script Training and testing datasets, input Feature Selection and file for ttest, cls file for identifying Classification the type of samples.

Schema for the text files

Figure 1.8 System diagram

14

Figure 1.9 Application flow diagram

15

Table 1.2 Software used

Design and Script Platform User Interface C++ (Visual studio 2008) Database Sql server 2005 Classification algorithms Perl Feature Selection algorithms Perl Dataset Creations Sql server stored procedures

16

CHAPTER II

LITERATURE REVIEW

2.1 Previous Work

In this chapter we have discussed about the previous work done on classification technique and the details about the standard microarray database.

Molecular classification of cancer – Class discovery and class prediction by gene expression monitoring

One of the first sample classification studies using microarray data was done by Golub

[10]. The initial leukemia data set consisted of 38 bone marrow samples (27 ALL, 11

AML) obtained from acute leukemia patients at the time of diagnosis [10]. RNA prepared from bone marrow mononuclear cells was hybridized to highly-density oligonucletoide microarrays, produced by Affymetrix containing probes of 6817 human genes [10]. For each gene a quantitative expression level was obtained. Samples were subjected to a priori quality control standards regarding the amount of labeled RNA and the quality of the scanned microarray image [10].

In the study a generic approach for cancer classification based on gene expression monitoring by DNA microarrays was applied to human acute leukemia as a test case. A

17

class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrated the feasibility of cancer classification based solely on gene expression monitoring and suggest strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

Improving classification of microarray data using prototype-based feature selection

This study of improving accuracy in the machine-learning task of classification from microarray data was done by Blaise Hanczar [11]. One of the known issues specifically related to microarray data is the large number of genes versus the small number of available samples. The most important thing is to identify the genes that are most relevant for classification. Classical feature selection methods are based on the notion of prototype gene. Each prototype represents a set of similar gene according to a given clustering method. In this method experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions was presented [11]. To improve the accuracy of machine learning based classifier algorithm, reduction methods play a key role. A somewhat original dimension reduction method, which experimentally increases classification accuracy of a support vector machine based classifier, was developed. Although the performance gain is comparable to that of classical reduction methods combining them outperforms both methods.

18

The dimension reduction technique follows two steps. The first one is to identify equivalent classes inside the gene space with respect to a given criterion be it the gene expression, the known gene function, or any biologically relevant criteria [11]. The second step is to create gene prototypes that are good representatives of these classes

[11]. The classification task is performed using one or more prototype-genes that have been computed by an aggregation of genes that best represent the class. The sketch of the algorithm is show below.

1. CM <- Select method of clustering (default: k mean) 2. NBCLUST <- Select the desired number of clusters 3. For each iteration of the cross validation. 3.1 Define the train and test dataset. 3.2 Do NBCLUST clusters of genes on train set using CM. 3.3 For each cluster C u 3.3.1 Build prototype P u <- mean of this cluster. 3.4 Model <- training of SVM using prototype. 3.5 accuracy <- prediction on the test set 4. Compute the average accuracy.

Figure 2.1 Sketch of the ProGene algorithm [11].

Improved Gene Selection for classification of microarrays

This study of deriving methods for improving techniques for selecting informative genes from microarray data was done by J.Jaeger, R.Sengupta, and W.L. Ruzzo [12].

Genes of interest are typically selected by ranking genes according to a test-statistic and then choosing the top k-genes. A problem with this approach is that many of these genes are highly correlated. For classification purpose it would be ideal to have distinct but still highly informative genes. Three different pre-filter methods - two based on clustering and

19

one based on correlation - to retrieve groups of similar genes was proposed. For these groups a test-statistic to finally select genes of interest was applied. This filtered set of genes can be used to significantly improve existing classifiers.

2.2 Existing Tools for Normalizing GPR Datasets

There are various tools available online for analyzing GPR datasets. Few amongst them are GProcessor, GPR Normalizer and Microbial Diagnostic Array Workstation

(MDAW).

GenepixPro built in normalization method is simple linear normalization. The label effect caused by the two channel experiment cannot be resolved by the simple liner normalization. GProcessor provides the option of user defined normalized conditions.

The mot efficient nonlinear normalization method which can deal with the label effect is

Lowess fit method which was originally proposed by William.S. Another method to analyze microarry data is the analysis of variance method. GProcessor uses these methods and does the normalization.

GPR Normalizer does the preprocessing of the raw data which includes background correction and normalization of raw intensities. The statistical analyses are done using the Bioconductor package (lemma).

Microbial Diagnostic Array Workstation is a web server for diagnostic array data storage, sharing and analysis. It is not platform dependent. We can analyze GPR datasets by uploading them directly.

20

2.3 Stanford Microarray Database (SMD)

The Stanford Microarray database serves as a microarray database for researchers and collaborators. It allows public login for data viewing and analysis. In addition, SMD functions as a resource for the entire scientific community by allowing them to download datasets, do analysis, download source code and use the various available tools to explore and analyze the data’s. The number of publicly accessible arrays is increasing by about

1000 per year. This data’s include experiments on twelve distinct organisms including

Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana, ,

Drosophila melanogaster and [9]. SMD provides users the option of selecting experimental data, assessing data quality, filtering by individual spot characteristics and by expression pattern and analyzing data using clustering technique

[9]. SMD’s software is open source.

SMD’s database server is currently an eight-processor Sun V880, which has

32GB or RAM installed [9]. The software used includes Database management system –

SMD uses oracle server enterprise edition version 9i (9.2.0.1.0); System software – The machine on which SMD resides currently uses SunOS 5.9; other software – Perl

5.0004_04 or later.

2.4 Microarray Tools

A list of available tools to work with microarray datasets is given in Table 2.1.

Each tool performs different tasks for data mining.

21

Table 2.1 List of microarray tools [9]

Software from other sources Program Description Provider Platform Tool assisting in primer Array Designer Premier Biosoft design for microarray JAVA [13] International construction Set of analysis tools using advanced algorithms to Optimal Design, Windows ArrayMiner [14] reveal the true structure of Sprl. MacOS gene expression data. Identification of National Human ArrayViewer [15] statistically significant Genome Research JAVA hybridization signals Institute Bayesian Analysis of Gene Expression Levels: University of MacOS, BAGEL [16] a program for the Connecticut Windows, Linux statistical analysis of spotted microarray data. Microarray database and BASE [17] Lund University Web analysis platform UNIX An enhanced version of University of Linux Cluster 3.0 [18] Mike Eisen’s Cluster Tokyo, Japan MacOS Windows European Expression Profiler Analysis & clustering of Bioinformatics Web [19] gene expression data Institute (EBI) Gene expression data analysis and simulation University of GEDA [20] tools, offering a variety of Pittsburgh and Web options for processing UPMC and analyzing results. Whitehead Institute/MIT JAVA GeneCluster [21] Self-organizing maps Center for Genome Windows NT Research Tools for visualizing data from gene expression Conklin lab; GenMAPP [22] experiments in the Gladstone Institute Windows context of biological & the UCSF pathways.

22

Table 2.1 List of microarray tools [9] (Continued)

The GeneSifter microarray data analysis system provides access to powerful statistical tools through a web interface, with integrated features for determining the GeneSifter [23] GeneSifter Web biological significance of the data. GeneSifter works with any array format and is especially optimized for Affymetrix GeneChip users. Free trial accounts available. Gene Expression National Windows Database : integrated Center for GeneX [24] Linux toolset for data analysis Genome SunOS/Solaris and comparison Resources Analysis of high density GenMaths [25] microarrays and gene Applied Maths Windows chips Windows A Gene Expression Data Macintosh Ocimum Genowiz™ [26] Analysis and Unix Biosolutions Management Tool. Linux Solaris Extracting and visualizing Partek Pattern Partek Linux, Unix, patterns in large Recognition [27] Incorporated Windows multivariate data TIGR Analysis and MultiExperiment Visualization of TIGR JAVA Viewer [28] Microarray Data Software for displaying University of Linux TreeArrange and and manipulating Waterloo, Unix Treeps [29] hierarchical clustered data Canada Windows

23

2.5 Available Source for Microarray Data.

Few available sources for microarray data which are available online are tabulated below. These sources provide different types of microarray dataset conducted by different biological experiment.

Table 2.2 Available source for microarray data

Name URL National Center for Biotechnology http://www.ncbi.nlm.nih.gov/geo/ Information Stanford Microarray Database http://genome-www5.stanford.edu/ University of Pittsburgh Microarray http://bioinformatics.upmc.edu/Help Dataset Collection /UPITTGED.html Kent Ridge Bio-medical Data Set http://sdmc.lit.org.sg/GEDatasets/Datasets.html Repository

24

CHAPTER III

MATERIALS AND METHODS

3.1 Database Design

A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model.

The model in most common use today is the relational model. Other models such as the hierarchical model and the network model use a more explicit representation of relationships. A computer based database relies upon software to organize the storage of data; this software is known as database management system. In this section we will be seeing about the database schema design of MAT and the table details.

3.1.1 Schema Design

The schema of a database system is its structure described in a formal language supported by the database management system. In a relational database, the schema defines the tables, the fields, the fields in each table, and the relationships between field and tables. The levels of database schema can be divided into conceptual schema, logical schema and physical schema. Conceptual schema is a map of concepts and their

25

relationships. Logical schema is a map of entities and their attributes and relations.

Physical schema is a particular implementation of a logical schema. The diagram of the database schema for MAT is shown below

Figure 3.1 Database schema

3.1.2 Table Details

The three tables used in MAT and their usage are depicted in the table shown below.

Table 3.1 Tables used in MAT

Table Name Usage Genepix_header Store the header information about the samples. Genepix_version Store the gpr dataset record sets apart from the header information. Genepix_sequence Store the gpr file names that are transferred to the repository. Training_data_bank Store the samples for creation of training data set. Testing_data_bank Store the samples for creation of testing data set.

26

The column name, field type and length of the attributes for the three tables are shown in the Figure 3.2 – 3.4.

Figure 3.2 Genepix_version table design

27

Figure 3.3 Genepix_header table design

Figure 3.4 Genepix_sequence table design

3.1.3 Attributes

We are using datasets generated by the biological kit GenepixPro. Once a microarray image is analyzed using the kit, the results are saved in the GPR format. GPR files has approximately 50 different attributes, including feature intensities, background intensities, ratio types, sums of various sorts, threshold parameters, several different variations of these attributes, etc. To use all these data wisely, we need to know basic facts about the microarray attributes. The current database design supports genepix formats; it can be further extended for any types of microarray datasets by adding tables to the schema and creating the relationship between the tables. The detail description of the data types will be explained in the next section.

28

3.2 Description of Genepix Data Format

The explanation of the attributes of the sample file and the biological relevance for the few attributes is given in Table 3.2.

Table 3.2 Attributes and their description

Attribute Description A Block is the unit that consists of a set number 1. Block of rows and columns arrayed spots. Block corresponds to a single pin in the array printer. 1. Column The Column or Row number of a feature 2. Row The name and Id associated with each spot on the microarray. The Genepix Pro kit uses a GAL file for creation of name and ID. If the GAL file is not used in the experiment the Name column 1. Name will be assigned null values or it will contain the 2. ID text “Blank”. The names in the GAL file are created by the authors and the ID will be taken from database GenBank. ID is used as the unique identifier in the analysis experiment [6]. 1. X The physical location of the feature on the slide. 2. Y The diameter of the feature-indicator ring, in 1. Diameter microns. If the diameter is exceeding a set limit, it will be considered as problematic feature. 1. F635Median These are the median or mean intensity values 2. F635Mean without background subtraction at both the 3. F532 Median wavelength for all the pixels that fall with the 4. F532 Mean feature-indicator ring. The standard deviation of the intensity values at 1. F635 SD wavelength 1 or 2 of all pixels that fall within the 2. F532 SD feature-indicator ring.

29

Table 3.2 Attributes and their description (Continued)

The median, mean or standard deviation of the 1. B635 Median background intensity values at wavelength 1 or 2 2. B635 Mean of all pixels that meet the criteria to be assigned as 3. B635 SD background pixels for a given feature. The 4. B532 Mean background intensities are used to find the quality 5. B532 Mean of the microarray. If the background intensity is 6. B532 SD higher we can assume that high voltage has been used [6]. 1. % >B635 +1SD The percentage of feature pixels at wavelength 1 or 2. % >B635 +2SD 2 that have intensity values greater than 1 or 2 SD 3. % >B532 +1SD above the median background intensity value [6]. 4. % >B532 +2SD

The percentage of feature pixels at wavelength 1 or 1. F635 % Saturated 2 that have maximum 16-bit intensity value of 2. F532 % Saturated 65535. 1. Ratio of Medians The 635 nm/532 nm ratio of the median or mean 2. Ratio of Means intensity values, calculated from the whole feature. The 635 nm/532 nm ratios are calculated for each 1. Median of Ratios pixel within the feature and a median or mean 2. Mean of Ratios value from these ratios is returned. The standard deviation (SD) of the intensity ratios of all pixels within a feature. This parameter is 1. Ratios SD derived from the ratio values computed on pixel by pixel basis. Plotting the fluorescence intensity values for each pixel against each other and finding the ratio gives 1. Rgn Ratio us the Rgn Ratio. This parameter does not use the background subtraction method [6]. This parameter is the square of the correlation 1. Rgn R2 coefficient and ranges between 0 and 1. This parameter is the sum of the background- 1. Sum of Median subtracted median or mean pixel intensity values 2. Sum of Means for both wavelengths. 1. F635 Median – B635 The median or mean pixel intensity of wavelength 2. F532 Median – B532 1 or 2 for the feature with the median background 3. F635Mean – B635 subtracted. 4. F532Mean – B532 Software flags few features based on the default conditions set on the system. The possible values 1. Flags are -100, -75, -50. Values greater than 0 are considered to be good features.

30

3.2.1 Features and Blocks

Figure 3.5 Hypothetical arrays of blocks [13]

Each individual spot in the array is called feature. Each feature is assigned one feature indicator. Block is a collection of feature indicators (There can be several blocks in a single array). Features were replicated on all the datasets used for the analysis. The median values were taken while doing the analysis. Number of blocks in the datasets for the analysis was 32. Features and blocks are highlighted in Figure 3.5.

31

3.2.2 Sample Dataset

A portion of a sample file created by the genepix experiment with the header contents and the attribute details is shown in Figure 3.6.

Figure 3.6 Sample dataset

32

3.2.3 Transferring Genepix Dataset to Database

All the samples used in the analysis are in gpr format. The initial step is to transfer the dataset from gpr format to relational format. The user has to select the samples to be transferred to the database. Sql script does the transformation work. The stored procedure sp_ReadFile written in Sql server receives two parameters the path and the file name and transfers the gpr file to the database. The transformation flow is depicted in the flow chart drawn below.

Figure 3.7 Creation of repository

33

3.3 Data Selection

The forthcoming section explains how the creation of training and the testing data sets are performed with the user selected samples.

3.3.1 Creation of Training and Testing Dataset

The training and the testing datasets are created from the sample names selected by the user. The user is given the option of selecting the training and testing datasets. The screen for the selection of datasets is shown in Figure 3.8.

Figure 3.8 Selection of datasets

34

The user can either select by sample name or randomly select random number of normal and diseased samples. The column “N/D” shows the user whether the sample is a normal or diseased one. If the user selects four sample files sp1, sp2, sp3 and sp4. The header ids of all the samples are retrieved from the genepix_header table. With the retrieved header_id the gene information is transferred to the temporary tables. Likewise the same procedure is repeatedly done for all the samples selected so that the original table is not modified throughout the experiment. If the user wants to change the type of the sample for the analysis purpose, it can be done by manually changing the test_cls file from the folder where the classifier algorithm is saved. “0“ is considered as normal and

“1” is considered as diseased.

A sample collection of training and testing datasets are created. Samples of temporary tables in the training and the testing datasets are shown in Figure 3.9.

Figure 3.9 Temporary table names for training and testing datasets

35

Flow chart shown in Figure 3.10 depicts the process of creation of training and testing data datasets.

Figure 3.10 Flowchart – Creation of datasets

36

Of all the variables the median intensity is selected for the formation of the training and the testing datasets. MAT uses the mean of the median intensity if a feature name is replicated in the samples. Figure 3.11 shows the duplication of features in the sample. With the use of join query, MAT selects the identical features amongst all the samples and creates the training and testing datasets.

Figure 3.11 Replication of gene

The final model of training and testing will have the median intensity attribute from all the samples appended together as shown in Figure 3.12.

Figure 3.12 Sample training dataset with median intensity values

37

3.4 Preprocessing

Preprocessing removes the unwanted data from the database and finds strategies to update the missing fields in the dataset. Datasets are the units of analysis. In many cases, user might want to create a dataset from only a subset of the available data, instead of from each feature from every microarray in an experiment. There are two main reasons for doing it

1. To remove unreliable data from the dataset. For example: Data points derived

from slide contain defects such as smears.

2. To remove uninteresting data from the dataset. For example: There may be

features that do not show any interesting behavior in order to make the analysis

task more tractable.

One of the main challenges in microarray analysis is how to translate the quality judgments that we make confidently by eye when looking at microarray images, into numerical description that can be applied reproducibly on large number of microarrays.

The easy way to do this is to make a common list of common feature and slide defects.

The lists of few defects identified are

• Feature is smeared into neighboring feature.

• Feature is very close to background.

• Feature has a hair of a scratch through it.

38

• Feature is in pieces.

• Feature is saturated.

• Feature pixels have highly non-uniform intensities.

• Feature has a highly non-uniform background

Using these lists, constraints can be created to be used for the calculation of the weight which is used in the classification. The weight calculation procedure will be explained in detail in the forthcoming chapter.

3.4.1 Preprocessing in MAT

User has the option of selecting a attribute and setting the constraints for the attribute. The screen for preprocessing is shown in Figure 3.13.

Figure 3.13 Preprocessing in MAT

39

The lists of attributes available are shown to the user in preprocessing step and user defined constraints can be given to filter the record sets. Table 3.3 shows the list of default choices for feature selection criterion.

Table 3.3 List of default choices for feature selection

Attribute Condition F635 SD <= 20000 B635 SD <= 20000 SNR 635 >= 0.8 % > B635+1SD >= 40 F635 Median - B635 >= 4.5

Apart from these choices, the genes named as “Blank” and the attributes having

Null values are removed from the tables.

Once the constraints are identified and conditions are given by the user. The conditions are passed as parameters to the stored procedure sp_UpdateFlag which has parameters

@filter_attributes varchar(3000),

@filter_values varchar(3000),

@sample_name varchar(10),

@flag varchar(25)

Using the filter attributes and the filter values dynamic update query is created and the flags in the datasets are updated if the features satisfy the conditions.

40

3.5 Normalization

Normalization is the process of adjusting the experimental data so that the data from a single experiment are as accurate as possible and data for different experiment can be compared to each other [11]. In microarray experiments, this typically involves adjusting the data on a single array, and then adjusting the data across arrays.

One normalization method is based on the premise that most genes on the array will not be differentially expressed, and therefore the arithmetic mean of the ratios from every feature on a given array should be equal to 1. Another method is to choose a subset of the features on an image as control features. It was suggested that all substances change expression levels under different conditions. Therefore, normalization control features should be selected based on their consistent behavior in all experimental conditions used on the arrays, not simply on their historical use as “housekeeping genes” in other molecular biology techniques [11]. For example, the control spots might be such that each is expected to have a ratio of 1. Hence the mean of control features should be 1.

Assuming that variations are uniform across the array, a single normalization factor can be calculated from these features and then applied to the whole array.

In MAT each sample is normalized such that the geometric mean of the sample signal is fixed to be a global median. Since there are thousands of genes in the datasets the median intensity normalization is done through taking the log values for the median intensities.

41

3.6 Feature Selection

Feature selection is to select a subset of genes by eliminating genes with little or no productive information. Feature selection can improve the comprehensibility of the classifier models and build a model that can be generalized better to unseen samples.

3.6.1 Student T-Test

T-Test assesses whether the means of the two groups are statistically different from each other. This analysis is appropriate when we want to compare the means of two groups.

The formula of the t-test is a ratio. The top part of the ratio is the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the samples [2]. Figure 3.14 presents a graphic view of the formula and the relation between the numerator and the denominator.

Figure 3.14 T-Test formulas

42

Consider two samples x and y from two populations. If the standard deviations of the two populations are same, the value of the t-test statistics can be calculated using (Eq.

3.1).

x − y t = , (3.1) 1 1 s + p n m where n and m are the sample size of x and y,

2 2 (n − )1 sx + (m − )1 s y s = is the pooled sample standard deviation, and sx, sy are the p n + m − 2 standard deviation of sample x and y.

If the standard deviations of the two populations are different, the value of the t- test statistics can be calculated using (Eq. 3.2).

x − y t = (3.2) 2 s 2 s x + y n m where n and m are the sample size of x and y. sx, sy are the standard deviation of sample x and y.

3.6.2 Implementation of T-Test in MAT

In MAT, equation (3.1) is implemented using Perl. The p-values are calculated using the t-test statistics with a degree of freedom n+m-2. “ttest.pl” (Refer Appendix-B for the script) file opens the test_data_set.txt file with the normalized data and calculates the p-value for the pre-processed genes. The results are saved with the file name

43

“ttest_results.txt”. Sql Procedure reads the result file and stores the result set in a table named as ttest_results for further processing. A portion of sample output of

“ttest_results.txt” is in Figure 3.15.

Figure 3.15 Calculated p-values for the genes

3.7 Classification

This section explains the three classification methods, Classical kNN, weighted kNN and Mean kNN implemented with the tool.

3.7.1 Classical kNN Method

Once the feature selection is done, in classical kNN method the top N genes and their expression levels in the training dataset and the testing dataset are given to the

44

classifier. The classify.pl file written in Perl reads the test sample and calculates the distance of test samples with all the training samples. The nearest K samples are considered. Based on the maximum vote, the sample is identified as normal or diseased.

Distance is calculated using the (Eq. 3.3). The flow chart of the algorithm is shown in

Figure 3.16.

N 2 d(x, y) = ∑(xi − yi ) (3.3) i=1 where xi, yi are the median intensities of sample x and y for gene i.

Figure 3.16 Pseudo code of kNN classical method

45

3.7.2 Weighted kNN Method

In this approach each gene is assigned a weight based on the importance of the gene. The lists of defects presented in section 3.4 are evaluated for each gene quality using few attributes ( attr j) such as F635 % Sat., Circularity, SNR 635, % > B635+1SD,

F635 Median - B635 Median.

The steps for the calculation of the weight is shown below

• Scale the values of the attributes selected as important between 0 and 1 using the

(Eq. 3.4)

attrj − min(attrj ) w j = (3.4) max(attrj ) − min(attrj )

• Calculate the geometric mean of the values using the (Eq. 3.5).

1 n ∑ log 2 w j n = w = 2 i 1 (3.5)

• The final weight for gene “i” is calculated using the (Eq. 3.6).

wi = w (* −log pi ) (3.6)

where pi = p-value of gene “ i”.

The weighted kNN method is a modification of the classical kNN shown in

Figure 3.16. The weights are calculated using the training dataset and the distance between sample x and y is calculated using the (Eq. 3.7).

46

N 2 d(x, y) = ∑ wi (xi − yi ) , (3.7) i=1

where xi = median intensity of gene i of training sample x;

yi = median intensity of gene i of test sample y.

3.7.3 Mean kNN Method

This section explains the steps to be followed for the mean method and the pseudo code for the mean method. The mean method follows the following steps

• Calculate the distance between the test sample j and the training samples using the

(Eq. 3.3).

• Find the mean distance ( djn ) between the test sample and the normal samples.

• Fine the mean distance ( djd ) between the test sample and the diseased samples.

• If djn < djd the sample j is considered as “Normal” or else it’s considered as

“Diseased”.

In MAT we have implemented kNN mean method. The algorithm is implemented in Perl.

The Pseudo code for the kNN mean method is shown in Figure 3.18

47

Figure 3.17 Pseudo code of kNN mean method

48

CHAPTER IV

RESULTS AND DISCUSSIONS

4.1 A Case Study

In this chapter the tool is applied to a case study of forty biological samples.

Three experiments were conducted with the samples. The results and discussion of the case study are provided to illustrate the advantage of using the tool and the proposed weighted kNN method. The screen shots for the experiment one and the step by step process of validating the tables in the database are shown in Figure 4.1 – 4.5.

As we can see, dataset selection for training shown in Figure 4.1 allows user to select a subset of samples for training. The next button on the window leads the user to select the samples for testing (Figure 4.2). The next button on the testing sample selection window leads the user to the preprocessing procedure.

Figure 4.3 shows the selection of the attributes and the constraints for the quality checking. When the update button is clicked, dataset containing the samples selected for training and the samples selected for testing is created. This can be verified by viewing the table training_data_bank and the testing_data_bank. The queries used to view the datasets are

1. Select * from training_data_bank;

49

2. Select * from testing_data_bank.

The results of the queires for the selected samples are shown in Figure 4.4 and 4.5

The samples in the dataset are then normalized with the specified normalization constraints. The verification of the training data set can be confirmed by executing the query: Select * from train_data_set. The query returns the median intensities of all the samples.

Feature selection and classification is done by selecting the feature selection method and the classification method selected by the user. Figure 4.6 shows the screen for Feature selection and classification. The p-values are calculated for the genes and saved in the ttest result file. The number of features can be determined by the user. The ttest results are then transferred to the database table. It can be verified by executing the query: Select * from ttest_results.

Figure 4.1 Training samples selected for the experiment

50

Figure 4.2 Testing samples selected for the experiment

Figure 4.3 Attribute selection and constraint specification for normalization

51

Figure 4.4 Training data sets

Figure 4.5 Testing data sets

Figure 4.6 Feature selections and normalization

52

4.2 Results

Experiment 1

The samples used for this experiment are tabulated in table 4.1 and the results of classification done on the three methods are tabulated in table 4.2

Table 4.1 Training and testing samples – Experiment 1

Training Samples Testing Samples SP2 Diseased SP22 Normal SP3 Diseased SP31 Normal SP4 Diseased SP32 Normal SP11 Normal SP33 Normal SP12 Normal SP38 Diseased SP13 Normal SP39 Diseased SP21 Normal SP42 Diseased SP37 Normal SP43 Diseased SP40 Diseased SP41 Diseased

Table 4.2 Accuracy of three classification methods for different N features

Top 100 Features Top 150 Features Top 50 Features Results % Results % Results % Classical kNN 62.5 62.5 75 Weighted kNN 75 62.5 87.5 Mean kNN 62.5 62.5 62.5

53

Experiment 2

The samples used for this experiment are tabulated in table 4.3 and the results of classification done on the three methods are tabulated in table 4.4.

Table 4.3 Training and testing samples – Experiment 2

Training Samples Testing Samples SP6 Diseased SP1 Diseased SP11 Normal SP2 Diseased SP12 Normal SP3 Diseased SP14 Diseased SP13 Normal SP15 Diseased SP23 Normal SP21 Normal SP28 Diseased SP22 Normal SP31 Normal SP38 Diseased SP33 Normal SP42 Diseased SP43 Diseased

Table 4.4 Accuracy of three classification methods for different N features

Top 25 Features Top 100 Features Top 150 Features Results % Results % Results % Classical kNN 75 37.5 25 Weighted kNN 87.5 50 50 Mean kNN 50 50 50

54

4.3 Discussion

The creation of database repository makes the user easier to handle the huge sample files. Using the GUI of the database repository, users can easily upload their files to the server. The processing of datasets can be done through query analyzer. The dynamic creation of temporary tables does not modify the original tables and the temporary tables are cleared when each experiment is started in order to restrict the database size as we are dealing with huge datasets.

Preprocessing is done with the user defined constraints. It enables the user to define any number of constraints. The flexibility of the preprocessing screen allows the user to try out different set of constraints with different set of samples.

By careful study of the biological experiment conducted and by knowing the importance of the attributes, constraints can be selected for normalization. The case study conducted shows that the proposed weighted kNN approach consistently performs better than the normal and the mean method. As we can see in the first experiment, when top

100 features were used, weighted kNN achieved the accuracy 75% which is 12.5% better than the regular kNN. When top 50 features were used, the accuracy of weighted kNN is

87.5% comparing with 62.5% for the regular kNN. We did not observer the accuracy change when top 150 features were used. In the second experiment, when top 25 features were used, weighted kNN achieved the accuracy 87.5% which is 12.5% better than the regular kNN. When top 100 features were used, weighted kNN achieved accuracy 50% comparing with 37.5% for regular kNN. When top 150 features were used, the accuracy of weighted kNN is 50% comparing with 25% for the regular kNN.

55

CHAPTER V

CONCLUSIONS AND FUTURE WORK

5.1 Conclusion

The database repository for microarray data has been successfully created. The graphic user interface and the analysis tool are implemented. The software tool allows users to conveniently upload microarray datasets, preprocess them and analyze them. A case study with forty miRNA microarray samples has been done with the tool. The analysis results indicate the kNN weighted method outperforms the classical kNN method and the kNN mean method. A step by step procedure of how to use the tool is presented in thesis. The convenience of using the tool can be readily seen.

5.2 Future Work

The tool currently supports Genepix datasets which can be extended with other microarray datasets. To include other type of datasets, the database schema can be modified. Student T-Test is used for feature selection. Other feature selection methods can be included. Three classification methods have been implemented with the tool which can be further extended with other classification methods.

56

Other improvements like visual representation of distribution of the datasets can be included. It is expected that the visual representation would help the users select the preprocessing constraints. Currently a simple normalization method based on geometric means of signal intensities is used in the tool. It can be extended with other normalization methods.

The approach of changing the dynamic scripts used in the database can be modified to use a huge amount of samples with the tool. Currently the Sql server supports

8000 characters to be stored in a variable and as the sample size increases there will be problem with the tool. This can be solved by creating new variables for group of three samples or change the dynamic query creation procedure.

57

REFERENCES

1. Dan E.Krane, Michael L.Raymer, Fundamental Concepts of Bioinformatics, Pearson Education, 1 -30, 2003.

2. Microarray Technology – Premier Biosoft International http://www.premierbiosoft.com/tech_notes/microarray.html.

3. Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, Menlo Park, CA, 1- 34,1996.

4. Pang- ning tan, Michael Steinbach, Vipin kumar, Introduction to Data Mining , Pearson Education, 207 – 226, 2006.

5. Damian Verdik, Guide to Microarray Analysis , by Axon Instruments, 1 -15, 2004. http://www.moleculardevices.com/product_literature/family_links.php?familyid= 14

6. Shawn Handran and Jack Y.Zhai, Biological Relevance of Genepix Results , Axon Instruments, 1- 9, 2003.

7. Axon Instruments, Genepix Pro 4.1 User Guide and Tutorial , Axon Instruments 11- 59, 2002. http://www.moleculardevices.com/product_literature/family_links.php?familyid= 14

8. James D. Tisdal, Beginning Perl for Bioinformatics , O’Reilly and Associates, Inc., 30 – 81, 2001.

9. Stanford Microarray database http://genome-www5.stanford.edu/

10. T.R.Golub, D.k.Slonim, P.Tamayo, C.Huard, M.Gaasenbeek, J.P. Mesirov, H.Coller, M.L.Loh, J.R.Downing, M.A.Caligiuri, C.D. Bloomfield, E.S.Lander., Molecular classification of cancer : Class Discovery and Class Prediction by Gene Expression Monitoring , Journal my sciencemag.org, 1 – 5, 1999.

58

11. Blaise Hanczar, Melanie Courtine, Arriel Benis, Corneliu Hennegar, Karine Clement, Jean-Daniel Zucker, Improving Classification of Microarray Data using Prototype-Based Feature Selection , SIGKDD Explorations,1 -30,2003.

12. J.Jaeger, R.Sengupta, W.L.Ruzzo, Improved Gene Selection for Classification of Microarrays , Pacific Symposium on Biocomputing, 53-64, 2003.

13. Premier Biosoft International http://www.premierbiosoft.com/dnamicroarray/index.html

14. Array Miner by Optimal Design http://www.optimaldesign.com/ArrayMiner/ArrayMiner.htm

15. National Human Genome research Institute, National Institute of Health http://www.genome.gov

16. Townsend Lab - Department of Molecular and Cell Biology http://www.yale.edu/townsend/software.html

17 BASE – Bio array software environment - http://base.thep.lu.se/

18 Open source clustering software – Cluster 3.0 http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#ctv

19 EMBL – EBI – Expression Profiler http://www.ebi.ac.uk/microarray-srv/EP/

20. University of Pittsburgh Bioinformatics Software and Web Tools Collection http://bioinformatics2.pitt.edu/index.html

21. BROAD Institute - GeneCluster http://www-genome.wi.mit.edu/cancer/software/software.html

22. GenMAPP - Tools for visualizing data from gene expression experiments in the context of biological pathways - http://genmapp.org/

23. VizX Labs – Creator of GeneSifter software- http://www.genesifter.net/web/

24. NCGR – Creator of GeneX- http://www.ncgr.org/genex/index.html

59

25. Applied Maths – Creator of Software for lab-wide data basing and analysis of biodata - http://www.applied-maths.com/genemaths/genemaths.htm

26. Ocimum Bio solutions - http://www.ocimumbio.com/products/products_gw.htm

27. Partek Software – Creator of Partek Pattern Recognition http://www.partek.com/html/products/products.html

28. J. Craig Venter Institute – Creator of TIGR MultiExperiment Viewer http://www.jcvi.org/cms/research/software/

29. Bioinformatics Research Group – Creator of TreeArrange and Treeps http://monod.uwaterloo.ca/downloads/treearrange/

60

APPENDICES

61

APPENDIX A

COPYRIGHT PERMISSION FOR FIGURE 1.2

Hello Sudarshan,

You are hereby given permission to use the Axon GenePix Autoloader 4200AL animation and screenshots for your thesis work. Thank you very much for inquiring.

Best wishes, Varshal

Varshal K. Davé Director of Marketing Molecular Devices (www.moldev.com) (now part of MDS Analytical Technologies) office: +1 408-747-1700 ext. 3776 fax: +1 408-548-6706 email: [email protected]

62

APPENDIX B

PERL SCRIPT FOR T-TEST - TTEST.PL

#!/usr/bin/perl -w use strict; use constant PI => 3.1415926536; sub max{ my ($a, $b) = @_; if ($a>$b) { return $a; } else { return $b; } } sub subtprob { #one tail p-value my ($n, $x) = @_; my ($a,$b); my $w = atan2($x / sqrt($n), 1); my $z = cos($w) ** 2; my $y = 1; for (my $i = $n-2; $i >= 2; $i -= 2) { $y = 1 + ($i-1) / $i * $z * $y; } if ($n % 2 == 0) { $a = sin($w)/2; $b = .5; } else { $a = ($n == 1) ? 0 : sin($w)*cos($w)/PI; $b= .5 + $w/PI; } return max(0, 1 - $b - $a * $y); } open INPUT, "< c:\\MicroRNA_APP\\Normal_Method\\testfile.data" or die "cannot open file > testfile.data $!"; open OUTPUT, "> c:\\MicroRNA_APP\\Normal_Method\\ttest_results.txt" or die 63

"cannot open file > ttest_results.txt $!"; my $n1 = $ARGV[0]; # $n1>1 my $n2 = $ARGV[1]; # $n2>1

my $line = ; chomp($line); print OUTPUT $line, "\tp(C) { chomp($line); #print $line, "\n"; my @item = split (/\t/, $line); print OUTPUT $line;

my $meanX = my $meanY = 0; my $X2 = my $Y2 = 0;

for (my $i=0; $i<$n1; $i++) { $meanX += $item[$i+2]; } $meanX /= $n1; for (my $i=0; $i<$n1; $i++) { $X2 += ($item[$i+2]-$meanX)*($item[$i+2]-$meanX); } $X2 /= ($n1-1); #print "mx=", $meanX, " ", "var(x)=$X2 \n";

for (my $i=$n1; $i<$n1+$n2; $i++) { $meanY += $item[$i+2]; } $meanY /= $n2; for (my $i=$n1; $i<$n1+$n2; $i++) { $Y2 += ($item[$i+2]-$meanY)*($item[$i+2]-$meanY); } $Y2 /= ($n2-1); #print "my=", $meanY, " ", "var(y)=$Y2 \n";

my $sqrtVariance = sqrt((($n1-1)*$X2+($n2-1)*$Y2)/($n1+$n2-2)*(1/$n1+1/$n2)); my $student = ($meanX-$meanY)/$sqrtVariance; my $p_value = subtprob($n1+$n2-2,$student); # test: meanX > meanY; same as ttest(a,b,1,2); my $p_value2 = subtprob($n1+$n2-2,-$student);

64

print OUTPUT "\t$p_value\t$p_value2\t$meanX\t$meanY\n"; } close INPUT; close OUTPUT;

65

APPENDIX C

kNN CLASSICAL METHOD ALGORITHM

open (INPUT, "C:\\MicroRNA_APP\\Normal_Method\\resultstop100.txt"); @lines = ; shift @lines;

$amlvsall=$lines[0]; @ainfos=split /\t/,$amlvsall;

#print $ainfos[1]; #<---aml's start at 1, not 0 shift @lines; chomp @lines;

@twoD=();

$i=0; foreach $line(@lines) { @currentElements=split /\t/, $line; $train_column = $#currentElements+1; for ($j=0;$j<$#currentElements+1;$j++){ $twoD[$i][$j]=$currentElements[$j]; } $i++;

}

open (INPUT2, "C:\\MicroRNA_APP\\Normal_Method\\test_data_set.txt"); @testdata = ; chomp @testdata; shift @testdata;

@twoDtest=();

66

$i=0; foreach $testline(@testdata){ @currentElements=split /\t/, $testline; $test_column = $#currentElements+1; for ($j=0;$j<$#currentElements+1;$j++){ $twoDtest[$i][$j]=$currentElements[$j]; } $i++; }

$test_column_index = 1;

while($test_column_index < $test_column) {

$column_index = 1; $dist = 0; @labeled = (); @temp = (); @min_list = ();

while($column_index < $train_column) { for ($k=0;$k<$#twoD+1;$k++) { $dist = $dist + ($twoD[$k][$column_index] - $twoDtest[$k][$test_column_index])*($twoD[$k][$column_index] - $twoDtest[$k][$test_column_index]);

} $dist = sqrt($dist); @labeled = (@labeled,$dist.":".$ainfos[$column_index]); $dist = 0; $column_index = $column_index + 1; }

@temp = sort {$a <=> $b} @labeled; $A = $temp[0]; $B = $temp[1]; $C = $temp[2];

67

$A = substr($A,index($A,":")+1,length($A)-index($A,":")); $B = substr($B,index($B,":")+1,length($B)-index($B,":")); $C = substr($C,index($C,":")+1,length($C)-index($C,":"));

@min_list = (@min_list,$A,$B,$C);

$N_COUNT = 0; $D_COUNT = 0; for($i=0;$i<3;$i++) { if($min_list[$i] eq "(N)") { $N_COUNT = $N_COUNT + 1; } elsif($min_list[$i] eq "(D)") { $D_COUNT = $D_COUNT + 1; } }

if($N_COUNT > $D_COUNT) { $predict = "(N)" } else { $predict = "(D)" }

@prediction = (@prediction,$predict);

$test_column_index = $test_column_index + 1;

}

open (INPUTCLS, "C:\\MicroRNA_APP\\Normal_Method\\test_cls.cls"); @cls = ; close INPUTCLS; #shift @cls;

@CLS_ALL = split /\s+/,$cls[0]; #Array of 0's and 1's.

68

my $correct = 0; my $Incorrect = 0;

@confusion = (); for($i=0;$i<2;$i++){ for($j=0;$j<2;$j++){ $confusion[$i][$j] = 0; } } for($i=0;$i<$test_column-1;$i++) {

if($CLS_ALL[$i] eq (0)){

if($prediction[$i] eq "(N)"){ $correct = $correct + 1; $confusion[0][0] = $confusion[0][0] + 1; } else { $confusion[0][1] = $confusion[0][1] + 1; $Incorrect = $Incorrect + 1; }

}

if($CLS_ALL[$i] eq (1)){

if($prediction[$i] eq "(D)"){ $correct = $correct + 1; $confusion[1][1] = $confusion[1][1] + 1; } else { $confusion[1][0] = $confusion[1][0] + 1; $Incorrect = $Incorrect + 1;

}

}

#print "Predicted Label:",$prediction[$i]," ","Actual Label:"; #if($CLS_ALL[$i] eq (0)){print "(ALL)"."\n";} #if($CLS_ALL[$i] eq (1)){print "(AML)"."\n";} 69

}

open OUTPUT, "> c:\\MicroRNA_APP\\Normal_Method\\Classification_Results.txt" or die "cannot open file > Classification_Results.txt $!"; print OUTPUT "Normal Method\n"; print OUTPUT $correct,"\n"; print OUTPUT "$Incorrect\n"; print OUTPUT ($correct/($test_column-1)) * 100,"%";

close OUTPUT;

kNN WEIGHTED METHOD ALGORITHM

open (INPUT, "C:\\MicroRNA_APP\\Weighted_Method\\resultstop100.txt"); @lines = ; shift @lines;

$amlvsall=$lines[0]; @ainfos=split /\t/,$amlvsall;

#print $ainfos[1]; #<---aml's start at 1, not 0

shift @lines; chomp @lines;

@twoD=(); @WeightsD=(); @pvalue=();

$i=0; foreach $line(@lines) { @currentElements=split /\t/, $line; $train_column = ($#currentElements/2)+1; for ($j=0;$j<($#currentElements/2);$j++){ $twoD[$i][$j]=$currentElements[$j]; #$WeightsD[$i][$j] = $currentElements[$j+($#currentElements/2)+1]; $pvalue[$i] = $currentElements[$#currentElements]; }

70

#Add the weights of the features to the p-value $t=0; for($k=($#currentElements/2)+1;$k<$#currentElements;$k++) { $WeightsD[$i][$t] = $currentElements[$k]; $t = $t + 1; } $i++; }

open (INPUT2, "C:\\MicroRNA_APP\\Weighted_Method\\test_data_set.txt"); @testdata = ; chomp @testdata; shift @testdata;

@twoDtest=();

$i=0; foreach $testline(@testdata){ @currentElements=split /\t/, $testline; $test_column = $#currentElements+1; for ($j=0;$j<$#currentElements+1;$j++){ $twoDtest[$i][$j]=$currentElements[$j]; } $i++; }

$test_column_index = 1;

while($test_column_index < $test_column) {

$column_index = 1; $dist = 0; @labeled = (); @temp = (); @min_list = (); $wt = 0.01; $dist_ALL = 0; $dist_AML = 0; $mean_ALL = 0; 71

$mean_AML = 0; $weight_index = 0; while($column_index < $train_column) { for ($k=0;$k<$#twoD+1;$k++) { $wt = $wt + ($pvalue[$k] + $WeightsD[$k][$weight_index]); $dist = $dist + ($wt * ($twoD[$k][$column_index] - $twoDtest[$k][$test_column_index]) * ($twoD[$k][$column_index] - $twoDtest[$k][$test_column_index])); $wt = 0.01; } $dist = sqrt($dist); @labeled = (@labeled,$dist.":".$ainfos[$column_index]); $dist = 0; $column_index = $column_index + 1; $weight_index = weight_index + 1;

}

@temp = sort {$a <=> $b} @labeled; $A = $temp[0]; $B = $temp[1]; $C = $temp[2];

$A = substr($A,index($A,":")+1,length($A)-index($A,":")); $B = substr($B,index($B,":")+1,length($B)-index($B,":")); $C = substr($C,index($C,":")+1,length($C)-index($C,":"));

@min_list = (@min_list,$A,$B,$C);

if($min_list[0] eq "(N)") { $predict = "(N)"; } else { $predict = "(D)"; }

@prediction = (@prediction,$predict);

72

$test_column_index = $test_column_index + 1;

}

open (INPUTCLS, "C:\\MicroRNA_APP\\Weighted_Method\\test_cls.cls"); @cls = ; close INPUTCLS; #shift @cls;

@CLS_ALL = split /\s+/,$cls[0]; #Array of 0's and 1's.

$correct = 0; $Incorrect = 0;

@confusion = (); for($i=0;$i<2;$i++){ for($j=0;$j<2;$j++){ $confusion[$i][$j] = 0; } } for($i=0;$i<$test_column-1;$i++) { if($CLS_ALL[$i] eq (0)){

if($prediction[$i] eq "(N)"){ $correct = $correct + 1; $confusion[0][0] = $confusion[0][0] + 1; } else { $confusion[0][1] = $confusion[0][1] + 1; $Incorrect = $Incorrect + 1; }

}

if($CLS_ALL[$i] eq (1)){

if($prediction[$i] eq "(D)"){ $correct = $correct + 1; $confusion[1][1] = $confusion[1][1] + 1; } 73

else { $confusion[1][0] = $confusion[1][0] + 1; $Incorrect = $Incorrect + 1;

}

}

}

open OUTPUT, "> c:\\MicroRNA_APP\\Weighted_Method\\Classification_Results.txt" or die "cannot open file > Classification_Results.txt $!"; print OUTPUT "Weighted Method\n"; print OUTPUT $correct,"\n"; print OUTPUT $Incorrect, "\n"; print OUTPUT ($correct/($test_column-1)) * 100,"%";

kNN MEAN METHOD ALGORITHM

open (INPUT, "C:\\MicroRNA_APP\\Mean_Method\\resultstop100.txt"); @lines = ; shift @lines;

$amlvsall=$lines[0]; @ainfos=split /\t/,$amlvsall;

#print $ainfos[1]; #<---aml's start at 1, not 0 0 = Normal 1 = Diseased

$no_of_D = 0; $no_of_N = 0; for($f=1;$f<$#ainfos;$f++) { if($ainfos[$f] eq "(N)") { $no_of_N = $no_of_N + 1; } elsif($ainfos[$f] eq "(D)") 74

{ $no_of_D = $no_of_D + 1; } } print $no_of_N; print $no_of_D;

shift @lines; chomp @lines;

@twoD=(); @WeightsD=(); @pvalue=();

$i=0; foreach $line(@lines) { @currentElements=split /\t/, $line; #$train_column = ($#currentElements/2)+1; $train_column = $#currentElements+1; for ($j=0;$j<($#currentElements/2)+1;$j++){ $twoD[$i][$j]=$currentElements[$j]; $pvalue[$i] = $currentElements[$#currentElements]; }

$i++;

}

open (INPUT2, "C:\\MicroRNA_APP\\Mean_Method\\test_data_set.txt"); @testdata = ; chomp @testdata; shift @testdata;

@twoDtest=();

$i=0; foreach $testline(@testdata){ @currentElements=split /\t/, $testline; $test_column = $#currentElements+1; for ($j=0;$j<$#currentElements+1;$j++){ $twoDtest[$i][$j]=$currentElements[$j]; } $i++; 75

}

$test_column_index = 1;

while($test_column_index < $test_column) {

$column_index = 1; $dist = 0; @labeled = (); @temp = (); @min_list = (); $wt = 1; $dist_N = 0; $dist_D = 0; $mean_N = 0; $mean_D = 0;

while($column_index < $train_column) { for ($k=0;$k<$#twoD+1;$k++) { $dist = $dist + ($twoD[$k][$column_index] - $twoDtest[$k][$test_column_index]) * ($twoD[$k][$column_index] - $twoDtest[$k][$test_column_index]);

} $dist = sqrt($dist); @labeled = (@labeled,$dist.":".$ainfos[$column_index]);

if($ainfos[$column_index] eq "(N)") { $dist_N = $dist_N + $dist; } else { $dist_D = $dist_D + $dist; }

$dist = 0; $column_index = $column_index + 1;

76

}

$mean_N = $dist_N / $no_of_N; $mean_D = $dist_D / $no_of_D;

print $mean_N , "\n"; print $mean_D;

if($mean_N < $mean_D) { $predict = "(N)"; } else { $predict = "(D)";

}

$dist = 0; $dist_N = 0; $dist_D = 0; $mean_N = 0; $mean_D = 0;

@temp = sort {$a <=> $b} @labeled; $A = $temp[0]; $B = $temp[1]; $C = $temp[2];

$A = substr($A,index($A,":")+1,length($A)-index($A,":")); $B = substr($B,index($B,":")+1,length($B)-index($B,":")); $C = substr($C,index($C,":")+1,length($C)-index($C,":"));

@min_list = (@min_list,$A,$B,$C);

@prediction = (@prediction,$predict);

$test_column_index = $test_column_index + 1;

}

77

open (INPUTCLS, "C:\\MicroRNA_APP\\Mean_Method\\test_cls.cls"); @cls = ; close INPUTCLS; #shift @cls;

@CLS_ALL = split /\s+/,$cls[0]; #Array of 0's and 1's.

$correct = 0; $Incorrect = 0;

@confusion = (); for($i=0;$i<2;$i++){ for($j=0;$j<2;$j++){ $confusion[$i][$j] = 0; } } for($i=0;$i<$test_column-1;$i++) {

if($CLS_ALL[$i] eq (0)){

if($prediction[$i] eq "(N)"){ $correct = $correct + 1; $confusion[0][0] = $confusion[0][0] + 1; } else { $confusion[0][1] = $confusion[0][1] + 1; $Incorrect = $Incorrect + 1; }

}

if($CLS_ALL[$i] eq (1)){

if($prediction[$i] eq "(D)"){ $correct = $correct + 1; $confusion[1][1] = $confusion[1][1] + 1; } else { $confusion[1][0] = $confusion[1][0] + 1; $Incorrect = $Incorrect + 1; 78

}

} } print "\n";

open OUTPUT, "> c:\\MicroRNA_APP\\Mean_Method\\Classification_Results.txt" or die "cannot open file > Classification_Results.txt $!"; print OUTPUT "Mean Method\n"; print OUTPUT $correct,"\n"; print OUTPUT $Incorrect, "\n"; print OUTPUT ($correct/($test_column-1)) * 100,"%";

79