Microarray Data Analysis Tool (Mat)
Total Page:16
File Type:pdf, Size:1020Kb
MICROARRAY DATA ANALYSIS TOOL (MAT) A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Sudarshan Selvaraja December, 2008 MICROARRAY DATA ANALYSIS TOOL (MAT) Sudarshan Selvaraja Thesis Approved: Accepted: _______________________ _______________________ Advisor Department Chair Dr. Zhong-Hui Duan Dr. Wolfgang Pelz _______________________ _______________________ Committee Member Dean of the College Dr. Yingcai Xiao Dr. Ronald F. Levant _______________________ _______________________ Committee Member Dean of the Graduate School Dr. Xuan-Hien Dang Dr. George R. Newkome _______________________ Date ii ABSTRACT Microarray is a technology that has been widely used by the biologists to probe the presence of genes in a sample of DNA or RNA. Using the technology, the oligonucleotide probes can be massively parallel immobilized on a microarray chip. It allows the biologists to check the expression levels of thousands of genes together. This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data. The repository currently allows datasets of GenepixPro format to be deposited, although it can be expanded to include datasets of other formats. The user interface of the repository allows users conveniently upload data files and perform preferred data preprocessing and analysis. The analysis methods implemented includes the traditional k-nearest neighbor (kNN) methods and two new kNN methods developed in this study. Additional analysis methods can be added by future developers. The system was tested using a set of microRNA gene expression data. The design and implementation of the software tool are presented in the thesis along with the testing results from the microRNA dataset. The results indicate that the new weighted kNN method proposed in this study outperforms the traditional kNN method and the proposed mean method. We conclude that the system developed in the thesis effectively provides a structured microarray data repository, a flexible graphical user interface, and rational data mining methods. iii ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Zhong-Hui Duan for giving me an opportunity to work on this project for my Masters thesis. I was motivated to choose this topic after I took Introduction to Bioinformatics course. I would like to thank her for invaluable suggestions and steady guidance during the entire course of the project. I am thankful to my committee members Dr. Yingcai Xiao and Dr. Xuan-Hien Dang for their guidance, invaluable suggestion and time. I would like to thank my friends Shanth Anand and Prashanth Puliyadi for helping me to do Master’s and change my career path. I couldn’t have achieved this without their help. I would like to thank my friend Manik Dhawan for his guidance in writing and formatting this report. I would finally like to express my gratefulness towards my parents and all my family members who were always there for me and cheering me on all situations and for their great interest in my venture. iv TABLE OF CONTENTS Page LIST OF TABLES......................................................................................................... viii LIST OF FIGURES....................................................................................................... ix CHAPTER I. INTRODUCTION................................................................................................ 1 1.1 Introduction to Bioinformatics................................................................ 1 1.2 Introduction to Microarray Technology……………………………….. 2 1.2.1 Genepix Experiment Procedural………………………………. 3 1.3 Applications of Microarrays................................................................... 5 1.4 Need for Automated Analysis…………………………………………. 6 1.5 Knowledge Discovery in Data………………………………………… 7 1.5.1 KDD Steps……………………………………………………... 8 1.6 Classification…………………………………………………………... 8 1.6.1 General Approach……………………………………………... 9 1.6.2 Decision Trees…………………………………………………. 10 1.6.3 k Nearest - Neighbor Classifiers………………………………. 12 v 1.7 Outline of the Current Study………………………………………… 14 II. LITERATURE REVIEW....................................…............................................. 17 2.1 Previous Work.........................................…............................................ 17 2.2 Existing Tools for Normalizing GPR Datasets……………………… 20 2.3 Stanford Microarray Database (SMD)…................................................ 21 2.4 Microarray Tools…................................................................................. 21 2.5 Available Source for Microarray Data………………………………… 24 III. MATERIALS AND METHODS …........................................…......................... 25 3.1 Database Design…........................................…...................................... 25 3.1.1 Schema Design………………………………………………… 25 3.1.2 Table Details…………………………………………………... 26 3.1.3 Attributes.…………………………………………………….... 28 3.2 Description of Genepix Data Format...................................................... 29 3.2.1 Features and Blocks…………………………………………… 31 3.2.2 Sample Dataset………………………………………………… 32 3.2.3 Transferring Genepix Dataset to Database…………………….. 33 3.3 Data Selection ….................................................................................... 34 3.3.1 Creation of Training and Testing Dataset…………………… 34 3.4 Preprocessing………………………………………………………….. 38 3.4.1 Preprocessing in MAT………………………………………… 39 3.5 Normalization …..................................................................................... 41 3.6 Feature Selection …................................................................................ 42 vi 3.6.1 Student T-Test…………………………………………………. 42 3.6.2 Implementation of T-Test in MAT…………………………….. 43 3.7 Classification .....................................…................................................. 44 3.7.1 Classical kNN Method………………………………………… 44 3.7.2 Weighted kNN Method……………………………………… 46 3.7.3 Mean kNN Method…………………………………………… 47 IV. RESULTS AND DISCUSSIONS………………................................................. 49 4.1 A Case Study…………………………………………………………... 49 4.2 Results…………………………………………………………………. 53 4.3 Discussion……………………………………………………………... 55 V. CONCLUSIONS AND FUTURE WORK…....................................................... 56 5.1 Conclusion……………………………………………………………... 56 5.2 Future Work…………………………………………………………… 56 REFERENCES…........................................…............................................................... 58 APPENDICES………………………………………………………………………... 61 APPENDIX A COPYRIGHT PERMISSION FOR FIGURE 1.2……... 62 APPENDIX B PERL SCRIPT FOR T-TEST - TTEST.PL…………… 63 APPENDIX C CLASSIFICATION ALGORITHMS…………………. 66 vii LIST OF TABLES Table Page 1.1 Confusion matrix for a 2-class problem ……………………............................. 9 1.2 Software used …………………….................................................................... 16 2.1 List of microarray tools ...…………...................................................................... 22 2.2 Available source for microarray data…………………………………………………. 25 3.1 Tables used in MAT …………………………………………………........................ 26 3.2 Attributes and their description………………………………………………... 29 3.3 List of default choices for feature selection…………………………………… 40 4.1 Training and testing samples – Experiment 1………………………………… 53 4.2 Accuracy of three classification methods for different N features…………….. 53 4.3 Training and testing samples – Experiment 2………………………………… 54 4.4 Accuracy of three classification methods for different N features…………….. 54 viii LIST OF FIGURES Figure Page 1.1 Schematic view of a typical microarray experiment…………………………. 3 1.2 Genepix experimental procedure……………………………………………. 4 1.3 Overview of KDD process................................................................................ 7 1.4 Mapping an input attribute set x into its class label y………………………... 8 1.5 A decision tree for the mammal classification problem……………………... 11 1.6 Classifying an unlabeled vertebrate………………………………………….. 12 1.7 Schematic representation of k-NN classifier………………………………… 13 1.8 System diagram …........................................................................................... 14 1.9 Application flow diagram……………………………………………………. 15 2.1 Sketch of the ProGene algorithm…………………………………………………… 19 3.1 Database schema…………………………………………………………….. 26 3.2 Genepix_version table design………………………………………………... 27 3.3 Genepix_header table design……………………………………………….... 28 3.4 Genepix_sequence table design……………………………………………… 28 3.5 Hypothetical arrays of blocks………………………………………………... 31 3.6 Sample dataset……………………………………………………………….. 32 ix 3.7 Creation of repository………………………………………………………... 33 3.8 Selection of datasets………………………………………………………….. 34 3.9 Temporary table names for training and testing datasets ……………………. 35 3.10 Flowchart – Creation of dataset……………………………………………… 36 3.11 Replication of gene…………………………………………………………... 37 3.12 Sample training dataset with median intensity values……………………….. 37 3.13 Preprocessing in MAT……………………………………………………… 39 3.14 T-Test formulas……………………………………………………………… 42 3.15 Calculated p-values for the genes……………………………………………. 44 3.16 Pseudo code of kNN classical method……………………………………….. 45 3.17 Pseudo code of kNN mean method…………………………………………... 48 4.1 Training samples selected for the experiment……………………………….. 50 4.2 Testing samples