Genetic Algorithm-Neural Network: Feature
Total Page:16
File Type:pdf, Size:1020Kb
GENETIC ALGORITHM-NEURAL NETWORK: FEATURE EXTRACTION FOR BIOINFORMATICS DATA DONG LING TONG A thesis submitted in partial fulfilment of the requirements of Bournemouth University for the degree of Doctor of Philosophy July 2010 Copyright This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with its author and due acknowledgement must always be made of the use of any material contained in, or derived from, this thesis. i Abstract With the advance of gene expression data in the bioinformatics field, the questions which frequently arise, for both computer and medical scientists, are which genes are significantly involved in discriminating cancer classes and which genes are significant with respect to a specific cancer pathology. Numerous computational analysis models have been developed to identify informative genes from the mi- croarray data, however, the integrity of the reported genes is still uncertain. This is mainly due to the misconception of the objectives of microarray study. Furthermore, the application of various preprocess- ing techniques in the microarray data has jeopardised the quality of the microarray data. As a result, the integrity of the findings has been compromised by the improper use of techniques and the ill-conceived objectives of the study. This research proposes an innovative hybridised model based on genetic algorithms (GAs) and artificial neural networks (ANNs), to extract the highly differentially expressed genes for a specific cancer pathology. The proposed method can efficiently extract the informative genes from the original data set and this has reduced the gene variability errors incurred by the preprocessing techniques. The novelty of the research comes from two perspectives. Firstly, the research emphasises on extracting informative features from a high dimensional and highly complex data set, rather than to improve classifi- cation results. Secondly, the use of ANN to compute the fitness function of GA which is rare in the context of feature extraction. Two benchmark microarray data have been taken to research the prominent genes expressed in the tumour development and the results show that the genes respond to different stages of tumourigenesis (i.e. different fitness precision levels) which may be useful for early malignancy detection. The extraction ability of the proposed model is validated based on the expected results in the synthetic data sets. In addition, two bioassay data have been used to examine the efficiency of the proposed model to extract significant features from the large, imbalanced and multiple data representation bioassay data. ii Publications Resulting From Thesis 1. Microarray Gene Recognition Using Multiobjetive Evolutionary Techniques (Poster). By D.L. Tong and R. Mintram. In RECOMB'08: 12th Annual International Conference on Research in Computational Molecular Bi- ology, Poster Id: 74, 2008. 2. Hybridising genetic algorithm-neural network (GANN) in marker genes detection (Proceedings). By D.L. Tong. In ICMLC'09: 8th International Conference on Machine Learning and Cybernetics, proceedings, volume 2, pages 1082-1087, 2009. 3. Innovative Hybridisation of Genetic Algorithms and Neural Networks in Detecting Marker Genes for Leukaemia Cancer (Supplementary Proceedings). By D.L. Tong, K. Phalp, A. Schierz and R. Mintram. In PRIB'09: 4th IAPR International Conference on Pattern Recognition in Bioinformatics, suppl. proceedings, 2009. 4. Innovative Hybridisation of Genetic Algorithms and Neural Networks in Detecting Marker Genes for Leukaemia Cancer (Poster). By D.L. Tong, K. Phalp, A. Schierz and R. Mintram. In PRIB'09: 4th IAPR International Conference on Pattern Recognition in Bioinformatics, Poster Id: 2, 2009. 5. Extracting informative genes from unprocessed microarray data (Proceedings). By D.L. Tong. In ICMLC'10: 9th International Conference on Machine Learning and Cybernetics, proceedings, 2010. iii iv 6. Genetic Algorithm Neural Network (GANN): A study of neural network activation functions and depth of genetic algorithm search applied to feature selection (Journal). By D.L. Tong and R. Mintram. In the International Journal of Machine Learning and Cybernetics (IJMLC), in press 2010. 7. Genetic Algorithm Neural Network (GANN): Feature Extraction for Unprocessed Microarray Data. By D.L. Tong and A. Schierz. Submitted to the Journal of Artificial Intelligence in Medicine on April 2010. Abbreviations Biology ALL - Acute Lymphoblastic Leukaemia AML - Acute Myelogeneous Leukaemia APL - Acute Promyelocytic Leukaemia BL - Burkitt Lymphoma DDBJ - DNA Data Bank of Japan DNA - Deoxyribonucleic Acid cDNA - complementary DNA CML - Chronic Myelogeneous Leukaemia EBI - European Bioinformatics Institute EMBL - European Molecular Biology Laboratory EST - Expressed Sequence Tag EWS - Ewing's Sarcoma FAB - French-American-British FISH - Fluorescent In-Situ Hybridisation FPR - Formylpeptide Receptor GO - Gene Ontology ISCN - International System for Human Cytogenetic Nomenclature LIMS - Laboratory Information Management System v vi MGED - Microarray Gene Expression Data MIAME - Minimum Information About a Microarray Experiment MLL - Mixed Lineage Leukaemia MPSS - Massive Parallel Signature Sequencing NB - Neuroblastoma NCBI - National Center for Biotechnology Information NHGRI - National Human Genome Research Institute NIG - National Institute of Genetics PCR - Polymerase Chain Reaction RT-PCR - Reverse Transcription-PCR PNET - Primitive Neuroectodermal Tumour PMT - Photo-Multiplier Tube RMS - Rhabdomyosarcoma ARMS - Alveolar RMS ERMS - Embryonal RMS PRMS - Pleomorphic RMS RNA - Ribonucleic Acid mRNA - messenger RNA tRNA - transfer RNA SAGE - Serial Analysis of Gene Expression SNP - Single Nucleotide Polymorphism SRBCTs - Small Round Blue Cell Tumours vii Computing ANN - Artificial Neural Network AP - All-Pairs BGA - Between-Group Analysis BIC - Bayesian Information Criterion BSS/WSS - Between-Group and Within-Group CART - Classification And Regression Tree COA - Correspondence Analysis DAC - Divide-And-Conquer DT - Decision Tree EA - Evolutionary Algorithm FDA - Fisher's Discriminant Analysis FS - Feature Selection GA - Genetic Algorithm GANN - Genetic Algorithm-Neural Network GP - Genetic Programming HC - Hierarchical Clustering IG - Information Gain KNN - k-Nearest Neighbour LDA - Linear Discriminant Analysis LOOCV - Leave-One-Out Cross-Validation LRM - Logistic Regression Model MDS - Multidimensional Scaling NB - Naive Bayes NSC - Nearest Shrunken Centroid viii OBD - Optimal Brain Damage OVA - One-Versus-All PAM - Prediction Analysis of Microarrays PART - Projective Adaptive Resonance Theory PCA - Principal Component Analysis PLS - Partial Least Squares RA - Relief Algorithm RFE - Recursive Feature Elimination S2N - Signal-to-Noise SA - Simulated Annealing SAC - Separate-And-Conquer SOM - Self-Organising Map SVD - Singular Value Decomposition SVM - Support Vector Machine WEKA - Waikato Environment for Knowledge Analysis WV - Weighted Voting Table of Contents Copyright . .i Abstract . ii Publications Resulting from Thesis . iii Abbreviations . .v Table of Contents . ix List of Figures . xiv List of Tables . xvi Acknowledgements . xviii Declarations . xix 1 Introduction 1 1.1 Motivation . .2 1.2 Statement of the Problem . .7 1.2.1 Implicit Research Objective and Ill-conceived Hypothesis . .7 1.2.2 Data Normalisation . .8 1.2.3 Over-fitting Problem . .9 1.2.4 Omission on Features expressed in Lower Precision Level . .9 1.3 GANN: Feature Extraction Approach . 10 1.4 Research Question and Hypotheses . 12 1.5 Contributions . 13 1.6 Structure of Thesis . 14 2 Background and Literature Review 16 2.1 A Biological Perspective . 16 2.1.1 Array Design . 17 2.1.2 Fabrication Technology . 20 2.1.3 Labelling Systems . 22 ix TABLE OF CONTENTS x 2.1.4 Hybridisation . 23 2.1.5 Image Analysis . 24 2.1.6 Microarray Challenge . 25 2.2 A Computing Perspective . 27 2.2.1 Data Preprocessing . 27 2.2.1.1 Missing value estimation . 28 2.2.1.2 Data normalisation . 29 2.2.1.3 Feature selection/reduction . 30 2.2.2 Validation Mechanism . 33 2.2.3 Classification Design . 35 2.2.3.1 Supervise learning . 35 2.2.3.2 Unsupervised learning . 42 2.2.4 Feature Selection (FS) . 45 2.2.4.1 Filter selection . 45 2.2.4.2 Wrapper selection . 47 2.2.4.3 Embedded selection . 48 2.2.5 Computing Challenges . 53 2.3 Summary . 55 3 Experimental Methodology 57 3.1 Empirical Data Acquisition . 58 3.1.1 Microarray Data Sets . 58 3.1.1.1 Acute leukaemia (ALL/AML) . 58 3.1.1.2 Small round blue cell tumours (SRBCTs) . 61 3.1.2 Synthetic Data Sets . 62 3.1.2.1 Synthetic data set 1 . 63 3.1.2.2 Synthetic data set 2 . 63 3.1.3 Bioassay Data Sets . 64 3.1.3.1 AID362 . 65 3.1.3.2 AID688 . 65 3.2 Designing Feature Extraction Model using GAs And ANNs . 66 3.2.1 GA - An optimisation search method . 67 3.2.1.1 Population of potential solutions . 67 3.2.1.2 Fitness of potential solutions . 68 TABLE OF CONTENTS xi 3.2.1.3 Selecting potential solutions . 69 3.2.1.4 Evolving the potential solution . 71 3.2.1.5 Encoding evolutionary mechanism . 73 3.2.1.6 Elitism . 74 3.2.1.7 Exploration versus Exploitation . 75 3.2.2 ANN - A universal computational method . 75 3.2.2.1 The architecture of the network . 77 3.2.2.2 The training of the network . 78 3.2.2.3 The activation function of the network . 80 3.2.3 Hybridising.