Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y

Purdue University Purdue e-Pubs ECE Technical Reports Electrical and Computer Engineering 1-1-2003 Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y. Yang Okan K. Ersoy Follow this and additional works at: http://docs.lib.purdue.edu/ecetr Yang, Jack Y. and Ersoy, Okan K. , "Combined Supervised and Unsupervised Learning in Genomic Data Mining" (2003). ECE Technical Reports. Paper 152. http://docs.lib.purdue.edu/ecetr/152 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y. Yang Okan K. Ersoy TR-ECE 03-10 School of Electrical and Computer Engineering 465 Northwestern Avenue Purdue University West Lafayette, IN 47907-2035 ii iii TABLE OF CONTENTS LIST OF TABLES .......................................................................................................................................................... V LIST OF FIGURES .......................................................................................................................................................VII ABSTRACT....................................................................................................................................................................IX 1. INTRODUCTION.......................................................................................................................................................1 1.1 MOTIVATION.......................................................................................................................................................1 1.2 ADVANCEMENT OF CURRENT RESEARCH....................................................................................................2 1.3 A SNAPSHOT OF OUR RES EARCH..................................................................................................................5 2. BIOPHYSICAL ASPECTS OF BIOINFORMATICS ..........................................................................................9 2.1 ROLES OF PHYSICS IN BIOMEDICAL SCIENCES .........................................................................................9 2.1.1 X-ray crystallography..............................................................................................................................12 2.1.2 The Overhauser effects: Combined ESR with NMR..........................................................................12 2.1.3 DNA microarray technology..................................................................................................................17 2.2 ROLES OF BIOINFORMATICS ........................................................................................................................19 2.2.1 Protein Modeling .....................................................................................................................................20 2.2.2 Evolution, phylogenetic systematics and phylogenetic trees..........................................................21 3. DATA MINING AND COMPUTATIONAL INTELLIGENCE...........................................................................25 3.1 ORIGINATIONS OF BIOINFORMATICS AND ITS RELATIONSHIP TO COMPUTER ENGINEERING.....25 3.2 STATISTICAL LEARNING, DATA MINING AND COMPUTATIONAL INTELLIGENCE..............................26 3.3 UNSUPERVISED LEARNING.............................................................................................................................27 3.3.1 K-means clustering algorithm ..............................................................................................................28 3.3.2 Self-organizing maps (SOMs) ................................................................................................................30 3.3.2.1 Update neuron’s rule from energy function.................................................................................... 30 3.3.2.2 Implementation of the SOM in optional first stage of UST........................................................ 31 3.3.3 Unsupervised decision trees ..................................................................................................................35 3.4 SUPERVISED LEARNING...................................................................................................................................35 3.4.1 Support vector machines........................................................................................................................36 3.4.2 Decision trees ...........................................................................................................................................36 3.4.3 Neural networks.......................................................................................................................................38 3.4.4 Ersoy’s parallel, self-organizing, hierarchical neural networks...................................................41 3.4.5 Nearset neighbor classifiers..................................................................................................................41 3.5 RESEARCH STRATEGY.....................................................................................................................................41 4. THE UST ALGORITHM: A NEW WAY OF COMB INING SUPERVISED AND UNSUPERVISED LEARNING.....................................................................................................................................................................43 4.1 STRUCTURE OF THE UST................................................................................................................................44 4.1.1 The optional first stage: self-organizing maps (SOMs)...................................................................44 4.1.2 Maximum Contrast Tree........................................................................................................................45 4.2 CONSTRUCTING THE MAXIMUM CONTRAST TREE...................................................................................45 4.2.1 Method I: MCT........................................................................................................................................46 4.2.2 Method II: Balanced MCT....................................................................................................................47 4.2.3 The fixed point MCT ..............................................................................................................................49 4.3 FAST IMPLEMENTATION ALGORITHM.........................................................................................................49 4.4. THE ENERGY FUNCTION................................................................................................................................51 iv 4.5 OVERVIEW OF UST RESULTS ........................................................................................................................52 5. DATA GENERATION..............................................................................................................................................55 5.1 REASONS FOR GENERATING OUR OWN DATA..........................................................................................55 5.2 METHOD IN GENERATING PROTEIN PHYLOGENETIC PROFILES ............................................................55 5.3 OBTAINING PROTEIN FUNCTIONAL LABELS..............................................................................................56 6. RESULTS OF UST ON BIOMEDICAL ASPECTS ............................................................................................61 6.1 IDENTIFYING FUNCTIONAL RELATED PROTEINS .......................................................................................61 6.2 IDENTIFYING PROTEIN (GENE) FUNCTIONAL COMPLEX WITHOUT SEQUENCE HOMOLOGY ........63 6.3 A SCENARIO OF EVOLUTIONARY PATHWAYS .........................................................................................65 6.4 PREDICTING FUNCTIONS OF UNKNOWN PROTEINS................................................................................67 7. NEW CLASSIFIER TO HANDLE MULTIPLE-LABELED INSTANCES .......................................................71 7.1 THEORETICAL PROPERTIES OF THE NEAREST NEIGHBOR CLASSIFIER ..............................................71 7.2 PROBABILITY THAT TWO PHYLOGENETIC PROFILES MATCH DUE TO CHANCE.............................72 7.3 NEW INSIGHT ON IMPROVING NEAREST NEIGHBOR CLASSIFIERS .....................................................82 7.3 UST BASED MULTIPLE-LABELED INSTANCE CLASSIFIER (MLIC)..........................................................83 7.4 MULTIPLE FUNCTIONALITIES OF MLIC ......................................................................................................86 7.5 USING THE MAXIMUM CONTRAST TREE AS A CLASSIFIER......................................................................87 8. COMARATIVE EXPERIMENTAL RESULTS ....................................................................................................89 8.1 ACCOMMODATING TO COMPLEX DATA BY UST.....................................................................................89 8.2 CONSTRUCTING A LIBRARY OF YEAST PROTEIN PHYLOGENETIC PROFILES ................................ 102 8.3 COMPARATIVE RESULTS ............................................................................................................................

Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y

The Coming Era of Integrative Bioinformatics, Systems Biology and Intelligent Computing for Functional Genomics and Personalized Medicine Research

Genomics, Molecular Imaging, Bioinformatics, and Bio-Nano-Info

2K09 and Thereafter : the Coming Era

Genomics, Molecular Imaging, Bioinformatics, and Bio-Nano-Info