Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y

Total Page:16

File Type:pdf, Size:1020Kb

Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y Purdue University Purdue e-Pubs ECE Technical Reports Electrical and Computer Engineering 1-1-2003 Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y. Yang Okan K. Ersoy Follow this and additional works at: http://docs.lib.purdue.edu/ecetr Yang, Jack Y. and Ersoy, Okan K. , "Combined Supervised and Unsupervised Learning in Genomic Data Mining" (2003). ECE Technical Reports. Paper 152. http://docs.lib.purdue.edu/ecetr/152 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Combined Supervised and Unsupervised Learning in Genomic Data Mining Jack Y. Yang Okan K. Ersoy TR-ECE 03-10 School of Electrical and Computer Engineering 465 Northwestern Avenue Purdue University West Lafayette, IN 47907-2035 ii iii TABLE OF CONTENTS LIST OF TABLES .......................................................................................................................................................... V LIST OF FIGURES .......................................................................................................................................................VII ABSTRACT....................................................................................................................................................................IX 1. INTRODUCTION.......................................................................................................................................................1 1.1 MOTIVATION.......................................................................................................................................................1 1.2 ADVANCEMENT OF CURRENT RESEARCH....................................................................................................2 1.3 A SNAPSHOT OF OUR RES EARCH..................................................................................................................5 2. BIOPHYSICAL ASPECTS OF BIOINFORMATICS ..........................................................................................9 2.1 ROLES OF PHYSICS IN BIOMEDICAL SCIENCES .........................................................................................9 2.1.1 X-ray crystallography..............................................................................................................................12 2.1.2 The Overhauser effects: Combined ESR with NMR..........................................................................12 2.1.3 DNA microarray technology..................................................................................................................17 2.2 ROLES OF BIOINFORMATICS ........................................................................................................................19 2.2.1 Protein Modeling .....................................................................................................................................20 2.2.2 Evolution, phylogenetic systematics and phylogenetic trees..........................................................21 3. DATA MINING AND COMPUTATIONAL INTELLIGENCE...........................................................................25 3.1 ORIGINATIONS OF BIOINFORMATICS AND ITS RELATIONSHIP TO COMPUTER ENGINEERING.....25 3.2 STATISTICAL LEARNING, DATA MINING AND COMPUTATIONAL INTELLIGENCE..............................26 3.3 UNSUPERVISED LEARNING.............................................................................................................................27 3.3.1 K-means clustering algorithm ..............................................................................................................28 3.3.2 Self-organizing maps (SOMs) ................................................................................................................30 3.3.2.1 Update neuron’s rule from energy function.................................................................................... 30 3.3.2.2 Implementation of the SOM in optional first stage of UST........................................................ 31 3.3.3 Unsupervised decision trees ..................................................................................................................35 3.4 SUPERVISED LEARNING...................................................................................................................................35 3.4.1 Support vector machines........................................................................................................................36 3.4.2 Decision trees ...........................................................................................................................................36 3.4.3 Neural networks.......................................................................................................................................38 3.4.4 Ersoy’s parallel, self-organizing, hierarchical neural networks...................................................41 3.4.5 Nearset neighbor classifiers..................................................................................................................41 3.5 RESEARCH STRATEGY.....................................................................................................................................41 4. THE UST ALGORITHM: A NEW WAY OF COMB INING SUPERVISED AND UNSUPERVISED LEARNING.....................................................................................................................................................................43 4.1 STRUCTURE OF THE UST................................................................................................................................44 4.1.1 The optional first stage: self-organizing maps (SOMs)...................................................................44 4.1.2 Maximum Contrast Tree........................................................................................................................45 4.2 CONSTRUCTING THE MAXIMUM CONTRAST TREE...................................................................................45 4.2.1 Method I: MCT........................................................................................................................................46 4.2.2 Method II: Balanced MCT....................................................................................................................47 4.2.3 The fixed point MCT ..............................................................................................................................49 4.3 FAST IMPLEMENTATION ALGORITHM.........................................................................................................49 4.4. THE ENERGY FUNCTION................................................................................................................................51 iv 4.5 OVERVIEW OF UST RESULTS ........................................................................................................................52 5. DATA GENERATION..............................................................................................................................................55 5.1 REASONS FOR GENERATING OUR OWN DATA..........................................................................................55 5.2 METHOD IN GENERATING PROTEIN PHYLOGENETIC PROFILES ............................................................55 5.3 OBTAINING PROTEIN FUNCTIONAL LABELS..............................................................................................56 6. RESULTS OF UST ON BIOMEDICAL ASPECTS ............................................................................................61 6.1 IDENTIFYING FUNCTIONAL RELATED PROTEINS .......................................................................................61 6.2 IDENTIFYING PROTEIN (GENE) FUNCTIONAL COMPLEX WITHOUT SEQUENCE HOMOLOGY ........63 6.3 A SCENARIO OF EVOLUTIONARY PATHWAYS .........................................................................................65 6.4 PREDICTING FUNCTIONS OF UNKNOWN PROTEINS................................................................................67 7. NEW CLASSIFIER TO HANDLE MULTIPLE-LABELED INSTANCES .......................................................71 7.1 THEORETICAL PROPERTIES OF THE NEAREST NEIGHBOR CLASSIFIER ..............................................71 7.2 PROBABILITY THAT TWO PHYLOGENETIC PROFILES MATCH DUE TO CHANCE.............................72 7.3 NEW INSIGHT ON IMPROVING NEAREST NEIGHBOR CLASSIFIERS .....................................................82 7.3 UST BASED MULTIPLE-LABELED INSTANCE CLASSIFIER (MLIC)..........................................................83 7.4 MULTIPLE FUNCTIONALITIES OF MLIC ......................................................................................................86 7.5 USING THE MAXIMUM CONTRAST TREE AS A CLASSIFIER......................................................................87 8. COMARATIVE EXPERIMENTAL RESULTS ....................................................................................................89 8.1 ACCOMMODATING TO COMPLEX DATA BY UST.....................................................................................89 8.2 CONSTRUCTING A LIBRARY OF YEAST PROTEIN PHYLOGENETIC PROFILES ................................ 102 8.3 COMPARATIVE RESULTS ............................................................................................................................
Recommended publications
  • The Coming Era of Integrative Bioinformatics, Systems Biology and Intelligent Computing for Functional Genomics and Personalized Medicine Research
    UC Berkeley UC Berkeley Previously Published Works Title 2K09 and thereafter : the coming era of integrative bioinformatics, systems biology and intelligent computing for functional genomics and personalized medicine research Permalink https://escholarship.org/uc/item/60k9m880 Journal BMC Genomics, 11(Suppl 3) ISSN 1471-2164 Authors Yang, Jack Y Niemierko, Andrzej Bajcsy, Ruzena et al. Publication Date 2010-12-01 DOI http://dx.doi.org/10.1186/1471-2164-11-S3-I1 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Yang et al. BMC Genomics 2010, 11(Suppl 3):I1 http://www.biomedcentral.com/1471-2164/11/S3/I1 RESEARCH Open Access 2K09 and thereafter : the coming era of integrative bioinformatics, systems biology and intelligent computing for functional genomics and personalized medicine research Jack Y Yang1,2,3*, Andrzej Niemierko1*, Ruzena Bajcsy4, Dong Xu5*, Brian D Athey6, Aidong Zhang7, Okan K Ersoy2, Guo-zheng Li8, Mark Borodovsky9, Joe C Zhang11,12*, Hamid R Arabnia10, Youping Deng11,12, A Keith Dunker3, Yunlong Liu3, Arif Ghafoor2* From The ISIBM International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS) Shanghai, China. 3-8 August 2009 Abstract Significant interest exists in establishing synergistic research in bioinformatics, systems biology and intelligent computing. Supported by the United States National Science Foundation (NSF), International Society of Intelligent Biological Medicine (http://www.ISIBM.org), International Journal of Computational Biology and Drug Design (IJCBDD) and International Journal of Functional Informatics and Personalized Medicine, the ISIBM International Joint Conferences on Bioinformatics, Systems Biology and Intelligent Computing (ISIBM IJCBS 2009) attracted more than 300 papers and 400 researchers and medical doctors world-wide.
    [Show full text]
  • Genomics, Molecular Imaging, Bioinformatics, and Bio-Nano-Info
    BMC Genomics BioMed Central Introductory Review Open Access Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research Jack Y Yang*1, Mary Qu Yang*2, Hamid R Arabnia*3 and Youping Deng*4 Address: 1Harvard Medical School, Harvard University, P.O.Box 400888, Cambridge, Massachusetts 02115 USA, 2National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20852 USA, 3Department of Computer Science, University of Georgia, Athens, Georgia, 30602, USA and 4Department of Biological Science, University of Southern Mississippi, Hattiesburg, 39406, USA Email: Jack Y Yang* - [email protected]; Mary Qu Yang* - [email protected]; Hamid R Arabnia* - [email protected]; Youping Deng* - [email protected] * Corresponding authors from IEEE 7th International Conference on Bioinformatics and Bioengineering at Harvard Medical School Boston, MA, USA. 14–17 October 2007 Published: 16 September 2008 BMC Genomics 2008, 9(Suppl 2):I1 doi:10.1186/1471-2164-9-S2-I1 <supplement> <title> <p>IEEE 7<sup>th </sup>International Conference on Bioinformatics and Bioengineering at Harvard Medical School</p> </title> <editor>Mary Qu Yang, Jack Y Yang, Hamid R Arabnia and Youping Deng</editor> <note>Research</note> <url>http://www.biomedcentral.com/content/pdf/1471-2164-9-S2-info.pdf</url> </supplement> This article is available from: http://www.biomedcentral.com/1471-2164/9/S2/I1 © 2008 Yang et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
    [Show full text]
  • 2K09 and Thereafter : the Coming Era
    2K09 and Thereafter : The Coming Era of Integrative Bioinformatics, Systems Biology and Intelligent Computing for Functional Genomics and Personalized Medicine Research The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Yang, Jack Y., Andrzej Niemierko, Ruzena Bajcsy, Dong Xu, Brian D. Athey, Aidong Zhang, Okan K. Ersoy, et al. 2010. 2K09 and thereafter : the coming era of integrative bioinformatics, systems biology and intelligent computing for functional genomics and personalized medicine research. BMC Genomics 11(Suppl 3): I1. Published Version doi:10.1186/1471-2164-11-S3-I1 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:4874784 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Yang et al. BMC Genomics 2010, 11(Suppl 3):I1 http://www.biomedcentral.com/1471-2164/11/S3/I1 RESEARCH Open Access 2K09 and thereafter : the coming era of integrative bioinformatics, systems biology and intelligent computing for functional genomics and personalized medicine research Jack Y Yang1,2,3*, Andrzej Niemierko1*, Ruzena Bajcsy4, Dong Xu5*, Brian D Athey6, Aidong Zhang7, Okan K Ersoy2, Guo-zheng Li8, Mark Borodovsky9, Joe C Zhang11,12*, Hamid R Arabnia10, Youping Deng11,12, A Keith Dunker3, Yunlong Liu3, Arif Ghafoor2* From The ISIBM International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS) Shanghai, China. 3-8 August 2009 Abstract Significant interest exists in establishing synergistic research in bioinformatics, systems biology and intelligent computing.
    [Show full text]
  • Genomics, Molecular Imaging, Bioinformatics, and Bio-Nano-Info
    The University of Southern Mississippi The Aquila Digital Community Faculty Publications 1-1-2007 Genomics, Molecular Imaging, Bioinformatics, and Bio-Nano-Info Integration are Synergistic Components of Translational Medicine and Personalized Healthcare Research Jack Y. Yang Harvard University, [email protected] Mary Qu Yang National Institutes of Health, [email protected] Hamid R. Arabnia National Institutes of Health, [email protected] Youping Deng University of Southern Mississippi, [email protected] Follow this and additional works at: https://aquila.usm.edu/fac_pubs Part of the Genetics and Genomics Commons Recommended Citation Yang, J. Y., Yang, M. Q., Arabnia, H. R., Deng, Y. (2007). Genomics, Molecular Imaging, Bioinformatics, and Bio-Nano-Info Integration are Synergistic Components of Translational Medicine and Personalized Healthcare Research. BMC Genomics, 9(S2), 1-22. Available at: https://aquila.usm.edu/fac_pubs/8471 This Article is brought to you for free and open access by The Aquila Digital Community. It has been accepted for inclusion in Faculty Publications by an authorized administrator of The Aquila Digital Community. For more information, please contact [email protected]. BMC Genomics BioMed Central Introductory Review Open Access Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research Jack Y Yang*1, Mary Qu Yang*2, Hamid R Arabnia*3 and Youping Deng*4 Address: 1Harvard Medical School, Harvard University,
    [Show full text]