Ontology Applications in Systems Biology: a Machine Learning Approach
Total Page:16
File Type:pdf, Size:1020Kb
Ontology Applications in Systems Biology: a Machine Learning Approach by Elma Hussanna Akand Master of Information Technology, University of Queensland, Australia, 2001. BSc. Electrical & Electronic Engineering, Bangladesh University of Engineering & Technology, 1999. A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the Faculty of Engineering School of Computer Science and Engineering The University of New South Wales July 2014 PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet Surname or Family name: Akand First name: Elma Other name/s: Hussanna Abbreviation for degree as given in the University calendar : PhD School: School of Computer Science and Engineering Faculty: Faculty of Engineering Title: Ontology Applications in Systems Biology: a Machine Learning Approach Abstract 350 words maximum: (PLEASE TYPE) Biology is flooded with an overwhelming accumulation of data and biologists require methods to apply their knowledge to explain biological networks of interacting genes or proteins in comprehensible terms. Therefore the focus of modern bioinformatics has shifted towards systems-wide analysis to understand mechanisms such as those underlying important diseases. Knowledge acquisition from such exponentially growing, inherently noisy and unstructured data is only likely to be achieved by combining bioinformatics, machine learning and semantic technologies such as ontologies. The major contribution of this thesis is on novel ontology applications to integrate complex multi-relational data towards learning models of biological systems. First we examined machine learning using ontology annotations to integrate heterogeneous data on systems biology. A series of propositional learning tasks to learn predictive models of intra-cellular expression in cells showed that feature construction and selection improved performance. Learning to predict phenotype is harder than predicting protein or gene expression, since identifying systems responses requires the integration of multiple potential causes and effects. In this thesis we applied Formal Concept Analysis (FCA) to integrate multiple experiments and identify common subsets of genes that share common systemic behaviour. Visual analytics was then applied to enable users to navigate concept lattices and generate training sets for further analysis by Inductive Logic Programming (ILP). This showed learned rules with biological background knowledge contained potentially interesting relations when validated. However, these rules are not always verifiable by humans. To address this issue a novel method called “visual closure", by analogy to the closure of formal concepts, was implemented. Rules, viewed as concepts, can be expanded by conversion to Datalog queries which then are used to search for additional knowledge in biological databases. The visual closure technique is then applied to complete these expanded concepts for visualization by domain specialists. This thesis has demonstrated novel ontology applications in systems biology. However, the question of how to acquire ontologies remains. Ontologies in systems biology often require relational representations due to the importance of network structures. Therefore, as our final step, an initial version of automated ontology construction in a first order representation is demonstrated. Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only). …………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research. FOR OFFICE USE ONLY Date of completion of requirements for Award: THIS SHEET IS TO BE GLUED TO THE INSIDE FRONT COVER OF THE THESIS COPYRIGHT STATEMENT ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.' Signed ……………………………………………........................... Date ……………………………………………........................... AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ Signed ……………………………………………........................... Date ……………………………………………........................... ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed …………………………………………….............. Date …………………………………………….............. \Research ! A mere excuse for idleness; it has never achieved, and will never achieve any results of the slightest value." Benjamin Jowett Abstract Biology is flooded with an overwhelming accumulation of data and biologists require methods to apply their knowledge to explain biological networks of interacting genes or proteins in comprehensible terms. Therefore the focus of modern bioinformatics has shifted towards systems-wide analysis to understand mechanisms such as those underlying important diseases. Knowledge acquisition from such exponentially growing, inherently noisy and unstructured data is only likely to be achieved by combining bioinformatics, machine learning and semantic technologies such as ontologies. The major contribution of this thesis is on novel ontology applications to integrate complex multi-relational data towards learning models of biological systems. First we examined machine learning using ontology annotations to integrate hetero- geneous data on systems biology. A series of propositional learning tasks to learn predictive models of intra-cellular expression in cells showed that feature construction and selection improved performance. Learning to predict phenotype is harder than predicting protein or gene expression, since identifying systems responses requires the integration of multiple potential causes and effects. In this thesis we applied Formal Concept Analysis (FCA) to integrate multiple experiments and identify common subsets of genes that share common sys- temic behaviour. Visual analytics was then applied to enable users to navigate concept lattices and generate training sets for further analysis by Inductive Logic Programming (ILP). This showed learned rules with biological background knowledge contained po- tentially interesting relations when validated. However, these rules are not always verifiable by humans. To address this issue a novel method called \visual closure", by analogy to the closure of formal concepts, was implemented. Rules, viewed as concepts, can be expanded by conversion to Datalog queries which then are used to search for additional knowledge in biological databases. The visual closure technique is then applied to complete these expanded concepts for visualization by domain specialists. This thesis has demonstrated novel ontology applications in systems biology. However, the question of how to acquire ontologies remains. Ontologies in systems biology often require relational representations due to the importance of network structures. There- fore, as our final step, an initial version of automated ontology construction in a first order representation is demonstrated.