<<

Ontology Applications in Systems Biology: a Approach

by Elma Hussanna Akand Master of Information Technology, University of Queensland, Australia, 2001. BSc. Electrical & Electronic Engineering, Bangladesh University of Engineering & Technology, 1999.

A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy

in the Faculty of Engineering School of Computer Science and Engineering The University of New South Wales

July 2014

PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name: Akand

First name: Elma Other name/s: Hussanna

Abbreviation for degree as given in the University calendar : PhD

School: School of Computer Science and Engineering Faculty: Faculty of Engineering

Title: Ontology Applications in Systems Biology: a Machine Learning Approach

Abstract 350 words maximum: (PLEASE TYPE)

Biology is flooded with an overwhelming accumulation of data and biologists require methods to apply their knowledge to explain biological networks of interacting or in comprehensible terms. Therefore the focus of modern has shifted towards systems-wide analysis to understand mechanisms such as those underlying important diseases. Knowledge acquisition from such exponentially growing, inherently noisy and unstructured data is only likely to be achieved by combining bioinformatics, machine learning and semantic technologies such as ontologies. The major contribution of this thesis is on novel ontology applications to integrate complex multi-relational data towards learning models of biological systems.

First we examined machine learning using ontology annotations to integrate heterogeneous data on systems biology. A series of propositional learning tasks to learn predictive models of intra-cellular expression in cells showed that feature construction and selection improved performance.

Learning to predict phenotype is harder than predicting or expression, since identifying systems responses requires the integration of multiple potential causes and effects. In this thesis we applied Formal Concept Analysis (FCA) to integrate multiple experiments and identify common subsets of genes that share common systemic behaviour. Visual analytics was then applied to enable users to navigate concept lattices and generate training sets for further analysis by Inductive Logic Programming (ILP). This showed learned rules with biological background knowledge contained potentially interesting relations when validated.

However, these rules are not always verifiable by humans. To address this issue a novel method called “visual closure", by analogy to the closure of formal concepts, was implemented. Rules, viewed as concepts, can be expanded by conversion to Datalog queries which then are used to search for additional knowledge in biological databases. The visual closure technique is then applied to complete these expanded concepts for visualization by domain specialists.

This thesis has demonstrated novel ontology applications in systems biology. However, the question of how to acquire ontologies remains. Ontologies in systems biology often require relational representations due to the importance of network structures. Therefore, as our final step, an initial version of automated ontology construction in a first order representation is demonstrated.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

…………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

THIS SHEET IS TO BE GLUED TO THE INSIDE FRONT COVER OF THE THESIS

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………......

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ……………………………………………......

Date ……………………………………………...... “Research ! A mere excuse for idleness; it has never achieved, and will never achieve any results of the slightest value.”

Benjamin Jowett Abstract

Biology is flooded with an overwhelming accumulation of data and biologists require methods to apply their knowledge to explain biological networks of interacting genes or proteins in comprehensible terms. Therefore the focus of modern bioinformatics has shifted towards systems-wide analysis to understand mechanisms such as those underlying important diseases. Knowledge acquisition from such exponentially growing, inherently noisy and unstructured data is only likely to be achieved by combining bioinformatics, machine learning and semantic technologies such as ontologies. The major contribution of this thesis is on novel ontology applications to integrate complex multi-relational data towards learning models of biological systems.

First we examined machine learning using ontology annotations to integrate hetero- geneous data on systems biology. A series of propositional learning tasks to learn predictive models of intra-cellular expression in cells showed that feature construction and selection improved performance.

Learning to predict phenotype is harder than predicting protein or , since identifying systems responses requires the integration of multiple potential causes and effects. In this thesis we applied Formal Concept Analysis (FCA) to integrate multiple experiments and identify common subsets of genes that share common sys- temic behaviour. Visual analytics was then applied to enable users to navigate concept lattices and generate training sets for further analysis by Inductive Logic Programming (ILP). This showed learned rules with biological background knowledge contained po- tentially interesting relations when validated.

However, these rules are not always verifiable by humans. To address this issue a novel method called “visual closure”, by analogy to the closure of formal concepts, was implemented. Rules, viewed as concepts, can be expanded by conversion to Datalog queries which then are used to search for additional knowledge in biological databases. The visual closure technique is then applied to complete these expanded concepts for visualization by domain specialists.

This thesis has demonstrated novel ontology applications in systems biology. However, the question of how to acquire ontologies remains. Ontologies in systems biology often require relational representations due to the importance of network structures. There- fore, as our final step, an initial version of automated ontology construction in a first order representation is demonstrated.

Publications

Parts of the work presented in Chapter3 were published in:

Akand, E., Bain, M., Temple, M.: “Learning with annotation using feature selection and construction”, Applied Artificial Intelligence 24(1-2), 5-38 (2010).

Akand, E., Bain, M., Temple, M.: “Learning from Ontological Annotation: an Ap- plication of Formal Concept Analaysis to Feature Construction in the Gene On- tology”. In: Proc. Third Australasian Ontology Workshop (AOW 2007), Gold Coast, Australia, 2007. CRPIT, 85. Meyer, T. and Nayak, A. C., Eds. ACS. 15-23.

Parts of the work presented in Chapter4 appeared as:

Akand, E., Bain, M., Temple, M.: “Learning responsive cellular networks by integrat- ing genomics and proteomics data”, 19th International Conference on Inductive Logic Programming (ILP’09). Katholieke Universiteit, Leuven, Belgium, 2009. Oral and poster presentation.

Parts of the work presented in Chapter5 were published in:

Akand, E., Bain, M., Temple, M.: “A Visual Analytics Approach to Augmenting Formal Concepts with Relational Background Knowledge in a Biological Domain”, In Proc. Sixth Australasian Ontology Workshop in Adelaide, Australia, 2010. (eds. Meyer, T., Mehmet Orgun, M. & Taylor, K.).

Additional publications for Chapters6 and7, in preparation:

Akand, E., Bain, M., Temple, M.: “Augmenting concept lattices with first-order learn- ing and visual analytics”.

Akand, E., Bain, M., Temple, M.: “Towards First-order Ontology Learning”. Acknowledgements

I gratefully thank everyone who contributed to my thesis in different ways, but with true helpful intentions. I deeply acknowledge the extraordinary and meticulous support of my supervisor Dr. Mike Bain. His in-depth knowledge and visionary understanding in the field of machine learning shaped my work into a meaningful one.

I would like to thank Dr. Mark Temple for providing biological data and critical knowledge feedback that enriched my thesis with appropriate context.

Special thanks to Karin Avnit, my colleague and friend who graciously supported me with thesis comments, latex tips and most of all humorous coffee time discussions.

My husband Enayet Karim’s enormous support to complete this work is simply invalu- able. My little kids Aryana and Ishan unknowingly supported me with their heavenly smile that kept me going, at times against all odds. I am also indebted to my parents and sisters for their unfailing support and endless inspiration. Contents

Originality Statement iii

Copyright Statement iv

Authenticity Statementv

Abstract vii

Publications ix

Acknowledgementsx

List of Figures xvi

List of Tables xix

List of Algorithms xix

1 Introduction1 1.1 Aims of Thesis...... 3 1.2 Contributions of Thesis...... 4 1.3 Thesis Outline...... 4

2 Ontology, systems biology and machine learning7 2.1 Ontology...... 7 2.1.1 Ontology applications in bioinformatics...... 9 2.1.2 Ontology engineering in bioinformatics...... 10 2.2 Systems biology...... 12 2.2.1 Post-genome biology...... 12 2.2.2 From genomics to systems biology...... 18 Contents xii

2.2.3 Bioinformatics databases...... 24 2.2.4 Annotations and ontologies in ...... 27 2.3 Machine learning...... 29 2.3.1 Supervised and unsupervised learning...... 30 2.3.2 Decision tree classifiers...... 32 2.3.3 Inductive logic programming (ILP)...... 34 2.3.4 Relative least general generalization(RLGG)...... 35 2.3.5 Inverse resolution...... 36 2.3.6 ILP in a first-order (multi-relational) setting...... 38 2.3.7 Association rule mining...... 39 2.3.7.1 CHARM...... 41 2.4 Ontology learning...... 42 2.4.1 Formal Concept Analysis and ontology learning...... 45 2.4.2 Applications of machine learning in functional genomics..... 48 2.4.2.1 Ontology driven approaches...... 48 2.4.2.2 Feature selection in microarray data...... 51 2.4.2.3 Inductive logic programming on multi-relational genomics data...... 52 2.5 Summary...... 53

3 Propositional learning from ontological annotation 55 3.1 Motivation...... 55 3.2 Formalization of over-representation analysis...... 56 3.2.1 Optimistic bias in probability estimates...... 57 3.2.2 Bioinformatics applications of over-representation analysis.... 59 3.2.3 Coverage matrix...... 59 3.3 Ontological annotation for machine learning...... 61 3.3.1 Feature selection...... 63 3.3.1.1 Feature ranking...... 63 3.3.1.2 Feature subset selection...... 65 3.3.2 Feature construction...... 65 3.3.2.1 Formal concept analysis...... 66 3.3.2.2 Feature construction from concept lattices...... 67 3.3.2.3 Selecting predictive constructed features...... 68 3.4 Experimental results...... 69 3.4.1 Biological background: cellular network response to stress.... 69 3.4.2 Overview of experiments...... 70 3.4.3 Case study 1: predicting protein expression...... 71 3.4.3.1 Experiment 1: protein expression given gene expression 71 3.4.3.2 Experiment 2: protein expression given GO features.. 74 3.4.3.3 Experiment 3: protein expression given microarray data and GO features...... 75 3.4.4 Case study 2: predicting general vs. specific stress-response... 77 3.5 Discussion...... 79 3.5.1 Case study 1: predicting protein expression...... 80 Contents xiii

3.5.2 Case study 2: predicting general vs. specific stress-response... 82 3.5.3 Related work...... 83 3.5.3.1 Graph-based similarity and over-representation analysis 83 3.5.3.2 Inductive bias of closed concept...... 83 3.5.3.3 Feature-extraction for machine learning using closed con- cepts...... 84 3.5.3.4 Description Logics...... 84 3.5.3.5 Bioinformatics approaches...... 85 3.5.4 Some ontological analysis of the Gene Ontology...... 86 3.6 Conclusions...... 89

4 Learning responsive cellular networks by integrating genomics and proteomics data 91 4.1 Biological background...... 91 4.2 Learning tasks...... 94 4.2.1 Learning the logic of responsive cellular networks...... 94 4.3 Predicting an intra-cellular response phenotype...... 95 4.3.1 Data sets...... 95 4.3.2 Method...... 96 4.3.3 Theory post-processing...... 97 4.3.4 Experimental results...... 98 4.3.4.1 Validation...... 99 4.4 Predicting an extra-cellular response phenotype...... 106 4.5 Discussion...... 107 4.5.1 Related work...... 107 4.6 Conclusions...... 108

5 Visualising concept lattices for learning from integrated multi-relational data 111 5.1 Introduction...... 111 5.2 Closed itemset mining...... 113 5.2.1 Formal Concept Analysis...... 113 5.2.2 A concept lattice algorithm to support visual analytics...... 114 5.2.3 Definitions and techniques...... 115 5.2.4 Properties used for search-space pruning...... 116 5.2.5 Algorithm Design...... 117 5.2.6 BioLattice compared to CHARM-L...... 122 5.3 A web-based browser for BioLattice...... 123 5.3.1 BioLattice Application Window...... 124 5.3.2 Tabular Lattice Display...... 124 5.3.3 Lattice Manipulation...... 125 5.3.4 Comparative Ontology Display...... 126 5.3.5 Visualizing Protein Protein Interaction Network...... 126 5.4 Mapping to Gene Ontology terms...... 127 5.5 Case study: yeast systems biology...... 131 Contents xiv

5.5.1 Concept ranking by gene interactions...... 131 5.5.2 Relational learning of multiple-stress rules...... 134 5.5.2.1 Method...... 134 5.6 Conclusions...... 136

6 Augmenting formal concepts with first-order learning and visual an- alytics 137 6.1 Introduction...... 137 6.2 Background: Semantics of Datalog queries...... 139 6.2.1 Rules...... 139 6.2.2 Facts...... 140 6.2.3 Negation...... 140 6.2.4 Queries...... 140 6.3 Algorithm Design...... 141 6.4 Worked Example...... 142 6.5 Visual closure of the rule...... 147 6.5.1 Approach...... 149 6.5.2 Background data...... 150 6.5.3 Visualization layouts...... 151 6.5.4 Protein complex architecture visualization...... 151 6.5.5 Force directed interaction visualization...... 152 6.6 Two biological application case studies...... 153 6.6.1 Case study 1:...... 153 6.6.2 Case study 2:...... 157 6.7 Conclusions...... 160

7 Towards First-order Ontology Learning 163 7.1 Introduction...... 163 7.2 A restricted first-order representation...... 164 7.3 First-order concept lattice construction...... 165 7.4 Examples...... 165 7.4.1 Animals...... 166 7.4.2 Pathways...... 169 7.5 Discussion...... 171 7.5.1 First-order logic based ontology representations...... 171 7.5.2 Ontology learning from first-order concept lattices...... 172 7.5.2.1 Limitations of the Theory Revision approach...... 172 7.5.2.2 Adapting a Description Logic approach...... 173 7.5.3 Related work...... 173 7.6 Summary...... 175

8 Conclusions 177 8.1 Propositional learning from ontological annotation...... 178 8.2 Learning responsive cellular networks by integrating genomics and pro- teomics data...... 179 Contents xv

8.3 Visualising concept lattices for learning from integrated multi-relational data...... 180 8.4 Augmenting formal concepts with first-order learning and visual analytics181 8.5 Towards First-order Ontology Learning...... 183 8.6 Future work...... 183

A Database and queries 185 A.1 Datasets...... 185 A.2 Databases...... 187

B Aleph parameter settings 191

C Learning networks 193 C.1 Microarray data preparation...... 193 C.2 Ground clauses for protein expression network...... 194 C.3 Significant annotations from FunSpec...... 202

D Workflow of a rule’s visual closure 207

List of Figures

2.1 Annotation for a yeast protein from GO biological process ontology...8 2.2 DNA structure:...... 13 2.3 Gene expression – see text for details...... 16 2.4 Network showing yeast protein-protein interactions [BB03]...... 23 2.5 A generic machine learning system:...... 30 2.6 C4.5 decision tree example...... 33 2.7 Hypothesis construction in inductive logic programming...... 35 2.8 “W” operator on propositional definite clauses...... 37 2.9 A generic ontology learning framework adopted from [CMSV09]..... 43 2.10 Formal context...... 46 2.11 Concept lattice for the formal context in Figure 2.10...... 47

3.1 Example of a DAG-structured ontology...... 60 3.2 IG Ranker – feature selection by information gain ranking...... 65 3.3 Concept lattice for the coverage matrix of Table 3.2...... 68 3.4 Experimental setup for machine learning with ontological annotation.. 72 3.5 Accuracy of predicting protein expression...... 74 3.6 Accuracy of predicting protein expression...... 75 3.7 A decision tree...... 76 3.8 Accuracy of predicting general vs. specific deletant sensitivity...... 80 3.9 Transitive closures in a GO-type ontology can show redundancy..... 87

4.1 A generic diagram of cellular response to environmental stress...... 93 4.2 Clauses learned for expression of a protein A induced by exposure of the to H2O2...... 98 4.3 Network diagram drawn by human expert from ground clauses of theory for protein expression under H2O2 stress...... 103

5.1 BioLattice intent-extent search tree. Candidate concepts are shown as intent, extent pairs. See Section 5.2.5 for details...... 115 5.2 Steps to build formal concept lattice...... 117 List of Figures xviii

5.3 Weight function values for frequent 1-itemsets Figure 5.1...... 118 5.4 Lattice Restructure...... 122 5.5 Concept Lattice...... 123 5.6 BioLattice input file...... 124 5.7 BioLattice application index window...... 125 5.8 Tabular list of concept details...... 125 5.9 Over-represented Geneontology categories...... 127 5.10 Protein-protein interaction network...... 128 5.11 Ontology categories in response to stress...... 129 5.12 Concept ratio...... 130 5.13 Variation in concept ratio...... 130 5.14 Example of a network structure...... 133

6.1 Multi-layered information processing filter for augmenting concept lat- tices with first-order learning and visual analytics...... 138 6.2 Example of the predicate ordering in the search space...... 143 6.3 Step-by-step flow-chart for processing of a rule to give a query...... 145 6.4 Completing the closure of the rule...... 147 6.5 Rule set learned for formal concept using background knowledge..... 154 6.6 Extension of a learned rule selected from the set in Figure 6.5...... 155 6.7 V0 vacuolar ATPase complex from rule 1 in Figure 6.6...... 156 6.8 V0 vacuolar ATPase complex from rule 2 in Figure 6.6...... 156 6.9 Mediator complex SRB subcomplex of RNA polymerase II...... 158 6.10 Saga complex...... 158 6.11 Gene response measured by microarray data of ADA2 under the envi- ronmental stressors Heat and Alkali...... 160

A.1 RulesDB schema...... 188 A.2 Gene Ontology relational schema...... 189

D.1 Data flow diagram for visual closure of a rule...... 208 D.2 Tabular list of concept details in a BioLattice browser window...... 209 D.3 Set of first order rules generated by Aleph...... 209 D.4 A learned rule, at the top of the window, and below the rows of its extensional representation...... 210 D.5 Visualization of all relations...... 210 List of Tables

2.1 Selection of systems biology techniques...... 20 2.2 Selected list of Bioinformatics Databases...... 26 2.3 List of web-based functional annotation tools accessed on 11th May, 2013 50

3.1 Tests on GO categories where one is more general than the other are not independent — see text for details...... 58 3.2 Coverage matrix for the example in Figure 3.1...... 61 3.3 Accuracies of predicting general vs. specific deletant sensitivity to mul- tiple stresses with GO biological process features. Note that ◦, • denote statistically significant improvement or degradation, respectively, with respect to “No Selection”...... 79

5.1 Concepts ranked by synthetic lethality...... 132

6.1 Inter complex protein relationships...... 159

7.1 Objects (input examples) for the animals domain are ground clauses... 166 7.2 First-order concept lattice for the animals domain...... 166 7.3 Objects (input examples) for the pathways domain are ground clauses.. 170 7.4 First-order concept lattice for the pathways domain...... 170 7.5 Comparison of first-order ontology learning approaches...... 174

A.1 Biological datasets...... 185

List of Algorithms

1 BioLattice(data G, minSup)...... 117

2 BioLattice Extend(G, IntentCid IC, minSup, lattice L, prefix p, root Lr) 118

3 DoSpecialization(concept N, xi, xj, branch, p, G)...... 119 4 ComputeClosure(candidateset P, IntentCid IC)...... 120

5 AddConceptToLattice(candidate c, L,IntentCid IC, Lr, closureS)... 121 6 Rule explorer(Concept extent E, extensional database D, set of rules R) 141

7 Order literal search space(ri,S,V)...... 142 8 Process query(E, D, S,V)...... 142 Dedicated to my living God for fitting my puzzle pieces graciously, to my late father for loving all my imperfections, to my mother for ongoing silent yet committed love and dedication, to my husband for partnership in crime and completion. 1 Introduction

“The real problem is not whether machines think but whether men do.”

Burrhus Frederic Skinner

Ontologies 1, are becoming increasingly important as activity in Web-based business, Government, technology, science and arts continues to grow, particularly with the effort towards building knowledge-based applications as envisaged by the Semantic Web [BLHL01]. A key application area for ontologies is post-genome biology [Kan00], which is faced with rapid growth in data from advances in gene and other biotechnologies. More than a decade after the first complete draft of the [L+01, VAM+01] a key challenge is to understand how the interactions of the complete set of genes and the proteins they encode combine to regulate the living cell, known as “systems biology” [IGH01, Kit02]. This thesis addresses several problems of applying and extending ontologies in systems biology. The techniques we develop are designed to aid biologists in understanding heterogeneous datasets [Qua06], but are sufficiently general to have potential applications to other complex systems. To achieve this we propose a novel combination of methods, using symbolic machine learning [Mit97, Rae08], formal concept analysis [GW99a], visualization [GOB+10] and ontologies [SS09]. This combined approach can be thought of as a new form of ontology-based visual analytics [KMS+08].

1In this thesis ontologies are viewed from computer science and can be thought of as defining “what is to be represented” [Smi03]; see Chapter2. Chapter 1. Introduction 2

Ontology applications in systems biology are typically deductive, where the objective is using semantic knowledge representations to integrate data, thus enabling query- answering applications [CYC13]. In this thesis we emphasise that ontologies are also important for inductive applications in systems biology, where the objective is to use some form of machine learning to enable hypothesis formation from integrated data. This topic has been less explored, but in fact in any application of machine learning there is at least an implicit ontology for being used for the representation of data and hypotheses.

Making ontology structures first-class citizens in applications using machine learning in this thesis has a number of advantages, leading to an improved understanding of the issues in learning with ontologies, and using learning in acquisition of ontology structures. Three aspects of ontology applications when using machine learning are important. First, ontology “enables semantic interoperability by presenting information consistently across organizations and domains and machines” [Usc11]. Machine learning makes this concrete by using ontology to give a standardised set of domain terms that can be used in data and hypothesis representation — frequently, obtaining such terms requires major effort as part of the data mining process. Second, ontologies can represent hierarchical knowledge, with concepts ordered by generality. Third, ontologies should be understandable and usable by human domain specialists. The last two aspects are often missing from representation languages used in machine learning, although Inductive Logic Programming [Rae08] is one exception. In areas that are challenging for current machine learning, such as systems biology, all three aspects are important.

A central problem for machine learning in systems biology is comprehensibility. In a recent paper [FWA10] three reasons for learning comprehensible models in the domain of protein function prediction are given: (i) to improve the biologist’s confidence in the prediction; (ii) to give the biologist new insights about the data and ideas for new hypothesis creation; and (iii) to detect errors in the model or the data. This leads to a different way of using machine learning than usual, with the goal of learning concept specifications rather than just concept classifiers. To see the difference, we adopt the terminology of descriptive induction, or learning a characteristic description of a class of objects. This was informally defined by Michalski [Mic83] as learning a description “that specifies all common properties of known objects in the class, and by that defines the class in the context of an unlimited number of other object classes”. Michie [Mic91] expanded this to a definition of a machine learning system as being one that “uses sample data to generate an updated basis for improved classification of subsequent data from the same source and expresses the new basis in intelligible symbolic form”. Chapter 1. Introduction 3

Pazzani et al. [PMS01] showed in a neurological domain that even though learned rules were accurate they were unlikely to be accepted by domain experts unless they were (i) simple enough to be easily understood and (ii) consistent with background medical knowledge.

1.1 Aims of Thesis

This thesis will explore the use of ontology in the domain of systems biology for new approaches to descriptive induction for knowledge acquisition. In particular we will focus on an aspect of ontology taken from formal concept analysis, which is the idea of closure. This links ontology and machine learning by defining a form of complete- ness that includes both the intensional (model) and extensional (data) aspects of the induction problem. We will use this to implement descriptive induction approaches in propositional and first-order representations, in the automated construction of hierar- chical concepts from integrated systems biology data sets, and for visualization. The unifying idea of the thesis is that approaches using closure operators provide solutions to the problem of descriptive induction.

From a high-level perspective, we can identify the following problems which have mo- tivated this thesis:

1. how to learn comprehensible models of biological systems from data on complex interaction networks and ontological annotation;

2. how to acquire comprehensible hierarchically structured models from such data using automated or semi-automated methods; and

3. how to combine visualization with the above techniques. Chapter 1. Introduction 4

1.2 Contributions of Thesis

We claim the following original contributions are made in this thesis:

1. a demonstration of learning comprehensible models of biological systems using a combination of machine learning techniques;

2. an implementation of formal concept analysis to structure complex datasets into sub-problems to which machine learning can be applied; and

3. the implementation of concept lattice browsing in systems biology and the devel- opment of a form of “visual closure” for learned models.

1.3 Thesis Outline

The outline of this thesis is as follows: chapter 2 important concepts and applications of ontology, systems biology and ma- chine learning required for this dissertation are reviewed. To apply and extend ontologies in systems biology using machine learning approaches, we discuss on- tology concepts for data mining, along with key challenges such as gaps in propo- sitional and relational learning frameworks. chapter 3 discusses the limitations of standard over-representation analyses of bio- logical data, and investigates an integrative “systems biology” approach where formal concept analysis is applied for feature construction in supervised learning. chapter 4 investigates issues for learning models of the intra- and extra-cellular net- work response of genes by combining multi-relational heterogeneous data sources using inductive logic programming. A semi-automated visualization method based around the integration of biological interaction data is also described. chapter 5 develops a visualization tool, BioLattice browser, that applies efficient closed itemset mining to generate a concept lattice, and subsequently integrates these concepts with additional heterogeneous data sources. User-interaction is enabled to navigate the lattice and select concepts that are then provided to a first-order learning system to generate explanatory rules. Chapter 1. Introduction 5 chapter 6 extends the work from chapter 5 by introducing the “visual closure” of learned rules. A user-interface enables the selection of rules and generation of their extensional representations with respect to background knowledge and domain-specific properties. Visual analytic tools provided with the system demon- strate that rules can identify comprehensible models of dynamic cell behaviour. chapter 7 describes a preliminary version of an automated ontology construction al- gorithm using a first-order representation. A syntactically-restricted version of Datalog is proposed for such a representation, and a more formalised approach is planned as a part of future work. chapter 8 summarises the results of this thesis and proposes some future directions.

2 Ontology, systems biology and machine learning

“You do not really understand something unless you can explain it to your grandmother.”

Albert Einstein

This chapter reviews background on relevant aspects of ontology, systems biology and machine learning in the context of the thesis. The goal is to make the thesis as self- contained as possible. In each area we do not attempt to give a complete review but focus instead on problems and techniques that relate to those in the thesis. We introduce definitions of the basic concepts required in each area and also describe the important literature and applications.

2.1 Ontology

Many attempts have been made to define ontology in computer science and AI (e.g., [Gru93]). We give the following definition that covers our work and much of the relevant research:

An ontology is an explicit description of semantic information, an understandable common representation for communication among people or organizations, providing a source of machine-readable background knowledge to facilitate automated inference that expresses comprehensive domain knowledge to be extended, integrated, or reused with other domains. Chapter 2. Background 8

Effort in ontology can be broadly divided into Upper Ontologies and Domain Ontolo- gies. Upper Ontologies are non-domain specific and provide very general descriptions of fundamental properties of the world. DOLCE (http://www.loa-cnr.it/DOLCE.html) and BFO (http://www.ifomis.org/bfo) are examples of Upper Ontologies that sup- port semantic integration of scientific research terms for different domain-specific on- tologies. On the contrary, Domain Ontologies are restricted to the systematic descrip- tion of concepts which are required in a particular application area.

A second dimension that is often used distinguishes ontologies based on how much for- languages such as mathematical logic are used to define and reason with an ontol- ogy. Formal ontology uses logical definitions and automated inference, whereas informal ontology typically uses statements in natural language and data structures like directed graphs. For example, the Gene Ontology, or GO for short (www.geneontology.org) is an informal and domain specific ontology that provides controlled vocabularies and classifications over the genes from a wide range of including humans. The Gene Ontology comprises sub-ontologies molecular function, cellular component and biological process, and is widely used by biologists.

biological process

metabolic process cellular process

cellular metabolic process primary macromolecule nitrogen compound metabolic process metabolic process metabolic process

nucleobase, nucleoside, nucleotide and nucleic acid metabolic process gene cellular macromolecule metabolic process expression

RNA metabolic process

RNA processing

RNA splicing mRNA metabolic process

RNA splicing, via transesterification reactions

mRNA processing RNA splicing, via transesterification reactions with bulged adenosine as nucleophile

Is a nuclear mRNA splicing, via spliceosome Part of

Figure 2.1: Annotation for a yeast protein from GO biological process ontology. Chapter 2. Background 9

An example of an application of an informal domain ontology is the annotation curated from GO for the yeast “Sm-like” protein (LSM2) in Figure 2.1. This example demon- strates that the protein (not shown) is an instance of the class “mRNA splicing via the spliceosome”, which is a subclass of “mRNA processing”. Directed “Is a” or “Part of” relations from subclasses to parent classes hold when the subclasses (individual terms) have specific properties in addition to those that hold for the parent.

There is a difference in terminology between approaches deriving from philosophy and those, such as bioinformatics, that are based on computer science. In both areas, “ontology” refers to the entire discipline [Smi03], but in the latter it can also be used in the singular to refer to a particular ontology, e.g., [Ash00] or in the plural to a collection of ontologies, e.g., [BR04]. In this thesis we use the terminology from computer science.

2.1.1 Ontology applications in bioinformatics

Bioinformatics is the study of information content and information flow in biological systems and processes [Lie95]. With the overwhelming explosion of biological data fueled by the high-throughput technologies, bioinformatics has grown rapidly in the past two decades in an attempt to supply software tools and applications for working biologists, e.g., in the analysis of large volumes of data gathered by genome sequencing projects.

Shah and Musen [SM09] note that “the biomedical community is perhaps the farthest along in recognizing the need and starting an organized effort for the creation of ontolo- gies” for the structured formalization of scientific knowledge. They identify three levels of ontologies in biomedicine. First is the controlled vocabulary which is a list of terms with defined meanings. The most widely-used of these is the Gene Ontology [Ash00]. The second is the information model such as used to specify data models for reposi- tories of microarray data and so on. The third is an ontology as a formal knowledge representation defining the concepts and relationships that exist in a domain. This is currently the least developed of the three.

The predominant application of ontology in bioinformatics is in interpreting gene ex- pression data, where ontology is regarded as a common reference source (controlled vocabulary) that enables querying, clustering and further analysis of the data. For example, the Gene Ontology (GO) facilitates interoperability between heterogeneous databases across and community resources. Chapter 2. Background 10

Ontology as a “sharable conceptualization” is reflected in the second prevalent ap- plication area (information model). An example of this kind is the Microarray Gene Expression Object Model (MAGE-OM – see www.mged.org) that uses an ontology to describe different microarray experimental setups, as well as systems for data manage- ment, storage and mining. Thereby ontology becomes an integral part of bioinformatics applications to enable information sharing as a source from which to draw meaningful inferences by application of standards-compliant tools.

2.1.2 Ontology engineering in bioinformatics

Two main problems in ontology engineering in bioinformatics are addressed in this thesis. The first is: how can we use ontology annotation for large, heterogeneous data sets with existing techniques for knowledge discovery for biologists ? The second is: how can we use approaches from ontologies and machine learning to improve such knowledge discovery techniques ?

Datasets in modern biology are typically characterized by huge numbers of records with very large numbers of attributes. Moreover, “dirty” biological databases containing experimental errors, human errors, standardization errors, etc., offer a great challenge to learn from these datasets. Machine Learning, one of the key areas in artificial intelligence, offers various techniques to classify efficiently and predict accurately in such environments. Examples include cancer prediction, gene regulatory networks, protein structures and functions, protein interactions, etc. [ZR09].

The predictive accuracy and information contributed by machine learned models are highly desirable but are limited when inference of knowledge about the application domain by biologists is important. We realize that knowledge about the application domain is required so that different techniques can be equally applied, and that their results can be compared using the same conceptualizations of the domain. This is complicated, since the properties of complete systems, such as the networks of biological activities in a cell, are organized using highly structured conceptual knowledge.

For example, in a genomic application, genes may be described at many different levels. as in Figure 2.1. A coarse description may require only more general levels of description (“mRNA processing”), whereas a more refined description may need more specific terms (“mRNA splicing via the splicesome”). The availability of a hierarchically structured taxonomy or ontology, which contains concept names and the relations between them, can provide machine-learning methods with such a source of multi-level descriptions. This is the basis of our work in Chapter3. Chapter 2. Background 11

However, a number of difficulties may exist with applying such ontologies. Since cur- rently they are created and maintained by human curators, this may entail considerable ongoing effort in editing the ontology. Typical operations required are adding and delet- ing concepts, and editing the relations between concepts. This may lead to errors, with incorrect or missing concepts or relations in the ontology [Mar03].

There may also be problems with the use of ontology for annotating objects of interest in a domain. We have observed this, for example, with the use of GO, where different authors annotate the same gene with different GO terms, one of which is a specialization of the other. The general problem here is redundancy. For example, we have observed the following kind of redundancy in annotations using the Gene Ontology. Say a and b are concepts (such as “mRNA binding” or “pre-mRNA splicing factor activity”) and x is a gene. Then we can have the following relations : (i) x ⊂ a, (ii) x ⊂ b and (iii) b ⊂ a. Evidently (i) is redundant here, since it can be inferred from (ii) and (iii), where “⊂” represents “is-a sub-category of”. This is discussed in Chapter3.

The existence of such problems has shown a need for software tools to support the construction of ontologies. This has led, for example, to the development of ontology editors [GMF+03]. The use of an editor can help with avoiding certain problems in ontology construction. The onus, however, is still on the user to provide the most of the concepts and their structures. In the case of domain ontologies in particular this may be quite a demanding task, since the domain may be very specialized, and the structures not well defined or partly unknown. What may be available in such cases, however, is large amounts of data. Although this will typically only be described with low level properties and relations, additional background knowledge gives the basis for the definition of ontological structure. Aspects of this work are addressed in Chapter4.

While building such ontologies is a resource-intensive and costly task, data-driven on- tology learning by processing and representing the structured concepts from any kind of information including biological databases seems more logical. However, for systems biology applications with large data sets on the complex behaviours shown by cells, fully automated ontology learning is likely to be a difficult task. A better approach is initially to study methods of structuring the data, and then applying machine learning techniques to model the components of these structures. We adopt this approach in Chapters5 and6.

With particular reference to domain ontologies, machine learning can be used to search for those structures in the data. A number of authors have studied ontology learn- ing [CMSV09]. Bain has previously demonstrated an approach which can successfully Chapter 2. Background 12 discover potentially interesting and useful ontologies in attribute-value (i.e., proposi- tional) representations [Bai03a]. Following this earlier work, Bain has laid the founda- tions for using these techniques in the much more expressive representation of first-order logic [Bai04]. This is the starting point to significantly extend the approach to ontology learning with a new method described in Chapter7.

2.2 Systems biology

In this section we review some of the biological basis of the applications discussed in the remainder of the thesis, with the aim of making the material in the thesis as self- contained as possible. We first describe the current rapid accumulation of biological data following research to determine the complete set of genes for a wide range of organisms. We then outline the goal of systems biology and the requirement for other sources of related data.

2.2.1 Post-genome biology

The term “genome” was first introduced in the 1920’s to refer the complete set of genes in the cells of an . However, the era of genomics could be said to begin with Sanger’s first complete sequence of an organism, the virus (bacteriophage) φX174 in 1977. “Genomics” as a term describes the scientific discipline of mapping, sequencing and analyzing a genome, i.e., an organism’s complete set of genes. The genomes from a wide range of organisms have now been obtained, with the most well-known being is the Human Genome Project [L+01, VAM+01]. As the number and extent of genome projects increases, so does the data generated by them. We are now in the era of “big biology” [Bak13].

With the increasing availability of technology to identify the complete sequence of DNA in a genome at relatively low cost, researchers have now turned to identify the mecha- nisms by which the genome, in a sense, determines the structure and function of cells. This requires obtaining other types of “omics” data, e.g., transcriptomics, proteomics, metabolomics, and so on [GLJ+01]. The suffix “omics” has now become widely ap- plied; it is used to indicate completeness of the data for a particular area of biology: for example, transcriptomics aims to determine all gene expression behaviours for an organism whose genome is known. This has led to the current post-genome [Kan00] or “omics” era in biology [Qua06]. This will be discussed further below in Section 2.2.2 on systems biology. Chapter 2. Background 13

However, to date most work involving machine learning and ontologies has been in the area of functional genomics. Functional genomics aims to characterize the cell at the DNA level to explain the behaviors of its biological systems. This is increasingly done by investigating all the genes, or proteins, at once in a systematic high-throughput ap- proach, e.g., to study differences in gene expression at the whole-genome level. We now outline some basic aspects of the biology of functional genomics, where our treatment is based on material from the book by Hunt and Livesey [HL00].

DNA

Figure 2.2: DNA structure: the regular structure encoded in the classic “double helix” enables the bioinformatics analysis of nucleic acid and amino acid sequences. Chapter 2. Background 14

Deoxy-ribonucleic acid (DNA) is the cellular material that carries the hereditary or ge- netic information that passes on through generations of a species, found in the genomes of all living cells, and some viruses. This genetic information determines the role of different cells in multi-cellular organisms, containing the instructions that differenti- ate, say, heart cells from immune system cells, and in single-cell organisms such as the baker’s and brewer’s yeast Saccharomyces cerevisae, all the mechanisms to survive and reproduce.

DNA is a double stranded helical molecule where each strand has four nucleotides — adenine (A), guanine (G), cytosine(C), and thymine (T) — which form complementary base pairs, with a backbone made of ribose sugars and phosphates. The complementary strands of DNA are bound together by hydrogen bonds between the base pairs adenine with thymine and guanine with cytosine, as shown in Figure 2.2. The resultant molecule can be extremely complex, for example DNA in the human haploid genome is made up of approximately 3 × 109 base pairs.

Genetic information is stored in the sequence of “letters” from the four nucleotide “alphabet” in a DNA molecule, and from this sequence a second alphabet, that of the amino acids, is used by cells to manufacture corresponding sequences that form proteins. Each amino acid position in a protein is identified by a corresponding sequence of three bases in a DNA molecule, referred as a codon. Particular regions of DNA responsible for specific RNA coding are called exons and between exons, there are non-coding regions called introns (see Figure 2.3).

Even though any of the two strands of DNA can carry the sequence information, only one strand encodes a particular gene. The complementary stand is further used to transcribe more genes for replication of the molecule. During the process of transcrip- tion, DNA is copied in one chemical direction known as 50 to 30. Towards the 50 region or upstream of DNA is called the promoter region that is involved in the control or regulation of transcription.

Physically, DNA is located as a component of the sub-cellular structures named chro- mosomes, residing in the nucleus in eukaryotic cells. Smaller amounts also appear in other organelles such as mitochondria or chloroplasts.

RNA and mRNA

Similar to DNA, ribonucleic acid (RNA) has four nucleotides in a chain. However, RNA has significantly different structure from DNA as the Uracil replaces Thymine Chapter 2. Background 15 and instead of deoxyribose double strands, RNA consists a single strand of ribose. Biologically, RNA molecules are produced as a result of gene transcription from one of the two strands of a DNA molecule and can be in one of three different types, mRNA (messenger RNA), tRNA (transfer RNA) or rRNA (ribosomal RNA).

Messenger RNA is a chemically unstable molecule that is copied from one strand of DNA to carry the sequence information. During the translation process, mRNA is edited by elimination of introns via RNA splicing, and only exons are transported to ribosome (see Figure 2.3). The information in mRNA is organized as a series of codon, each containing three bases and translated into particular protein by transfer RNA (tRNA). Because of the introns and the splicing mechanism, alternative mRNA transcriptions are possible and thereby multiple proteins can be manufactured from a single gene.

Genes

The information encoded in the sub-sequence of a DNA molecule required to specify a single protein molecule or amino acid sequence is called a gene. A gene is split by non-coding regions (introns) and coding regions (exons). The exons that encode pro- teins define an open reading frame (ORF), i.e., a sequence of DNA marked by start and stop codons. Thus each exon can carry a portion of an ORF responsible for a specific part of the protein sequence. The introns contain the sequences to determine the state of the gene or when that gene is expressed. Interestingly, most of the DNA in known genomes do not encode genes corresponding to proteins; less than 2% of DNA consti- tutes the approximately 20,000 genes in the human genome, although around 80% of the remainder is judged to have some functional role. Historically, genes were identified before DNA as responsible for the transfer of discrete traits from one generation to the next. The genetic controls of different traits are normally maintained in a repressed state and only characterized when expressed in various cells. The key factor to such selective gene expression is in the proteins that bind to the promoter region of gene and control its activity to increase or decrease expression.

Gene expression

In most organisms, DNA stores all the genomic information required for cell to operate. Functional genomics is the study of the dynamic aspects of gene expression — how the stored information is processed and transmitted to implement the functions and Chapter 2. Background 16

Figure 2.3: Gene expression – see text for details. Chapter 2. Background 17 reproduction of the cell. Genetic information in DNA is transmitted by the cell when a portion of DNA is copied into RNA and processed to produce proteins. This is referred to as “the central dogma of molecular biology” and is now known to be an incomplete characterisation, although it is still widely applied in thinking about systems biology. The information flow can be summarized as:

DNA → RNA → Protein

In Figure 2.3, gene expression has transcription as a first step. Here, one of the DNA double helices serves as a template for the formation of RNA and complementary bases in DNA are lined up to be copied in the direction 30 to 50.

In post-transcriptional processing, pre-mRNA is edited to contain only coding sections (exons) to form a mature mRNA script. The mRNA is then transported to the cyto- plasm (in ) where it is bound to ribosomes for protein synthesis.

The tRNA attaches specific amino acids to the mRNA to form complete polypeptide chains, i.e., proteins. Thus the genetic information in DNA is translated into proteins. Correctness of the translation is important as any variation in nucleotide sequence, such as a point mutation in a coding region (where a nucleotide in the sequence undergoes a change to another nucleotide) can result in an alteration in the amino acid sequence, or a mutation in an intronic region may result in altered expression or splicing of genes.

Regulation of gene expression

Bioinformatics applications are often based on the concept of understanding the mech- anisms that regulate gene expression and their role in determining the function of genes.

Gene expression requires a series of steps which includes mRNA processing, protein translation and post-translational modifications, all to be actively regulated. Studying this requires measuring the amounts of RNA or proteins in a particular cell or tissue type under a set of physiological conditions. This expression is controlled at the pro- moter region of the DNA for the gene in question that is bound by regulatory proteins called transcription factors, and also by protein-protein interactions.

Thus we can see the importance of the genomics level, for example, the potential for mutation or other changes in the function of gene and protein expression to lead to phenotypes that have implications in genetic and other diseases. Chapter 2. Background 18

2.2.2 From genomics to systems biology

Initial methods of determining biological functions focused on characterizing the genomes, i.e., physical mapping and sequencing the DNA of different organisms, identifying gene locations and their organizations, building the data storage of genes by type, etc. Comparative analysis of the similar gene regions provides significant insight into gene function, associated genetic diseases and new drug discoveries. Additionally, genome sequencing can contribute indirect information of the post-translational changes to form final functional protein and the evolutionary context of different species [Mus07].

Following in the footsteps of sequencing, investigation leads to a finer level that iden- tifies patterns of gene expression and protein abundance under different conditions. With the added stimulus of technical advancement, e.g., gene expression profiling or microarray analysis, scientists are now able to study thousands of genes simultaneously in a single RNA sample. As active genes are being transcribed into mRNA, the set of mRNA molecules present in a particular tissue under a set of physiological condition indicates which subset of genes is active. Thus one can infer not only the possible protein formations, their functions and cell types, but also dynamic information such as gene regulation in response to different environmental stresses or a particular disease state. There are a variety of methods for the measuring of mRNA levels or transcrip- tome, including spotted microarrays (“gene chips”), serial analysis of gene expression (SAGE) [VZVK95], total gene expression analysis (TOGA) [SFE+00], and more re- cently the use of high-throughput sequencing (RNA-Seq).

The final stage of genomics concentrates on the larger role of discovering the mech- anisms of gene expression regulation in biochemical pathways and networks. The key factors that regulate gene expression depend on DNA binding sites that bind regulatory proteins and protein-protein interactions. Techniques such as gel shift assay [GR81], foot printing [HRBHF07] and ChIP-chip [L+02] are commonly used to identify transcription factor binding sites, and a range of techniques such as two- hybrid screening [FS89] is used to study protein-protein interactions. In general, since “mRNA transcription does not accurately represent the concentration of their proteins” ([AA05], [RGK+08]) this has led to the growth of “proteomics”. Several investigations showed that correlation of mRNA levels with their protein levels varies with different experimental approaches ([GRFA99], [FLM+99], [ALI+00]).

Furthermore, post-translational modification or the other various stages of protein syn- thesis are hard to predict. Therefore more reliable and accurate quantification requires identifying proteins and their peptide complexes in a more direct fashion. Much of Chapter 2. Background 19 the progress has involved biotechnological developments, with less emphasis to com- putational and statistical approaches. Commonly used high-throughput techniques for characterizing proteins and protein-protein interaction detection are two-dimensional gel electrophoresis (2-DE) [Lin99], mass spectrometry (MS) ([FMM+89], [ZKF+99]) and protein microarrays [FLM+99]. See Table 2.1.

“The reductionist approach has successfully identified most of the components and many of the interactions but, unfortunately, offers no convincing concepts or methods to un- derstand how system properties emerge . . . the pluralism of causes and effects in biolog- ical networks is better addressed by observing, through quantitative measures, multiple components simultaneously and by rigorous data integration with mathematical mod- els” [SHZ07].

Thus a new field has emerged termed “systems biology”, a holistic approach that is based on systematically perturbing the biological workings of components of an or- ganism [IGH01] and relating the observed affects to a computational or mathematical model. The proposed methodology of systems biology [IGH01] is:

1. Define the components (“parts list”) and model the interactions of the system

• identify all genes, proteins, reactions, pathways, compartments, etc. • construct an initial executable, predictive model • run the model to simulate the system and generate testable hypotheses

2. Systematically perturb the real system and monitor results

• use experimental manipulation and record the global system response • use high-throughput (genome-wide) assay biotechnology

3. Reconcile the experimental data with output from model simulations

• how well does the model fit the data ? • revise, extend and re-evaluate the model, i.e., go back to step 2 or 1 as required and iterate

This thesis aims to conform this methodology, although our focus is on the use of ontol- ogy, machine learning and visualization to achieve greater automation of the process.

According to O’Malley and Dupr´e[OD05]“Systems biology involves the study of in- teracting molecular phenomena through the integration of multilevel data and models”. Such comprehensive integrative analysis requires not only the study of genomics and Chapter 2. Background 20

proteomics, but also information from the metabolome and other “omics” experiments to explain their interrelationship in complex regulatory network. It has been known for some time that integration of multiple “omics” data sets can show evidence of coordi- nated biological systems. Raamsdonk et al. [RTB+01] demonstrated that the genotype and metabolic profiles from yeast deletion mutant have strong correlation and can un- cover the function of unknown genes. Similar works [Fie02, OWKB98, ADB+03] also emphasise the importance of such integrative studies to gain understanding of cellular components to predict phenotypic behavior, information on pathways and regulatory networks, etc. More recently, evidence from proteomics has shown how the precise tim- ing of proteins activating and deactivating other proteins in signalling networks adds a complex layer of control of cellular processes [Gun10].

A selection of biological techniques used in systems biology is in Table 2.1. Some of the key challenges identified are: integrating functional genomics data with proteomics; and applying mathematical structures to organize data that can facilitate navigation from the abstract to specific details. The most common approach is based on where the combination of genes, proteins and other components are represented as a graph. For example, the protein-protein interaction network in Figure 2.4 has nodes corresponding to the proteins and edges to the binding relationships among them. Our research interest lies in applying an integrated approach based on “systems biology” data to investigate the cellular response of genes using the standard machine learning tools, and based on ontology to set the level of generality.

Table 2.1: Some of the main experimental techniques used for systems biology. These are broadly categorised into “omics” classes. Each technique is described in terms of the type of data produced, the experimental methods used and a brief description of the technique with a sample reference to the literature.

Data Methods Description Transcriptomics mRNA transcrip- Spotted cDNA microar- Experiments typically involve tion ray hybridising two mRNA samples where each is converted into cDNA and labelled on a single glass slide, spotted with thousands of cDNA probes [SSDB95]. continued on next page Chapter 2. Background 21

Table 2.1 – continued from previous page Data Methods Description mRNA transcrip- High density microarray High-density arrays contain tens tion (GeneChip) of thousands of synthetic oligonu- cleotides, which are usually hy- bridised with only one sample, thus providing direct information about the expression levels of an mRNA [LBW+04]. mRNA transcrip- Serial Analysis Gene Ex- Allows analysis of overall gene ex- tion pression (SAGE) pression patterns; commonly used for quantitative study of new genes as well as known genes [VZVK95]. mRNA transcrip- Differential display Requires a relatively small amount tion of RNA and is able to simulta- neously detect multiple differences (up or down regulation) in gene ex- pression [LP93]. mRNA transcrip- Gene expression finger- Can identify new genes, as well as tion printing (GEF) analyse their exact level of mRNA expression in different mature cell types [IB95]. mRNA transcrip- Northern blot analysis A simple method to study expres- tion sion of one gene at a time [AKS77]. Protein-DNA in- ChIP-chip Combines chromatin immunopre- teractions cipitation (ChIP) with microarray technology (chip) to identify po- tential transcription factor binding site at promoter regions [L+02]. Proteomics Protein expres- 2D-gel electrophore- Strategy to characterize the pro- sion sis mass spectrometry teome of a cell by separating (MS/LC-MS) proteins from each other fol- lowed by separation according to mass [O’F75, Lee01]. continued on next page Chapter 2. Background 22

Table 2.1 – continued from previous page Data Methods Description Protein-protein Yeast two-hybrid screen- Identifies protein interactions by interactions ing splitting transcription factor into a binding domain and an activat- ing domain, followed by fusion to cDNA library to produce a hybrid protein. Function is restored if the two-hybrid proteins physically in- teract [FS89]. Protein-protein Tandem-affinity purifica- Identifies protein interactions by interactions tion (TAP) with mass combining TAP-tagging with mass spectrometry spectrometry; designed to detect stable binding of proteins in com- plexes [A-C02]. Protein-protein Fluorescence resonance Detects an interaction between interactions energy transfer (FRET) electronically excited states of two dye molecules in terms of the ex- tent of energy transfer between them [Ken01]. Protein-protein Protein arrays/protein Protein arrays contain full-length interactions, chips functional proteins or domains to protein-DNA enable study of the biochemical ac- interactions, tivities of an entire proteome in a protein activity single experiment [LHE+99]. Metabolomics Cellular metabo- Metabolic fingerprinting Method to measure tissue samples lite concentra- (internal metabolites) directly to generate information on tion the intracellular metabolome un- der a given set of physiological con- ditions [Fie02]. Cellular metabo- Metabolic footprinting A non-invasive approach to mea- lites (exometabolome) sure extracellular metabolites by monitoring patterns of excreted metabolites while strains are sub- jected to sub-lethal concentrations of growth inhibitors [OWKB98]. continued on next page Chapter 2. Background 23

Table 2.1 – continued from previous page Data Methods Description Phenomics (phenotypes) Deletion (mu- Gene deletion/disruption This method studies functions of tant) phenotype proteins encoded by uncharac- (Saccharomyces terized ORFs by deleting ORFs Genome Dele- from the genome one at a time tion Project; and measuring the expression pro- EUROFAN) file for each mutant strain under a set of different growth condi- tions [WSA+99, Oli96].

Figure 2.4: Network showing yeast protein-protein interactions [BB03]. Red nodes denote “hub” proteins, essential for survival, that tend to be more densely connected. See text for details.

Figure 2.4 represents a network map of yeast protein-protein interactions obtained by two techniques, yeast two-hybrid and protein complexes identified by affinity purifica- tion and mass spectrometry [BB03, ZGS07]. In this network the nodes are coloured Chapter 2. Background 24 according to the phenotype of the yeast cell when the corresponding gene is deleted (the mutant, or gene deletant, phenotype). Red nodes denote a lethal phenotype (removal of the gene will cause the cell to die); orange nodes denote some importance (gene deletion will cause slow growth), and green and yellow denote proteins of non-lethal or unknown phenotypes, respectively. The finding that the connectivity characteristics alone can be predictive of protein function, in particular, that “hub” or highly linked proteins appear to be critical for cell survival, led to the characterization of the global distribution of node degree as following a power-law [BB03].

2.2.3 Bioinformatics databases

Bioinformatics applications for statistical and computational analyses play a key role in understanding genes, their associated proteins, the roles they play in various biochem- ical processes in cell biology, and ultimately, in helping to relate them to properties such as cell growth, disease, and so on.

High throughput biological experiments containing substantial genomic information, fueled by the requirement for searching and data mining, has resulted in vast web accessible databases. Bioinformatics database sites that organize and store sets of curated data and are maintained in a consistent state nowadays form an indispensable resource for biologists. In particular, such databases are now a key resource for scientific knowledge on model organisms such as the yeast Saccharomyces cerevisae, hosted at the Saccharomyces Genome Database (SGD – www.yeastgenome.org/).

The 2013 Nucleic Acids Research Database Issue [FSG13] lists 1512 databases in molec- ular and cell biology. A comprehensive overview is out of our scope, so only well-known resources with bioinformatics perspectives are discussed in brief. In the beginning of the bioinformatics era around 1980, data was mainly understood to be DNA and protein sequences, gathered from various sequencing projects. GenBank, EMBL (European Molecular Biology Laboratory) nucleotide , DDBJ (DNA Data Bank of Japan), PIR (Protein Information Resource) and SWISS-PROT are well known databases storing such data.

A wide range of technological advancements in DNA microarrays, mass-spectroscopy and yeast two-hybrid systems generated specialized data sets such as expressed-sequence tags (ESTs), single-nucleotide polymorphisms (SNPs) etc. TGI (TIGR Genome In- dices), dbEST(Database of Expressed Sequence Tags), dbSNP (Database of Single Nucleotide Polymorphisms) are example databases for such data. Chapter 2. Background 25

Proteomics studies required storing not only molecular data but also structures. Ex- PASy (Expert Protein Analysis System) is one such repository. Other 3D structural databases are PDB () and EBI-MSD (Macromolecular Structure Database at EBI). In addition, various inferred knowledge sources from these exper- iments are archived in different databases. Examples include protein interactions in BioGRID (General Repository for Interaction Dataset) and metabolic pathways in KEGG (Kyoto Encyclopedia of Genes and Genomes). KEGG is also referred as a systems biology database as it infers the systematic behaviors or complex cellular pro- cesses (metabolism, signal transduction, cell cycle) of cell or organisms based on a model of gene or protein similarity. This provides means to infer knowledge processed from organism similarity to generate corresponding networks of interacting molecules and the graphical display of pathway diagrams.

Some databases are model-organism databases, i.e., information is archived for individ- ual species widely used in laboratories. FlyBase (the Drosophila genetic and molecular database), SGD (the Saccharomyces Genome Database), EcoCyc (for E. coli K-12) are representative. Some of these databases provide interfaces to query details not only on individual organisms but also to compare and integrate information on a genome-wide scales to find genes with similar features within or across different organisms. For ex- ample, SGD stores nucleotide sequences for , as well as genes, their products, phenotype characteristics and supporting literature. An important in- terface on this site is the “fungal alignment viewer” which displays information on S.cerevisiae protein sequences. For a particular entry for a gene (ORF) name the inter- face gives rich summary information, e.g., similar sequences across fungal genomes, a dendrogram of relationships between the sequences, colored highlighting for the degree of sequence conservation, etc. [CWB+04]. sites like SGD have invested heavily in visualization and high quality user interfaces for browsing and searching.

Collections of published genomics articles are recorded in bibliographic databases and their links are provided by all major databases. MEDLINE has recorded biblio- graphic citations and abstracts from about 5,000 biomedical journals. PubMed contains biomedical literature citations with links to molecular resources at the NCBI. Data and knowledge from the published literature plays a critical role especially in pathway databases (KEGG, BioCyc) as a source of reference data for annotating genomes and phenotype screens, etc. [KB03]. Selected bioinformatics databases are in Table 2.2. Chapter 2. Background 26

Table 2.2: Selected list of Bioinformatics Databases (websites accessed on 11th May, 2013)

Name Description Size Non-organism-specific nucleotide databases GenBank Contains DNA sequences, their 185+ million sequence protein translations, phylogenetic entries comprising 281+ classification and references to billion nucleotides from published literature 260,000 species Website: http://www.ncbi.nlm.nih.gov/genbank EMBL (Euro- Collection of DNA and RNA se- 185+ million sequence pean Molec- quences from the scientific litera- entries comprising 281+ ular Biology ture, patent applications, directly billion nucleotides from Laboratory nu- submitted from researchers 260,000 species cleotide sequence database) Website: http://www.ebi.ac.uk/ena/home DDBJ (DNA Contains nucleotide sequence data 185+ million sequence Data Bank of entries comprising 281+ Japan) billion nucleotides from 260,000 species. Website: http://www.ddbj.nig.ac.jp/index-e.html TGI (TIGR Contains DNA and protein se- includes 57 animals, 60 GenomeIndices) quence, gene expression, cellular plants, 10 fungi and 15 role, protein family and taxonomic protists) data Website: http://compbio.dfci.harvard.edu/tgi/ UniProtKB/Swiss- Contains protein sequence, their 12,980 species, 540,052 se- Prot functions, domains structure, post- quence entries, compris- translational modifications, vari- ing 191+ million amino ants, etc. acids Website: http://www.expasy.org/ Model organism databases continued on next page Chapter 2. Background 27

Table 2.2 – continued from previous page Name Description Size SGD (Saccha- Contains Saccharomyces cerevisiae 6,607 ORFs, 14,840 On- romyces Genome genes, ontology terms, sequences, tology terms Database) protein domains, expression data, mutant phenotypes, physical and genetic interactions etc. Website: http://www.yeastgenome.org/ FlyBase Contains Drosophila species ge- 1.1+ million genes from nomic maps, gene products, phe- 12 Drosophila species notypes, genetic interactions, ex- pression patterns etc. Website: http://flybase.org/ Miscellaneous databases KEGG - Kyoto Integrated systems biology 442 pathways, 140 func- Encyclopedia database of metabolic path- tional hierarchies, 568 of Genes and ways, ontology terms, genetic pathway modules Genomes diseases, drug, genome maps and more Website: http://www.genome.jp/kegg/ GO (Gene Ontol- Contains ontology terms to de- 39,311 ontology terms, ogy consortium scribe gene product characteristics 25,505 biological pro- database ) and their functional annotations cesses, 3,310 cellular components, 10,452 molecular functions Website: http://www.geneontology.org/

2.2.4 Annotations and ontologies in functional genomics

The goal of any genome sequencing project is to annotate as many genes as possible into putative functional families and thus to predict their functions. Homology-based (sequence similarity) approaches established the foundation for predicting gene function where a newly discovered gene is annotated by aligning its sequence to similar well- studied gene sequences. However, characterizing genes solely based on the annotations from molecular function is prone to various errors [BDDL+98, DV01, PPK05] and is Chapter 2. Background 28 often insufficient to describe or predict biological functions. Different methods such as clustering of genes may incorrectly assign homologs as functionally related, or unrelated proteins as homologs; clustering of orthologous genes [TGNK00] may find homologs but with incorrectly assigned orthologs (evolved from common ancestral genes in different species) and paralogs (evolved from common ancestral genes in same species); and finally, although phylogenetic classification of genes may correctly identify homologs, prediction of protein function may still be incorrect due to errors in sorting homologs into orthologs and paralogs [Mus07]. Moreover, homology studies indicate that there is no rigorous definition of biological function for each particular protein, hence it may be impossible to distinguish whether two proteins have the “same” or “similar” functions.

Several classification schemes are proposed to address these issues. An early classifica- tion scheme by Riley [RL97] extends gene search by physiological roles in GenProtEC (E.coli genome and proteome database), where networks of interacting genes and pro- teins are categorized into eight distinct categories of gene types (enzymes, transport elements, regulations, membranes, structural elements, protein factors, leader peptides and carriers), and gene products into 118 hierarchically arranged functional categories. The Enzyme Commission [TB00] applied numerical hierarchical classifications of en- zymes based on the biochemical reactions they catalyze. Combined studies for sequence related pairs of E.Coli genes from two systems by Riley et al. [Ril98] suggested that se- quence similarity is often a good indicator of biochemically related functions. However, enzymes performing the same biochemical tasks may arise by duplication or divergence of genes in ancestral genomes with a different evolutionary history.

Functional classification based on biological processes is investigated by Ouzounis [AOS+99] where the three main protein categories are energy (various biosynthesis, metabolism, phospholipids, transport), communication (cell processes, cell envelope, regulatory functions) and information (translation, transcription, replication). According to this hierarchical scheme, energy-related proteins are represented in all three domains of life (Bacteria, Eukarya and Archaea), communication-related functions tend to be the most distinctive for each of the three domains and information-related proteins have a more complex distribution.

The most well-known collaborative project of this type is the Gene Ontology (GO) project [Ash00] that adopts a systems level approach where all biological objects and relations are refined hierarchically into more specific components, processes and func- tions. Key roles for application of GO are characterizing unknown homologous genes based on GO categories, facilitating communication among databases by providing a Chapter 2. Background 29 common platform and providing structured controlled vocabularies (ontologies) to an- notate genes. Thus, GO can be used as a tool for automated inference such as statistical analysis or machine learning.

The Gene Ontology is represented as an acyclic directed graph (DAG) where vertices or nodes have unique ID of the form “GO:N” where N is natural number with text char- acterizing the biological properties of the corresponding GO term, and edges represent relationships among the GO terms either as type “is a” or “part of”. Currently, Gene Ontology provides annotations and query accesses for many different model - ism databases, the main ones of which are Saccharomyces Genome database (SGD), FlyBase (Drosophila) and Mouse Genome Database (MGD). As of ontology version 1.3499, dated September 19, 2012, GO contained 38137 terms: 23928 to describe biological process, 3050 terms for cellular component, and 9467 for molecular function with as many as 18 levels within hierarchical of terms. Even though the Gene Ontol- ogy is the most widely used ontology resource in biology, with growing complexity of the knowledge domain it also suffers issues such as interoperability with other species- specific annotation scheme, limited relationship types (“is a”, “part of”) that some- times do not correctly define relations among two categories, and so on.

2.3 Machine learning

Learning, in philosophy, is defined as “to acquire, or to gain, knowledge of a subject or skill through education or experience”. Given an explicit representation of knowledge, the learning ability of a machine is understood as its autonomous capacity to acquire and integrate knowledge from experience, analytic observation or in some manner that can enhance knowledge or refine skill for future performance. Learning for enhancing knowledge has an emphasis on acquiring knowledge of domain concepts and their re- lations to each other, where the decision structure is inferred from data. This form of learning is known as knowledge acquisition, or informally “learning what”. In contrast, refining skills is based on repetition or practice play that gradually improves motor skills or cognitive skills; this is “learning how”.

Since our research interest lies in theories and methodologies of various machine learn- ing approaches in exploring, analyzing and representing biological systems, it clearly belongs to the realm of knowledge acquisition. Successful applications of such machine learning from as early as the 1950s include game-playing, credit card fraud detection, speech recognition, etc., where methods are often based on a multidisciplinary con- vergence including statistics, artificial intelligence, information theory, control theory, Chapter 2. Background 30 computational complexity theory, philosophy and other diverse fields [Mit97]. This sec- tion reviews various machine learning algorithms under different paradigms, followed by techniques from Formal Concept Analysis oriented towards ontology applications in a learning setting. Since machine learning is by now a very large area [Alp10] in this section we restrict attention to methods relevant to the thesis.

In the absence of a precise formal definition of machine learning, perhaps a reasonable one is “a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance on tasks in T , as measured by P , improves with experience E ”[Mit97]. Thus careful consideration is required in selection of training experience, direct or indirect feedback as a performance measure, specification of the target function, a representation language for that function and finally, an efficient learning algorithm to approximate the target function from training examples. Therefore, the problem of improving performance is reduced to the problem of approximating a particular target function (see Figure 2.5).

Input Output x1 y1 x2 y2 x3 y3 y x4 4 ...... xn yk

Figure 2.5: A generic machine learning system: the central mechanism shown in the blue box is a hypothesis generation algorithm, where each hypothesis hk is an approximation of the target function X → Y using some method of fitting hypotheses to the data hxi, yii, 1 ≤ i ≤ n.

2.3.1 Supervised and unsupervised learning

Machine learning methods can be viewed as falling into one of two major paradigms, supervised or unsupervised learning, depending on the class of learning tasks.

Supervised learning Let data D = {d1, d2, d3.....dn} be a set of n examples where di = hxi, yii, x and y are sets of input and desired output (e.g., class label) vectors, respectively. Thus knowing the target function f for n values in D, the task of the Chapter 2. Background 31 learner is to search for a hypothesis that approximates ∀x ∈ X , h(x) = f(x) where H, the set of all h, represents the hypothesis space.

Thus supervised learning, as the name suggests, is guided by training instances to discover relations between input and output values specified by the target function, to be represented in a structure, often called the model. A common application setting involves predicting the class membership of previously unseen instances, known as classification. In the case of continuous characteristics of given instances and class labels, techniques are often referred to as regression analysis. Classification, probably the most widely studied setting, can be seen as learning models represented in many different forms: decision trees, neural networks, Na¨ıve Bayes classifiers, Support Vector Machines, logical clauses (rules), and so on [Alp10].

Unsupervised learning Let data D = {d1, d2, d3.....dn} be a set of n examples

where di = hxii, x is a set of input vectors without any specified desired output (class label). The task of the learner is to find interesting structures or relations in D.

Unlike the supervised setting, it focuses on the underlying distribution of the instances and through partition, unveils new theories or concepts. Two common techniques to learn descriptions or properties of data in an unsupervised manner are clustering and association rule discovery. Clustering can be seen as “unsupervised classification” where data is not labeled with class values and the aim is to discover potential classes. On the other hand, association rule mining reveals interesting dependencies or other relations among attributes of the data. Due to the discrete representation of most of the data and scalability issues with large mining, association rule mining is preferred to clustering in this thesis.

Additionally, since unsupervised learning is usually part of exploratory data analysis, its contribution is mainly in the discovery of potentially useful and interesting structures in the data that can be used in further stages of data mining and analysis. For example, ontology learning would typically be viewed as unsupervised learning, whereas use of ontology, such as in annotating data from an experiment in which genes were over- or under-expressed, typically would be part of supervised learning.

In the following sections we discuss algorithms relevant to the context of thesis falling under the two paradigms. In the supervised setting, two algorithms are reviewed. First, one of the most popular approaches, learning decision tree classifiers, is covered. Then we discuss inductive logic programming in both propositional and multi-relational (first-order) representations. In the unsupervised setting association rule mining, in particular, frequent closed itemset mining algorithms, are discussed. Chapter 2. Background 32

2.3.2 Decision tree classifiers

Decision trees are widely-applied classifier learners that recursively partition datasets to identify tree structure models from the data with the aim of maximising predictive (classification) accuracy. In the late 50’s Hunt pioneered “concept learning systems” in several works (CLS-1 to CLS-9) [HMS66] that formalized models of human learning as acquisition of structured knowledge in the form of concepts, represented as decision trees. The definition of a concept is either the set of instances it contains (extensional definition) or a tree partitioning instances into such sets (the intensional definition). Concepts are distinguished in the data by the value of the “class” variables; in Figure 2.6 this is either yes or no. A similar approach was implemented in Quinlan’s machine learning system ID3 [Qui79] for discriminating between two classes in a domain; this was then later extended as C4.5 [Qui93]. Decision tree learning will be our chosen method of supervised learning for propositional representations in this thesis.

The decision tree representation consists of a single rooted tree with internal nodes that have exactly one incoming edge. Terminal or leaf nodes do not have outgoing edges. Induction in decision tree learning follows a top-down “divide and conquer” strategy, where at the initial stage, all the labelled training instances are assigned to the root node and instances are partitioned to specialize the tree based on some discrete function (or heuristic measure) of their attribute values. A node cannot be partitioned further if all the instances belong to same class label and thus it becomes a terminal node. Recursive application of this process to each internal node finally leads to generating a discriminating tree. Leaf nodes denote one class value that usually is over-represented in the instances it classifies compared to the other classes, based on some loss function. Instances are classified by navigating from the root node down to a leaf following a path, where the decision of the test on the attributes of the instance at each internal node is true.

An example of an ID3 decision tree [Qui79] is in Figure 2.6. Examples (instances labelled with class values) are listed in the table at the left. The generated tree at the right has internal nodes (square boxes) showing tests on the attribute (windy/humidity etc.), edges showing the outcome of that particular test (outlook= sunny/overcast/rain etc.), and leaf nodes (triangles) representing class label distribution (yes/no).

Learning algorithms for ID3

Decision trees can be built in different ways based on the choice of key criteria for the algorithm: a) the quantitative measure of “goodness” for selecting attributes to recursively split the tree on; b) the hypothesis search strategy, including overfitting Chapter 2. Background 33

No Outlook Temp Humidity Windy Class outlook 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes sunny overcast rain 4 rain mild high false yes 5 rain cool normal false yes 6 rain cool normal true no humidity windy 7 overcast cool normal true yes 8 sunny mild high false no yes 9 sunny cool normal false yes true high normal false 10 rain mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes no yes no yes 14 rain mild high true no

Figure 2.6: C4.5 decision tree example. Examples are listed in the table at the right. The tree on the left has blue square boxes as internal test nodes, edges as the outcome of a particular attribute test in yellow and triangular green boxes are leaf nodes showing class labels. The decision or the class variable is yes or no, for the concept “enjoy sport” (example from [Mit97]). avoidance; c) the method to construct a tree to classify unseen examples, such as how to handle missing data, or if probabilistic classification is used. Various “goodness” measures used in selection of attributes include information gain, gain ratio, imprecise Info-Gain and the Gini index. Information gain (a mutual information criterion) based on Shannon’s information entropy [Sha48] was first introduced by Quinlan as a metric for ID3 where the best-splitting attribute is the one that results in the maximal reduc- tion in entropy. This measure has a natural bias that favours attributes with many values to have high information gain, and thus for missing values or noisy data this may cause overfitting in the resulting tree. Quinlan’s C4.5 has gain ratio as an alternative measure that relatively penalizes attributes with many values, is able to handle missing values, and uses post-pruning mechanisms to reduce overfitting. Breiman et al.’s Gini Index [Bre84] for decision trees statistically computes each partition’s degree of purity. A partition is considered “pure” if it has only one associated value for a class, and the attribute with the highest Gini Index is selected as a decision attribute.

While decision tree learning is a simple method and generally efficient, learning involves dealing with issues such as: handling missing values; handling noisy data; and termi- nal node subsets containing too few training examples. A complex and deep enough decision tree can perfectly classify examples, but is then vulnerable to overfitting in the presence of random noise or limited training examples. Overfitting of a hypothesis occurs if there is another hypothesis that fits training data less well but outperforms it Chapter 2. Background 34 over the entire distribution of instances. Pre-pruning the tree before it perfectly clas- sifies data is an approach to deal with such cases. However, post-pruning is generally more successful as it is often difficult to identify the correct stopping criteria in pre- pruning, since the greedy search strategy of tree growing may miss multiple attribute tests required for the tree to fit well the target concept [Qui93].

Inductive bias of decision tree learning

Although the decision tree representation is sufficiently expressive to allow any concept to be defined, the search strategy of decision tree learning algorithm shows an inductive bias to learning small trees where attributes showing the highest mutual information with the class appear closest to its root. This leads to the claim that decision tree learning generates comprehensive models. However, there are limitations to this claim. First, some target functions may not be easily expressible in a decision tree using the input attributes. This motivates predicate invention [Bai03b]. Second, the inductive bias of decision tree learning greedily maximizes predictive accuracy. However, this could lead good explanatory attributes left out of the tree, since they are “masked” by other attributes with higher mutual information with the class. This motivates the use of closed concepts, whose intents by definition include all attributes relevant to the subset of instances in the concept’s extent. Therefore, in chapter 3 we investigated a combination of decision tree and closed concepts.

In this thesis the version of C4.5 called J48 from the Weka machine learning toolkit [WF05] is used. Unless otherwise indicated, default parameter settings with standard method for assessing classification accuracy of measuring ten randomized replications of ten-fold cross validation have been applied on a learning task.

2.3.3 Inductive logic programming (ILP)

“Inductive logic programming” was defined by Muggleton [Mug91a] as the intersection of machine learning and logic programming. ILP adopts techniques and tools from ma- chine learning for induction of hypotheses from examples while using logic programming for representations and formalization.

Formally, an ILP problem can be stated as follows. Let E be a set of examples con- taining the subset E+ labelled as positives and E− as negatives. Given background knowledge B, the task of the learner is to find a hypothesis H that explains E+ in terms of B such that:

+ B ∧ H  E (hypothesis H covers all positive examples; H is complete) Chapter 2. Background 35

Positive examples E+ Background B grandfather(henry, john)  grandfather(henry, alice)  grandfather(X, Y)  father(X, Z), parent (Z, Y) Hypothesis H father(henry, jane)  U parent(X, Y)  mother(X, Y) - mother (jane, john)  Negative examples E mother(jane, alice)  grandfather(john, henry) grandfather(alice, john)

Figure 2.7: Hypothesis construction in inductive logic programming.

and

− B ∧ H ∧ E 6  (hypothesis H covers no negative examples; H is consistent).

An example of learning in ILP is given in Figure 2.7.

Depending on the representation language adopted for the task of building general de- scriptions (hypothesis induction) from a set of examples, the two principal paradigms are bottom-up generalization and top-down specialization. In well-founded representa- tions like the logic programs of ILP, the hypothesis space follows bottom-up generaliza- tion that starts from “specific” hypotheses and learns description at a more “general” or “higher” level whereas, in top-down techniques, the search is conducted in opposite order by means of specialization. Classical generalization techniques include relative least general generalization (RLGG), inverse resolution and inverse entialment, a uni- fying framework for generalization. In the following sections we continue to describe briefly several ILP methods for first-order learning.

2.3.4 Relative least general generalization(RLGG)

A significant contribution in this area was Plotkin’s introduction of relative subsump- tion [Plo71] to define the generality relationship between clauses and the relative least general generalization.

Least general generalization under θ-subsumption: A clause C θ-subsumes a clause D, if and only if there exists a substitution θ such that Cθ ⊆ D (written as C≺D). Thus C is the least general generalization of D under θ-subsumption if C≺D, and for every other E such that E≺D, it also is the case that E≺C. For example, Chapter 2. Background 36 mother(X, Y) ∨¬ parent(X, Y) ∨¬ female(X)

θ-subsumes mother(jane, john) ∨¬parent(jane, john) ∨¬ parent(jane, alice) ∨¬female(jane)∨¬ male(john) with θ={X/jane, Y/john}.

Plotkin [Plo71] extended θ-subsumption to relative least general subsumption (RLGG) where the RLGG of two clauses C and D is more general than each of them relative to some background knowledge. However, it is a well known problem that the RLGG can contain an infinite number of literals and thus it is impractical. GOLEM was proposed by Muggleton and Feng [MF92] to implement RLGG and uses ground atomic facts to represent both examples and background theory, hence being tractable. Successful applications to its credit are predicting protein secondary structure in bioinformatics and structure-activity relation in drug design. GOLEM introduced language bias of ij-determinacy, a restriction to produce unique and finite RLGG. Since the hypothesis generated may be inconsistent (a conjunction of literals may cover negative examples) and clauses are learnt independent of each other, GOLEM had a limited ability to learn theories.

2.3.5 Inverse resolution

Inverse resolution can be seen as “inverse of deductive inference”[Mug87] in ILP. Muggleton’s work on the rule based DUCE system has rules in the theory represented by attribute-value pairs and searches for “interesting” concepts in the theory as the “maximal common subset of literals”. Given input as a set of conjunctive produc- tions of literals, DUCE applies six operators repeatedly to compress and restructure set of clauses. Four of the operators are variants of “V” (generalization operators) or “W” (predicate invention operators) that compress theories. Absorption and identifi- cation are “V” operators, whereas intra-construction and inter-construction are “W” operators. Considering examples from [BG96],

Absorption Given parent clause C1 and resolvent C, absorption operator builds C2 by absorbing body of C1 into C.

C1 = fly(X): −feathered(X), has wings(X).

C = bird(tweety): −feathered(tweety), has wings(tweety), has beak(tweety). Chapter 2. Background 37

A possible solution

C2 = bird(tweety): −fly(tweety), has beak(tweety).

Identification Given parent clause C2 and resolvent C, the identification operator builds C1 by identifying body of C2 in the body of C.

C2 = bird(tweety): −fly(tweety), has beak(tweety).

C = bird(tweety): −feathered(tweety), has wings(tweety), has beak(tweety).

A possible solution

C1 = fly(X): −feathered(X), has wings(X).

Intra-construction and Inter-construction A common literal l within clause A resolves with clauses C1 and C2 to obtain clauses B1 and B2. Thus the “W” operator builds clauses A, C1 and C2 given clauses B1 and B2. Depending on the nature of the common literal, if negative, it is called an intra-construction operation and if positive an inter-construction operation. An example is given in Figure 2.8.

C1 = sparrow :- bird, A = bird :- beak, tail, legs(2), C2 = eagle :- bird, brown wings, homeothermic golden

B1 = sparrow :- beak, tail, legs(2), B2 = eagle :- beak, tail, legs(2), wings, wings, homoeothermic, brown homeothermic, golden

Figure 2.8: “W” operator on propositional definite clauses.

Following the predecessor DUCE, Bain built the Conduce system [Bai03c] to auto- mate induction of structured theories by applying “V” and “W” operators to clauses in the theory guided by the set of closed concepts in a concept lattice. Like DUCE, it uses theoretical terms for structural induction and borrows inverse resolution opera- tors to invent new predicates in propositional logic. But Conduce’s novelty is to exploit Formal Concept Analysis (subsection 2.4.1), a technique to construct hierarchical con- cept definitions from theories. Conduce formalizes a propositional context where data and rules can be represented as examples (observational terms), and theoretical terms are introduced containing conjunctions of descriptor variables labeled with a predictor Chapter 2. Background 38 class. This is equivalent to a propositional form of theories containing definite clauses. Conduce looks for common subsets of descriptors hidden in theories and applies FCA to build a general-to-specific ordering of the conceptual hierarchy. In the lattice, each concept corresponding to a set of propositional clauses has the common intersection of clause bodies as intents (or attributes) and clause indices as extents (or objects). Thereafter, a lattice is scanned to detect significant concepts based on an information compression measure, followed by inverse resolution operators to revise (or restructure) both the theory and lattice in a way that reduces overall complexity. Thus automatic structured induction is carried out by repeated process of revising lattices and theories.

2.3.6 ILP in a first-order (multi-relational) setting

Many real-life biological databases are multi-relational where data is stored in multiple tables linked via entity relationships. Typical data mining approaches convert such multi-relational data into a single huge relational file and apply propositional learning. This is termed propositionalization and examples include systems like LINUS [LDG91]. While this technique may reduce the cost of data preparation for mining, it has several disadvantages: (i) the restriction to an attribute-value or Boolean input representation may be subject to a loss of information with a poor learning outcome; (ii) proposi- tionalization increases the input dimension space, and hence may not scale well. A possible solution is therefore to apply first-order learning with the ability to mine di- rectly from multi-relational tables where relations can reside in relational or deductive databases, i.e., training examples are represented extensionally (by a relational table) and background knowledge is represented either extensionally (by relational tables) or intensionally (in the form of rules) [DL94]. General background knowledge can rep- resent properties of objects or relations among them or useful auxiliary concepts, for example, inter-atomic structure for molecular data, or existing metadata in database mining domains. Declarative language biases in ILP background knowledge may in- clude Horn clauses as restrictions to learning non-recursive clauses, restrictions to learn only functions, etc. Thus ILP in first order logic offers powerful approaches to learn concepts that involve relations among attributes.

For our research we have investigated TILDE [Blo99] and ALEPH [Sri99] as the two well-known approaches in first-order learning. TILDE is a first-order variant of a propo- sitional leaner that uses a top down divide-and-conquer strategy to induce logical binary decision trees from examples, inspired by Quinlan’s C4.5 [Qui93]. As a language bias, Prolog is used to represent training examples and background theories. Hypothesis search is carried out based on a generalization ordering using theta subsumption where Chapter 2. Background 39 the resultant TILDE tree has internal nodes as conjunctions of literals and leaf nodes as classes. Unlike TILDE, under its default search strategy ALEPH uses mode-directed inverse entailment for predicate induction. It employs top-down induction and a se- quential covering strategy to generate a hypothesis as a set of clauses. A positive example is selected as a seed and background knowledge is searched to form the most specific clause. Once this most-specific clause with respect to the seed example is built, ALEPH carries out a top-down search from the most-general clause and gradually spe- cializes it by adding literals selected from the most specific one. Clauses are evaluated by several possible metrics and the best one selected is from the search. All the positive examples covered by the clause are removed and finally, the solution is complete when the constraint is satisfied that all the positive examples are covered and none of the negatives. We have selected ALEPH as our first-order learning technique for its rule based representation over TILDE’s decision trees in several applications since we be- lieve rules are easier to represent as networks and are more likely to be comprehensible to domain specialists.

2.3.7 Association rule mining

Association rule mining or more specifically, frequent itemset mining, belongs to the paradigm of unsupervised learning, that discovers interesting dependencies or associa- tions among sets of items in a database. This technique is classically used in various decision making processes in businesses, for example, in customer “market basket anal- ysis”. Here, shopping behavior is analyzed based on associations among the items customers place in their baskets. The following is Agrawal’s [AIS93] problem state- ment of market basket analysis:

Let R be a set of m distinct binary items R = {i1, i2, i3, . . . im}, T be a set of database of transactions with unique id where t[k] = 1, if transaction t contains attribute ik, otherwise t[k] = 0. Let X be a set of some items in R, called a k-itemset where ||X|| = k.

A transaction t satisfies k-itemset X, if for all items ik, t[k] = 1. An association rule c is defined as r : X → Y with X, Y ⊂ R and X T Y = φ. Statistical significance of the rule r is measured by support s which is the fraction of transactions in T that contain both X and Y , thus s =support(X ∪ Y ) and confidence factor 0 ≤ c ≤ 1 measures a rule’s strength denoted as c =support(X ∪ Y )/support(X). Thus the association rule learning problem can be decomposed as searching for all frequent itemsets with support greater or equal to minimum support and then generates rules with confidence greater or equal to minimum confidence. Chapter 2. Background 40

Apriori [AIS93] was proposed to solve the task of searching for frequent itemsets us- ing a bottom-up breadth-first search. Instead of storing all transactions in memory, transactions are retrieved by scanning through the database and frequent itemsets are computed iteratively with increasing order of the itemset size. At each stage, can- didates are generated based on the downward closure property, i.e., the k itemsets are considered potentially frequent if all their k-1 subsets are also frequent. The bot- tleneck with such mining is that extensive database scans are involved while mining for long patterns. For example, for a database of size |R| = m items, there are pos- sibly 2m frequent itemsets, and a frequent itemset of size k implies the presence of 2k − 2 other frequent itemsets as well. Thus Apriori-type algorithms perform well for sparse datasets with short frequent patterns such as occur in market basket analysis. However, when mining dense datasets with many long frequent patterns, performance degrades significantly. This complexity of finding frequent itemsets has been studied extensively [AMS+96, MTV94, PCY95, SON95, BMUT97]. All these studies follow a “generate and test” approach like Apriori and are focused on reducing the number of database scans to generate candidate sets based on dynamic hashing or partitioning techniques, sampling approaches, dynamic itemset counting, and so on. A different approach to avoid candidate generation called FP-growth tree was proposed by Han et al. [HPYM04]. This is a divide-and-conquer method that decomposes the mining task into smaller subdivisions and stores the relative frequency information in a tree-like structure for generating frequent patterns recursively.

While all the above approaches are still based on mining frequent patterns, Pasquier et al. [PBTL99a] proposed an interesting solution to prune the candidate search space based on computation of frequent closed itemsets. The advantage of this approach is that while closed itemsets have smaller order of magnitude than the frequent itemsets, they can be used to uniquely determine the exact frequency of all itemsets. An itemset X is a closed itemset in a set of database transactions T if X is frequent and there ex- ists no proper superset S with same support as X. To explore simultaneously both the itemset space and transaction space, Formal Concept Analysis (FCA) has been adopted as a framework in several techniques such as A-Close [PBTL99a], TITANIC [STB+02] and CHARM [ZH02]. A-Close and TITANIC follow an Apriori-like algorithm and con- duct a breadth-first search, therefore suffering the bottleneck of generating too many frequent candidates. CHARM seems a promising approach with a vertical representa- tion technique to reduce the search space. The following section briefly discusses the condensed representation technique of CHARM, related to our work in chapter 5. Chapter 2. Background 41

2.3.7.1 CHARM

CHARM, proposed by Zaki et al. [ZH02], is a well-known closed itemset mining al- gorithm particularly suited for dense datasets. This algorithm is designed to perform better compared to other Apriori-like algorithms by utilizing a “vertical” data format, hash storage techniques and a two way pruning strategy. The horizontal data for- mat in Apriori-like algorithms is Tid-Itemset based indexing, i.e., a transaction (Tid) is represented by a set of items (Itemset) in it. Alternatively, the vertical format in CHARM is indexed by Item-Tidset, i.e., an item is represented by a set of transactions containing that item. Search is carried out using this approach over an Itemset-Tidset tree (IT-tree) where each node is an Itemset-Tidset pair with child nodes sharing the same prefix class as the parent. A prefix class over itemsets indicates that they share a common arbitrary length of prefix. Versions of CHARM also use a diffset technique to keep track of differences in Tidsets of a child node from its parent to reduce memory storage.

To generate frequent subsets of any two itemsets, the intersection of their Tidsets is computed and subset is tested depending on four closure properties, of which two are variations of the others. An empty set difference of Tidsets indicates that two itemsets share identical transactions and their union has the same closure, therefore one Itemset is pruned. A non-empty set difference with one Tidset as a subset of another results in replacement of the Itemset (with subset Tids) by their union as it continues to hold the same closure and the Itemset with the superset Tids is retained. Finally, both itemsets are considered frequent itemsets if they do not have any common Tids. To speed up the mining process, CHARM adopts sum of support as a weight function that sorts the Itemset-Tidset search space in a way that higher support itemsets appear later. This technique helps early pruning of non-closed itemsets, thereby generating a tree with fewer search levels and boosts the mining process. The version of CHARM without lattice generation [ZH02] uses the sum of Tidsets as a hash function to store closed itemsets which later are easier to retrieve and thereby is used for subsumption checking to eliminate non-closed ones. However, as lattice generation [ZH05] also re- quires recording minimal elements or a hierarchy among them, the later algorithm does not benefit from hash functions and adopts an intersection-based subsumption check. For our work we have implemented a variant of CHARM algorithm that generates the formal concept lattice with nodes as concept having intents as itemsets and extents as Tidsets. Chapter 2. Background 42

2.4 Ontology learning

Since building an ontology manually is a labor resource-intensive and costly task, data- driven ontology learning by processing and representing concepts from all kinds of available information such as unstructured (text), semi-structured (HTML/XML) or structured (databases) data seems more logical. Ontology learning in general, refers to the techniques that automatically extract meaningful concepts and their relations from input resources, builds an ontology from them or extend the definitions of existing on- tology concepts in semi-automatic manner using heterogeneous data sources[CMSV09]. Thus the process of ontology learning combines techniques from variety of disciplines including data mining, machine learning, natural language processing etc.

There have been considerable efforts to automate ontology learning mostly from un- structured or semi-structured data, however, only a few in learning from multi-relational data([CMSV09]). A generic ontology learning architecture Figure 2.9 has an ontology management component, a coordination component, a resource processing component and an algorithm library component. The management component plays the role of an integrated platform to access algorithms that process resources, combine newly gen- erated ontology concepts and relations into the model and finally, evolve, reason with and evaluate ontologies. The coordination component facilitates interactions among the resource processing components and the algorithm library component. Various techniques for discovering, importing, analyzing and transforming input data depend- ing on the availability of the their structure are contained in the resource processing component. Finally, essential machine learning algorithms to extract and maintenance of ontology modeling are contained in the algorithm library content.

The above learning framework further devises methods to extract ontological struc- tures out of data in an unsupervised manner followed by classification or clustering in order to add newly learned ontology concepts. Tasks involved are preprocessing or ex- tracting ontological terms, constructing an initial taxonomy, identifying and labelling non-taxonomic relations, detecting instances, populating the ontology and finally, eval- uation of the model. Although we do not directly use this framework, our methods could fit into this kind of approach. We differ in the role of the human in Figure 2.9, where the only human role is that of ontology engineer. However, this implies a user is also able to engineer ontologies which is demanding of a domain specialist. Instead, we take the approach adopted in the RDR methodology for Knowledge Acquisition — the user is the domain expert, who leads the system to build a knowledge structure by giving cases. Chapter 2. Background 43 Ontology Learning 247

Web documents DTD Legacy databases O2

WordNet Ontology XML-Schema Imports existing Crawl Import semi- Import schema corpus structured schema ontologies O1

Resource Processing OL Coordination

Lexiconi

Ontology GUI for Ontology Engineer Domain Management Ontology

Pre-Processed Data

Result Algorithm Set Library

Ontology Management Backend

Fig. 1. Ontology learning conceptual architecture Figure 2.9: A generic ontology learning framework adopted from [CMSV09]. browsing, modification, versioning as well as evolution of ontologies. However, theWhile main many purpose techniques of the have ontology been devised, management free text mining component approaches in the are context more focused of on- tologyon developing learning various is to provide resource an processing interface strategies between in the linguistic ontology and and natural the language learning algorithms. When learning new concepts, relations or axioms, the learning al- processing. Free text contains a large number of lexical relations among the language gorithms should add them into the ontology model accessing the Application Programmingconcepts, and therebyInterface requires (API) more of the preprocessing ontology management compared to component. semi-structured Thus, or thestructured ontology data. management An example API of such should an application at least containis Text-To-Onto methods [MS04 for] creating that re- newquires concepts, six different relations, natural axioms, language individuals, processing components etc. Most for available detecting APIs words, indeed mor- fulfillphological this requirement.analysis, syntactic Further annotation, important regular functionalities expression matching for ontology and syntactic learn- ingparsing. are: evolution Applications, reasoning of learningand fromevaluation structured. Techniques data such as for database, ontology existing evolution on- as presented in [52] or [35] are very important for ontology learning as it is tology etc., are investigated in [Kas99, WT00]. an inherently dynamic process. As the underlying data changes, the learned ontologyMachine learning should algorithms change as that well have and been this used requires for ontology not only learning incremental can be classified ontol- ogyinto learning four types; algorithms, depending buton the also input some structure support multiple for ontology types of evolution machine learning at the ontology management level. Reasoning and evaluation play a crucial role in can be adopted by any ontology learner. Association rule discovery, hierarchical cluster- guiding the ontology learning process. In case the ontology learning system facesing, classification, several alternatives, inductive logic it should programming definitely and conceptualchoose that clustering alternative are possible which preservesapproaches the for consistencythis purpose. of the underlying ontology [36] or the one which maximizes certain quality criteria. Chapter 2. Background 44

Association rule discovery (subsection 2.3.7) is used to explore correlations among the conceptual terms and represent them as a set of generalized propositional rules. An application of such a method is Text-To-Onto [MS04]. This applica- tion aims to extract concepts, and non-taxonomic and taxonomic relations. It uses the frequency of co-occurring words to discover non-taxonomic relations with background knowledge such as WordNet [Fel98]. WordNet is a popularly used source for semantic relationships between words (taxonomy), and contains syn- tactic and morphological data (a lexicon). Finally, association rules are applied to derive statistically significant relations based on support and confidence where the background taxonomy is used to give the appropriate level of abstraction.

Hierarchical clustering can be defined as a process of organizing objects into groups based on their certain representation similarity. ASIUM [FNR98] uses this tech- nique for extracting taxonomic relations from syntactic parsing and aggregates clusters to form concepts as generality graph or domain ontology. Hierarchies can be generated either bottom up or top down. Bottom up starts with individuals and builds groups by merging similar clusters, whereas top down starts with all objects and divides them into groups. Classification is a possible application to refine the existing taxonomy, by learning a classifier which is later used to classify new relevant terms. As we mentioned earlier, WordNet can be used as a source for basic level categories for such classification tasks.

Logic based methods are more suitable for extracting concepts and relations from structured or multi relational data. Inductive logic programming, first order rules, propositional learning have been applied in several applications. ASIUM [FNR98], OntoTermExtraction[FLV08], TextStorm and Clouds are applications that learn ontologies from text data semi-automatically. Throughout the thesis we have used different logic based methods for learning ontologies.

Formal concept analysis (subsection 2.4.1) can be seen as a conceptual cluster- ing technique, however one with strong theoretical foundations that can be ap- plied to learn concepts and their hierarchy at the same time. Our work investi- gates the combined approach of FCA and various techniques in inductive logic programming to automatically learn ontologies.

The next section reviews details of formal concept analysis and related ILP techniques for an ontology learning framework. Chapter 2. Background 45

2.4.1 Formal Concept Analysis and ontology learning

Formal Concept Analysis (FCA) is based on mathematical order theory to derive con- ceptual structures from data. Barbut and Monjardet [BM70] first introduced Galois lattices for this, which Wille [GW99b] proposed as the basis for Formal Concept Anal- ysis, naming them concept lattices. According to Wille, a concept can be seen as a node of a Galois lattice. In other words, FCA recognises the “concept” as a fundamen- tal unit of thought, which is constituted by its intension and extension [GW99b]. The advantages of FCA are a good formalization of structures as conceptual hierarchies and a guideline for analysing dependencies within the data.

This is formalised by defining a formal context as a triple K=hD, O, Ri where D and O are finite sets of descriptors (attributes) and objects respectively, and R ⊆ D × O is a binary relation [GW99b]. The inclusion hx, yi ∈ R denotes the fact that “descriptor x is a property of an object y”. The formal concept is an ordered pair of sets hX,Y i where X ∈ D, Y ∈ O such that hX,Y i is maximal with respect to the property X × Y ⊆ R. This is the closure property. The set X is called the intent and Y is called the extent of the concept hX,Y i. The formalism was further extended with the binary order or subsumption relation. A partial order is introduced to characterise a concept lattice L of context K defined by the set of all complete pairs of intents and extents with order hX1,Y1i ≤hX2,Y2i ↔X1 ⊆ X2. The fundamental property of a Galois lattice as a complete lattice is defined by in terms of the least upper bound (lub) and the greatest lower bound (glb) of all pairs of concepts as follows

S 00 T lubj∈J hX,Y i = hh j∈J Xji , j∈J Yji

T S 00 glbj∈J hX,Y i= h j∈J Xj, h j∈J Yji i where the double application of the derivation operator (00) denotes the closure.

A formal concept lattice is shown in Figure 2.11 for the context in Figure 2.10.

Many methods have been proposed to build concept lattices by incremental or by batch construction [KO01]. In the context of machine learning, incremental construction of such concept hierarchy can be viewed as conceptual clustering [MS83]. Here formal concept analysis is used as a tool for restricting, restructuring and ordering the set of concepts that can be induced over a collection of objects [MG95, CR93, MS83]. In association rule mining, the critical issues are to speed up the computation of frequent itemsets, and their subsequent retrieval. Several applications are developed to address these issues by using FCA [PBTL99b, TPBL00, VMGM02]. But the techniques in this Chapter 2. Background 46

four- hair- intelligent marine thumbed legged covered

cats x x

dogs x x

dolphins x x

gibbons x x x

humans x x

whales x x

Figure 2.10: Formal context. Columns represent domain attributes or features, rows represent objects in the domain, and an “x” indicates that the row object has the corresponding attribute. framework are restricted to rule learning such as the use of association rule learning to extract co-occurring pairs of terms. The clear drawback to the use of rule learning is that inference over the resulting knowledge bases is limited.

In the domain of ontology engineering, ontologies and formal concept analysis can complement each other from an application point of view [CHST04]. Three main application areas identified are (i) building ontologies where FCA can support as a learning technique [STB+02, Stu04] (ii) analysing and navigating ontologies by the techniques of FCA [CS00] and (iii) improving FCA applications for ontology repre- sented in logics [Bai03a, Pre97]. The last application area needs to deal with various complexities of FCA techniques in terms of scalability (i.e., deriving FCA from large data sets), and the use of expressive formalisms of first order logic.

Several authors have investigated FCA in first order logic (see references in [FR00] and in chapter 7). However, investigation in detail from the application of ILP techniques is limited [Bai04, Bai03a]. The former [Bai04] is important for dealing with large data sets and is useful for large experimental evaluation on data sets with known properties. The latter [Bai03a] is anticipated to be important for user interactivity. Further research must be made to improve areas of ILP techniques specifically predicate invention methods [Bai04], based on the large body of mathematical results in FCA.

Our approach to ontology learning is based on formal concept analysis [GW99b] and the advantages it can provide when used in combination with machine learning algorithms. FCA takes a set of objects, a set of attributes and their binary relationship and builds Chapter 2. Background 47

Figure 2.11: Concept lattice for the formal context in Figure 2.10. The lattice is shown as a Hasse diagram, and each node represents a formal concept: a set of attributes common to the set of objects in the concept. the concept lattice with a concept inclusion ordering. Scalability is the main impedance to the task of ontology learning. The search space for a set of N objects in a domain contains up to 2N subsets of objects representing possible concepts, leading to the order of sets of concept inter-relationships defining possible ontologies. Bain has previously shown [Bai03a] that intensive analysis of basic concepts inherent in a domain enabled by FCA can provide a useful representation bias for the search space of possible ontolo- gies. The approach taken is to go from propositional to first order representations of intensional concept definitions and to use first order logic to represent concepts in for- mal concept analysis. Because of the richer formalism, this will allow for coverage of a wider range of multi relational concepts with greater structure, more scope for domain specialities (e.g., in our case, interactions in biological networks), although leading to more difficulty in searching this expanded space for the “best” ontology. In order to address this, our approach will be to tackle each of the key sub-problems, employing a Chapter 2. Background 48 range of novel methods and adaptations of existing techniques from related areas.

2.4.2 Applications of machine learning in functional genomics

Systems biology data requires integrated analysis of high-throughput data such as sequence, structure, function, gene and protein expression, pathway, phenotype data, etc., which are exponentially growing in size, inherently noisy and lack generalisation structures. Machine learning offers various techniques such as decision trees, neural networks, Markov models, support vector machines, graphical models and inductive logic programming to analyze these complex data sets with the application of data- preprocessing, classification, prediction, probabilistic modeling etc. The following is a brief discussion of recent applications of machine learning approaches in functional genomics.

2.4.2.1 Ontology driven approaches

The objective of any microarray experiment is to study the underlying biological phe- nomena of genes that are differentially expressed over time or under the modified con- ditions of interest. Two complementary paradigms of machine learning techniques are unsupervised and supervised learning. Unsupervised learning is the most commonly applied exploratory framework that attempts to cluster data to identify similarly ex- pressed genes, i.e., discover the classes. Examples are hierarchical clustering [ESBB98], K-means clustering [THC+99], self-organizing maps [TKWC99], principal components analysis [RSA00] and Bayesian Network learning [FLNP00]. On the other hand, su- pervised learning applies conditional distribution assumptions, i.e., given the classes of interest, predict gene activity.

Independent of any model, ontological analysis has become an essential part in mi- croarray data analysis to infer the functional annotations for genes and their products. In unsupervised learning, the clusters require a post-hoc analysis to evaluate correlated genes based on the existing ontological annotation available in the databases whereas, in supervised learning, the Gene Ontology features may be combined with other features and classifiers are learned to predict gene activity from expression data.

Over-representation analysis is the most commonly applied statistical method to es- timate the statistical significance of categories within the study set of genes. Many statistical hypothesis tests based on Gene Ontology terms have been proposed to translate clusters into relevant biological information. For example, Al-Shahrour et Chapter 2. Background 49 al. [ASDUD04] applied Fisher’s exact test to extract relevant GO categories in a group of genes compared against a reference group. Similar approaches based on single or multiple combinations of χ-square, hypergeometric, binomial and Fisher’s exact test are implemented in ([SF04], [KDOK02], [ZFW+03], [BS04]: see also [KD05] and Table 2.3).

GO functional annotations combined with other information sources have been in- vestigated in supervised frameworks to improve prediction. Hvidsten et al. [HLK03] applied supervised learning based on rough set theory to categorize genes from expres- sion patterns to GO biological processes. Barutcuoglu et al. [BST06] selected SVM as a classifier to generate a multi-label training set based on GO processes which later were applied with a Bayesian scheme to predict function. An emerging role of ontology is knowledge representation where machine learning methods offer techniques to pre- dict and integrate complex annotations, mappings, and literatures across data sources. King et al. [KFD+03] have investigated methods to standardize GO functional vocab- ulary by predicting new GO terms based on existing SGD and Flybase annotations where decision trees and Bayesian network were both applied.

All the statistical methods mentioned above are based on some assumptions with ad- vantages and limitations. For example, the use of the hypergeometric distribution to calculate P -values, which is a widely applied test, assumes: a) the set of all genes in the genome as the reference set and calculates a probability that includes genes which do not have any chance to be selected (thereby the resulting P -values may differ from those that would be obtained under the correct statistical model); and b) GO cate- gories are uncorrelated and P -values are computed for each independently. Even with corrections for multiple testing, these methods have the limitation that they do not have the correct probabilistic model to approximate the relationships existing among various categories in the hierarchical structure of the ontology. Therefore, they may fail to correctly infer knowledge about genes with unknown function.

An alternative knowledge-driven approach is “semantic similarity” which takes into account similarity among genes by considering the information content and structure in the ontology. To quantify the similarity, two common approaches are edge count- ing [RMBB89] and information theoretic principles [Res95]. Edge counting methods measure similarity of two given ontology terms, t1 and t2 by computing edges in the graph path between them in the ontology. The shorter the distance, the higher the sim- ilarity. In case of multiple paths, the shortest or average distance is taken into account. The assumption with this method “links in taxonomy represent uniform distance” does not correctly satisfy a biological ontology where single links can cover widely variable distances. Weighting edges as a function of hierarchical depth or considering density Chapter 2. Background 50

with link types are among several approaches used to correct this assumption. Resnik et al. [Res95] showed that a node-based method not only overcomes these issues, but also performs better. The information content (negative log likelihood) based method measures similarity between two terms by computing all common terms they share in taxonomy. The more they share, the more similarity they have. As the Gene Ontology has a subsumption hierarchy, probability increases as we move up and decreases at the bottom, with terms that are more specific. Thus the topmost single node has prob- ability 1 which is less informative and more specific nodes contain more information. Quantitative associations between GO-driven semantic similarity and gene expression correlation have been studied in [AB04] and [SSP+05]. Such approaches combined with Carey’s [Car04] formalization have been used in supervised framework for feature discriminant analysis in our work [ABT10].

Table 2.3: List of web-based functional annotation tools (websites accessed on 11th May, 2013)

Tool Statistical test(s) Functional labels FatiGO, Fisher’s Exact test GO, KEGG, protein, domains, FatiGO+ Swissprot keywords, Transfac mo- tifs, CisRed motifs, chromosomal location, tissues [ASDUD04] Website: http://bioinfo.cipf.es/babelomicswiki/tool:fatigo DAVID, Hypergeometric and GO, pathways, diseases, protein EASEonline Fisher’s Exact tests domains, interactions [DJSH+03] Website: http://david.abcc.ncifcrf.gov/ FunSpec Hypergeometric test GO, phenotypes, protein interac- tions (yeast only) [RGMH02] Website: http://funspec.med.utoronto.ca/ GoMiner Fisher’s Exact test GO [ZFW+03] Website: http://discover.nci.nih.gov/gominer/index.jsp GOstat χ2 and Fisher’s Exact GO [BS04] tests Website: http://gostat.wehi.edu.au/ GOToolBox Hypergeometric, Bino- GO [MBR+04] mial and Fisher’s Exact tests Website: http://genome.crg.es/GOToolBox/ Chapter 2. Background 51

2.4.2.2 Feature selection in microarray data

Microarray data is characterized by large dimensionality with small sample sizes and inherent noise, which results in over-fitting and poor performances for many learning models [SDB03]. Gene selection can be referred to as feature selection in machine learning, which becomes a increasingly demanding task to discriminate significant fea- tures. Here the objectives are to increase efficiency by reducing computational runtime and improving performance or predictive accuracy of both classifiers and clustering techniques [LM98]. Two difficulties identified to apply GO based approaches in such setting are: a) multiple category annotation – a gene can have several different func- tions in cell, and it may be found in several cellular processes and in different locations; and b) multiple depth of annotation – ontology categories are organized in a hierarchy, therefore two genes may have related functions that are annotated at different levels of generality. Propositional machine learning has limitations in learning from such data due to its attribute-value representation for instances.

There have been several approaches to pre-process GO annotation to be incorporated in a probabilistic framework. Commonly applied techniques in the supervised learning framework are filter methods and wrapper methods. The basic difference between these two methods is that filter methods select subsets of features independent of any knowledge or induction algorithms in assigning discriminative scores to genes, whereas wrapper methods search for subset of features using an induction algorithm itself as part of the evaluation function (for more details see [KJ97]).

Examples of filter methods based on statistical approaches are t-test (ANOVA [JA06], [BL01]), Fisher’s Discriminant Analysis ([JSR03], [DP05]), information gain ([GST+99], [ABT10]). Heuristic-based filter methods suffers from evaluating the individual features to pre- dict the class label while ignoring the true underlying distribution of the expression. Alternatively, nonparametric methods have been investigated such as Wilcoxon’s sum test ([TGB+02], [TOTZ01]). Example of wrapper method include genetic algo- rithms [JUA05]. Unlike wrapper methods embedded methods utilize the available data set to select features during training and thus when used in a specific algo- rithm it reduces training time of the predictor and is faster to reach a solution. Ex- amples are utilizing SVM ([GE03], [WMC+01]), decision trees [WTH+05] (for more see [SIL07]). We have examined various approaches and correlation based feature selection (CFS) [Hal00], a filter based method, was found appropriate for us. This Chapter 2. Background 52 method assumes that a subset of good features is highly correlated with the class but uncorrelated with each other. Feature subsets are then searched for efficiently using best-first-search heuristics while considering their individual predictive power combined with inter-correlation among themselves.

2.4.2.3 Inductive logic programming on multi-relational genomics data

Biological data sources often contain multi-relational data such as gene interactions, regulatory pathways, metabolic pathways, etc. These are often organized in a hierar- chy, DAG, or other structured data type. Propositional learning approaches to generate accurate and comprehensible models from such data often fail due to the limitation of attribute-value representations for instances. King et al. [Kin04] pointed to three prob- lem areas that strongly suggest the suitability of applying inductive logic programming ([Mug91b], [DL94]) in functional genomics: a) to uncover significant logical relation- ship by deductive inference from multi-relational and hierarchical data; b) to reduce the cost of querying in heterogeneous data sources by offering direct analysis from multiple relations; and c) to produce comprehensible and interpretable results, unlike the rules from a propositional learner.

King et al. applied ILP to predict protein functions by homology induction, and also their functional classes. Heath et al. [HRS+02] studied gene response to long term stress in loblolly pine where ILP was applied to functionally categorize genes with sys- tematic variations in expression levels. Gene regulatory network learning with ILP was studied in ([OTPC07], [FK08], [PC03]). An approach to incorporate available struc- tural annotation (ontology) as background data with ILP was examined by Badea et al. [Bad03] where ILP was used with Gene Ontology and ProteomHumanPSD data to induce functional discrimination rules for genes in microarray analysis of adenocarci- noma of the lung. Such representations offer a comprehensible functional interpretation for microarray data analysis. We have applied a similar approach to learn responsive cellular network in yeast [ABT09].

In recent years there has been more emphasis on probabilistic ILP (abduction and induction) with applications to systems biology data where, since background informa- tion is missing or incomplete, concepts cannot be learned directly from ground facts but rather they are required to be inferred indirectly. This framework works by abduction to generate a hypothesis from observable facts followed by induction to learn rules from the abduced hypothesis. Examples include utilizing Bayesian networks [TNCKM06] to infer enzyme inhibition from an abduced hypothesis generated from observed metabolic Chapter 2. Background 53 concentrations. Also, a similar framework was deployed in utilizing SVMs [Mug05] to predict the effect of toxins in rat metabolic pathways.

2.5 Summary

This chapter has reviewed ontology, the basic biology of functional genomics and sys- tems biology, bioinformatics tools and databases. This was followed by various ma- chine learning algorithms from supervised and unsupervised paradigms. The current state-of-the-art for ontology learning within the framework of formal concept analysis, techniques used in condensed representations of itemsets and multi-relational concepts have been described. Combining the benefits of these techniques will be the basis of our approaches, and this may in turn lead to benefits in the areas from which we derive our methods.

3 Propositional learning from ontological annotation

“It’s a mistake to think you can solve any major problems just with potatoes.”

Douglas Adams

This chapter reviews standard over-representation analysis of high-throughput biolog- ical data and the bias problem arises from currently used probabilistic model while considering the dependency of ontological terms in the structure. We propose an in- tegrative “systems biology” approach to combine and explore all heterogeneous data sources where formal concept analysis is applied for feature construction in super- vised learning methodology. Parts of this work have been previously published in [ABT07, ABT10].

3.1 Motivation

Ontologies are of growing importance in biomedical informatics and their uptake in this area may constitute one of the most successful applications to date of ontological engineering. Resources from open-source projects such as the Gene Ontology (GO) 1 are now nearly ubiquitous tools in bioinformatics [BR04]. A recent check 2 showed the original GO paper by Ashburner et al. [Ash00] had over 4500 citations in PubMed Central. A number of reasons can be identified for this success from the standpoints of biology and computer science — for reviews see Bada et al. [BSG+04] and Bodenreider

1At www.geneontology.org. 2As of May, 2013, at www.pubmedcentral.nih.gov. Chapter 3. Propositional learning from ontological annotation 56 and Stevens [BS06]. In this chapter we are concerned with two such aspects of the Gene Ontology that led us to study its applicability for machine learning problems in bioinformatics.

First, the use of ontologies such as GO provides a standard terminology for functional genomics — the objective of which is to describe the function of all genes in the genome of an organism — for example, in the analysis of gene expression data [BH02]. Thus it is important for machine learning tools to work with such data. Second, the category definitions and hierarchical structure of GO represent a “pre-Semantic Web” view of ontology, where the goal of the representation was as a tool for human inspection rather than as a formal knowledge structure suitable for automated inference. The question is then how the structure of such a resource can be handled by automated systems.

In previous work [Bai02], M.Bain developed a feature construction approach using Formal Concept Analysis that demonstrated bias shift, leading to improved predictive accuracy with standard machine learning algorithms. Feature construction as a pre- processing step is a promising approach for handling non-vector based data as input to attribute vector-based machine learning methods. This chapter investigates the applicability of this approach to learning from GO annotation data, and evaluates the strengths and limitations of the approach.

The process of feature construction can be computationally expensive, and so in this chapter we extend our approach to investigate methods of feature selection as an alter- native. In this approach the GO categories are pre-processed using standard methods to select those likely to lead to good predictive accuracy [GE03].

The results we obtain suggest that an integrative analysis of heterogeneous genome- wide data on the behaviour of cellular systems [Kan00] is possible based on standard machine learning tools. The remainder of the chapter is organised as follows. Related background on over-representation analysis and limitations of its traditional setting in bioinformatics applications are discussed in Section 3.2, followed by our alterna- tive statistical approach to handle the ontological annotation appropriate for machine learning methods. A series of machine learning experimental settings and their results are discussed in Section 3.4 and Section 3.5 with conclusion in Section 3.6.

3.2 Formalization of over-representation analysis

A common setting for over-representation analysis is computing p-values using a par- ticular statistical test of gene sets resulting from a genome-wide high-throughput assay, Chapter 3. Propositional learning from ontological annotation 57 typically a gene expression microarray experiment [BH02]. In this context most tools adopt a standard probabilistic model for the number of genes that would be found by chance to be annotated to the particular GO category of interest. Typical choices for such models, and hence significance tests, include the hypergeometric or binomial distributions, or the χ2 or Fisher’s exact tests [KD05].

A typical application of the hypergeometric distribution is the following. We have a set of genes annotated to a particular GO term and we are interested in knowing the probability of finding that number of genes annotated to the term simply by chance 3. We assume a “background distribution” of genes, typically the total number of genes in the genome with GO annotations, or the total number of genes in the experiment. This number is n. Of this set of genes, the size of the subset annotated to the GO term of interest is m ≤ n. In the results of the experiment the size of our set of genes is s and the number of genes in that set annotated to our term of interest is r. The probability of finding by chance r genes from s annotated to the term of interest is given by the hypergeometric distribution:

r−1 mn−m X i s−i P (r, s, m, n) = 1 − n (3.1) i=0 s

3.2.1 Optimistic bias in probability estimates

A well known problem in the analysis of GO annotation using a probabilistic model to estimate statistical significance is that of multiple testing, in this case, where each GO term is tested separately using Equation 3.1. As the number of applications of a statistical test on a data set increases so the probability of obtaining an apparently sig- nificant result (p-value below the selected threshold) increases. This is usually allowed for by essentially lowering the effective significance threshold based on the number of tests by using the Bonferroni correction or alternatives [BWG+04].

In the case of over-representation analysis of GO annotation the issue turns out to be more complicated. The Bonferroni correction is usually thought of as being a conserva- tive correction, i.e., the effective significance threshold is lowered more than necessary. However, Boyle et al. [BWG+04] report that in their experiments the Bonferroni cor- rection is not conservative enough, leading to an optimistic bias in estimating GO categories as statistically significant annotation for gene sets.

3This treatment follows that of Boyle et al. [BWG+04], although we have corrected an error in the formula they present. Chapter 3. Propositional learning from ontological annotation 58

No. of genes annotated in Total in Sample to Parent GO term ≥ m ≥ r to Child GO term m r

Table 3.1: Tests on GO categories where one is more general than the other are not independent — see text for details.

A simple qualitative argument can be developed to suggest why this is the case, and it has important consequences for the use of ontological annotation in over-representation analysis. The Gene Ontology, like many ontologies, is a generalisation hierarchy. This means that any object annotated to a GO term is also implicitly annotated to all of its ancestor (i.e., more general) terms. Just by considering the relation of a node to its parents, as in Table 3.1, the effect of this in terms of multiple testing is obvious. Using the notation from Equation 3.1 we see from Table 3.1 that the hierarchical structure of the Gene Ontology implies a dependency between the total number m out of n genes annotated to a GO term and the number (≥ m) annotated to any of its parents and hence all of its ancestors. The number of genes r in any sample of size s annotated to a GO term and its parents (≥ r) shows a similar dependency relation.

However, the Bonferroni correction assumes that each statistical test applied to the outcomes of an experiment (here, each set of genes annotated to a GO term) is inde- pendent. This can be expressed as an assumption that the values for r and m occurring in one statistical test have no relation to the values in any other test (s and n are fixed for any particular experiment). In the case of GO terms, as seen in the table, this is clearly incorrect. Once we have a GO term T annotating r genes, the probability of having other terms (i.e., the ancestors of T ) annotating ≥ r genes is increased. A similar argument applies in the case of values of m.

In order to deal with this bias Boyle et al. [BWG+04] implemented an alternative correction factor based on randomly sampling s genes from the background set of n and applying Equation 3.1 to each GO term annotating any of this set. Repeating this procedure 1000 times gives, for each GO term, the proportion of apparently significant gene sets under the null hypothesis of random selection, which may then be used as an adjusted p-value. Although this way of computing a correction for p-values avoids independence assumptions it took three orders of magnitude longer to compute. Chapter 3. Propositional learning from ontological annotation 59

3.2.2 Bioinformatics applications of over-representation analysis

Despite the known issues in using statistical tests for over-representation analysis of GO categories, in a recent survey including over 20 widely-used GO analysis bioinformatics tools, nearly all used either the hypergeometric test or the closely related Fisher’s exact test [Riv07]. Rivals et al. [Riv07] conclude that the use of the hypergeometric or two-sided Fisher’s exact test are appropriate, but do not address the issue of suitable corrections for multiple testing.

From discussion with biologists it is clear that they are familiar with the hypergeometric test or its alternatives for over-representation analysis and they are aware of multiple testing issues. Their approach accordingly is not to consider any such results of interest unless the p-values are well below the usual thresholds of 0.05 or 0.01.

For these reasons at various points in this thesis we have used hypergeometric tests for statistical significance, usually with the Bonferroni correction for multiple testing, either in our own implementation in PHP based on the GO Perl library 4, or the version for S. cerevisiae at FunSpec 5; at some points further checking was also done using DAVID 6 which uses Fisher’s exact test, usually with the Benjamini and Hochberg correction for multiple testing (data not shown).

3.2.3 Coverage matrix

An approach developed by Carey [Car04] aims to account for GO structure to obtain an information measure for GO annotation. In this approach GO edge types are ignored and edges are regarded as instances of the single relation refines(C,P ) where C and P are child and parent nodes.

Terms to which a gene is annotated are referred to as the “associated terms” or simply associations for the gene. The set of associated terms for a gene is then its “ontological annotation”. Figure 3.1 shows the separation between the ontology and the associa- tions. In this diagram the ontology is a DAG, with the edge direction going from child

to parent, i.e., an edge from term Ti to term Tj denotes an instance of the relation

refines(Ti,Tj). In this case the “terms” are the letters {a, b, c, d, e, f, g, h}, and the ontology has a single edge type, shown by the solid arrows.

4 http://search.cpan.org/~cmungall/go-perl/ 5FunSpec (an acronym for “Functional Specification”) is at http://funspec.med.utoronto.ca/ 6The Database for Annotation, Visualization and Integrated Discovery (DAVID) is at http:// .abcc.ncifcrf.gov Chapter 3. Propositional learning from ontological annotation 60

Ontology

a

c b

f e d

h g

Associations

g4 g5 g1 g2 g3

Figure 3.1: Example of a DAG-structured ontology (such as GO, but with a single edge type) and associated objects (such as genes) annotated to its terms.

The associations in Figure 3.1 are shown in a separate box below the ontology. Here the set of “genes” {g1, g2, g3, g4, g5} are linked with their associated ontology terms by the dotted arrows. Note the same gene can be annotated with multiple terms (e.g., g1), which we refer to as multiple category annotation, and that genes can be annotated to a term at any level in the ontology hierarchy (e.g, d as well as its parent b), referred to as multiple depth of annotation. These issues are discussed in more detail in Section 3.3.

Carey introduces the idea of an object-ontology complex in which the refinement relation from the GO DAG is represented as a binary (0,1) matrix Γ. Γ is a square matrix V ×V where V is the number of terms in the ontology. For two terms Ti, Tj in the ontology, k Γij = 1 if refines(Ti,Tj) for i > j, otherwise Γij = 0. Matrix powers Γ represent k−step refinements of ontology terms.

A second matrix M maps P objects (here genes) to V ontology terms. In accordance with GO annotation policy it is assumed that genes are annotated to the most specific term. All 1-step refinements of the object annotations can then be computed as C1 = MΓ. This is generalised using the idea of coverage, where a term covers an object if that term or any refinement of it is associated with the object via the matrix M. The binary coverage matrix C (P × V ) contains all such covers. Chapter 3. Propositional learning from ontological annotation 61

a b c d e f g h g1 1 1 1 1 0 1 0 1 g2 1 1 0 1 0 0 0 0 g3 1 1 0 0 0 0 0 0 g4 1 0 1 0 0 1 0 1 g5 1 0 1 0 0 1 0 1

Table 3.2: Coverage matrix for the example in Figure 3.1.

Table 3.2 has the coverage matrix for the example in Figure 3.1. It is interesting to note that the coverage matrix corresponds to the inter-relation between the terminological and assertion components of an ontology represented in a Description Logic, where they are known as the TBox and ABox, respectively [BCM+03]. In this setting, specific genes from experiments can be thought of as ABox individuals, while the the Gene Ontology itself forms a TBox.

The coverage matrix can be used to calculate the probability of a term in the context of a specific object-ontology complex, i.e., a specific set of genes and their annotation in the Gene Ontology. The sum of the column i for a term Ti is ni, i.e., the number of genes annotated to Ti or one of its refinements. The probability of that term appearing

ni in the annotation of the gene set is P (Ti) = n where n is the number of occurrences of the most frequent term in the annotation, typically the root node of the ontology.

Carey proposes an information-based similarity measure between terms. The informa- tion (in bits) of term Ti is − log2 P (Ti). However, to the best of our knowledge no statistical tests for over-representation analysis based on the coverage matrix approach have been developed.

3.3 Ontological annotation for machine learning

Over-representation analysis is a commonly used approach that fits well the paradigm of exploratory data analysis. This is appropriate when the purpose of a biological experi- ment is “data-driven” rather than “hypothesis-driven”, e.g., to group together similarly behaving genes in a microarray experiment by clustering expression profiles [BH02]. However, other experimental designs can be used that lead instead to results in which the data is divided into two or more groups. Then the task is to find a model or “hy- pothesis” that is the “best fit” to the data according to some criterion. In statistics this is known as discriminant analysis and in machine learning as supervised or classifier learning [Alp10]. Chapter 3. Propositional learning from ontological annotation 62

If the experimental setting is appropriate then discriminant analysis can have advan- tages compared to over-representation analysis. For example, in the context of func- tional genomics experiments, there are problems with the approach of developing a probabilistic model then estimating statistical significance. First, as we have discussed in Section 3.2.1 simple probabilistic models may fail to take into account dependencies due to structure in a data set. Second, the difficulty of constructing correct proba- bilistic models increases rapidly with the number and diversity of data sets to be used for integrative analysis, where the goal is to combine multiple sources of data from different experiments to obtain a more complete picture of the operation of biological systems.

However, there are also problems in the use of this kind of heterogeneous data with discriminant analysis or classifier learning. Algorithms of this type require example data in the form of fixed-width vectors of attribute values. Much of the data used, for example, graph data, in the yeast data set analysis tools developed in Chapters5 and6 is not in this format. In this chapter we focus on Gene Ontology annotation data.

In order to handle this type of annotation as data for classifier learning there are two problems to be dealt with, namely multiple category annotation and multiple depth of annotation (see Figure 3.1). The problem of multiple category annotation is that a given gene may have several different functions in a cell, and it may be found in several cellular processes and in different locations. Therefore, any gene is annotated with all of the categories with which it has been associated in the published scientific literature. In any particular experimental setting, however, only a subset of the known annotations of a gene will be relevant. This is known as a multi-instance problem. For propositional machine learning algorithms this is not an easy problem to solve.

The problem of multiple depth of annotation arises from the hierarchical arrangement of ontology categories and the way in which these are used to annotate genes. For example, two genes may have a related function, but are annotated at different levels of generality. Unfortunately, to a learning algorithm this relationship is not apparent; the categories are different. The learning algorithm could be modified to deal with the concept hierarchy. However, this is not straightforward and would have to be done over again for each learning algorithm to be used, which is impractical. This problem is compounded since all genes have not been annotated to the same extent through biological experimentation, leading to variations in multiple depth of annotation.

A solution to both problems may be provided by adapting the coverage matrix of Section 3.2.3 and use the ontological annotation for input to a supervised machine Chapter 3. Propositional learning from ontological annotation 63 learning algorithm. Viewed in terms of graph theory, the coverage matrix represents the induced graph for a set of genes and their associations with respect to the DAG structure of the Gene Ontology. Non-zero entries on each row denote all terms in the set of paths from the associated terms for that gene to the root node of the ontology. The columns can be seen as Boolean attributes or features.

Given the large number of GO terms 7 it is necessary to apply either machine learn- ing techniques specialised to handle very high-dimensional data, or methods to select or construct “good” features. In this work we are concerned with learning human- comprehensible models, so we adopt the latter approach. The idea is to pre-process the data containing multiple category and multiple depth annotation and use prop- erties of the probability distribution on the annotation categories to select subsets of “good” features, or construct new “intermediate” features based on such subsets.

3.3.1 Feature selection

Feature selection (see [GE03] for a recent review) is typically a pre-processing step carried out on a data set prior to the application of machine learning. It can have a number of objectives, such as increasing the efficiency (e.g., faster runtimes) or the effectiveness (e.g., predictive accuracy) of learning. It can also contribute towards better understanding of the domain due to simplification of data sets and learned models. In this work we focus on supervised learning, and we have used two types of features selection: feature ranking and feature subset selection.

3.3.1.1 Feature ranking

In feature ranking for classifier learning the goal is to evaluate each feature separately in terms of its ability to predict the class. This approach uses an information theory- based heuristic to evaluate the usefulness of individual GO terms for predicting the class label. Specifically, the method looks to maximize information gain (similar to the method used in decision-tree learning [Qui93]). This is important, as a good feature should partition the search space into separable classes. For a feature (or attribute) with values A = {a1, a2,...} and a class with values C = {c1, c2,...} equations (3.2) and (3.3) define the entropy of the class C before and after observing feature A [HH03]:

7Over 38,000 as of September, 2012. Chapter 3. Propositional learning from ontological annotation 64

X H(C) = − p(ci) log2 p(ci) (3.2) i

X X H(C|A) = − p(ai) p(cj|ai) log2 p(cj|ai) (3.3) i j

Here p(ci) is the prior probability for all values for class C, and p(cj|ai) is the conditional probability of class value cj given feature value aj. The value of entropy H ranges from 0 to 1. If all the values of a feature belong to the same class then entropy becomes zero. Otherwise if all values of a feature are random with respect to the class this results in entropy of one.

The mutual information I(C,A) is computed as:

I(C; A) = H(C) − H(C|A) (3.4)

In Equation (3.4) the amount the entropy of C decreases after observing A is called information gain (or mutual information). As this is a symmetrical function, the amount of information gained for C after observing A is equal to the information gain for A after observing C. High information gain indicates strong interdependence between the feature values and the class values and that the feature may be useful in a classifier. Once the information gain is computed for all features, they can be ranked and features with low ranks can be eliminated.

The aim of feature selection is to include only a subset of the original features that are more relevant to the target concept or contain more information about the class label. For each feature in the coverage matrix (i.e., induced ontology graph) for a gene set, the algorithm computes the information gain. The set of features above a minimum number of genes M is ranked and the top K among them are selected to use in the data set. The parameters M and K filter out features with too few genes or with too little information gain. For example, in the experiments reported in Section 3.4 we used M = 2 and K was set to include the top 75% of the ranked features. The algorithm for feature selection by information gain ranking — called “IG Ranker” — is in Figure 3.2. Chapter 3. Propositional learning from ontological annotation 65

Input: coverage matrix Γ, M, K, class C Output: subset E of K top-ranked terms Begin E = ∅

For each term Ti in Γ, check number of genes Ni annotated to Ti If Ni ≥ M then Compute mutual information I(C; Ti) If I(C; Ti) ≥ 0 then Add Ti to E EndIf EndIf EndFor

Rank all terms in E in decreasing order of I(C; Ti) Return top K terms

Figure 3.2: IG Ranker – feature selection by information gain ranking.

3.3.1.2 Feature subset selection

The goal here is to select a “good” subset of features for a data set. Clearly a search for the “best” subset is not practical in general. Therefore many heuristic approaches have been proposed. In this work we selected the correlation-based feature selection (CFS) method of Hall [Hal00].

Details of the method are beyond the scope of this thesis. However, the intuitive basis for the heuristic measure used is that good subsets of features should be highly correlated with the class (i.e., likely to be relevant) while inter-correlation between features should be low (i.e., unlikely to be redundant).

The CFS algorithm implements forward selection, a greedy search (hill-climbing) method, which starts with an empty set of features and at each iteration adds the feature that has the highest value for the heuristic measure. Search terminates when no increase in this value can be obtained by adding further features to the subset.

3.3.2 Feature construction

The coverage matrix is a bipartite graph since it denotes a set of edges between genes and GO terms. It can therefore be represented as a “cross table” or formal context in the framework of formal concept analysis [GW97]. A method of feature construction can then be derived, based on our previous results [Bai02]. The example genes in the Chapter 3. Propositional learning from ontological annotation 66 data set can then be re-expressed in terms of the new intermediate features which define combinations of GO categories. Constructed features can then be selected by the learning algorithm based on their utility in forming accurate models to predict the class of the genes.

3.3.2.1 Formal concept analysis

Detailed coverage of Formal Concept Analysis (FCA) is in [GW97]. In this section we follow the treatments of [GM94a, CR93] since they are more oriented towards machine learning. However, some naming and other conventions have been changed.

Definition 3.1. Formal context A formal context is a triple hD, O, Ri. D is a set of descriptors (attributes), O is a set of objects and R is a binary relation such that R ⊆ D × O.

The notation hx, yi ∈ R or alternatively xRy is used to express the fact that a descriptor x ∈ D is a property of an object y ∈ O.

Definition 3.2. Formal concept A formal concept is an ordered pair of sets, written hX,Y i, where X ⊆ D and Y ⊆ O. Each pair must be complete with respect to R, which means that X0 = Y and Y 0 = X, where X0 = {y ∈ O|∀x ∈ X, xRy} and Y 0 = {x ∈ D|∀y ∈ Y, xRy}.

The set of descriptors of a formal concept is called its intent, while the set of objects of a formal concept is called its extent. For a set of descriptors X ⊆ D, X is the intent of a formal concept if and only if X00 = X, by composition of the 0 operator from Definition 3.2. A dual condition holds for the extent of a formal concept. This means that any formal concept can be uniquely identified by either its intent or its extent alone. Intuitively, the intent corresponds to a kind of maximally specific description of all the objects in the extent.

The correspondence between intent and extent of complete concepts is a Galois con- nection between the power set P(D) of the set of descriptors and the power set P(O) of the set of objects. The Galois lattice L for the binary relation is the set of all com- plete pairs of intents and extents, with the following partial order. Given two concepts

N1 = hX1,Y1i and N2 = hX2,Y2i, N1 ≤ N2 ↔ X1 ⊇ X2. The dual nature of the Galois connection means we have the equivalent relationship N1 ≤ N2 ↔ Y1 ⊆ Y2. Chapter 3. Propositional learning from ontological annotation 67

The formal context hD, O, Ri together with ≤ define an ordered set which gives rise to a complete lattice. The following version of a theorem from [GM94a] characterizes concept lattices.

Theorem 3.1. Fundamental theorem on concept lattices [GM94a] Let hD, O, Ri be a formal context. Then hL; ≤i is a complete lattice 8 for which the least upper bound (Sup) and greatest lower bound (Inf) are given by

T S 00 Supj∈J (Xj,Yj) = h j∈J Xj, ( j∈J Yj) i S 00 T Infj∈J (Xj,Yj) = h( j∈J Xj) , j∈J Yji

Since we are concerned with concepts formed from sets of descriptors, the partial order as well as Sup and Inf definitions are given so as to relate to lattices in machine learning rather than that which is typical in formal concept analysis. That is, the supremum Sup of all nodes in the lattice is the “most general” or top (>) node and the infimum Inf is the “most specific” or bottom (⊥).

3.3.2.2 Feature construction from concept lattices

Treating the coverage matrix as a formal context (Definition 3.1) where genes are ob- jects and GO terms are descriptors enables the construction of concept lattices in which the formal concepts (Definition 3.2) contain sets of GO terms that group together sets of genes. The terms shared by such groups of genes indicates the biological properties that they have in common.

In our previous work [Bai02] we investigated the use of concept lattices for both unsu- pervised and supervised learning. Both cases required the use of an information-based measure on formal concepts, similar to that of Carey [Car04] discussed in Section 3.2.3. While both approaches are based on probability, our measure was motivated by the compressibility or algorithmic complexity [Cha87] of a concept. Compressibility of structured data objects, such as strings in a formal language or, as in this case formal concepts in a lattice, is inversely related to the probability of finding such objects by chance.

This approach was later used by us for ontology learning, essentially by extracting concepts from a concept lattice and combining them in a structured (propositional) logic program. However, while this is suitable for unsupervised learning it ignores the

8Given a non-empty ordered set P, if for all S ⊂ P there exists a least upper bound and a greatest lower bound then P is a complete lattice. Chapter 3. Propositional learning from ontological annotation 68

{a}, {g1,g2,g3,g4,g5}

{a,b}, {g1,g2,g3}

{a,b,d}, {g1,g2} {a,c,f,h}, {g4,g5}

{a,b,c,d,f,h}, {g1}

Figure 3.3: Concept lattice for the coverage matrix of Table 3.2. distribution of classes within the set of objects. Therefore, to use concepts in supervised learning we used a pseudo-MDL measure in which compressibility was combined with the entropy of the class-distribution of the examples in the intent of the concept. The intuition behind this is that concepts that will be useful in supervised learning will tend to have high compressibility and low class entropy; i.e., they will potentially lead to high accuracy classifiers.

3.3.2.3 Selecting predictive constructed features

For the current work, where we are focused on the specific problem of handling Gene Ontology annotation in supervised learning, we implemented a simpler approach to feature construction. This feature construction method focuses on selecting discrimi- native concepts; in this case, those that discriminate well between different class values are said to be predictive.

The procedure to construct features from a concept lattice for use in supervised learning was as follows. The method shares two parameters with information gain ranking (Section 3.3.1.1): M, the minimum number of examples to be covered by any selected concept, and K, the maximum number of concepts to be selected.

1) for each of the genes in the training set, generate the gene’s GO coverage as described in Section 3.2.3. For the example of Figure 3.1 this gives a set of “objects”, i.e., genes: Chapter 3. Propositional learning from ontological annotation 69 g1 ← a, b, c, d, f, h. g2 ← a, b, d. g3 ← a, b. g4 ← a, c, f, h. g5 ← a, c, f, h.

2) construct a concept lattice L from the objects shown as clauses in step 1).

3) for each formal concept in L with extent containing ≥ M objects, evaluate the class distribution of the objects in the concept.

4) sort the concepts identified at step 3) in decreasing order of predictive accuracy for the majority class of objects in the concept.

5) select the top K concepts in the order identified at step 4), or simply all of those with accuracy above that of the frequency of the majority class (e.g., for a two-class problem, > 0.5).

6) construct a table containing, for each gene, a row noting whether, for each of the concepts identified at step 5) the gene is in the concept (i.e., feature is true) or not (feature is false).

7) join the table constructed at step 6) with the class values (further attributes can also be joined in the same way) to form the training set for supervised machine learning.

This procedure is actually much more efficient than that of [Bai02] since it only involves constructing the concept lattice once. When this is done, concepts can be evaluated quickly, and there is no expensive lattice revision as in the earlier method.

3.4 Experimental results

3.4.1 Biological background: cellular network response to stress

The number of organisms for which the complete genome (DNA sequence) is available continues to grow at an increasing rate. Meanwhile there have also been major ad- vances in laboratory techniques to analyse complex cellular processes. This has led to a new era in cell biology, termed systems biology, in which responsive phenotypes, i.e., the measurable characteristics of the organism in response to environmental or genetic perturbations, can be investigated genome-wide, i.e., by collecting data on the Chapter 3. Propositional learning from ontological annotation 70 activity of all the organisms genes simultaneously. In this way we can investigate the cellular network response of the genes that give rise to an observed phenotype as the downstream effect of an external stimulus through signal transduction.

For example, when cells adapt to sudden changes in the environment, cellular network responses include the action of sets of transcription factors (proteins) to activate sets of genes involved in biochemical pathways. Such a responsive sub-network of the cell is referred to as the genetic regulatory network (GRN). The protein products of co- expressed genes in a GRN combine to form interacting molecular machines that produce a responsive cellular phenotype. This responsive sub-network is partly described by the protein-protein interaction (PPI) network. In turn, proteins act to regulate cellular metabolism in pathways of biochemical reactions and, by subtle feedback mechanisms, their own GRNs and PPI networks.

The bakers and brewers yeast Saccharomyces cerevisiae is a key model organism for systems biology, due to the ease with which genetic manipulation can be carried out. Virtually all areas of cell biology have benefited from the use of yeast as a model or- ganism to study processes and pathways relevant to higher eukaryotes. Importantly, many fundamental processes in yeast are conserved through to humans. Data describ- ing yeast cellular network responses are derived from high-throughput genome-wide experimental techniques; the development of which continues unabated. However, al- though a decade has passed since the sequencing of the complete yeast genome, there is still a sizeable subset of yeast genes with no known molecular function. Even for those with a designated function this often denotes only part of their likely cellular role. Yet yeast is one of the most intensively studied organisms, with a relatively small genome. A key reason that knowledge on gene function is not greater is that the computational techniques and tools that will provide biologists with systematic ways to integrate the data resources and generate hypotheses about the function of cellular networks are not yet in place. This study aims to contribute towards that goal.

3.4.2 Overview of experiments

We evaluated our approach to learning from ontological annotation with a comparative study of different methods on two biological data sets. We selected two problems on yeast systems biology and formulated them as classification problems. In both cases the task was to learn a classifier to predict a selected behaviour or phenotype of interest for genes (or their products), given a profile of other measures for those genes, i.e., a set of attributes or features. Chapter 3. Propositional learning from ontological annotation 71

We chose real problems, in the sense that any progress on them may be of interest to biologists. This enabled us both to obtain quantitative results on the performance of different methods and validate the learned classifiers in biological terms. The first case study was on the prediction of protein expression in response to addition of hydrogen peroxide (H2O2) to yeast cells to produce “oxidative stress”. The second task was more difficult — predicting the behaviour of genes when yeast cells were challenged by a range of multiple stress agents.

Machine learning typically is focused on the generation of predictive models, e.g., clas- sifiers with high predictive accuracy on data sets. However, such models can sometimes lack in explanatory power. In this work we are concerned with learning predictive mod- els but we also require them to be comprehensible, hence our approach uses decision tree learning [Qui93, WF05]

Since this was a preliminary study, and we wanted to focus on the effects of represen- tation, i.e., attributes used, rather than the choice of learning algorithm, we limited our attention to a single algorithm. We also had a requirement that the classifiers should be comprehensible to biologists. This indicated an approach such as decision- tree learning. We selected the Weka implementation [WF05] of the C4.5 decision-tree induction system [Qui93], called J48. For these experiments and those in Section 3.4.4 we used the default parameters for J48 and 10 replicates of 10-fold cross-validation to obtain a mean predictive accuracy. An overview of this setting is in Figure 3.4.

3.4.3 Case study 1: predicting protein expression

A preliminary “proof of principle” experiment was carried out to test the main aims of this work. First, we aimed to investigate whether a supervised learning approach using a standard machine learning algorithm was suitable to perform an integrative analysis of high-throughput results from multiple molecular biology experiments. A second aim was to compare feature selection and construction approaches as described above to investigate the use of Gene Ontology annotation in such integrated data sets.

3.4.3.1 Experiment 1: protein expression given gene expression

As an example application of integrating data sets on yeast genes, we uploaded two sets of data on the same 92 genes to our database. These two data sets give different “snapshots” of the yeast cellular network response to the addition of hydrogen peroxide to the cells’ growth environment. This environmental oxidant places the yeast cells Chapter 3. Propositional learning from ontological annotation 72

MySQL database

Genomics Proteomics Screen data data data Gene Ontology Class Source data for Predictive Features Extract genes GO from selected class data Build Coverage Matrix

Construction Feature Feature

Formal Concept Analysis

Extract concepts from the lattice

Add Class Labels

Information Feature P-value OR Gain Subset OR Ranker Ranker(IG) Selection (<0.01)

(CFS) Feature Selection Feature

Build Genomics training data data (AND/OR)

Predictive Decision Tree Accuracy (J48)

Figure 3.4: Experimental setup for machine learning with ontological annotation. A MySQL database version of GO is downloaded that is updated every month. This allows the coverage matrix to be generated efficiently via SQL queries and produce integrated data sets. Chapter 3. Propositional learning from ontological annotation 73 under oxidative stress, causing the suppression of some normal functioning and the activation of cellular defence mechanisms. This provides a useful experimental tool for studying cellular responses to environmental stress. As an additional test, we also included data on other stresses to see if any effect of a common stress response system would be observed.

Classification task

In the paper by Godon et al. [GLL+98] the authors identified 56 proteins whose syn- thesis was stimulated and 36 that were repressed under oxidative stress caused by exposure of yeast cells to hydrogen peroxide. This was a proteomics study, i.e., the results obtained reflected changes in the total composition of proteins in the cell using comparative two-dimensional gel electrophoresis.

Attributes

In the paper by Causton et al. [CRK+01] microarray data was collected on the cellular network response by yeast to six different environmental stresses, including hydrogen peroxide. In contrast to the proteomics study above, this data was on the transcrip- tional response in terms of mRNA levels in the cell observed at 8-10 time points over a period of around 2 hours. This reflects genes that are “turned on” or “turned off” in response to the addition of the stress-inducing agent to the cellular environment. Data for each stress condition comprises a time series, with mRNA levels recorded at irregular intervals following initial exposure to the stressor.

Results

The results are summarised in the histogram of Figure 3.5. These show that gene expression under H2O2 is a good predictor of protein expression (mean accuracy 85.3%). Of the remaining five conditions, three (heat, acid and alkali) show some predictivity (mean accuracy 69.0% – 74.2%). The other two conditions, salt and sorbitol have mean accuracy 65.8% and 64.8%, respectively. However, this is only slightly above the baseline accuracy of simply predicting the majority class for all examples (60.9%), shown as the dotted line across the histogram in Figure 3.5. Chapter 3. Propositional learning from ontological annotation 74

90

85

80

75

70

65 Percentage accuracy 60 61% Alkali Acid H2O2 Heat Salt Sorbitol Causton et al. microarray data sets

Figure 3.5: Accuracy of predicting protein expression given six microarray data sets (see text for details). The dotted horizontal line shows the baseline accuracy of 61% obtained by simply predicting the majority class for all genes.

3.4.3.2 Experiment 2: protein expression given GO features

Classification task

The classification task was the same as that of Section 3.4.3.1.

Attributes

The set of attributes was limited to those Gene Ontology categories in the coverage matrix of the 92 genes to be classified. Each of the three GO sub-ontologies biological process (BP), cellular component (CC) and molecular function (MF) was used to gen- erate separate coverage matrices. Each set of GO categories was then used as the basis for four methods of feature selection or construction, as follows.

No Selection no feature selection or construction; used as a baseline for comparison

IG Ranker the information gain ranking method as described in Section 3.3.1.1, with M = 2 and K set to select the top-ranked 75% of features

CFS correlation-based feature selection as described in Section 3.3.1.2; the Weka im- plementation was used with default parameter settings Chapter 3. Propositional learning from ontological annotation 75

Conduce uses formal concept analysis to construct new features ranked by their pre- dictive ability as described in Section 3.3.2.3 with M = 2 and K set to select all features

Results

The results are summarised in the histogram of Figure 3.6. These show that Gene On- tology categories can provide predictive power, although less so than gene expression under H2O2. The highest performance was obtained with the CFS method of fea- ture subset selection on BP (mean accuracy 78.8%). However, the performance range between sub-ontologies for each selection or construction method is relatively large. Conduce has the smallest range (mean accuracy 70.8% – 76.8%) and No Selection the largest (mean accuracy 61.3% – 75.5%). For each selection method, the ordering in terms of decreasing accuracy was BP, MF and CC.

85 C 80 F P 75 70 65 61%

Percentage accuracy 60 No Selection IG Ranker CFS Conduce Feature Selection/Construction Method

Figure 3.6: Accuracy of predicting protein expression given cellular component (C), molecular function (F) and biological process (P) sub- ontologies of the Gene Ontology. Four feature selection and construction methods are compared for each sub-ontology (see text for details). The dotted horizontal line shows the baseline accuracy of 61% obtained by simply predicting the majority class for all genes.

3.4.3.3 Experiment 3: protein expression given microarray data and GO features

With the addition of features from the molecular function (MF) ontology in this ex- periment the predictive accuracy of decision tree learning showed a slight increase to Chapter 3. Propositional learning from ontological annotation 76

node_25

= f = t

Peroxide 0 (B) Protein repressed by H2O2 (2.0)

<= 3881.5 > 3881.5

Peroxide 60 Protein repressed by H2O2 (17.0/1.0)

<= 1161.6 > 1161.6

Peroxide 120 Protein induced by H2O2 (42.0)

<= 252.4 > 252.4

Protein induced by H2O2 (8.0) Protein repressed by H2O2 (23.0/5.0)

Figure 3.7: A decision tree for protein induction repression learned with Gene Ontology features. Ovals are at- tribute tests (“Peroxide t” means microarray data at time t), classifications are at leaves. See text for details.

87% correct, with a slightly smaller tree size (5 leaves, 9 nodes), compared to the use of the Causton et al. [CRK+01] microarray data alone. The tree learned is shown in Figure 3.7. In this tree the feature “node 25” stands for the following set of molecular function GO terms:

%molecular function ; GO:0003674 %catalytic activity ; GO:0003824 %transferase activity ; GO:0016740 %transferase activity, transferring alkyl or aryl (other than methyl) groups ; GO:0016765 %methionine adenosyltransferase activity; GO:0004478

Although this is biologically plausible, the question of whether it has significance in the context of this data set is left to further work.

We also investigated adding features from the other sub-ontologies, biological process (BP) and cellular component (CC) separately, in all pairwise combinations, and all together using Conduce (see above). However, at around 79-80%, predictive accuracy was not as high for these combinations. Chapter 3. Propositional learning from ontological annotation 77

3.4.4 Case study 2: predicting general vs. specific stress-response

The Saccharomyces Genome Deletion Project (Winzeler et al. [WSA+99]) is a set of yeast strains in each of which exactly one gene from the genome has been systematically removed. Use of this set of strains [G+02] has led to the discovery that a relatively small subset of genes in the genome are essential for growth under favourable conditions, with the rest presumably remaining in the genome due to their evolved role in enabling the organism to survive a range of stresses. About 4800 of these deletant strains are viable (in the remainder the deleted gene is essential for life). Strains from this set can then be studied to see the effects of genes under different conditions. Roughly, the methodology enables the inference of a functional role for a gene in response to some stress if its deletant shows some growth defect under that stress.

Biologists have carried out many “screens” of the deletant set — selecting a subset of genes and subjecting each of the corresponding deletants to that stress. Unlike our first case study (Section 3.4.3), where there would be expected to be a fairly clear relationship between the target class and the microarray data, the screen data presents a more interesting yet more difficult problem, since many mechanisms could be hypothesised to be the cause of the observed effect of a gene deletion. In a nutshell, case study 1 had the aim of learning to predict an intra-cellular response given measurements of intra-cellular concentration of mRNA plus GO annotation, whereas in this case study we aim to learn to predict an extra-cellular response, given the same data. This is in fact more representative of a typical laboratory analysis.

Our collaborating yeast biologist Dr. Mark Temple assembled a dataset of 26 screens on 1094 genes from various different laboratories. Each screen represents deletant sensitivity to a chemical or environmental stress - the “phenotype” is that cells are sensitive (to a given stress) when the gene is deleted. These included multiple screens from Thorpe et al. [TFA+04] and Tucker and Fields [TF04]. Initial results on individual screens from both these studies indicated two issues with the approach we developed in case study 1.

First, for each individual screen we have, in most cases, simply a set of genes deemed “sensitive”. For supervised learning, it is not clear what the set of “not-sensitive” genes should be. Since only a subset of the genes are tested in a screen, one possibility is to take those deletants tested but not found to be sensitive. But in most cases this is a much smaller proportion than those that are sensitive, leading to a very skewed class distribution. It also ignores the fact that other genes not tested in the screen could be “sensitive” or “not-sensitive”. Chapter 3. Propositional learning from ontological annotation 78

Second, Thorpe et al. [TFA+04] noted that for many of the screens in their study there was little correlation between microarray data and the deletant sensitivity phe- notype. This was borne out in our attempt to use the microarray data under various stress conditions to predict the response to H2O2 stress in screens from both Thorpe et al. [TFA+04] and Tucker and Fields [TF04], where we found no improvement in predictive accuracy over predicting the majority class.

Classification task

Since many of the 26 screens contain genes sensitive only to that stress, we defined instead a “meta-level” classification problem, in terms of the number of screens under which a deletant was found to be sensitive. We found 382 deletants sensitive to exactly one screen and 290 deletants sensitive to exactly two screens. While no gene was sensitive to more than eleven screens, we found that 422 out of 1094 were sensitive to 3 or more screens. We hypothesised that these could be “key” deletants — part of a general cellular stress response system — whereas the remainder could be involved in responses more specific to the particular stress-inducing agent. The training set therefore contained genes sensitive to 3 or more screens labelled “general-responders” and the rest labelled “specific-responders”. We will refer to this task as predicting the “response-class”.

Attributes

We envisaged three experiments for this classification task. In the first experiment the task was to predict the response-class given the microarray data sets as in Sec- tion 3.4.3.1. However, as we found with the H2O2 screens, and as observed by Thorpe et al. [TFA+04], there was no correlation between the microarray data and the response- class. This resulted in no decision tree being found that exceeded the baseline accuracy of 60% obtained by simply predicting the majority class for all genes. This led us to abandon the experiment.

The second experiment involved predicting the response-class given GO features as in Section 3.4.3.2. We used the same approach as before, with the exception of Conduce, which was unable to run to completion on the enlarged training set. However, we also investigated a method of feature selection based on ranking all GO terms by their p- value computed by the hypergeometric distribution of Equation 3.1, as follows. For all genes in the general-responders (resp. specific-responders) class, we collected all GO Chapter 3. Propositional learning from ontological annotation 79 terms with p-value less than 0.1. The union of these sets of GO terms was then used as a set of Boolean features — a feature has the value true if a gene is annotated to the term, otherwise it is false.

Thirdly, we aimed to investigate the combined representation of microarray data plus GO features. However, as with the first experiment, we found the microarray data gave no increase in predictive accuracy, and the results were not significantly different from using GO features alone.

In fact 78 genes did not have GO annotations, so they were removed from the dataset, giving 409 genes labelled “general-responders” and 607 labelled “specific-responders”, for a total of 1016.

Results

The results are summarised in the histogram of Figure 3.8 and Table 3.3. We also obtained some initial results on the method of feature ranking using the hypergeometric distribution p-values. We selected the biological process sub-ontology, as that was the best-performing, and achieved a 10-fold cross-validation accuracy of 82.5% in a tree of only 19 leaves. Since these were not obtained using the 10 replicates of 10-fold cross-validation methodology we have omitted them from the table.

No Selection IG Ranker CFS 62.4 66.7 ◦ 67.4 ◦

Table 3.3: Accuracies of predicting general vs. specific deletant sensitivity to mul- tiple stresses with GO biological process features. Note that ◦, • denote statistically significant improvement or degradation, respectively, with respect to “No Selection”.

3.5 Discussion

We have investigated two separate but related problems empirically to test our approach to incorporating GO terms into machine learning. In both we compared the use of gene expression data alone, GO terms alone, and both together. The first problem was one of predicting an “intra-cellular” phenotype, the second predicting an “extra-cellular” phenotype. In the first case, we found that both types of information in isolation were comparable in terms of predictive accuracy, and their combination yielded no significant extra performance. The second case was a considerably more difficult problem, and Chapter 3. Propositional learning from ontological annotation 80

70 C F 68 P

66

64

62 Percentage accuracy 60% 60 No Selection IG Ranker CFS Feature Selection Method

Figure 3.8: Accuracy of predicting general vs. specific deletant sensitivity to multiple stresses given cellular component (C), molecular function (F) and biological process (P) sub-ontologies of the Gene Ontology. Three feature selection and construc- tion methods are compared for each sub-ontology (see text for details). The dotted horizontal line shows the baseline accuracy of 60% obtained by simply predicting the majority class for all genes. here the use of GO terms enabled some gains in predictive accuracy, although gene expression data, separately or in combination, gave no improvement.

We are able to conclude that our results support the hypothesis that GO annotation can be useful in the context of supervised machine learning.

3.5.1 Case study 1: predicting protein expression

There is a relationship between protein synthesis and gene expression. For example, Godon et al. [GLL+98] noted that the alterations they observed in the expression of proteins in response to hydrogen peroxide would be likely to involve a transcriptional component. In particular, the observed genomic response to oxidative stress strongly suggested an element of transcriptional control. Accordingly, we designed a classifica- tion task to predict the protein response observed in the Godon et al. [GLL+98] data in terms of attributes involving transcriptional response to hydrogen peroxide, or any of the other stressors. Chapter 3. Propositional learning from ontological annotation 81

Two effects are suggested by the results of Figure 3.5. One is a specific response to

H2O2 by transcription of genes leading to expression of corresponding proteins. The other is that some non-specific response, varying over the non-H2O2 stresses, may be occurring in the transcriptional data and is correlated with a subset of the proteins expressed under H2O2 stress.

The effect of using Gene Ontology categories alone for predicting protein expression as seen in Figure 3.6 is interesting. First, using the highest accuracy obtained using microarray data alone is better than the highest accuracy obtained using any of the feature selection or construction methods on GO categories alone. Second, none of the feature selection or construction methods is clearly superior to any other. Although CFS with BP has the highest accuracy, Conduce on BP has next highest accuracy, and has the least variability over all three sub-ontologies.

Third, it is noticeable that accuracy for each of the sub-ontologies with all four different methods follows the ordering BP, MF and CC. Interestingly, this is the same ordering as the number of terms in each — in fact, BP had around twice the number of terms as MF (approximately 16,000 vs. 8,000) at the time of the experiments (with approximately 2,000 terms for CC).

One interpretation of this is that GO categories are predictive of a reasonably large subset of the genes in the training set, as long as the ontology has sufficient “resolution” — in other words, that there is sufficient detail and coverage within the ontology to provide accurate definitions of the target class. Based on the number of terms alone, the BP would a priori be the most likely to have the highest resolution.

For BP and MF there is relatively little difference between the accuracy obtained by using all GO terms in the training set and the complicated methods of feature subset selection and construction, CFS and Conduce. Using the simpler feature ranking by information gain actually reduced accuracy relative to using all the features, presum- ably since this leaves out some useful features. However, on CC it is interesting that the best accuracy was obtained with Conduce. This suggests that there may be some advantage in feature construction when the ontology may have lower resolution, since the “formal concepts plus decision tree” representation is the most powerful of those used in our approach. Chapter 3. Propositional learning from ontological annotation 82

3.5.2 Case study 2: predicting general vs. specific stress-response

There are two principal messages from the results. First, as expected, microarray data alone or in combination with GO features makes no significant contribution to being able to predict which response class a gene is in. Second, GO features do enable a significant improvement in accuracy over predicting the default class, although only when some method of feature subset selection is used. Taken together with the results for predicting protein expression, this supports the central hypothesis of the chapter.

Although further work is needed to confirm the results when using a ranking based on the hypergeometric test, it is instructive to examine the tree learned from the biological process ontology. There are 53 GO terms with a p-value below 0.01 for the general responder class, and only 8 for the specific responders. Only one of these terms is in common between the two subsets. This suggests that using these GO terms as features will lead to good predictive accuracy, and that is the case.

Examining the tree, the top 6 out of 19 leaves each contain ≥ 10 genes. The corre- sponding GO terms are:

GO:0006350 transcription [66] GO:0006974 response to DNA damage stimulus [40] GO:0015031 protein transport [38] GO:0006417 regulation of translation [31] GO:0006897 endocytosis [13] GO:0006811 ion transport [10]

These leaves correctly classify 198, or 48.4% of the general responders. The remainder of the leaves correctly classify a further 42 general responders. All other genes are classified as specific responders, for an overall accuracy of 82.5%.

It might seem that the importance of the “transcription” category contradicts the finding that microarray data is not predictive for the response-class problem. However, initial inspection suggests that this category denotes genes involved in the control of transcription, rather than being themselves changed in expression. An example of this would be the transcription factor Yap1p, (the protein corresponding to the gene YAP1). Yap1p is the best-characterized component of cellular response to reactive oxygen species [TPD05]. Chapter 3. Propositional learning from ontological annotation 83

3.5.3 Related work

3.5.3.1 Graph-based similarity and over-representation analysis

The problem of bias in over-representation analysis of GO annotation is still under research for example, [ARL06, GBRV07]. However, the more general problem of dealing with dependencies in complex data types such as interaction networks in an integrative setting remains.

In the context of machine learning, an alternative to the pre-processing approach we have described in this chapter is to build ontology-handling directly into the machine learning algorithm. This is a long-standing idea in machine learning; a recent approach was implemented by Zhang et al. [ZCH05]. They incorporated “attribute-value trees” (AVTs) into a decision tree learning system, and later into a Naive Bayes learner called AVT-NBL [ZKSH06]. Their work focuses on learning classifiers from partially defined data and their AVTs. AVT is a tree-like hierarchy, which represents generalization/spe- cialization orders among the attribute values in that particular domain. Thus an AVT defines an abstraction hierarchy over values of attributes. Here, AVT-NBL uses AVTs as background information with the given data to further specialize the learning clas- sifier, and it is claimed to generate more compact and accurate classifiers compared to simple NBL with broad range of data.

AVT-NBL’s performance was tested over three sets of data, and overall performance showed lower error rates in most compare to the NB classifier based on 10 fold cross validation with 90% confidence. However, the experiments were performed on data sets where the AVTs were quite simple and restricted to a tree-structured taxonomies. They are generated using AVT-Learner (a hierarchical agglomerative clustering algorithm). It will be interesting to see how well it performs where AVTs have hierarchical DAG structures, for example, in the Gene Ontology. For ontologies of the size and complexity of the Gene Ontology, it is not clear how well this approach will scale. Additionally, building in ontology handling requires modifying each machine learning algorithm one wishes to use, whereas pre-processing the training data into a standard format can allow the use of any standard algorithm.

3.5.3.2 Inductive bias of closed concept

A potential problem with our approach lies in the use of Formal Concept Analysis as the basis for our feature construction approach. Since each concept in the lattice has a Chapter 3. Propositional learning from ontological annotation 84 set of descriptors that is “closed” with respect to the objects in its extent, the features that can be constructed are, in a sense, maximally specific. However, this is a form of inductive bias [Mit97] that may not be appropriate. In particular, it is not clear that this is an appropriate bias for the often noisy data that gene sets constitute. Although, since more general concepts may also be included as features, this may not be a critical problem, it should be investigated as part of future work. There are also known issues with the scalability of Formal Concept Analysis, as we found in Section 3.4.4 and this will also need to be investigated.

3.5.3.3 Feature-extraction for machine learning using closed concepts

In general, the reason GO annotation causes problems for vector-based machine learn- ing is that it is a graph-based data representation. For example, the coverage matrix approach could be used directly as a set of attribute-vectors for such algorithms. This would lead, though, to very high-dimensional data sets (there are approaching 30,000 terms in the current Gene Ontology, leading to a coverage matrix with up to 30,000 columns). However, it is possible that kernel methods could be used, since they are often appropriate for high-dimensional data. We have chosen not to take this approach at this stage since we seek comprehensible models rather than “black-boxes”. Further- more, with the addition of multiple graph-based data sources, such as protein-protein interactions, the data dimensionality could quickly rise to the order of 106 or even higher. Nonetheless, this could be investigated as part of further work.

A general property of graph-based data representations for machine learning is their sparseness when converted to vector format. However, other data sources used in later chapters of this thesis have this property, such as annotation of genes by the biological pathways in which they are involved. A preliminary experiment using the feature construction method of Section 3.4.3.2 with KEGG pathway annotation [KG02] replacing GO annotation resulted in successful incorporation of pathway features in the learned decision tree. The use of graph-type data is investigated further in later chapters.

3.5.3.4 Description Logics

Our approach may be seen as deduction followed by induction: first, the transitive closure of the GO refines relation is computed relative to the set of genes in the training set; then a machine learning algorithm searches for a hypothesis with good estimated Chapter 3. Propositional learning from ontological annotation 85 predictive accuracy. In principle the same approach should be possible using a more powerful ontology language. In this setting, however, it is likely to be necessary to more strictly control both the deductive and inductive phases of the process, via devices such as language restrictions, search constraints and heuristics. Restrictions to subsets of first-order logic are well-studied in Description Logics [BCM+03], and these are widely proposed as a basis for representation of and reasoning in ontologies.

Wroe et al. [WSGA03] was an early proposal to move the Gene Ontology to a description logic framework. This was motivated by the need for semantic analysis of the Gene Ontology by the use of description logic to enable validation, extension and classification tasks. Currently multiple formats of GO are available for download, including OWL, MySQL and Prolog; we are using the latter two in our work.

Computing the coverage matrix can be done by bottom-up breadth-first traversal of the Gene Ontology from the gene associations. Then constructing a concept lattice by treating the coverage matrix as a formal context amounts to finding the minimal common graph paths for gene subsets. This appears to be related to the classification problem in description logics, and we plan to investigate possible connections as part of further work. Since GO is available in OWL format it makes sense to pursue this approach. Although the simple formalism of GO itself probably does not make this worthwhile, the possibility of applying this approach to ontologies in richer represen- tations is one of our research goals.

The Gene Ontology Next Generation Project (GONG) is developing a staged method- ology to evolve the current representation of the Gene Ontology into DAML+OIL in order to take advantage of the richer formal expressiveness and the reasoning capabil- ities of the underlying description logic. Each stage provides a step level increase in formal explicit semantic content with a view to supporting validation, extension and multiple classification of the Gene Ontology.

However, it is not clear that description logics are always the best choice for ontology construction tasks; Stevens et al. [SEAW+07] found that using OWL for modelling complex biological knowledge was only partially successful due to limitations of the formalism.

3.5.3.5 Bioinformatics approaches

Antonov and Mewes [AM06] have proposed a method of finding combinations of ontol- ogy terms that is related to our method of feature construction and selection. Their Chapter 3. Propositional learning from ontological annotation 86 method searches for algebraic combinations of basic categories, such as GO terms. Since they use set intersection, union and difference their representation is quite powerful. The exact relation to our approach is not clear, but given the general nature of the hypothesis representation language of decision trees it should be possible to accomplish the same or more powerful combinations in our approach using feature construction.

Guo et al. [GZL+05] integrate gene expression profiles from microarray data with GO categories by reducing the vector of measurements to a point score. They can then learn a decision tree to classify samples based on combinations of the profile-based GO-derived attributes. This is clearly related to our approach, except their aim is to classify samples (e.g., microarrays taken from patients), whereas we are aiming to classify genes.

3.5.4 Some ontological analysis of the Gene Ontology

One outcome of generating the coverage matrix for ontological annotation of the data before machine learning is that we discovered that although GO annotation can be redundant as explained below, the application of FCA for feature construction can remove this redundancy.

Limitations of the GO representation

The important relational mappings in GO are the Term, Term2Term and Graph path tables (see Figure A.2 in AppendixA). The lexical representation of GO categories (all the terms and their possible relationship types) is in the Term table, whereas Term2Term has all the parent-child paths (or reflexive transitive closure) and the Graph path stores the transitive closure of a relationship, i.e., the paths between the terms and all of its ancestors.

Note that GO is based on the pre-computed transitive closure stored in the Graph path table due to issues with recursive queries in RDBMS (Relational Database Management Systems). This technique enhances the query performance in RDBMS which is com- promised by the limited predefined relationship types, i.e., transitive properties such as “is-a” and “part-of” cannot be intensionally represented in a relational database. Therefore, GO has significant drawbacks in terms of its restricted representation of concepts and their transitive properties. Chapter 3. Propositional learning from ontological annotation 87

Figure 3.9: Transitive closures in a GO-type ontology can show redundancy. In this toy example the directed edge from “A” to “B” is redundant, given the other two edges. This redundancy can occur in the Gene Ontology.

Redundancy in GO annotation

We have found that there are redundant edges in the GO structure. In general, there may exist relations where a term “A” is subsumed by two terms such as “B” and “C”, and “C” is subsumed by the term “B”. However, knowing that “A” is subsumed by the term “C”, and “C” is subsumed by the term “B”, we can derive that “A” is also subsumed by the term “B”; see Figure 3.9. Here the relation “is subsumed by” could be any edge in the GO graph such as “is-a” or “part-of”. In fact, the FCA-based feature construction method of Section 3.3.2 is able to handle this for GO annotated gene sets, since it will construct a new feature “merging” such categories, giving the most specific set of categories.

However, it is not clear how common this problem of transitive closure in GO in- dicating redundant relationships among the terms actually is. To investigate in de- tail, we conducted a series of MySQL database queries in the three relational tables Graph path, Term2Term and Term in GO 9. The two consecutive relational views View TriangularDist and View Triangular flatten out the whole ontology graph by re- cursively generating all the intermediate terms stored in the Graph path table, followed by a single query to list the redundant triangular relations in GO. The following SQL queries show our method.

9GO version Jan ’06. Chapter 3. Propositional learning from ontological annotation 88

Table1: Table 2: Create table View TriangularDist Create table View Triangular SELECT DISTINCT SELECT t.term1 id as term1 id, t1.term1 id as parent, tm1.acc as parentGoId, m1.parentGoId, t.term2 id as term2 id, t1.term2 id as midterm, tm2.acc as childGoId, t2.term2 id as child, t.relationship type id m1.childGoId, FROM t1.relationship type id as parent mid Type, Graph path as g1, t2.relationship type id as mid child Type, Graph path as g2, m1.relationship type id as parent child Type Term2Term as t, FROM Term as tm1, Term2Term as t1, Term as tm2 Term2Term as t2, WHERE View TriangularDist as m1 g1.distance=1 and WHERE g2.distance=2 and t1.term1 id = m1.term1 id and g1.term1 id=g2.term1 id and t2.term2 id = m1.term2 id and g1.term2 id=g2.term2 id and t1.term2 id = t2.term1 id; g1.term1 id=t.term1 id and g1.term2 id=t.term2 id and tm1.id=t.term1 id and tm2.id=t.term2 id;

Query: SELECT parent, parentGoId, midterm, child, childGoId, count(*) FROM View Triangular v group by parent, parentGoId, child, childGoId HAVING count(*) ≥ 1;

The queries resulted in 60 such cases, where almost every case is a combination of a “part-of” and an “is-a” relation. None of these cases had a combination of only “is-a” relations, and only a few cases have all three vertices connected through “part-of”.

Further investigation indicated the presence of “diamond”-like structures, i.e., where there are two paths between the top and the bottom terms of a diamond, which is Chapter 3. Propositional learning from ontological annotation 89 formed from two triangles. In fact, diamond structures can be formed from multiple triangles.

The reason is that, in GO, ontologies are structured as directed acyclic graphs and more-specific (or child) terms can have multiple more-general terms (or parents), and vice-versa. We suggest that this is due to repeated annotations using “is-a”and “part- of”. A linear search from the graph path relation found 6,049 such diamond structures. 1 Therefore, approximately 3 of terms in GO participate directly in diamond relations, and it is a very common phenomenon in the GO structure.

3.6 Conclusions

Over-representation analysis applied to Gene Ontology annotation of gene sets obtained from high-throughput experiments was reviewed and the problem with bias resulting from dependencies due to the structure of the ontology was described.

We proposed an alternative approach that avoids the need for development of a prob- abilistic model by which significance can be assessed. By adopting a discriminant or supervised learning methodology we enable both the integration of heterogeneous data sources in a common “systems biology” framework and the use of Gene Ontology annotation without directly relying on statistical tests to compute p-values.

We implemented a method for feature construction based on concept lattices to com- pute the common GO annotations for subsets of genes. Features selected from the concept lattice by a simple discriminative measure were then supplied to a decision tree learning algorithm. We further examined a number of feature selection algorithms. In two case studies we found that GO annotation can be incorporated into a learned tree to enable prediction significantly above default accuracy. This was found both with and without the integration of microarray data.

In principle we could extend the use of the approach to other data sets to enable a more detailed evaluation, both in application to GO annotation and other non-vector based attributes for machine learning. However, this raises issues of combinatorial explosion in generating the coverage matrix, for example, when using large graph datasets with highly-connected nodes, which exist in biological networks.

As outlined in Section 3.4.2, one of the reasons for using decision tree learning in this work is that they are designed to produce comprehensible classifiers. This is an important advantage over “black-box” methods that are opaque to human inspection Chapter 3. Propositional learning from ontological annotation 90

— even though they may be highly accurate, biologists will have less confidence in their outputs. However, decision trees have an inbuilt tendency to return the smallest tree giving good classification performance on the data. When interpreting the output of a decision tree, there may be a question as to which other attributes of the data not appearing in the relevant part of tree may be associated with an output classification.

The requirements for better comprehensibility and the use of network data in learning led us to the approaches that are investigated in the next chapter. 4 Learning responsive cellular networks by integrating genomics and proteomics data

“Essentially, all models are wrong. But some are useful.”

George Edward Pelham Box

This work leads to the study of cellular network response of genes that give rise to an observed phenotype as the downstream effect of an external stimulus through signal transduction. For example, when cells adapt to sudden changes in the environment, cellular network responses include the action of sets of transcription factors (proteins) that activate sets of genes involved in biochemical pathways. Since the underlying biology is network-based, relational machine learning is an appropriate technique that can directly handle graph-type relations encoded as background knowledge. Parts of this work has been published as Akand et al. [ABT09].

4.1 Biological background

A. Cellular response to stress in eukaryotic cell

Most organisms can respond to sudden changes or stresses in their environment. “Stress” can be defined as an abrupt environmental force that disrupts an organism’s cell func- tions and thereby causes it to grow at a declining rate, or perhaps to cease growth permanently or die. “Adaptation” is the response mechanism to a particular stress that an organism’s cells make to continue its pre-stress growth rate. This study is Chapter 4. Learning responsive cellular networks 92 extremely important because adaptation physiology also includes growth physiology, and characterizing gene expression under environmental stress can provide answers to which genes respond and how the responses change activities in the cell.

A variety of stresses are possible, including physical or chemical. Physical stresses in- clude heat, radiation, osmotic pressure, etc., whereas chemical stresses include changes in the concentration of oxygen and its derivatives such as H2O2, acid, alcohol, phenols, etc. Cells respond to a particular stress in three different phases. In the preliminary stage, immediate cellular changes occur as a result of the on-set of stress; in the second stage various defense mechanisms are triggered; and in the final stage, adapted cells resume normal growth. The following is a generic outline of how a cell responds to an environmental stress, which is also shown in the diagram of Figure 4.1.

• A eukaryotic cell senses stress as an alteration in the outside world. In the cell membrane, sensors (receptors) which are proteins interpret stress as an extra- cellular signal. Often this mechanism leads to phosphorylation of proteins that cause conformational changes in the cell.

• The signal is transmitted into the intracellular domain by cascades of receptor proteins residing in the intracellular domain.

• Finally, the signal is relayed to regulator (or transcription modulator) proteins that modulate a set of effector operons to produce responding proteins that co- ordinate the cellular responses.

• To respond appropriately to the extent of the stress, feedback control further modulates the above response pathways of the cell.

Some stages of the stress response within the cell can be measured, such as the change in proteins due to signalling — this is termed an intra-cellular phenotype. Otherwise, the effects of stress can only be measured as the overall effect on the cell, such as whether its growth is reduced or the cell dies — this is an extra-cellular phenotype. Since effects on cell growth can be due to many factors, discovery of the underlying response network of an extra-cellular phenotype is generally considered a harder problem for machine learning.

B. Saccharomyces cerevisisae

The baker’s and brewer’s yeast Saccharomyces cerevisiae is a single-celled organism that has the basic structure and organization of eukaryotic cell (genetic materials con- tained in nucleus). Many fundamental processes in yeast are conserved through to Chapter 4. Learning responsive cellular networks 93

Environmental stress Plasma membrane of the cell

Extra-cellular Receptor proteins Signals

Cascades of Intracellular Feedback control receptor proteins

Metabolic Regulator/ Gene regulatory Cytoskeletal protein enzyme protein effector proteins

Altered cell Altered Altered gene shape/ metabolism expression Cell response movement

Figure 4.1: A generic diagram of cellular response to environmental stress. humans [SSD+02] and due to its ease of genetic manipulation possible in the lab, yeast is a key model organism for systems biology. In 1996 [GAAC+97] S. cerevisiae was the first eukaryotic organism to have its full genome sequenced and since then many studies have been carried out to investigate functional organization in yeast cells, e.g., [Oli96, WSA+99, ESH+99]. However, although nearly two decades have passed since the sequencing of the complete yeast genome, in a recent estimate around 25% of yeast genes still do not have a known molecular function [PCH07]. Chapter 4. Learning responsive cellular networks 94

4.2 Learning tasks

Data sets describing yeast cellular network responses derived from new high-throughput genome-wide experimental techniques are increasingly available [DB05], and are often inherently relational. Therefore the aim of this work is to apply ILP to uncover signifi- cant logical relationships that govern cellular network responses, such as those involved in the onset of oxidative stress-related phenotypic responses that are important in many human diseases, through the integration of genome-wide data sets. In this chap- ter we investigate two problems of modelling responsive phenotypes in yeast using ILP: protein expression under oxidative stress, and sensitivity of gene deletion mutants to multiple stresses. The basic setting is described in Section 4.2.1, initial results are in Sections 4.3.4 and 4.4, and discussion in Section 4.5.

4.2.1 Learning the logic of responsive cellular networks

A responsive sub-network of a cell that involves control of gene transcription is referred to as a genetic regulatory network (GRN) [Dav99]. The protein products of co-expressed genes in a GRN combine to form interacting molecular machines that produce a re- sponsive cellular phenotype. This responsive sub-network is partly described by the protein-protein interaction (PPI) network. In turn, proteins act to regulate cellular metabolism in pathways of biochemical reactions, and, by subtle feedback mechanisms, their own GRNs and PPI networks.

We do not expect to be able to learn an entire cellular response network from data on its behaviour. That would be pointless since, in some sense, it is implicit in the empirical data on the GRNs and PPI networks, although this is typically incomplete and incorrect. Instead, we aim to learn theories on network components that may be predictive and explanatory of an observed cellular response. These may be used, for example, in visualization or further learning.

We assume a logical language LNet to represent cellular networks, as follows. In this chapter we use as constants only gene symbols (genes represent proteins in certain contexts). Function symbols are not currently used. Predicate symbols express prop- erties or relations, such as gene expression or protein interactions. Similar relational representations have previously been used in a number of studies, for example, by [Bad03, TZLT08, FK08]).

We do not necessarily assume a supervised learning framework, and adopt a simpler setting than is typical in ILP. The task of learning a logical network will be to discover Chapter 4. Learning responsive cellular networks 95 a theory T defined in LNet which is descriptive with respect to a data set E and background knowledge B. We assume there is a function fE,B(T ) to evaluate candi- date theories, and that some threshold can be set by the biologist on this function to decide if the network may be of interest, i.e., it is descriptive. Note that in the work reported here, theories are constructed by learning individual clauses separately, but the evaluation could be applied per clause, or to the complete theory.

To develop and test our approach we selected two problems of learning elements of the cellular response network in yeast. The first was to learn a protein expression network from integrated data sets. This is the same problem setting as Case Study 1 expression from Section 3.4.3 in Chapter3. The second was a typical laboratory problem, where multiple stresses were applied to yeast cells which were then screened for changes in growth. This is the same problem setting as Case Study 2 from Section 3.4.4 in Chapter3. However, in this chapter we use a first-order instead of propositional representation for both problems.

4.3 Predicting an intra-cellular response phenotype

4.3.1 Data sets

In this experiment proteomics and genomics data were integrated to learn to predict protein expression in yeast in response to the environmental addition of hydrogen per- oxide (H2O2), a condition known to produce “oxidative stress”.

A. Proteomics data

The response was taken from a proteomics experiment by Godon et al. [GLL+98]. Two-dimensional gel electrophoresis was used as a method to characterize proteins with altered expression under H2O2. This technique has drawn significant interest for its ability to accurately quantify proteins after posttranslational modification (the final stage for products of gene expression). In this experiment there were 56 proteins whose synthesis was stimulated and 36 that were repressed under oxidative stress. As background knowledge, we took two independently generated data sets.

B. Genomics data

Microarray data on the cellular network response by yeast to six different environmental stresses, a study by Causton et al. [CRK+01], was used in this experiment. Unlike the proteomics study above, this data contains transcriptional response of mRNA levels in Chapter 4. Learning responsive cellular networks 96 cell observed at 6 time points over a period of 2 hours. Each time series data under a particular stress has the mRNA levels recorded at irregular intervals following initial exposure to the stressor. We discretized this data from time-courses into the values “up” or “down” for each of the conditions (details in Section C.1 of AppendixC).

C. Transcription factor binding data

Transcription factor binding data partially defines a regulatory network by indicat- ing transcription factors and their binding target genes. Identifying transcription fac- tors that control gene expression can provide significant insight into misregulated ex- pression and thereby potential links between transcription (genomics) data and post- translational (proteomics) data. For this work, we have used transcription factor bind- ing (ChIP-chip) data from the study by Harbison et al. [HGL+04].

D. Protein-protein interaction data

Protein-protein interaction data provides a partial picture of cellular pathways and cascaded responses to signals in the cell at the molecular level. Yeast protein interaction data was downloaded from BioGRID 1.

4.3.2 Method

The basis of any Inductive Logic Programming (ILP) algorithm is that it represents facts and rules and induces hypothesis from them while logic programming is used for formalization. Unlike propositional learning, ILP has the ability to represent multi- relational data such as inter-atom structure in molecular data. In particular, for bi- ological data as in the present case, ILP can offer a powerful first order language representation to learn concepts that can unveil significant logical relationships in cel- lular network responses. We used Aleph 2 to learn clauses to predict whether genes in the Godon et al. data had their protein expression induced or repressed.

Given a set of clauses, Aleph uses a top-down sequential covering strategy to gener- ate hypotheses. Initially, a positive example is selected as a seed and inference over background knowledge is applied to form a most specific or bottom clause. A top-down search is then performed from the most general clause to specialize it using literals from the bottom clause. Once a “good” subset of the positive examples are covered by the

1The Biological General Repository for Interaction Datasets (BioGRID) is an actively maintained collection of protein and genetic interactions available at www.thebiogrid.org. 2‘A Learning Engine for Proposing Hypotheses’, available from www.comlab.ox.ac.uk/activities/ machinelearning/Aleph/aleph.html. Chapter 4. Learning responsive cellular networks 97 clause, they are removed and the clause is added to the theory. Finally, the solution is complete when no more positive examples can be covered by clauses achieving a threshold of the evaluation function relative to the positives and negatives.

The experimental setting of Aleph requires three files.

• the background file (.b) contains relevant intensional or extensional information about the domain. In addition, language restrictions and search restrictions are included in the file. Types, modes and determinations are commonly used to specify such restrictions. Types are facts describing arguments in predicates. Modes define meta-predicates or relations between object and data types, and determination statements declare meta-predicates required to construct valid re- lationships between target and background predicates.

• the positive example file (.f) contains positive ground facts of a concept (the target predicate or relation) to be learned by Aleph.

• the negative example file (.n) contains negative ground facts of a target predicate to be learned by Aleph. This file is optional as Aleph can also learn using only positive examples.

A Prolog compiler is required for Aleph to run. Two readily available open source compilers we used are Yap (www.dcc.fc.up.pt/~vsc/YAP/) and SWI Prolog (www. swi-prolog.org).

4.3.3 Theory post-processing

Although Prolog has a simple and clean syntax for rules the domain specialist requested that the theory should be translated into a more human-readable syntax. Following the approach of Finn et al. [FMPS98] we initially implemented a translation method based on a simple “pseudo-natural language” approach. However, to make learned network models more understandable the domain specialist further requested that actual gene or protein names should be shown in the output, to make clear the “biological logic” 3. Extending the approach in [FMPS98], we implemented a method to generate all ground instantiations of gene and protein names that could appear in a valid rule. These ground rules, representing sub-graphs of the stress response network, were then output in a pseudo-natural language for the complete set of valid instantiations, which was acceptable to the domain specialist. 3Dr. Mark Temple proposed this terminology, referring to the fact that it would be of biological interest to discover patterns from the rules relating protein expression to the background knowledge. Chapter 4. Learning responsive cellular networks 98

induced(A) :- % Rule 1: Pos cover = 2 ppi(A, A), ’ACID’(B, A). induced(A) :- % Rule 2: Pos cover = 3 ’H2O2LO’(B, A), ppi(C, A), ’ACID’(D, C). induced(A) :- % Rule 3: Pos cover = 25 peroxide(A, up). induced(A) :- % Rule 4: Pos cover = 2 ’YPD’(B, A), ppi(C, A), ’RAPA’(B, C). induced(A) :- % Rule 5: Pos cover = 3 ’RAPA’(B, A), peroxide(B, down), ppi(A, C). induced(A) :- % Rule 6: Pos cover = 6 ’H2O2LO’(B, A), ppi(A, C), ’ACID’(D, C). induced(A) :- % Rule 7: Pos cover = 9 ’H2O2LO’(B, A), peroxide(B, down). induced(A) :- % Rule 8: Pos cover = 8 ppi(A, B), ’BUT14’(C, B).

Figure 4.2: Clauses learned for expression of a protein A induced by exposure of the cell to H2O2. Only clauses generalized to cover at least two proteins are shown. Each clause defines cellular conditions on A or other genes or proteins (B, C, . . . ) under which the experimentally observed protein expression can occur (see text for details).

4.3.4 Experimental results

A theory was learned using Aleph that covered all 56 positive examples and none of the 36 negative examples. This theory comprised 8 generalized clauses, shown in Figure 4.2, and 17 ungeneralized positive examples.

This theory can be seen as a successful result in terms of descriptive induction, since it is a compact representation of 40 out of 56 or about 71% of the positive training examples, without covering any negatives. The 40 unique proteins are as follows (some are covered by more than one clause): Chapter 4. Learning responsive cellular networks 99

ALD5 ARG1 ARO4 BGL2 CCP1 CDC48 CTT1 DDR48 ENO1 GLK1 GLR1 GPD1 HIS4 HSP104 HSP12 HSP26 HSP42 HSP78 HSP82 LYS20 PEP4 PGM2 PRE1 PRE3 PRE8 PRE9 PUP2 SCL1 SOD2 SSA1 SSA3 TKL2 TPS1 TRX2 TSA1 UBA1 UGP1 YNL134C YNL274C ZWF1

4.3.4.1 Validation

The following steps to validate the above theory in Figure 4.2 were carried out. First we evaluated the theory by testing the functional annotation of the proteins it covers for statistical significance. Next we converted the theory to ground rules as outlined above in Section 4.3.3. These were passed to our collaborating yeast biologist Dr. Mark Temple who developed the visualization in Figure 4.3 to confirm the validity of the underlying interaction graph. Then for each clause in the theory the set of all proteins appearing in its ground instantiations was analysed for statistically significant annotation categories. The set of significant annotations found appears in Section C.3 of AppendixC. Finally we used cross-validation to estimate predictive accuracy of theories learned on the same positive and negative example sets but with a slightly modified and extended version of the background knowledge.

A. Statistical significance of theory

We analysed the 40 proteins shown in Section 4.3.4 above covered by the theory in Figure 4.2 using FunSpec 4. This tool analyses a gene set for statistically significant over-representation against the main curated data sets in yeast biology, including the Gene Ontology, the MIPS functional catalogue 5 plus a number of sources of protein interactions and other resources (details at FunSpec website).

FunSpec uses a hypergeometric test for significance as discussed in Section 3.2.2, . We applied this type of analysis since it is widely used by biologists. The p-values were obtained with a cutoff of 0.01 and with the Bonferroni correction for multiple testing applied. 4At http://funspec.med.utoronto.ca. 5Munich Information Center for Protein Sequences at http://mips.helmholtz-muenchen.de/ genre/proj/yeast. Chapter 4. Learning responsive cellular networks 100

For each category type the tables below show: the name and identifier (if any) of the significant category, the p-value, the set of proteins in the category from the data set, the size (k) of this set, and the size (f) of the set of all proteins or genes annotated to this category from the complete genome.

GO Molecular Function Category p-value In Category k f threonine-type endopeptidase activity 9.68678e − 11 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 14 [GO:0004298] endopeptidase activity [GO:0004175] 1.21752e − 09 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 20 oxidoreductase activity [GO:0016491] 3.77581e − 07 HIS4 GPD1 ALD5 CTT1 SOD2 CCP1 TSA1 11 272 YNL134C ZWF1 GOR1 GLR1 unfolded protein binding [GO:0051082] 6.53868e − 07 SSA1 SSA3 HSP26 HSP42 HSP104 TSA1 7 86 HSP82 peptidase activity [GO:0008236] 3.48335e − 06 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 PEP4 7 110

GO Biological Process Category p-value In Category k f response to stress [GO:0006950] 2.1374e − 12 SSA1 SSA3 HSP26 TPS1 ARO4 GPD1 HSP42 13 152 HSP78 HSP12 CTT1 HSP104 DDR48 HSP82 proteolysis involved in cellular protein 5.21982e − 12 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 PEP4 7 18 catabolic process [GO:0051603] proteasomal ubiquitin-independent protein 9.68678e − 11 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 14 catabolic process [GO:0010499] proteasomal ubiquitin-dependent protein 2.69912e − 08 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 32 catabolic process [GO:0043161] oxidation-reduction process [GO:0055114] 3.77581e − 07 HIS4 GPD1 ALD5 CTT1 SOD2 CCP1 TSA1 11 272 YNL134C ZWF1 GOR1 GLR1 metabolic process [GO:0008152] 4.62643e − 06 TKL2 ARO4 HIS4 GPD1 LYS20 ALD5 BGL2 12 425 UGP1 UBA1 YNL134C ZWF1 GOR1

GO Cellular Component Category p-value In Category k f proteasome core complex [GO:0005839] 1.60734e − 10 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 15 proteasome storage granule [GO:0034515] 7.04224e − 09 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 26 cytoplasm [GO:0005737] 9.15273e − 09 SSA1 SSA3 HSP26 TKL2 TPS1 ARO4 GPD1 30 2026 LYS20 HSP42 PRE1 HSP12 SCL1 CTT1 PRE9 TRX2 PUP2 ENO1 PRE3 UGP1 UBA1 HSP104 TSA1 PRE8 PGM2 DDR48 YNL134C ZWF1 GOR1 GLR1 HSP82 proteasome core complex, alpha-subunit com- 3.98941e − 08 SCL1 PRE9 PUP2 PRE8 4 7 plex [GO:0019773] proteasome complex [GO:0000502] 2.62206e − 07 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 46 nuclear outer membrane-endoplasmic reticu- 4.1922e − 06 SCL1 PRE9 PUP2 PRE8 4 19 lum membrane network [GO:0042175] cytosol [GO:0005829] 1.52819e − 05 SSA3 GLK1 GPD1 CDC48 HSP12 TRX2 8 192 TSA1 ARG1

MIPS Functional Classification Category p-value In Category k f stress response [32.01] 2.69194e − 08 SSA3 TPS1 PRE1 PUP2 PRE3 UBA1 HSP104 10 162 DDR48 ZWF1 HSP82 oxidative stress response [32.01.01] 2.83073e − 08 HSP12 CTT1 TRX2 SOD2 CCP1 TSA1 GLR1 7 55 protein processing (proteolytic) [14.07.11] 7.45078e − 08 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 PEP4 7 63 proteasomal degradation (ubiquitin/proteaso- 7.31623e − 07 CDC48 PRE1 SCL1 PRE9 PUP2 PRE3 UBA1 8 128 mal pathway) [14.13.01.01] PRE8 protein folding and stabilization [14.01] 1.11781e − 06 SSA1 SSA3 HSP26 HSP42 HSP78 HSP104 7 93 HSP82 glycolysis and gluconeogenesis [02.01] 4.0254e − 06 GLK1 ENO1 PGM2 YNL134C ZWF1 5 41 sugar, glucoside, polyol and carboxylate 7.77819e − 06 TKL2 TPS1 ENO1 UGP1 PGM2 ZWF1 6 81 catabolism [01.05.02.07] ATP binding [16.19.03] 1.47095e − 05 SSA1 SSA3 CDC48 HSP78 UBA1 HSP104 8 191 DDR48 HSP82 Chapter 4. Learning responsive cellular networks 101

MIPS Subcellular Localization Category p-value In Category k f cytoplasm [725] 5.30018e − 05 SSA1 SSA3 HSP26 TKL2 TPS1 ARO4 HIS4 30 2879 GLK1 GPD1 CDC48 HSP42 HSP12 SCL1 CTT1 PRE9 TRX2 PUP2 ENO1 UGP1 UBA1 HSP104 TSA1 PGM2 DDR48 YNL134C ZWF1 GOR1 ARG1 GLR1 HSP82 nuclear envelope [750.01] 5.38344e − 05 PRE1 SCL1 PRE9 PUP2 BGL2 PRE3 PRE8 7 167 ER membrane [735.01] 0.000119949 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 131

MIPS Protein Complexes Category p-value In Category k f Complex Number 60, 20S Proteosome (13) 5.55983e − 11 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 13 [550.3.60] Complex Number 238 [550.2.238] 1.60734e − 10 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 15 20S proteasome [360.10.10] 1.60734e − 10 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 15 Complex Number 110, probably protein syn- 1.54173e − 06 PRE1 SCL1 PRE9 PRE3 PRE8 5 34 thesis turnover [550.1.110] Complex Number 111, probably protein syn- 1.79074e − 06 PRE1 SCL1 PRE9 PRE3 PRE8 5 35 thesis turnover [550.1.111] Complex Number 170, probably signalling 7.2017e − 06 PRE1 SCL1 PRE9 PUP2 PRE8 5 46 [550.1.170]

MIPS Protein Classes Category p-value In Category k f other ATPases [41.61] 1.83297e − 05 SSA1 HSP104 DDR48 HSP82 4 27

SMART Domains Category p-value In Category k f Proteasome A N 3.98941e − 08 SCL1 PRE9 PUP2 PRE8 4 7

PFam-A Domains Category p-value In Category k f Proteasome 9.68678e − 11 PRE1 SCL1 PRE9 PUP2 PRE3 PRE8 6 14 Proteasome A N 3.98941e − 08 SCL1 PRE9 PUP2 PRE8 4 7

MDS Proteomics Complexes Category p-value In Category k f YER012W (PRE1) 1.21202e − 08 SCL1 PRE9 PUP2 PRE3 PRE8 5 14

Cellzome Complexes Category p-value In Category k f YDL188C (PPH22) 2.62109e − 08 PRE1 SCL1 PRE9 PUP2 PRE8 5 16 YGL011C (SCL1) 5.52016e − 07 SCL1 PRE9 PRE3 PRE8 4 12 YHR200W (RPN10) 1.33279e − 05 PRE1 SCL1 PRE9 PRE8 4 25

Proteome Localization–Observed Category p-value In Category k f cyto 2.04085e − 05 SSA3 HSP26 TPS1 ARO4 HIS4 GLK1 GPD1 20 1321 HSP42 HSP12 PRE9 PUP2 ENO1 HSP104 TSA1 PGM2 DDR48 YNL134C ZWF1 GOR1 HSP82

Comparison of over-representation categories for the 40 proteins covered by the theory with those for all of the positive example set showed that nearly all appeared in both. That is, the proteins covered by the theory include nearly all of those that appear in the curated biological databases accessed by FunSpec. Also, there was little overlap in categories with the negative examples, proteins that were repressed under H2O2. This confirms that the induced theory represents a good approximation to the known biology of the target predicate in this domain. Chapter 4. Learning responsive cellular networks 102

B. Expert validation

One advantage of ILP is that its hypothesis language of first-order clauses is straight- forward to render into a form of controlled natural language. Section 4.3.3 outlined how we extended an earlier approach to translate a learned theory from Prolog to pseudo- natural language format in consultation with the yeast biologist Dr. Mark Temple. The set of ground rules was expressed in a form of English using Prolog “portray” defini- tions. These rules in the form they were given to Dr. Temple are shown in Section C.2 of AppendixC. By examining this rule set he then manually assembled the network diagram of Figure 4.3, representing key components of the responsive network.

Since the data sets generated by high-throughput biological methods are known to have many false positives and false negatives, integrating them in a systems biology framework can help detect interactions that are wrong and show where others are miss- ing [GWV03]. In Figure 4.3 the goal of assembling the network by hand was to relate the learned model to the expert’s domain knowledge. It is known that using machine learning for knowledge acquisition is credible to experts when the learned theory is con- sistent with their domain knowledge [PMS01]. Initial attempts at visualisation with automated graph-drawing algorithms were unsuccessful since the diagrams were too complex. The approach taken to develop Figure 4.3 was to structure the diagram by interaction type, focused on transcriptional regulation (the grid-like structure in the center of the diagram), which allowed the biologist to produce a comprehensible struc- ture. Note that not all interactions in the ground theory were added into the network, since only those consistent with the expert’s existing knowledge were selected by him.

When completed, the biologist reported that the patterns of predicate usage in learned clauses identified relationships across diverse heterogeneous data, and enabled him to filter these according to the inherent “biological logic”. Clauses connect genes or their proteins according to this biological logic, so that expected relationships between two genes, two proteins or between a protein and a gene can be represented, and the pathways leading to the observed response can be captured. For example, if a protein’s expression is shown to be induced in response to hydrogen peroxide treatment then the biological logic is (i.e., an informed biologist would suggest) that the expression of that gene may also be up-regulated in response to a similar condition. By compiling the set of ground clauses into a network we discover other genes that exhibit similar or related behavior, and this provides validation of the relationships that are revealed.

For example, one of the clauses in Figure 4.2 is: induced(A) :- ’H2O2’(B,A), ppi(A,C),

’ACID’(D,C). This says that a gene A has its protein induced under H2O2 addition since Chapter 4. Learning responsive cellular networks 103

AHP1 DDR2 TRX2 SSA1 HSP104 MGA1 SIS1 SSA3 TDH3 SSA2 HSP42 HSP82 PRE1 PRE9 PRE8

Acid Msn2 Msn4 H2O2 Cin5 Msn2 Msn4 Hsf1 Skn7 Rpn4 Yap1 Yap7

Figure 4.3: Network diagram drawn by human expert from ground clauses of theory for protein expression under H2O2 stress. The horizontal bold line represents the intragenic (promoter) region of the gene named on the left-hand side of the diagram. The filled circle on each promoter links ver- tically (dotted line) to the transcribed mRNA indicated by the right-pointing arrowhead — a bold vertical line indicates that the transcript is up-regulated in the microarray data [CRK+01]. The circle around each arrowhead indicates the response to hydrogen peroxide — a grey filled circle indicates that the protein is induced [GLL+98]. Proteins connected by a curved arrow indicate that a protein-protein interaction is in BioGRID. The downward pointing triangle represents a transcription factor (protein) bound to the gene promoter (DNA sequence). The identity of this factor is indicated by the bold ver- tical line attached to the labelled circle below. The condition under which the transcrip- tion factor is bound is indicated by the box to the left of the labelled factor [HGL+04]. Chapter 4. Learning responsive cellular networks 104 two transcription factors, B and D, bind the promoter regions of the genes A and C, and there is a protein-protein interaction between A and C. This is a general relationship that is plausible, but only in a general way. However, in one of the instantiations of this clause we see the following. The gene TRX2 is bound by the transcription factor Msn2/Msn4 (and others as indicated in Figure 4.3 and Section C.2 of AppendixC) under H2O2 treatment conditions, the TRX2 mRNA is up-regulated under similar treatment conditions and the protein itself is induced upon H2O2 treatment. In ad- dition, the Trx2 protein exhibits a protein-protein interaction to Ahp1 (as shown in Figure 4.3). In turn, further instantions of the learned clauses describe the AHP1 gene’s relationships to others the network. Since TRX2 is known to have a central role in one of the cell’s oxidative stress response pathways, this supports the validity of the learned theory.

This is, to the best of our knowledge, the first network model of yeast response to

H2O2 learned from empirical data sets integrating protein expression, transcription factor binding, protein-protein interactions and gene expression data.

Note that the set of ground instantiations of a clause is its extensional representation, so this represents a form of “closure” operation that we develop further in Section 6.3 of Chapter6, and link it to a semi-automated network visualization approach in Sec- tion 6.5 of that chapter.

C. Statistical significance of clauses

Given the confidence in the theory expressed by the domain expert following construc- tion of the network, we repeated the significance analysis using FunSpec as above but this time treating the set of proteins or genes in the ground instantiation of each clause as separate samples. Output of the analysis appears in Section C.3 of AppendixC.

This showed some differences between the biological basis of different clauses. For ex- ample, clauses 3 and 7 represent most of the categories that are over-represented for the proteins in the positive training examples. However, clause 8 represents some more spe- cific cellular systems related to cell replication (e.g., budding and pheromone response) that appear as statistically significant, although these do not appear as significant for the proteins in the training set.

We conclude that this shows an advantage of using a relational representation in learn- ing the model, since this can bring out links to genes not directly observed in the phenotype (as induced or repressed) but having a role in the network response. Also Chapter 4. Learning responsive cellular networks 105 some clauses have few over-represented categories — these dependencies between ob- served phenotype and network interactions could indicate new explanations which could be experimentally tested.

D. Predictive accuracy

The theory discussed above can be considered as a “proof-of-principle” that enabled the follow-up applications in later chapters. Following this work, we re-ran the learning algorithm using a slightly modified and extended version of the background knowledge to obtain estimates of predictive accuracy and other measures using 10-fold cross- validation.

The background knowledge used was essentially the same as that of Section 5.5.2, with the addition of protein interactions from the “Yeast Interactome” dataset [YBY+08] (available from interactome.dfci.harvard.edu).

Measure Clause length 6 Clause length 8 Accuracy 0.7391 0.7391 Error 0.2609 0.2609 Precision 0.9444 0.9211 Recall 0.6071 0.6250 F1 0.7391 0.7447 Mean runtime (per fold) 0.42 min 115.5 min

Given a default accuracy for predicting everything positive of 0.6087 (every protein’s expression is induced) we obtain a small lift of 1.2142.

The general-to-specific search for clauses leads to a preference for short clauses. Ac- cording to our yeast biologist, longer clauses should be more informative, since they would include more properties and relations. We therefore tried increasing the “clause- length” parameter. However, although longer clauses were discovered, they were not found to be more comprehensible; for the reasons discussed above, non-ground clauses were difficult to interpret by the domain expert.

From the results, longer clauses required significantly longer runtime, for no change in accuracy. However, there is a slight increase in recall and decrease in precision. This is counter-intuitive, since increasing possible clause length suggests discovery of more specific clauses, which would tend to increase precision at the expense of recall. However, since the background knowlege includes all Gene Ontology categories, which allows inclusion of literals at many levels of generality into clauses, this is not necessarily the case. Chapter 4. Learning responsive cellular networks 106

4.4 Predicting an extra-cellular response phenotype

The S. cerevisiae Genome Deletion Project [WSA+99] set of deletant yeast strains was introduced in Section 3.4.4 of Chapter3. In screens of the deletant set, biologists have searched for a “sensitive” phenotype (e.g., abnormal growth) that would suggest a role for the deleted gene in the cellular response to that stress. This is known to present a hard problem in functional genomics, since there is very little correlation of these screens to microarray data [TFA+04].

As in Section 3.4.4 of Chapter3, we used the dataset of 26 screens on 1094 genes obtained from various different laboratories, assembled by the yeast biologist Dr. Mark Temple. Of these, 422 deletants were sensitive to three or more screens. These may have a general role in cellular stress response, whereas the remaining 672 are implicated in specific responses. The classification problem was to learn to discriminate these “general” response genes (the positive class) from those sensitive to only one or two screens (since many stresses actually had two screens, each from different studies, this is roughly equivalent to being sensitive to one stress).

Measure Background 1 Background 2 Accuracy 0.6225 0.6033 Error 0.3775 0.3967 Precision 0.5302 0.4615 Recall 0.1872 0.1706 F1 0.2767 0.2491 Lift 1.6139 1.5642 Mean runtime (per fold) 62.5 min 52.9 min

The background data was essentially the same as used in Section 4.3. We give results on two versions. “Background 1” contains the same background set as in the protein expression cross-validation accuracy estimates above, whereas “Background 2” also has 5600 additional facts on the empirical localisation of yeast proteins into 23 location categories [HFG+03] (available from yeastgfp.yeastgenome.org). The total number of facts (ground atoms) in “Background 1” was 284, 593 and for “Background 2” it was 290, 193.

Since the screen data is known to be noisy, it is not suprising that the predictive accu- racies are not high, and are actually around the same value as propositional learning using only GO categories as attributes with no feature selection — see Table 3.3. Al- though the accuracy is not that high, with a default accuracy for predicting positive of Chapter 4. Learning responsive cellular networks 107

0.3857, we see a small lift in both settings, which is actually larger than for the intracel- lular phenotype prediction task above. Also the clauses learned contain the relational structure representing the biological networks involved in the stress responses that is not able to be represented in the propositional models.

However, the size of the training dataset in this task is over an order of magnitude larger than the intracellular phenotype prediction task, and there is a large amount of background data. A typical theory in any cross-validation run could contain 20-30 clauses, each of similar complexity to those of Figure 4.2. This meant that the 2-stage visualization method used there was not likely to be possible and was not attempted.

4.5 Discussion

This chapter applied first-order learning for systems biology problems and introduced several innovations. However, we are not the first to have applied ILP to such types of systems biology tasks. We first review some of the previous work then conclude the chapter.

4.5.1 Related work

Inductive logic programming to learn hypotheses in first order logic using biological background knowledge has been successfully applied to several systems biology-related tasks [Bad03, Kin04, TZLT08, FK08]. Badea [Bad03] appears to have been the first to learn predictive theories of gene expression using ILP. He applied ILP using Gene Ontology categories and gene annotations from the HumanPSD Proteome database to induce functional discrimination rules for gene expression in three subgroups of ade- nocarcinoma of the lung based on microarray data. In his approach he used Progol (www.doc.ic.ac.uk/~shm/progol.html) with a standard cover-set algorithm, but ad- ditionally generated the set of alternative hypotheses having the same coverage as that returned by Progol. In his study all hypotheses covered all positive but no negative examples. For each of these hypotheses the set of covered genes and associated GO cat- egories was given to a human expert for validation. Although similar to our approach, he did not use any interaction data as background knowledge, and so could not learn the responsive network.

More recently, Trajkovski et al. [TZLT08] argued that Badea’s cover-set algorithm is inappropriate to generate interesting subgroups. Subgroup discovery aims at searching Chapter 4. Learning responsive cellular networks 108 for interesting subgroups or patterns from the instance space guided by the utility functions. The discovery method depends on the description language, search heuristic and quality function used. One search heuristic is weighted relative accuracy, which is a form of binomial test function that favours significant rules with greater emphasis on coverage.

Trajkovski et al.’s work used propositionalization of relational features by first order feature construction. Afterwards a subgroup discovery algorithm as a propositional learner was used to learn interesting rule sets. They found that a weighted covering algorithm performed better compared to more traditional covering algorithms as these suffered by proposing overly general hypotheses. Trajkovski et al. included data on gene regulatory network interactions combined with Gene Ontology categories to search for relations among genes. Recently this work has been updated in several ways, for example, to operate as a web service and enable use of external OWL ontologies [VL13].

Fr¨ohlerand Kramer [FK08] included more biological background knowledge when ap- plying ILP to the task of predicting regulatory state of a gene in yeast cells under vari- ous stress conditions. In their background knowledge they included information such as binding site sequences, protein interactions and Funcat( mips.helmholtz-muenchen. de/proj/funcatDB/)categories. Instead of Aleph, their work used TILDE to induce first-order logical decision trees. They achieved high accuracy using boosted first-order decision trees, but reported very large search spaces when attempting a Progol-type learning with hypotheses constructed from the bottom clause.

In these studies, the task is typically the easier problem of predicting intra-cellular rather than extra-cellualr phenotype. They use much less background knowledge, for example they often consider fewer interactions.

Two recent studies have incorporated the hyper-geometric test (or Fisher’s exact test) into hypothesis search, rather than using this in validation, as we have done above and in Chapter5. Jiline et al. [JMT11] apply the hyper-geometric test in an ILP setting, but note that it is not practical for large datasets with a relational hypothesis space due to the combinatoric complexity. Langohr et al. [LPP+13] apply Fisher’s exact test in searching for subgroups of genes but using a propositional representation.

4.6 Conclusions

P˜ena-Castilloand Hughes [PCH07] noted in contrasting the large number of studies on yeast bioinformatics with the fact that the functions of many yeast genes were not Chapter 4. Learning responsive cellular networks 109 well understood:

There is clearly a need for human inference and domain knowledge in the creation of new approaches for specific problems, including the characteri- zation of individual genes and their roles in nature.

We believe our ILP approach to learning networks developed in this chapter contributes to this goal, providing a path from large multi-relational data sets to comprehensible diagrams or other biologist-oriented applications of learned theories.

In this chapter we have extended and expanded the methods of the previous chapter towards the goals of this thesis. The representation language has been lifted from propositional to first-order, thus enabling the use of large biological network data sets which are inherently relational. We have expanded theory validation to include external reference data sets to identify significant connections with biological knowledge. We demonstrated that first-order theories can be understood and validated by a domain expert using a two-stage approach of translating ground clauses to natural language and then constructing a structured network visualization.

However, we found that it appears harder to learn to predict phenotype at the cellular level (such as sensitivity to environmental stresses) than a quantitative intra-cellular measure such as protein or gene expression. Additionally, it was not possible to scale up our approach to visualization for the larger gene sets collected for the extra-cellular phenotype prediction task. In the next two chapters of the thesis we address these issues by using techniques from formal concept analysis and closed itemset mining to first structure the problem, then apply a version of our ILP learning to selected concepts. We also apply visualization techniques to both stages, i.e., initial structuring of the problem, then validation of the learned theories.

5 Visualising concept lattices for learning from integrated multi-relational data

“Create your own visual style... let it be unique for yourself and yet identifiable for others.”

Orson Welles

Formal Concept Analysis provides a rigorous framework in this chapter to combine visualization and machine learning for a complex application in systems biology. We develop a concept lattice construction algorithm based on techniques from frequent itemset mining and adapt local browsing of concept lattices as the visualization ap- proach. A web-based interface enables the user to navigate the concept lattice easily to inspect formal concepts for further processing. This includes integration of concepts with external data sources, including relational data representing biological networks, and machine learning using Inductive Logic Programming. Initial results are presented on a significant real-world data set showing that the approach can generate biologically promising hypotheses for further analysis.

5.1 Introduction

Visual analytics [KMS+08] combines automated methods of machine learning and knowledge discovery with visualization techniques to enable human domain experts to apply their background knowledge and decision-making skills in the exploration of Chapter 5. Visualising concept lattices 112 demanding data analysis problems. Although data visualization and model visualiza- tion are usually considered separately, in this chapter we use Formal Concept Analysis (FCA) [GW99b], a data analysis framework appropriate for discrete domains, that combines both data objects and their descriptive attributes in a unified concept lattice, integrated with a browser interface and machine learning tool.

Our approach is motivated by applications in molecular systems biology [IGH01], which is developing rapidly in the post-genome era [KB03]. It will enable biologists to evaluate the results of a genome-wide phenotypic study (such as a screen of the library of single gene knock-out mutations to a treatment to determine which genes are necessary for survival – the yeast deletion library) in context with large scale data sets encompassing hundreds of thousands of protein-protein or genetic interactions.

Our long-term goal is to enrich the process of data analysis and mining with domain knowledge from users, to which this work contributes as follows.

First, although the space of formal concepts or closed itemsets is reduced relative to that of all itemsets [PBTL99b], this still results in large numbers of formal concepts on typical real-world data sets. In this chapter we develop an improved algorithm for closed itemset mining that generates a browsable concept lattice designed for systems biology applications. Additional criteria to select or rank concepts from the space can be useful [ABT10], and in our experiments we explore statistically motivated criteria to evaluate potentially interesting concepts.

Second, integration of multiple sources of knowledge that can be used for the intents of formal concepts is often useful. However this can cause problems, either in expanding the attribute set, which leads to increased computation time in lattice construction, or extending the expressiveness of the representation, which leads to greater complexity in the FCA framework and algorithms [CR96]. An alternative method we investigate is integration of knowledge after construction of the lattice using an initial set of at- tributes.

Third, for human domain experts visualization of the concept lattice enables an overview of the structure of the domain and assists in data analysis and mining. Since the con- cept lattice is usually too large to be viewed in its entirety, only localized viewing and navigation of the space is possible [Pri06], and we take a similar approach.

Fourth, following the visual analytics approach we add a machine learning capability. MINER [KLS+09] is an interactive method of guiding data mining for genomics data, but is restricted to propositional learning. In this chapter we apply Inductive Logic Chapter 5. Visualising concept lattices 113

Programming to learn definitions of selected concepts in the lattice; this allows the integration of relational background knowledge in a data-driven post-processing phase of concept analysis.

Finally, viewing the construction of formal concepts and their lattice as one stage of an analytics process which we describe below for an application to a systems biology domain suggests a possible role for ontology construction using methods from M.Bain’s previous work [Bai03a].

The remainder of the chapter is structured as follows. Section 5.2 reviews closed item- set mining and its suitability as a model class for visual analytics and describes the method we use to construct a lattice of closed itemsets. In section 5.3 we outline our approach to visualization of the space of closed itemsets based on ideas from Formal Concept Analysis for browsing a lattice. Section 5.5 discusses the systems biology task investigated in this work, with conclusions and further work in Section 5.6.

5.2 Closed itemset mining

A standard problem in data mining is finding all itemsets and their associations, usu- ally expressed in the form of association rules, given certain parameters [AMS+96]. However, it is well-known that this can result in a very large number of itemsets. In practice this can typically be reduced by using a non-redundant representation of the set of itemsets such as that provided by the set of closed itemsets [PBTL99b]. This framework is closely related to Formal Concept Analysis (FCA) [GW99b], which also provides an approach to visualization of the resulting lattices of itemsets, and hence we will use both FCA and itemset mining notation and terminology interchangeably throughout this chapter.

5.2.1 Formal Concept Analysis

We apply Formal Concept Analysis, a well-known and widely-used method to explore concepts and their hierarchical relations, given a set of observations organized in a formal context. Here we formalise the context in the background of a set of extra- cellular multiple response phenotypes in yeast.

Following the FCA formalization (details in Section 2.4.1), our context has the at- tributes as stresses and the objects as genes, therefore, the concept lattice induced by Chapter 5. Visualising concept lattices 114 the order ≤ relation (partially ordered by a subsumption) is a complete lattice with tractable encoding structures.

For example given two concepts,

C1: h {H2O2 Thorpe, CHP Thorpe, GHS homeostasis, Diamide Thorpe}{EOS1, HFI1, PAF1, URE2 } i

C2: h {H2O2 Thorpe, CHP Thorpe, GHS homeostasis, Diamide Thorpe, Sorbate}, {HFI1, PAF1 } i we have the order C1 ≤ C2 ↔ X1 ⊇ X2 and the equivalent relationship C1 ≤ C2 ↔

Y1 ⊆ Y2 where C1 is a more general concept and C2 is a more specific concept. This structure allows user to manipulate the combined response in genomic terms with addition or removal of a number of stresses. As we see, an addition of stress such as

Sorbate to the concept C1 cause reduction in the combined response {EOS1, URE2 } in the concept C2 and vice-versa.

Therefore, this formalisation provides us with the basis for a method to extract the concepts, to visualize them and relate them to other sources of data, such as annotation of their gene functions using the Gene Ontology [Ash00], data on their activity such as microarray measurements and protein-protein interactions.

5.2.2 A concept lattice algorithm to support visual analytics

Many different algorithms have been proposed to construct concept lattices: see [KO01] for a review. All these algorithms are subject to the bottleneck of generating all of the concepts in the lattice. It is a well known problem that the computational complexity of generating all concepts in the lattice becomes critical as the number of concepts exponentially increases with the size of the input context. The complexity of mining closed itemsets (formal concepts) for a given level of support has been shown to be #P-complete [Yan04], i.e., intractable in the worst-case.

Therefore, work on FCA and closed-itemset mining has tended to focus either on - ing approaches more efficient or enabling conceptual analysis. On one hand there are algorithms to speed up closed itemset mining for conceptual knowledge discovery or data mining such as TITANIC [STB+02], GALICIA [VGRH03] or CHARM [ZH05]. On the other hand, implementations of formal concept analysis tend to have a focus on elegant visualizations of entire concept lattices, but restricted to smaller datasets, such as TOSCANA [EGSW00] for conceptual information systems. Chapter 5. Visualising concept lattices 115

root{}

{F}{g7} {B}{g1,g2,g3,g5,g6} {F,D}{g7} {B,A}{g1,g2,g3,g5,g6} {F,D,C}{g7} {E}{g5,g6,g8} {C}{g3,g4,g6,g7,g8} {G}{g1,g2,g3,g4} {F,D,C,A}{g7} {E,D}{g5,g6,g8} {D}{g5,g6,g7,g8} {C,A}{g3,g4,g6,g7,g8} {G,A}{g1,g2,g3,g4} {D,A}{g5,g6,g7,g8} {E,D,A}{g5,g6,g8} {C}{g3,g6} {I}{g4} {B}{g5,g6} {C}{g6,g7,g8} {B,A,C}{g3,g6} {B}{g5,g6 {I,H}{g4} {C}{g6,g8} {D,A,B}{g5,g6} } {D,A,C}{g6,g7,g8} {I,H,G}{g4} {E,D,A,B}{g5,g6} {E,D,A,C}{g6,g8} {C}{g3,g4} {B}{g1,g2,g3} {H}{g2,g3g4} {I,H,G,C}{g4} {C}{g6} {H,G}{g2,g3g4} {G,A,C}{g3,g4} {G,A,B}{g1,g2,g3} {I,H,G,C,A}{g4} {C}{g6} {H,G,A}{g2,g3g4} {E,D,A,B,C}{g6} {B}{g3}

{B}{g2,g3} {C}{g3,g4} {A}{g1,g2,g3,g4,g5,g6,g7,g8} {H,G,A,B}{g2,g3} {H,G,A,C}{g3,g4}

{C}{g3} {H,G,A,B,C}{g3}

Branch processing Search level 2: Prefix {E,D,A} search level 3: Prefix {E,D,A,B} Brach21:{B}{g5,g6} Branch31:{C}{g6} Branch22:{C}{g6,g8} 4. Branch 31 becomes {E,D,A,B,C}{g5,g6} 1. Branch 21 becomes {E,D,A,B}{g5,g6} 5.check closure for branch31 and add {E,D,A,B,C}{g5,g6} to lattice 2. Combine branch22 creates branch31 {C}{g6} under branch21 6.Backtack to search level 2 to process branch22; which becomes {E,D,A,C}{g6,g8}; check 3.check closure and add {E,D,A,B}{g5,g6} to lattice closure and finally add to the lattice

Figure 5.1: BioLattice intent-extent search tree. Candidate concepts are shown as intent, extent pairs. See Section 5.2.5 for details.

However, in systems biology applications the problem is that while typically visualiza- tion is a requirement, the data sets are quite large and heterogeneous, from multiple sources. To address this problem we have developed a system called BioLattice which is a full-scale implementation of a concept lattice building algorithm that also enables visualization. We have carefully selected a lattice building algorithm as the complexity is biased by the choice of input context. A variant of CHARM [ZH02] is implemented to take advantage of its vertical data format representation, and its computational efficiency is based on: (a) search performed over an intent-extent tree search space; (b) pruning based on both non-frequent itemsets (as in association rule mining) and non-closed itemsets.

5.2.3 Definitions and techniques

Horizontal data format is the most commonly applied data format in Apriori [AMS+96] or FP growth [HPYM04] based data mining technique. Vertical data format in this work differs from the horizontal format as it has set of transaction indexes for each single item instead of rows containing list of items indexed by a transaction. Several methods such as CHARM [ZH05] has demonstrated the merits of vertical format in Chapter 5. Visualising concept lattices 116 outperforming the horizontal format, for example, reduced number of I/O overhead operations for candidate generation and support counting at each single step, and improved internal data structures with hash/search tree. Mining based on intersection of readily available transactions of itemsets and computing support equivalent to the length of the transaction indexes seems a relatively easy task in the vertical format. Moreover, this approach enhances pruning of irrelevant transaction indexes at each intersection operation and offers scope for compression.

Fast computation of frequent closed concept mining is achieved by dividing the search space based on individual prefix based equivalence classes. As in the figure 5.1, each node or concept in the tree has intent-extent pair hPi ihPt i, defining a prefix based equivalence class Pi over all of its children {C1,C2, ...Cn}.

For example, at the root each of the frequent intents from the set {F,I,E,H,D,G,B,C,A} is a possible extension for equivalence class. Starting with the left most child with class [F ], all the right siblings or frequent subsets such as {FD,FC,FA} (see Figure 5.3) share the prefix F . In such framework, mining frequent concepts is straightforward.

As we see above, given a node or prefix class, [P] ={C1,C2, ...Cn}, a new frequent class

[PCi ] ={Cj} can be mined by intersecting PCihextenti with all PCjhextenti, where j > i in order and itemset {PCiCj} is frequent.

5.2.4 Properties used for search-space pruning

Pruning is carried out based on four properties introduced in [ZH02]. Let there be two candidate concepts derived from the context, Ci and Cj. The properties are described as follows.

1. if Cihextenti = Cjhextenti, then replace Cihintenti with the more specialized

intent Cihintenti∪Cjhintenti and remove candidate concept Cj from the context;

2. if Cihextenti⊂Cjhextenti, then replace Cihintenti with the more specialized intent

Cihintenti ∪ Cjhintenti;

3. if Cihextenti ⊃ Cjhextenti, then add a specialized branch Cb under Ci with intent

Cjhintenti and extent Cihextenti ∩ Cjhextenti. Also remove candidate concept

Cj from the context;

4. if Cihextenti 6⊂ Cjhextenti and Cihextenti 6⊃ Cjhextenti and Cihextenti∩Cjhextenti= 6

∅, then add a specialized branch Cb under Ci with intent Cjhintenti and extent

Cihextenti ∩ Cjhextenti. Chapter 5. Visualising concept lattices 117

5.2.5 Algorithm Design

This section presents an overview of the algorithm to search for closed frequent con- cepts over the intent-extent search space and organize them in a concept lattice built incrementally at the same time ( Figure 5.2) . The top-level of the method is in Algo- rithm1.

Recursively process branches

Divide intent- Find frequent Compare Add newly extent search concepts closure found closed space based on Formal Formal local to each globally to concept to frequent n concept context prefix prune non- lattice and intents (1- lattice Preprocess to equivalence closed update candidate vertical format , class concepts hierarchy ordered by itemset) increasing weight

Figure 5.2: Steps to build formal concept lattice

Algorithm 1 BioLattice(data G, minSup) 1: Generate frequent 1-intent candidates and sort them based on weight function 2: Initialize root Lr =φ, IntentCid IC, prefix p = φ, lattice L 3: BioLattice Extend(G, IC, minSup, L, p, Lr) 4: return L // Formal concept lattice

In general, the search space of formal concepts or closed itemsets is reduced by elim- inating candidates which have less than minimum support specified by the user, in common with other itemset mining methods. As a pre-processing step a weight func- tion is computed for candidate 1-itemset concepts as in [ZH02]. Once frequent 1-itemset candidates are generated, they are sorted based on this weight function, which is de- fined for an item x as the sum of the support of all frequent 2-itemsets containing x. This sorted data becomes the input set G. The weights for the example in Figure 5.1 are shown in Figure 5.3 – note that the weights determine the left-to-right sibling ordering in the search tree of that figure.

A look-up table “IC” records each intent and their occurrences in closed concepts, and thereafter is used for subsumption checks. An example of such a table is shown at top left of Figure 5.4. The concept lattice is stored in a table called “L” containing information on all concepts and their hierarchical relations in the lattice.

Whenever a frequent concept is found, i.e., one with support above the user-supplied parameter minSup, IC and L are accessed to check subsumption, and if found closed, a unique concept index using hash function is generated. Based on the subsumption Chapter 5. Visualising concept lattices 118

descriptor objects Weight

F g7 3 I g4 4 E g5, g6, g8 10

H g2, g3, g4 11 D g5, g6, g7,g8 13 G g1, g2, g3,g4 13

B g1, g2, g3,g5,g6 16

C g3, g4, g6,g7,g8 18 A g1, g2, g3, g4,g5, 26 g6,g7,g8

Figure 5.3: Weight function values for frequent 1-itemsets Figure 5.1. information, a closed concept is added into the lattice while maintaining the general to specific order.

Generating concept sets

Algorithm2 (BioLattice Extend) generates closed concepts for each of the equivalence classes i.e., class of intents sharing common prefix set of items. The resulting search tree, see in Figure 5.1 for such a class can be seen as a locally generated general-to- specific ordered set of concepts, where the top concept is the most general concept with intent containing the prefix shared by all of its specialized sub-concepts.

Algorithm 2 BioLattice Extend(G, IntentCid IC, minSup, lattice L, prefix p, root Lr) 1: for all xi in G do 2: c < intent >= {p} ∪ xi < intent > 3: c < extent >= xi < extent > 4: branch = φ and closure = φ 5: for all xj in G do 6: concept < intent >= xj < intent > 7: concept < extent >= xi < extent > ∩ xj < extent > 8: if support(concept) ≥ minSup then 9: DoSpecialization(concept, xi, xj, branch, c, G) 10: end if 11: end for 12: ComputeClosure(c, IC) 13: Ln = AddConceptToLattice(c, L, IC, Lr,closure) 14: if Ln =6 Lr then 15: sort branch in order of increasing support 16: BioLattice Extend(branch, IC, minSup, L, c, Ln ) 17: end if 18: delete c 19: end for

A working example is described for the Algorithm2. Given the frequent 1-itemset candidates as listed in the table Figure 5.3, each of them forms an initial equivalence Chapter 5. Visualising concept lattices 119 class with 1-intent prefix. The first candidate F with the lowest weight is selected and consecutively combined with the next element in the list to check their number of co- occurrences (intersection of extents) satisfy required minimum support. The subsets such as {F,I}, {F,E}, {F,H}, {F,G}, and {F,B} result infrequent candidates, and are thereby pruned. Frequent candidates such as {F,D}, {F,C} and {F,A} are checked against the four properties (line 9; DoSpecialization subroutine defined in Algorithm3). Two types of local specialization are possible. Properties 1 & 2 state a specialization while replacing a node with increased number of intents whereas, properties 3 & 4 results a specialization while branching out a node with a reduced number of extents. As we see for the candidate {D},{g5,g6,g7,g8} property 2 is satisfied, the node {F},{g7} is replaced by the increased number of intents i.e., {F, D}{g7}, as shown in Figure 5.1.

Algorithm 3 DoSpecialization(concept N, xi, xj, branch, p, G)

1: if xi < extent >= xj < extent > then 2: delete xj from G 3: p < intent >= p < intent > ∪ xj < intent > 4: else if xi < extent >⊂ xj < extent > then 5: p < intent >= p < intent > ∪ xj < intent > 6: else if xi < extent >⊃ xj < extent > then 7: delete xj from G 8: add N to branch 9: else if xi < extent >=6 xj < extent > then 10: add N to branch 11: end if

An example of brunching out a new node is shown as a highlighted circle in the Fig- ure 5.1. Suppose a top node has the prefix as {E, D} with extents {g5, g6, g8} and a possible candidate has {B},{g1, g2, g3, g5, g6}. Following the property 4, a branch with the intersection of their extents {E,D,B},{g5, g6} is possible. Note that, unlike CHARM, we record only the difference of their intents i.e., {B},{g5, g6} at this point. Processing with the next element {C},{g3, g4, g6, g7, g8} adds another branch {C},{g6, g8}. Finally, the last element {A},{g1, g2, g3, g4, g5, g6, g7, g8} satisfies property 2, and the change is recorded only at the top node, replacing {E, D},{g5, g6, g8} by {E, D, A},{g5, g6, g8}. Unlike CHARM, our algorithm does not visit branches, and reorder & update details every time property 1 or 2 is satisfied.

After subsumption checks and lattice updates are carried out for {E, D, A},{g5, g6, g8}, BioLattice Extend is recursively called for next search level in depth-first man- ner. From the available information in the prefix {E, D, A} at line 2 of Algorithm2, branch {B},{g2, g3} forms the full concept {E, D, A, B},{g5, g6}. Combining with the only right sibling {C},{g6, g8}, satisfies property 4 and forms branch {C},{g6}. Before search is extended another level, closure of the concept {E, D, A, B},{g5, g6} Chapter 5. Visualising concept lattices 120 is examined and added to the lattice. While search is continued with the prefix {E, D, A, B},{g5, g6}, the only branch {C},{g6} forms concept {E, D, A, B, C}{g5, g6} and added in the lattice. Finally, algorithm backtracks to the root with the prefix {E, D, A}{g5, g6, g8} to process remaining branch {C},{g6, g8} which thereafter, forms {E, D, A, C}{g6, g8} and is placed in the lattice.

To promote occurrences of property 1 & 2, branches are sorted in increasing order of support. This ordering is done at the line 15 of BioLattice Extend, before branch processing initiates recursively. This step reduces the number of levels to be searched with the possibility of creating fewer branches.

Algorithm2 calls two subroutines (at lines 12 and 13), ComputeClosure (Algorithm4) and AddConceptToLattice (Algorithm5), to do the tasks of pruning non-closed con- cepts and building the concept lattice. The idea is that a list of more specific concepts is generated while their supports are examined. If any concept has equal or greater support than the candidate concept, it is not closed and therefore, pruned.

Algorithm 4 ComputeClosure(candidateset P, IntentCid IC)

1: for all pi in P do 2: closures = T ICpi 3: end for 4: return closures

ComputeClosure(Algorithm4) generates a list of all possible more specific concepts from the IntentCid table. Based on the intersection of their concept indices and the corresponding lattice entries, their support can be computed. For example, from Fig- ure 5.1 the concept {D, A, B},{g5, g6} (line 2, Algorithm4) generates:

{C5,C6} = {C1,C3,C4,C5,C6,C11}

∩ {C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11}

∩ {C5,C6,C9,C10}

As we see in the lattice table L,

C5 = {E, D, A, C, B}, {g6} and

C6 = {E, D, A, B}, {g5, g6} therefore, the concept {D, A, B},{g5, g6} is clearly subsumed by C6, but not C5, and is pruned as a non-closed concept. Line 14 of Algorithm2 enforces that all of its children Chapter 5. Visualising concept lattices 121 listed under the branch will also be pruned as they will generate further non-closed concepts.

Algorithm 5 AddConceptToLattice(candidate c, L,IntentCid IC, Lr, closureS)

1: for all si in S do 2: if supp(si) ≥ supp(c) then 3: return Lr // if Subsumed return current lattice node 4: end if 5: end for 6: key ← newid // generate unique id 7: for all intent d in c do 8: IC < key > .add d // Insert into IntentCid 9: end for 10: L< key >< intent >= c < intent > //add concept to the lattice 11: L< key >< extent >= c < extent > 12: L < key >< parent >=Lr //add root node as parent 13: L< Lr >< child >.add (Lr)//add new concept as child 14: Lr=key //update root Node 15: S ← sort descending order based on support 16: for all si in S do 17: for all sj in S do 18: if sj < extent > ⊂ si < extent > then 19: S.delete (sj) // Remove lower transitive closure 20: end if 21: end for 22: L< Lr >< child >.add (si) //add more specific concept as child 23: L< si >< parent >.add (Lr) //update parents 24: P = parents(si) 25: for all p in P do 26: if Lr< extent > ⊂ p < extent > then 27: L< p >< child >.delete (si) //Remove upper transitive closure 28: L< si >< parent >.delete (p) 29: end if 30: end for 31: end for

A unique concept index is generated for each of the closed concepts and is recorded in the IntentCid table by Algorithm5, AddConceptToLattice (line 6). The final step to add a concept in the lattice requires correctly identifying all of its immediate more general and more specific concepts. The root variable Lr keeps track of immediate more general concepts and is added as a parent of the newly found closed concept (at line 10-14 of Algorithm5). With the list of readily available more specific concepts from the previous steps, we need to eliminate any transitive closure if exists and afterwards, add them as children of the current concept. Any closed concept that is found to be more specific than another closed concept by lower transitive closure is discarded (at lines 16-21, Algorithm5). This step is followed by addition of remaining more specific concepts as the children of that newly-generated closed concept. Any upper transitive closure as shown in the steps of Figure 5.4 are also removed (in line 25-30, Algorithm5). Chapter 5. Visualising concept lattices 122

Intent Concept_ids C_0 A C_1,C_2,C_3,C_4,C_5,C_6 {root}{g1,g2,g3,g4,g5,g6,g7,g8} B C_5,C_6 C C_1,C_2,C_4,C_5 C_7 D C_1,C_3,C_4,C_5,C_6 {H,G,A}{g2,g3,g4} E C_3,C_4,C_5,C_6 F C_1 G C_2 C_2 H C_2 {I,H,G,C,A}{g4} I C_2 b)

C_0 C_0 {root}{g1,g2,g3,g4,g5,g6,g7,g8} {root}{g1,g2,g3,g4,g5,g6,g7,g8}

C_1 C_3 C_1 C_3 C_2 C_7 {F,D,C,A}{g7} {E,D,A}{g5,g6,g8} {F,D,C,A}{g7} {E,D,A}{g5,g6,g8} {I,H,G,C,A}{g4} {H,G,A}{g2,g3,g4}

C_6 C_4 C_6 C_4 {E,D,A,B}{g5,g6} {E,D,A,C}{g6,g8} {E,D,A,B}{g5,g6} {E,D,A,C}{g6,g8} C_2 C_5 C_5 {I,H,G,C,A}{g4} {E,D,A,B,C}{g6} {E,D,A,B,C}{g6}

a) c) Candidate frequent conceptp {H,G,A}{g2,g3,g4}{ More specific concept (H G C_2 Not subsumed by C_2, thereby, generate new concept_id C_7 and add C_2 as child More general concept Lr =C_0 (add as a parent)) Remove upper transitive closure, as C_2 C_7 C_0 in b)

Figure 5.4: Lattice Restructure

5.2.6 BioLattice compared to CHARM-L

Most closed set mining algorithms including CHARM [ZH02] only generate intents and in systems biology applications this seems to be inadequate, because the objects, usually genes with complete lattice are required for visualization. An extended version CHARM-L includes lattice generation, yet further building complete lattice with closed concepts containing extents was left out. Therefore, the only solution to this problem was to develop our own implementation.

BioLattice has noteworthy implementation differences from CHARM-L in the way it structures the search space and builds the lattice. Figure 5.1 demonstrates the BioLat- tice search tree where each of the equivalence classes can be seen as a locally ordered general to specific concepts that share a common prefix. A candidate concept is spe- cialized either by increasing intents (no change in extent) computed from right sibling with higher support (Property 1 & 2), or by creating a new branch with reduced extents (Property 3 & 4). During this process we propose that storing the intersection of the extents Cjhextenti ∩ Cihextenti is sufficient to process branches in the next recursive call (see Algorithm2). Replacement occurs strictly at the prefix and, unlike CHARM, Chapter 5. Visualising concept lattices 123 this technique does not require visiting all branches every time properties 1 or 2 is sat- isfied, leading to replacement of a more general by a more specialised concept, which significantly speeds up the search process. In bioinformatics domains, especially in our case where many genes (or transactions) can share common attributes or common subsets of attributes, this compression technique not only speeds up the search but also reduces the memory storage requirements.

Thus we have expanded the approach to store extra information while taking advan- tage of using a concept index intersection-based subsumption check. BioLattice checks subsumption at the parent before it generates a next level concept, thus builds a top down lattice faster with less restructuring.

5.3 A web-based browser for BioLattice

Figure 5.5: Concept Lattice

The ability to generate complete lattice over the concepts that can be visualized is the hallmark for Formal Concept Analysis. This hierarchical structure provides potential reasoning for classification, clustering, implication discovery, rule learning etc. While the size of a lattice grows exponentially by the number of attributes, a systems biology application for a complex network with such large number of attributes is common. It is obvious that visualization of such lattices in their entirety tend not to be comprehensible for users [Pri06]. Chapter 5. Visualising concept lattices 124

Our attempt to minimize the cost of scalable visualization is to propose an “incremental exploration technique” [HMM00] by placing a visible “window” on the lattice and moving this window along lattice edges to navigate by the concept order. This is related to previous work on concept lattice navigation by Godin et al. [GMA93], Carpineto and Romano [CR96], and Kim and Compton [KC04]. The underlying strategy is to explore the concept in current focus with the relative information of adjacent concepts (parents or children) in the lattice hierarchy.

To simplify the browser implementation, a tabular format is used for intents, extents and integrated background information, as seen in Figure 5.5. For each concept or a set of stress response, over-represented ontological categories (Gene Ontology www. geneontology.org) and protein interactions from BioGRID( www.thebiogrid.org) are displayed, see Figure 5.5.

5.3.1 BioLattice Application Window

The application window for BioLattice (see figure 5.7) offers two options i) input formal context file ii) desired minimum support (default= 2) in order to build the formal concept lattice. The input file is a .txt file with header as a list of stresses (descriptor/intent). Subsequent rows contain genes or ORF in a matrix form of binary values ‘1’ for up regulated genes and ‘0’ for down regulated genes to the corresponding descriptor (stress) column. Once the file is uploaded, context is preprocessed to vertical format required for frequent closed concept mining as we have discussed in 5.2.5. The sample text file is in figure 5.6

header(‘ORF’, [‘Farnesol ', 'Sterols uptake', 'Cisplatin res', 'Intron splicing', 'Cisplatin sen’, ::: ]). 'YBL044W', [1, 0 ,0, 1, 0, : : : : ]. ‘YDR006C', [0 ,0 ,1, 1, 0, : : : : ]. 'YDR470C', [0 ,1, 0, 0, 0, : : : : ]. : :

Figure 5.6: BioLattice input file

5.3.2 Tabular Lattice Display

Lattice is graphically represented as a table with ordered index by concept ids as rows. Corresponding columns for a particular concept outlines set of stresses as “INTENT”, set of gene response to those stresses as “EXTENT”, more general concept (reduced Chapter 5. Visualising concept lattices 125

Figure 5.7: BioLattice application index window number of stresses) as “PARENT” and more specific concept (increased number of stresses) as “CHILD”. Therefore, such representation can provide subtle overview of the lattice complexity given a minimum support, size of the genomic response to a set of stresses and finally, flexibility to pick individual concept of user interest and initiate browsing for further details.

CONCEPT_INDEX INTENT EXTENT PARENT CHILD

C_592 CHP_Thorpe; BLM; AKR1; ARV1; GAL11; C_560; C_698; C_945 C_589; C_467; C_565; C_578; LoaOOH_Thorpe PHO85; REG1; SLX8; C_564; C_593 SNF6; SRB5 C_593 CHP_Thorpe; BLM; AKR1; ARV1; GAL11; C_592; C_594; C_590; C_563 LoaOOH_Thorpe; Diamide_Thorpe SNF6; SRB5 C_699; C_960 C_583 CHP_Thorpe; BLM; AKR1; ARV1; ERG4; C_560; C_595; C_582; C_585; C_587 Mefloquine_Fields PHO85; ROX3 C_1197 C_585 CHP_Thorpe; BLM; AKR1; ARV1; ERG4; ROX3 C_583;C_594; C_633; C_536; C_580; C_586 Mefloquine_Fields; C_1184 Diamide_Thorpe C_586 CHP_Thorpe; BLM; AKR1; ARV1; ROX3 C_585; C_587; C_537; C_515; C_427 Mefloquine_Fields; C_591; C_606; C_118 Diamide_Thorpe; Sorbate

Figure 5.8: Tabular list of concept details

5.3.3 Lattice Manipulation

Selection of a concept link from the tabular display opens an interactive browser page to let the user examine concepts with the following functionality as shown in Figure 5.5. Chapter 5. Visualising concept lattices 126

1. The center of the navigation page has “Main” that displays the view of set of gene response to corresponding set of stress for selected concept.

2. The group of more specific concepts on the right hand side with the heading “More Specific” represent the effect of additional stresses (“more”) in terms of sensitive genes (“less”).

3. To browse the lattice bottom-up to view more general concepts, the user is pro- vided with the link “more general concept”. Similarly a “more specific concept” link is provided when browsing top-down (not shown).

4. By default, the more specific concepts are displayed for the current “Main” con- cept in the pane (i.e., bottom-up navigation). The link “Browse” when clicked will make that concept the new focus of the browser.

5. Each of the displayed genes has a gene product query link to the Gene Ontology entry for that gene.

6. All the concepts have ontology information via the “GO” link (not shown). More general or more specific concepts have comparative ontology information with regards to the current focus concept. The Gene Ontology MySQL database has been integrated with the lattice browser for this purpose and all three possible ontologies (process, component and function) with significantly over-represented categories are presented.

7. A visual display of the concept-related protein-protein interaction network is pro- vided for all the concepts via the “PPI link”. The BioGRID database has been integrated for this.

5.3.4 Comparative Ontology Display

The figure 5.9 displays two alternative ontology information. Common ontology terms shared by the target and its more general/specific concepts and over-represented on- tology categories in each of them. Categories can be viewed either by “Molecular Function”, “Biological Process” or “Cellular Component” as in figure 5.9.

5.3.5 Visualizing Protein Protein Interaction Network

Understanding protein-protein interactions as a part of global regulatory network is an essential task to comprehend the dynamics of the underlying cellular response. Pair- wise interactions among yeast proteins were gathered from BioGRID and graphically Chapter 5. Visualising concept lattices 127

Figure 5.9: Over-represented Geneontology categories displayed for each subgraph induced by the concept. Figure 5.10 represents a protein- protein interaction network as an undirected graph, where nodes represent proteins, and edges their interactions. Nodes filled with a red colour are the proteins extracted from concepts. Even though “interaction type” is not available, solid lines indicate direct interactions and dashed lines indirect interactions.

5.4 Mapping to Gene Ontology terms

The problem of mapping genes in the concept to their functional domain can be stated as follows. Given ordered set of formal concepts or lattice L, where each concept hX,Y i has X as a set of responses and Y as a set of genes, find the related set of ontology terms Z, with some evaluation function evl(Z) which will rank the top GO terms from the context F. Chapter 5. Visualising concept lattices 128

a b

a) Interaction network for combined stress Menadione_thorpe, LoaOOH_thorpe b) With additional stress CHP_thorpe (a and b)

Figure 5.10: Protein-protein interaction network

The context F, a coverage matrix (details in Chapter3), is used to account for GO structure to obtain an information measure for GO annotation. In this approach GO edge types are ignored and edges are regarded as instances of the single - tion refines(c, p) where c and p are child and parent nodes. The coverage matrix may then be used for over-representation analysis for the set of genes covered by each of the concepts in the lattice.

To understand how well the coverage matrix approach relates to more standard over- representation analysis methods we carried out an intial comparison. We again use the hypergeometric distribution due to our domain expert’s familiarity with its use in over-representation analysis, but here the setting differs from that of Equation 3.1 in Chapter3, so we give the equation again with the new notation.

From the the definition of hypergeometric distribution,

m−1 MN−M X P m, n, M, N − i n−i ( ) = 1 N (5.1) i=0 n where N is the total no. of genes in the background distribution (i.e., all those in S. cerevisiae) M is the total no. of genes in that distribution annotated to the category n is the no. of genes in the current concept, and m is the no. of genes in the current concept annotated to the category. Chapter 5. Visualising concept lattices 129

General to specific ontology categories concept intent(stress) GO:0030528 GO:0003702 GO:0016251 GO:0016455 concept concept concept concept ratio pvalue(x10-2) ratio pvalue(x10-2) ratio pvalue(x10-3) ratio pvalue(x10-4) BLM 0.056 0.845 0.051 0.155 0.040 0.137 BLM+ Sorbate 0.111 0.062 0.095 0.070 0.095 0.004 BLM+ Sorbate+ CHP_thorpe 0.429 0.575 0.429 0.041 0.429 0.047 0.429 0.033 BLM+ Sorbate+CHP_thorpe+ diamide_thorpe 0.500 0.343 0.500 0.024 0.500 0.027 0.500 0.019 BLM+ Sorbate+CHP_thorpe+ diamide_thorpe+ H2o2_thorpe 0.750 0.075 0.750 0.005 0.750 0.006 0.750 0.004

Figure 5.11: Ontology categories in response to stress. Shown is the relation of values for over-representation analyses on two measures for four GO categories (columns) on five concepts (rows). Categories in columns are ordered by decreasing generality from left to right. Concepts in rows are ordered by decreasing generality from top to bottom.

This approach takes each concept as a set of genes and evaluates related functional categories as to whether they are highly significant to that particular concept or occur just by a chance. The lower the p-value, the higher is the significance. Unlike our work in Chapter3, here the size of the sample is replaced by the size of a set of genes in the concept extent. Applying the hypergeometric distribution at the concept level may not only detect functions involved in various stresses but also indicate an increase or decrease in significance of particular functions.

To test this we investigated how tightly the mappings of ontology terms followed the original formal concept lattice. Table 5.11 has sample data showing variations in con- m cept ratio ( n ) and hypergeometric distributions in response to 5 different stresses for 4 hierarchically related ontology categories. Concept ratio is included as a heuristic approximation to the hypergeometric distribution, since it is faster to evaluate and may be more robust to multiple testing issues.

Monotonicity of concept ratio in categories

m This is a good example showing that concept ratio ( n ) in ontology categories satisfies monotonicity. With added stresses, the size (n) of the set of genes involved in a response decreases in a concept and thereby, the concept ratio in any category in the respective concept also decreased, as we see in Figure 5.12. Chapter 5. Visualising concept lattices 130

Concept ratio vs. stress in ontology categories Concept Ratio

3.5 GO:0016455 3

2.5

2 GO:0003702 1.5

1 GO:0030528

0.5

0

BLM

+CHP_thorpe

BLM+ BLM+

BLM+ Sorbate BLM+ Sorbate BLM+

Environmental stress H2o2_thorpe

diamide_thorpe

diamide_thorpe+

Sorbate+CHP_thorpe+ Sorbate+CHP_thorpe+

Figure 5.12: Concept ratio (no of genes annotated to the category/no of genes in the concept) in ontology categories in response to stress. Generality orderings as in Figure 5.11.

Concept ratio, Pvalue by stress in ontology categories

Concept Ratio Pvalue 0.800 0.9

0.700 0.8 0.7 0.600 0.6 0.500 0.5 GO:0030528 concept ratio 0.400 0.4 GO:0003702 concept ratio 0.300 0.3 GO:0016251 concept ratio 0.200 GO:0016455 concept ratio 0.2 GO:0030528 pvalue(x10-2) 0.100 0.1 GO:0003702 pvalue(x10-2) 0.000 0 GO:0016251 pvalue(x10-3)

BLM GO:0016455 pvalue(x10-4)

Environmental stress

CHP_thorpe

BLM+ BLM+

BLM+ Sorbate BLM+

BLM+Sorbate+

H2o2_thorpe

diamide_thorpe

diamide_thorpe+

Sorbate+CHP_thorpe+ Sorbate+CHP_thorpe+

Figure 5.13: Variation in concept ratio (no of genes annotated to the category/no of genes in the concept) and Pvalue in ontology categories in response to stress. Generality orderings as in Figure 5.11. Chapter 5. Visualising concept lattices 131

Hypergeometric distribution in categories

Figure 5.13 shows hypergeometric distribution values in categories as more stresses are added to concept intents. As we see, while the distribution often does not satisfy monotonicity, the variation in values of the distribution preserves ontology hierarchy, i.e., a more general category has high p-value (less significant) compared to more a specific category irrespective of stress. Thus unlike other GO based annotation meth- ods, the coverage matrix representation overcomes the “multiple category annotation” and “multiple depth of annotation” problems (discussed in Chapter3) and preserves ontology hierarchy.

5.5 Case study: yeast systems biology

The problem of understanding the causal mechanisms leading to observable phenotypes in the cell has in recent times been re-orientated away from single-gene models to focus more on multiple-gene systems. This is driven on one hand by biological need – single gene models do not explain sufficient functionality – and technological innovations enabling genome-wide high-throughput experimental data to be gathered.

Thus focus has shifted to the study of networks as the common representation for biological systems. For example, protein interactions are an increasingly important source of information on gene behaviour [Cag09, Cos10], and have a potential medical role [CC05]. Network data are relational, which causes problems for standard FCA and machine learning.

To obtain potential results of using a concept lattice for evaluation and generation of biological hypotheses, we devised two tasks using a dataset selected for its relevance to the application domain. Data to be mined was taken from the Saccharomyces Genome Deletion Project [WSA+99] as described above in Section 3.4.4 of Chapter3.

5.5.1 Concept ranking by gene interactions

As a first test of the possible biological significance of the stress combinations in the lattice we ran all genes in each stress combination against the set of genes occurring in the collection of synthetic lethal interactions at BioGRID (www.thebiogrid.org, downloaded 15 July, 2010). Chapter 5. Visualising concept lattices 132

Table 5.1: Concepts ranked by synthetic lethality. p-value Intent – stress sensitivities 2.4E − 44 Menadione 9.3E − 38 BLM 2.4E − 29 H2O2, Menadione 2.1E − 28 H2O2 1.6E − 25 Ibuprofen 1.7E − 22 BLM, H2O2 1.9E − 21 Sorbate 2.8E − 21 MMS 1.2E − 20 Ibuprofen, Menadione 1.4E − 20 TPZ, anticancer 5.4E − 19 IR2 3.7E − 18 H2O2, Sorbate 6.0E − 18 H2O2, Ibuprofen 6.8E − 18 H2O2, Mefloquine 7.0E − 18 Mefloquine, Menadione 1.3E − 17 Mefloquine 5.1E − 17 GHS, homeostasis 1.1E − 16 BLM, Sorbate 2.2E − 16 Menadione, Sorbate 1.9E − 15 BLM, Menadione 3.9E − 15 H2O2, Ibuprofen, Menadione 4.4E − 15 BLM, H2O2, Menadione

The cellular systems of an organism can be perturbed by challenges such as the addition of a stress-inducing agent to the environment or a mutation such as the deletion of a gene, and this can effect the observable phenotype. The stress data set used in this chapter can be seen as a set of such perturbations, where a cellular system is rendered sensitive to an environmental perturbation in combination with the deletion of a gene. An external reference point for such system interactions is the set of “synthetic lethality” pairs of gene deletions, where joint deletion of two genes is lethal to the cell, although the single deletion of each gene individually is not [TEP+01].

This leads to the following hypothesis: genes implicated in stress sensitivity should be more likely to appear in synthetic lethality interactions than genes from the complete genome. We used a hypergeometric test (Equation 5.1), with a background number of 6140 yeast genes, and 2893 genes occurring in a synthetic lethality interaction. Then for each concept in the lattice, we counted the number of genes in a synthetic lethality interaction and the number of genes in the concept extent, and assessed the probability of seeing that proportion by chance. We found 22 concepts with p-value below 1.0 × 10−14, shown in Table 5.1 ranked in increasing order of p-value. For instance, 220 of Chapter 5. Visualising concept lattices 133

249 genes that show the sensitivity “Menadione Fields” are also in a synthetic lethality interaction, giving a p-value of 2.4 × 10−44, which is likely to be highly significant.

Mitochondrial ribosomal small subunit complex

Core Attachment

MRPS16 UBP10 MRPS17 MRPS8 RSM24 RSM22 MET6 RSM25

RSM19

GCN4 BAS1 CAD1 MET32 UME6

Figure 5.14: Example of a network structure learned for genes in the extent of a formal concept. In this case the structure relates gene and protein interactions in the mitrochondrial ribosomal protein complex in response to H2O2 stress.

Since the synthetic lethality testTranscription is on factors a pair of deletants, this is an indirect indication of common functionality, but it is not clear what this might be. This leads to the interesting question: if two genes (individual deletants) are sensitive to a stress, are the respective genes more likely to appear in a synthetic lethality interaction ? To test this we compared the number of synthetic lethality interactions appearing between pairs of genes in concept extents with those in randomly selected sets of the same size. Applying a log-odds ratio, we found a positive score, indicating that 72 (resp. 119) concepts out of over 1300 were more likely to contain synthetic lethal pairs than would be expected by chance at the 99% (resp. 95%) confidence levels.

This indicates that the concept is likely to denote some interesting biological function worthy of further study. It demonstrates how application of an external (to the concept lattice construction) statistical evaluation criterion can lead to a significant reduction of the set of concepts. This makes apparent how post-processing of the lattice can have a dramatic effect on the possible usefulness of the concept set, since a lattice of around 100 concepts is much more likely to be visually navigated, with further analysis applied, than one with an order of magnitude more concepts. Chapter 5. Visualising concept lattices 134

5.5.2 Relational learning of multiple-stress rules

In this experiment we adopted a machine learning approach to model dynamic cell behavior in response to multiple-stress concepts from the lattice. Various data sets from genomics, protein-protein interactions, transcription factor binding, pathways and the Gene Ontology were integrated to learn first order multiple-stress rules. The concept extents or genes sensitive to a common set of stresses comprise the positive examples while the other data is used as background knowledge.

A. Genomics data This data has been used in Chapter4; details are in Section 4.3.1.

B. Protein-protein interaction data Protein-protein interaction data provides a picture of cellular pathways and cascaded responses in the cell at the molecular level. We used a simplified “flattened” version of the Gavin et al. [A-C06] protein complex data. Protein-protein interactions are represented as true for any pair of genes whose proteins appear in the same core, attachment or module of one of the stable complexes.

C. Transcription factor binding data For this work we used a version of the tran- scription factor binding (ChIP-chip) data from the study by Harbison et al. [HGL+04] as in Section 4.3.1.

D. Biochemical pathway data Pathways are an invaluable resource to understand dynamic cell behavior, containing biochemical evidence for a set of interacting genes and thereby can unveil interesting relationships among them. Yeast biochemical pathway data was downloaded from SGD (www.yeastgenome.org).

E. Ontology data Finally, Gene Ontology [Ash00] data was integrated to relate all three categories (molecular function, cellular component and biological process) to complete the background knowlege required for multiple-stress rule learning.

5.5.2.1 Method

Inductive Logic Programming system Aleph 1 is used for relational learning task in this chapter. The above data types are represented as Prolog facts. Since concepts from the lattice contain only positive examples, i.e., genes sensitive to a given set of stresses, we have used the positive-only learning setting introduced in [Mug96]. The following Aleph settings are used for our experiment, where the learning mode is “positive examples only”: maximum number of literals acceptable in a clause is set to 6;

1www.comlab.ox.ac.uk/activities/machinelearning/Aleph/aleph.html Chapter 5. Visualising concept lattices 135 number of randomly generated examples is set to be equal to size of the set of positive examples; minimum number of positive examples to be covered by an acceptable clause is set to 4; the search strategy is set to be a heuristic best-first (branch-and-bound) search; and the maximum number of clauses to be generated when searching for an acceptable clause is set to 50000.

The following clause is learned for a concept whose intents/genes are sensitivity to oxidative stress: concept(A) :- ppi(B,A,C), tfbinds(D,C), ppi(B,C,E), tfbinds(F,E), ppi(B,A,E). denotes that gene A, which is sensitive to oxidative stress, interacts with two other genes/proteins C and E that are in the same protein complex. Proteins/genes C and E also have interactions among themselves, and each of them is bound by transcription factors D and F. We extracted the set of genes covered by this clause and mapped them onto the protein complex architecture from Gavin et al. [A-C06] to assemble the network diagram in Figure 5.14. In stable protein complexes those with the greatest degree of functional similarity and physical association are grouped as the “core” or “modules” of a complex, and proteins with greater heterogeneity are grouped as attachments. Thus an attachment specifies a particular function for a protein complex. Proteins highlighted in Figure 5.14 as red circles are sensitive to the oxidative stress (known from the concept definition) and proteins in green circles are likely to be under regulatory control of the transcription factors linked to them by the vertical dotted lines.

Interestingly, in one instantiation of the clause above, the gene RSM19, which is sen- sitive to oxidative stress in our data set, interacts with RSM22 and MRPS17 where RSM22 and MRPS17 also have interactions among themselves within the same com- plex (“mitochondrial ribosomal small subunit”). In addition, RSM22 is bound by the transcription factor CAD1 and MRPS17 is bound by GCN4 and BAS1. Although RSM22 and MRPS17 were not found to be sensitive to the oxidative stress in the concept intent, they were found to be sensitive to another oxidative stress, not in the concept from which the rule was learned. This supports the possible inference of this sensitivity from the relational structure of the clause as applied to the protein complex in Figure 5.14. Chapter 5. Visualising concept lattices 136

5.6 Conclusions

In this chapter we studied the application of Formal Concept Analysis as a framework for visual analytics. We presented a new algorithm for concept lattice construction based on ideas from frequent closed itemset mining. This improves on some aspects of previous algorithms and has been implemented with a browser interface as an environ- ment integrating multiple data sources for visual analytics. In the BioLattice system we have developed several ways to control concept lattice construction to enable bet- ter visualization, such as using a support threshold (preventing concepts being added) and post-processing concepts using external data sources, particularly relational data, to refine the concept space. Owing to the highly relational nature of the application domain, systems biology, the integration of machine learning into the visual analytics framework used Inductive Logic Programming. It was shown that domain relevant concepts identified from the lattice could be used in learning potentially useful rules defining the concept in a first-order relational representation.

Simplifying a concept lattice by selecting a subset of formal concepts is not as straight- forward an operation as, say, cutting the concept at a given level of support (extent cardinality). Owing to the inter-dependence of concepts in the lattice, removal (dele- tion) of one concept will often lead to the need to delete, revise or even add other concepts. To see this, suppose we have three concepts, each with intent containing two out of a possible three attribute, call them A, B and C. The concept to be deleted has intent {A, B}, while the other two concepts are not to be deleted. If the concept with intent {A, B} is deleted, however, any other concept whose intent contains either A or B will also be affected. In previous work [Bai03a] we developed approaches to incrementally revise the lattice without reconstructing the entire Hasse diagram, and this can be viewed as selection of concepts for ontology construction.

Although the tabular browser interface shown in Figure 5.5 includes useful functionality, this may be improved by adopting a richer graphical interface; the work of Eklund et al. [EGW+09] provides an example of high quality graphics in a lattice browser environment, and [WE11] shows how the use of tagging and similarity metrics can improve navigation options in lattice browsing. 6 Augmenting formal concepts with first-order learning and visual analytics

“It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.”

Abraham Maslow

This chapter is based on the idea of “visual closure”, which is analogous to the closure of formal concepts based on the Galois relation between intents and extents. The visual closure takes the intent, produces its dual definition in the extent, then instead of finalising the closure by translating back to the corresponding intent the translation is into a graphical representation that can be visualised by domain specialists.

6.1 Introduction

This chapter is an extension of the work on augmenting formal concepts with relational background knowledge in a biological domain. In the previous chapter we used the Inductive Logic Programming system Aleph to discover significant logical relationships as a set of first order rules representing cellular network responses. Such learning involves inducing rules by generalizing facts over large background data. While rules were learned using search by a Prolog-like automated-inference engine, this had two significant drawbacks. First, to reduce complexity of learning, a flattened version of the protein complex data was used as background knowledge. Thus the richer structure which is crucial for understanding functional organization of expressed proteins in a Chapter 6. Augmenting formal concepts with learning and visualization 138 particular complex had been lost. Second, as a part of the inference process, all the intermediate results — trillions of rows — became redundant and were discarded. This limited the ability to further investigate the rules in detail, in the application area of this thesis, by biologists. In summary, this research aims to investigate methods for biologists to answer the question: “What do these rules mean ?”.

The approach taken in this part of the research is to enrich learned rules further by focusing on a particular aspect of the biological domain, namely the architecture of stable protein complexes. We devise user-interface guided tools based on a data driven search algorithm to “ground” rules, and develop visual analytic tools that are applied to comprehend the dynamic cell behavior.

Owing to the complexity of this task we adopt a multi-stage approach. This incor- porates Formal Concept Analysis followed by first-order rule learning then deductive query processing and user-driven visualization. This has the effect of multi-layered fil- tering, starting from a genome-wide data set and ending by focusing on a user-selected protein complex. An overview of the process is shown in the diagram of Figure 6.1.

Figure 6.1: Multi-layered information processing filter for augmenting concept lattices with first-order learning and visual analytics. Chapter 6. Augmenting formal concepts with learning and visualization 139

For this work we have adopted an integrated deductive database approach that has the ability to support analysis using deductive rules to infer additional biological informa- tion from the facts or heterogeneous data stored into a relational database. Formally, this kind of rule based query language is known as Datalog and the technique can be viewed as a generalization of logical model of databases [UGMW01]. Alternatively, Datalog can be viewed as a programming language with Prolog-like syntax but sim- pler semantics. Datalog is used in this chapter as a bridge from rules learned by ILP as shown in the previous chapter to systems biology databases, and in the following chapter to learn rules defining first-order formal concepts.

This chapter is organized as follows. To start, necessary background on Datalog, and the problem of rule closure relative to a set of examples and background knowledge are discussed. The next section describes query algorithms to implement the approach efficiently with an example from the biological application. Visualization methods for biologists are required which are described in the following section using a domain- specific method referred to as visual closure. This is illustrated by two selections of case studies followed by conclusion.

6.2 Background: Semantics of Datalog queries

First we describe relevant concepts required for deductive database system using the representation language called Datalog [CGT89, Gen10]. This is a domain-specific logic programming language designed for integration with relational databases. The syntax of Datalog is a subset of Prolog. The two main types of information categorization in deductive database system are rules and facts. The corresponding parts of a Datalog program are the intensional and extensional parts: these are also referred to as the intensional database (IDB) and the extensional database (EDB). In this thesis our representation is based on [CGT89].

6.2.1 Rules

A rule is a function free definite (Horn) clause with two parts: a single literal called head of the rule and a conjunction of literals called body of the rule. This can be written as: Chapter 6. Augmenting formal concepts with learning and visualization 140

A ← B1,B2,...,Bm

where each literal Bi, 1 ≤ i ≤ m, has the form p(t1, t2, ....tk) where p is the name of the predicate (relation name) applied to the terms tj, 1 ≤ j ≤ k, and each term can either be variable or constant. The left hand side of the symbol ← is the head of the rule and the right side is the body.

A rule can be seen as a statement that enables the use of implication to deduce facts from other facts in a way that the head (conclusion, or consequent) is true whenever the body (conditions, or antecedents) is true. In database terminology rules define how a relation, such as A, can be computed from other relations, such as the Bi.

6.2.2 Facts

Facts are the assertions of relevant world knowledge that are formally represented by instances of a predicate with constant arguments. These are known as atoms, of the form p(t1, t2, . . . , tk) where p is a predicate symbol and the ti, 1 ≤ i ≤ k are terms; a term being either a constant or a variable.

6.2.3 Negation

In this thesis the Datalog representation used has all literals as positive literals, i.e., there is no negation. This is not a restriction for the biological domains we have investigated, but for future work it may be necessary to include negation, for example, to represent certain SQL queries.

6.2.4 Queries

For a rule learned by an ILP system to efficiently generate the data for visualization containing all the relevant information, the full extension of the rule must be generated. Since the information is typically stored in a relational database, each rule must be converted to a SQL query and executed on the actual database. In order to run efficiently as a backend to a web server, care must be taken in such conversions to avoid unacceptably long response times for users. This query planning and optimization is crucial and the approach developed in this thesis is described in Section 6.3. Chapter 6. Augmenting formal concepts with learning and visualization 141

In logic programming query can be referred to as inference. For a query q, the system searches for n facts where q ← n is true. The problem is that in a worst case expo- nentially many tuples may be returned by a query, e.g., protein-protein interactions with k variables, each with domain size up to N, may return kN tuples. Therefore, query optimization (reordering of literals) is crucial. Also, a naive implementation of ontology would involve recursive definitions. Hence we use a flattened version of the ontology (see previous chapters, and also [GS12]).

6.3 Algorithm Design

This section presents the overview of the rule explorer query algorithm that extends an additional layer of knowledge of yeast genome by searching extensional database for a given rule and its positive cover.

The main idea of the algorithm is to reduce the number of candidate keys to join relations in the query. For example, literals representing interactions (such as protein- protein or protein-DNA) which can lead to blow-up in the set of tuples retrieved are delayed by sorting them to the end of the query, whereas literals with constant values (such as the transcriptomics or pathway data) are prioritised for execution on the database. Since the queries are conjunctive, this has the effect that the domain size of any variable can only be reduced or stay the same as the remaining literals in the query are executed.

We have adopted both SQL and Datalog notation to explain the example that follows the algorithm description.

Algorithm 6 Rule explorer(Concept extent E, extensional database D, set of rules R)

1: for all rules ri in R do 2: Initialize ordered literal search space S=∅, variable set V=∅ 3: Order literal search space(ri,S,V) // Algorithm7 4: Process query(E,D,S,V) // Algorithm8 5: return query result Q 6: Do visualization // Section 6.5 7: end for

Note that at line 15 in Algorithm7 the effect of adding constant terms to the variable set V will be that these can be used as restriction terms (i.e., in the where clause) in the corresponding SQL query. Chapter 6. Augmenting formal concepts with learning and visualization 142

Algorithm 7 Order literal search space(ri,S,V)

1: Split ri into head H and body B 2: for all literals li in B do 3: si < name > = predicate name in li 4: si < term > = terms in li 5: add si to literal search space S 6: if li has a constant term then 7: shift si to the beginning of the literal search space S 8: end if 9: for all variable tj in si < term > do 10: if tj is NOT in V then 11: // create new index variable 12: initialize V< tj >< data >=∅, V< tj >< feature >=∅ 13: end if 14: if si < term > has constant c then 15: update V< tj >< feature > by adding c 16: end if 17: end for 18: end for 19: add H as the first item at the beginning of the literal search space S 20: return updated V, S

Note that we have assumed that the clause head H contains variable(s) representing the target concept; therefore the domain or extension of this concept will be bounded by definition and will not affect the tuples found by executing the query.

Algorithm 8 Process query(E, D, S,V)

1: for all literal li in S do 2: for all variable vi in li do 3: retrieve S< vi >< data > 4: retrieve S< vi >< feature > 5: if S< vi >< data > is not empty then 6: form SQL query using S< vi >< data > in the where clause as a constraint 7: end if 8: if S< vi >< feature > is not empty then 9: form SQL query using S< vi >< feature > in the where clause as a constraint 10: end if 11: with the result of the query, update S< vi >< data > 12: end for 13: end for 14: compute candidate keys for join 15: apply join operator to form total coverage of the rule

6.4 Worked Example

The following steps in processing a rule show how the algorithm solves the problem. Chapter 6. Augmenting formal concepts with learning and visualization 143 concept(A) :- gc ppi(B,A,C), tfbinds lone(D,C), acid(D,down), nacl(C,up), bp(C,‘GO:0003674’).

In the above rule, the head consists of variable A which appears at least once in the body of the rule. There are five different predicates in the body of the rule: gc ppi/3, tfbinds lone/2, acid/2, nacl/2 and bp/2 applied to seven different terms — four variables A, B, C, D and three constants up, down and ‘GO:0003674’. All the data or facts are stored extensionally, that is, in the database. Intensional predicates in the rule body can be seen as a view in a typical database system.

This rule can be interpreted as: there is a gene A, which is sensitive to sorbate, that interacts with gene C in some protein complex B. Furthermore, the rule indicates that gene C has two other properties: that it is up-regulated by salt (NaCl) and involved in the Gene Ontology biological process GO:0003674; and is bound by transcription factor D, which in turn is down regulated under exposure to acid. Clearly, this captures the complexity of the underlying fragment of biology.

Figure 6.2: Example of the predicate ordering in the search space.

This rule corresponds in traditional SQL to a selection from each table that reduces the number of vertical joins. For our above example, literals with constants, such as bp, nacl and acid are placed higher in the ordering than literals without any constants, such as gc ppi and tfbinds lone, by Algorithm7 and as shown in the Figure 6.2.

In Figure 6.3, each variable A, B, C and D is initialized to contain a data-feature pair, defining a top-down generalization of data as the query proceeds, followed by the Chapter 6. Augmenting formal concepts with learning and visualization 144 selection operator. Data field corresponding to each variable stores newly generated facts that satisfy the variable and is updated each time selection operator is applied to that particular variable. On the other hand, the feature stores the condition for selection operator which are constants in the literal.

The search begins with the head literal L0 that is querying the ?−concept(A) to retrieve facts in the positive cover for this concept. Note that this is the complete positive cover of the rule, in order to ensure that we obtain all relevant tuples in the concept for the visualization of this rule. In Figure 6.3, 204 genes are in the positive cover which are stored as data under the variable A. The first body part in the ordered literal search space is bp(C, ‘GO:0003674’). Therefore, the query result retrieves all the values of variable C that satisfies the constraints or features for the biological process ontology category ‘GO:0003674’. This can be seen as a selection process in a relational query that reduces the number of tuples to be joined for the predicate bp based on the key variable C. In SQL, the query can be written as:

SELECT DISTINCT id, go cat FROM bp where go cat = ‘GO:0003674’;

An alternate Datalog query is:

?- bp(Id,Go cat), Go cat = ‘GO:0003674’.

This results in 3015 new facts or genes that are stored as data under the variable

C. The next body literal L2 is nacl(C,up ). The constant ‘up’ is further added as a selection condition along with the existing condition ontology category ‘GO:0003674’ for the variable C. Therefore, a significant reduction in gene number is achieved, here in our case from 3015 to down 212 genes. The SQL query is:

SELECT DISTINCT id, response as nacl response FROM nacl

WHERE response = ‘up’ AND (id = gc1 OR id = gc2 OR ... OR id = gc3015);

In Datalog, the query can be written as:

?- nacl(Id,Response),Response=‘up’,member(Id,[gc1, gc2, ..., gc3015]). Chapter 6. Augmenting formal concepts with learning and visualization 145

Figure 6.3: Step-by-step flow-chart for processing of a rule to give a query. Chapter 6. Augmenting formal concepts with learning and visualization 146

The next literal acid (D,down) queries the facts that satisfy the condition acid response down and gathers 479 genes under the variable D. In SQL, the query is:

SELECT DISTINCT id, response as acid response FROM acid WHERE response=‘down’;

In Datalog, the query can be written as:

?- acid(Id,Response),Response=‘down’.

The fourth literal L4 in the search space is gc ppi(B,A,C) which has no constants but only variables. This literal states that in protein complex B, there is a set of proteins A that interacts with another set of proteins C. Our algorithm inspects the number of proteins listed under each variable and gathers 204 proteins under A and 212 proteins under C. Therefore, the resulting query retrieves new facts which is a set of 34 protein complex ids that contain such interactions and reduces the number of the interacting genes in A from 204 to 41, and in C from 212 to 21.

The equivalent query in SQL is:

SELECT DISTINCT gid as B, proteinA id as A , proteinB id as C

FROM gc ppi WHERE ( proteinA id =ga1..... OR proteinA id = ga204)

AND ( proteinB id = gc01..... OR proteinB id = gc0212);

In Datalog:

?- gc ppi(Complex,Id A,Id C),member(Id A,[ga1, ga2, ga3 .....ga204]),

member(Id C,[gc01, gc02, ... , gc0212]).

The last literal L5 is tfbinds lone(D,C) which depicts the transcription factor D binding to under a single stress condition. To process this part of the query, the algorithm inspects proteins listed under both D and C and the resulting search space further reduces the number of proteins from 21 to 3 for the variable C and from 749 to 2 for the variable D.

The equivalent query in SQL is: Chapter 6. Augmenting formal concepts with learning and visualization 147

Figure 6.4: Completing the closure of the rule. The final step is the join of the two relations gc ppi(B,A,C) and tfbinds lone(D,C). For each of these relations, the leftmost and centre columns show the total num- ber of tuples (respectively, distinct values of each variable). The rightmost col- umn shows the corresponding values once the query evaluation has terminated.

SELECT DISTINCT factor id as D, promoter id as C

FROM tfbinds lone WHERE (promoter id=gc001..... OR promoter id= gc0021)

AND ( factor id = gd1..... OR factor id = gd479)

In Datalog:

?- tfbinds lone(Id D,Id C),member(Id D,[gd1, gd2, gd3 ..... gd479]),

member(Id C,[gc001, gc002, ..... gc0021]).

Finally the algorithm counts the number of occurrences of each variable in the join conditions to identify the candidate join key. Here, in this case, there are two candidate literals gc ppi(B,A,C) and tfbinds lone(D,C) and the key is C. According to the last processing step, we have only 3 key values for C, which significantly reduces the size of the final table. The query size of gc ppi(B,A,C) is 96 and tfbinds lone(D,C) is 3, which gives a total number of rule tuples of 26, as in Figure 6.4.

6.5 Visual closure of the rule

In recent years there has been a flourishing of studies on visualization of biological networks in order to analyze the complexity of high-throughput data. For example, Chapter 6. Augmenting formal concepts with learning and visualization 148 visualization can be used to illustrate interconnected protein networks, with the aim of giving biologists the ability to extract information, investigate details, validate hy- potheses and apply their decision, e.g., by further editing to revise the visualization and the hypotheses it represents.

Extensive background knowledge or biological context and the scalability of the inter- connected genetic network have been well-known challenges to model dynamic behavior of the cell and its effective visualization. To visualize a large network at once, there are several software packages: a detailed list can be found in [GOB+10]. Most of these, such as Cytoscape [SMO+03], VisANT [HSD08], etc., use force directed algorithms to map interactions among proteins resulting from high-throughput data, where nodes represent proteins and edges are the interactions among them. Such visualizations can be quite helpful for simple and small networks, while for complex networks they can identify features such as highly connected hub proteins and visual outlines of the protein complexes.

However, such visualization is limited by a number of issues. First, scalability, because as the number of network proteins and their inter-connections increases, the complexity of the visualized network leads to reduced comprehensibility. Second, by visualization of the complete set of interactions, other information on the structure of the network is often lost. For example, simply visualizing all protein interactions at once from the data of Gavin et al. [A-C06] loses the essential known modular information about the complexes. Also, there are limitations on how to integrate into visualizations such knowledge for example, biological process categories from an ontology, or information about physical structure of interactions in proteins.

As an alternative, several clustering visualization tools, such as MCODE [BH03] and GEOMI [ADF+06], have been proposed. In general, these tools use different graph- theoretic cluster prediction methods to identify dense subgraphs, representing such biological features as protein complexes. This information is later applied for layout and network visualization. Further methods need to be then applied to integrate additional information on the biological systems being represented, such as gene expression or metabolic pathways; for details see [GOB+10].

However, in our work we take a different approach, which may be termed “bottom-up” as opposed to the above “top-down” approaches. A major advantage of our approach is that rules that may denote significant biological sub-systems, as outlined in this thesis, can form the basis of an integrated visualization. Additionally, we can extend further rule representation by enriching the background information by injecting knowledge Chapter 6. Augmenting formal concepts with learning and visualization 149 about structures that could not have been easily included during the learning phase due to complexity overheads.

In this section, we illustrate this technique by adding extra structure to the data before visualization, i.e., knowledge of the module, core and attachment components of the protein complexes of Gavin et al. [A-C06], plus more refined transcription-factor binding condition information from the data set of Harbison et al. [HGL+04]. Hereby, we offer our integrated approach to model systems biology in incremental steps and finally incorporate visual analytics as a tool to further investigate with additional knowledge in a process termed the “visual closure” of a rule. Figure 6.6 shows the rows of data in the extension of the rule from case study 1 in Section 6.6.1 — this shows that the extension of this rule with respect to the background knowledge contains several complexes, one of which is selected by the user to be the focus of the visualization.

6.5.1 Approach

It is clear from our work that the use of protein interaction data is critical to the succes of the learning. However, the complexity of interaction graph makes understanding of the learned models difficult. Therefore a common method in biological network visualization is to focus on smaller subnetworks involving only a subset of the pro- teins [GOB+10]. Typically, such visualizations are done manually by expert biologists — as was done in the manual approach of Figure 4.3 earlier in this thesis.

In order to automate such visualization we adopt the following method, based on the suggestion in [GOB+10] as a first step to identify the interactions based on protein complexes.

The key idea is the following: a biological structure of significance is selected as the “focus” of the visualization, in order to reduce the complexity. Initial investigation showed the importance of the stable complex data of Gavin et al. [A-C06], and it was decided to adopt this as the focus for the visualization. We refer this choice as “complex-centred visualization”. Note that in general, this selection will be decided jointly by bioinformatics engineers and biologists.

The “visual closure” then follows as a series of design choices from this initial selection. In what sense is this a “closure” ? By itself, the information on a given protein complex, even including its structure formed as core, module or attachment, is insufficient to explain its connections and hence potential functions in the cellular system. Therefore Chapter 6. Augmenting formal concepts with learning and visualization 150 we add all available information, as determined by the coverage of the learned rules with respect to the background knowledge, to the visualization.

To avoid over-complicating the visualization, these two information sources were shown as separate, but related, in a side-by-side diagram. This gives a two-view structure – in one, the focus is on the relevant protein complex (left-hand side), and in the other on the related sub-network (right-hand side). See, e.g., Figure 6.7.

Why is a single complex chosen as the starting point for the visualization ? This was not our first choice, given that (i) a single clause may have in its extensional definition several complexes, and (ii) even in a single extensional definition (table row) this may include literals denoting more than one complex. However, visualization of more than one complex at a time together with its connections to the rest of the system, was found to be too complicated.

To see this, a worst-case complexity analysis shows that, for N proteins, since within a complex each protein is connected to all the others, this forms a clique, and any protein may occur in more than one clique, i.e., the cliques are interconnected, hence the number of possible edges in a graph to visualize the complexes is O(N 4) (all proteins within a clique is O(N 2), and each clique to all other cliques is also O(N 2)).

6.5.2 Background data

The following additional data was used to augment the table data for visualization.

A. Protein complex data

Data from Gavin et al. [A-C06] was used for protein complex visualization. There were 491 complexes localized in 12 distinct cellular sites. Among them 377 complexes had attachments with 148 different modules, in total 115,164 interactions. There were 1,487 distinct proteins either in core, module or attachments.

B. Transcription factor data

This is an expanded version of Harbison et al. [HGL+04] data used in the learning (which were simplified by aggregation for the ILP experiments). There were 13 differ- ent stress conditions (with 36 different combinations of multiple stresses) under which proteins were found to bind to regulatory regions of the DNA with high confidence; all of the resulting 9075 interactions were added. Chapter 6. Augmenting formal concepts with learning and visualization 151

The remainder of the background data was as used in Chapter5 and as detailed in AppendixA.

6.5.3 Visualization layouts

This work combined two different layout paradigms. The protein complex architecture from Gavin et al. [A-C06] was visualized using a radial space-filling approach, and the connected sub-network induced by the complex was visualized using a force directed layout. It is a well known fact that proteins rarely act alone and, on the contrary, a set of interacting proteins grouped as a complex often shares functional homogeneity. In Gavin et al.’s study proteins in such complexes are further partitioned into “core” or “modules” based on their greatest degree of functional similarity and physical as- sociation, whereas proteins with greater heterogeneity are grouped as “attachments”, indicating a particular function for a protein complex.

There are numerous graph visualization software packages; details can be found in [HC12]. The inevitable trade-offs associated with selection of a visualization software package are from one of the three broad categories: information visualization, graph analysis, and statistical packages. After careful inspection, Infovis (http://philogb.github. com/jit/index.html), which offers a rich interactive visualization ability plus an anal- ysis ability, was selected to leverage our previous visual analytics tool for systems biol- ogy visualization.

Infovis is implemented in Javascript, which can be easily integrated with our previous PHP-based software, uses a static JSON graph structure adequate to represent rich graph structures and other data, and provides various dynamic functionalities on this graph for interactive visualization. Taken together, these features provided the best open-source solution for our visualization requirements.

Our design for this work uses two graph views from Infovis: “Sunburst”, a library of algorithms for radial space-filling visualizations, used for protein complex architecture; and a standard force-directed library used for all other interactions and properties.

6.5.4 Protein complex architecture visualization

Selection of a particular complex from the interactive tabular display browser window in Figure 6.6 opens a visualization window as shown in Figure 6.7. Chapter 6. Augmenting formal concepts with learning and visualization 152

The Sunburst visualization uses a radial layout of three ring segments. Input data for this graph is organized hierarchically where top of the hierarchy, i.e., the complex name, is at the center, the next hierarchy contains the categories of core, module and attachment, and the last hierarchy (furthest away from the center) has the set of proteins or genes associated with each category. The circular angle swept out by each group corresponds to the size of the set of proteins or genes in it.

Additionally:

1. for each complex, its name and localizations, in single or multiple sites, are dis- played at the top of the window;

2. each group of proteins or genes are sorted alphabetically inside the core, module or attachment categories;

3. to maximize the view at different circular levels, each protein or gene is highlighted and zoomed by the “on mouse-over” function;

4. in addition, other properties such as module name, response to particular stress, ontological information covered by the rule are also displayed by the “on-click” function;

5. further, on clicking on any component, the complex will rotate to focus that component in the centre of the window and orient the text to the horizontal for the user to see the details.

6.5.5 Force directed interaction visualization

To present all the interactions and properties in a single complex covered by a rule we have used a standard force-directed graph library. The advantage of the force directed method is that it uses straight edge drawing with spring like flexible properties to stabilize the node connections. That is, it creates crossing-free layouts.

Additionally:

1. to maximize the view, each protein or gene is zoomed by the “on mouse-over” function;

2. on clicking on any protein within the force-directed layout graph, the protein is highlighted, its immediate connections (graph edges) to other proteins are also highlighted, and a list of the connected proteins is shown to the upper-left of the Chapter 6. Augmenting formal concepts with learning and visualization 153

window in text format to enable the user to copy and paste them into other tools for further analysis.

6.6 Two biological application case studies

Since this work aims to model and visualize the dynamic behaviour of cells under combinations of particular stress conditions, we now describe two selected applications of the rule extension and its visual closure. These were chosen on the basis of illustrating different aspects of the integration of multiple data sources.

The sequential steps were:

1. we have gathered gene responses to 26 different stresses for 1094 genes and applied FCA over them (see above). Concepts in FCA have intents as a set of stresses and extents as a set of genes;

2. concepts are extracted from FCA to apply ILP and uncover significant logical relationships in cellular network response data;

3. developed an integrated deductive database front-end to query rules with respect to background knowledge, which is then enriched with the protein complex ar- chitecture, transcription factor binding conditions, etc;

4. applied visual analytics to enable the user to further investigate details of the complex architecture, its connections to other integrated data and support their decisions for further analysis.

Here we give the two case studies as an application of this framework.

6.6.1 Case study 1:

The mechanism of cell response and adaptation to a particular stress is critically com- plex as the process may involve all aspects of cell biology. Three different phases can describe the model. In the preliminary phase, immediate cellular changes occur as a result of stress onset; in the second phase, various defense mechanisms trigger; and in the final phase, adapted cells resume normal growth.

In certain eukaryotes such as Saccharomyces cerevisiae, the stress response may add further significance by undergoing certain development stages, e.g., sporulation. Sporu- lation is a protective mechanism in yeast which is crucial for its survival when it is Chapter 6. Augmenting formal concepts with learning and visualization 154

Figure 6.5: Rule set learned for formal concept using background knowledge. Shown are the set of rules for the set of gene deletants sensitive to oxidative stress from [TF04]. The user can select a rule for visualization from this interface.

exposed to various environmental stresses resulting in lack of nitrogen or glucose re- quired for cellular activity. Another contributing factor to model such complexity is that often there can be overlapping of multiple responses, or some genes can be induced by multiple stresses.

This study gathers genes or proteins sensitive to oxidative stress found in a single concept using FCA, where extents are from the Tucker and Fields [TF04] study data set. We have selected two rules (on the fifth and sixth rows of the theory shown in Figure 6.5) that cover these genes or proteins to demonstrate and compare the adaptation of the yeast cell under a single stress, i.e., hydrogen peroxide stress. rule1 : concept(A): −gc ppi(B, A, C). rule2 : concept(A): −gc ppi(B, A, C), heat(C, down), gc ppi(B,C,D), bp(D,0 GO : 00304350), bp(A,0 GO : 00068100).

Rule 1 is more general than rule 2; it simply notes that the deleted gene A is in a stable complex B with another gene/protein C. Rule 2 captures the addition of an effect on Chapter 6. Augmenting formal concepts with learning and visualization 155

Figure 6.6: Extension of a learned rule selected from the set in Figure 6.5. The rule is shown at the top of the window, and below are the rows of its extensional representation generated by the rule closure method for the complex selected by the user from the search box at the top right.

gene C, which is down-regulated in response to heat stress and also in the complex with another gene D; GO categories are specified for D and the target gene A.

One of the several complexes covered by both rules is the V0 vacuolar ATPase complex, localized at vacuoles in the cytoplasm. Variable A in rule 1 denotes six genes or proteins (VPH2, DRS2, VMA2, VMA6, VMA7, VMA10) in the complex; these are related to the biological process ‘transport’ (GO:0006810) and are sensitive to the single oxidative stress. However, by adding a condition on the occurrence of transcriptional regulation induced by heat (rule 2), four of them (DRS2, VMA2, VMA7, VMA10) interact with two other genes or proteins (VTC4, LYS9) represented by variable C, which in turn are down-regulated and interact with EMI2 (variable D) to participate in sporulation (GO:0030435).

The full denotation of rule 2 coverage is in Figure 6.6. These rules lead to the possible hypothesis that the reason that gene A is required to survive exposure to oxidative stress may relate to its functional role in transport, and its membership of complex B, together with the gene C which is involved in the cell’s heat-response system, and D which is involved in sporulation, a technique for adaptation to stress. Chapter 6. Augmenting formal concepts with learning and visualization 156

Figure 6.7: V0 vacuolar ATPase complex from rule 1 in Figure 6.6.

Figure 6.8: V0 vacuolar ATPase complex from rule 2 in Figure 6.6.

Note that rule 1 (Figure 6.7) will contain all interactions from the complex that appear in rule 2 (Figure 6.8), which leads to quite a complicated set of dependencies. However, Chapter 6. Augmenting formal concepts with learning and visualization 157 since rule 2 captures more information and is more specialized, it leads to a simpler and more understandable visualization than rule 1. This shows that our approach of integration of relevant information can enable the user to achieve the correct level of generality in their visualizations.

6.6.2 Case study 2:

Our second case study demonstrates the ability of our method to mine interesting generalised inter-protein complex relations defined by the rule, and its biological signif- icance as revealed by the visual analytic tool, indicates the result is not due to random chance.

From Gavin et al. [A-C06], core proteins are hypothesised to be involved in central functions, together with a set of highly interacting proteins, whereas proteins in at- tachments interact with core proteins as a mediator to perform a particular function. Therefore, an attachment may contain a number of auxiliary complexes, each with many sub-unit proteins. One example of this is the RNA polymerase II holoenzyme that acts as transcription machinery to recruit basic transcription factors: it has 12 sub-units, some of whose proteins appear as subsets of other complexes [BK03].

For this work, the concept extracted from the BioLattice browser had 174 genes which are sensitive to both hydrogen peroxide and menadione stress. According to the rule below, there is a set of genes or proteins A sensitive to oxidative and menadione stresses that interacts with gene C that is regulated by the transcription factor D. The rule also indicates that C is up regulated under heat exposure and down-regulated by alkali exposure. concept(A) :- gc ppi(B,A,C), alkali(C,down), heat(C,up), gc ppi(B,C,A), tfbinds mult(D,C).

It is known that peroxide and menadione stress share a common set of sensitive genes and generate reactive oxygen species (ROS) that can be lethal for a cell. However, the mechanism to respond to the menadione stress is completely different than the oxidative stress [Jam92]. Therefore, the above rule suggests a more universal or general response to any specific stress or combination of stresses. By examining the rule we found that it covers 12 protein complexes including the “SAGA complex” and the “Mediator complex SRB subcomplex of RNA polymerase II”. Chapter 6. Augmenting formal concepts with learning and visualization 158

A

C D

Figure 6.9: Mediator complex SRB subcomplex of RNA polymerase II.

Figure 6.9 depicts on the left the inter-protein complex network for the Mediator com- plex, showing on the right the transcription regulatory interactions (indicated by green triangular nodes) while the cell is subjected to multiple stresses.

A

C D

Figure 6.10: Saga complex.

In the protein complex architecture, the Mediator complex has the SRB unit at the core, and the attachment has other auxiliary complexes, including SAGA. It is clearly Chapter 6. Augmenting formal concepts with learning and visualization 159

Table 6.1: Inter complex protein relationships. Note that all proteins appearing as attachments in the visualization of Figure 6.9 are listed in their corresponding complexes. evident in Figure 6.10 that SAGA has genes/proteins which are a subset of those in the attachments of the Mediator complex SRB subcomplex.

The detailed list of genes/proteins in the core and attachments of the Mediator complex is in Table 6.1. This shows the set of proteins in attachments for the Mediator complex SRB subcomplex and where they appear in other complexes or sub-units.

It is a well studied fact that the Mediator complex SRB subcomplex is composed of certain general transcription factors that are required to activate transcription as a general response to stress. Furthermore, several studies [RW97] indicate that the SRB sub-complex performs functions related to SAGA and Snf/Swi subcomplexes. Therefore, our rules are shown to have identified significant inter-complex relationships present in the data.

The final part of this case study involves the nature of the stress response of genes appearing in the Mediator complex SRB subcomplex. From the force directed graphs in both Figures 6.9 and 6.10 we can see that ADA2 (labelled C in Figures 6.9 and 6.10) plays a central role as a co-activator of the genes in the complex while binding to the transcription factor FHL1 (labelled D). A detailed investigation of the global gene expression from the study [GKT+04] shows that most of these genes, including ADA2, in the mediator complex SRB subcomplex are transcription factors that regulate genes under non-specific general environmental stresses. Chapter 6. Augmenting formal concepts with learning and visualization 160

700

600 Alkali 500 Heat

400

300

200

gene gene expression 100

0 0 5 10 15 20 25 30 35 40 45 55 60 65 70 75 80 85 90 95 100 105 110 115 120 Time(min)

Figure 6.11: Gene response measured by microarray data of ADA2 under the envi- ronmental stressors Heat and Alkali.

The rule also indicates the property of ADA2 which is that it is up-regulated under exposure to heat and down-regulated under exposure to alkali in the extra-cellular en- vironment. According to the plot of time-course microarray data for this gene from Causton et al. [CRK+01] shown in Figure 6.11, we can see the difference in the func- tions.

This may be evidence of distinct adaptation techniques for both peroxide and mena- dione stress and, therefore, the rule represents both a general pattern of interactions in stress response in addition to a specific one of differences in transcription under dif- ferent stresses. Note that without integration of these data sets such a rule could not be found.

6.7 Conclusions

In this chapter we have developed:

1. a new method to identify clauses based on protein complexes;

2. an efficient algorithm to find complete coverage (extent) of the learned clauses by fixing on each protein complex separately;

3. different visualization layouts to integrate heterogeneous systems biology data. Chapter 6. Augmenting formal concepts with learning and visualization 161

Our visualization was demonstrated for a number of complexes to collaborating yeast biologist Dr. Mark Temple, who confirmed its usefulness in improving comprehensibil- ity of learned rules. As further work, an alternative method of visualization could be based on overlaying the complete structure using different colours on the force-directed graph. Also, it should be investigated how to move from visualization of one com- plex to other (related) complexes. This could possibly be done by an overlay or other hyperlinking using location or dynamic data from associated processes.

7 Towards First-order Ontology Learning

“If the facts don’t fit the theory, change the facts.”

Albert Einstein

7.1 Introduction

In this chapter we conclude the thesis by describing an initial version of automated ontology construction in a first-order representation. This extends the approach de- scribed in Chapter3. We begin by describing a restricted first-order representation for formal concepts and an algorithm to construct concept lattices. This representation is essentially Datalog, but as shown in Chapter6 this is sufficient to be able to represent useful relational concepts such as biological interactions. Two illustrative examples are given to show how the algorithm works.

Next we discuss how the concept lattice can enable a formal ontology to be defined based on subsumption relationships between concepts. This idea is based on the approach to ontologies taken in Description Logics. The advantage is that we avoid the need for lattice revision as used in the propositional method applied in Chapter3. Our method also has the advantage of using background knowledge in concept construction as in Inductive Logic Programming. We discuss how this approach relates to other work in Description Logics and Inductive Logic Programming. Chapter 7. Towards First-order Ontology Learning 164

Since this work is preliminary the treatment is informal, although there is an imple- mentation which we take examples from to illustrate how it works. A more formalised approach is planned as part of future work.

7.2 A restricted first-order representation

We propose a syntactically-restricted version of Datalog [CGT89, Gen10]. Such clauses can represent formal concepts with respect to background knowledge, as in Inductive Logic Programming. This representation combines ideas from [FH95, Bai04]. For ontology representations the Datalog clauses can be further restricted to contain literals with only one or two arguments. This enables representation of concepts or roles as in Description Logics (DLs) [BCM+03], although this is not used in the examples in this chapter. However, it is nonetheless sufficiently expressive to represent, e.g., the protein interactions from Chapter6 in bioinformatics applications 1, plus others 2.

Syntax

A term is either a variable or a constant. An atom has a predicate symbol and zero or more arguments; with 1 or 2 arguments the atom is referred to as a “monadic predicate” (or property) or “dyadic predicate” (or relation), respectively 3. A literal is either an atom or its negation. A clause is a set of literals. A Horn clause contains at most one positive (non-negated) literal. A definite clause contains exactly one positive literal, and a query contains no positive literals. A set of clauses is called a program or theory. Each clause in a theory is assumed to have a unique index i ∈ N, the natural numbers, as in [Bai04].

Example

This shows how we could use this representation to define a clause relating deletion of a gene X and response of the organism to the stressor H2O2 of the gene Y: h2o2 sensitive(X,Y) ← interaction(X,Y), interaction(Y,Z), deleted(X),transcription factor(Z),up(h2o2,Y).

1The 3-place relation gc ppi/3 can be represented by two 2-place relations in gc/2. For example, if A and C are in complex B, we can replace gc ppi(B,A,C) by the conjunction in gc(B,A), in gc(B,C). 2Many applications in social networks are based on 2-place relations, as are Semantic Web ap- proaches based on RDF triples. 3Although the terms “unary” and “binary” can be used, we adopt “monadic” and “dyadic”, respec- tively, in relation to the relevant literature. In this work we will not use atoms with zero arguments. Chapter 7. Towards First-order Ontology Learning 165

7.3 First-order concept lattice construction

To construct a first-order concept lattice we use a version of Ganter’s standard algo- rithm Next-Closure 4 algorithm [GR91], modified to generate the intent of a concept as the least-general generalisation (LGG) [Plo71] of the set of ground clauses in its ex- tent. This approach was first proposed in [Bai04] but it was not fully described in that paper.

The use of background knowledge B to generate RLGGs is enabled by (1) using a standard ILP system such as Aleph to generate a most-specific clause ⊥(ci) for each ground example fact ci with respect to B, and (2) applying Buntine’s Generalised 0 Subsumption [Bun88] to generate the LGG(⊥(ci), ⊥(ci)) for any pair of such clauses.

The idea behind Ganter’s algorithm [GR91] is to apply a linear order < on the elements of the input set, either descriptors or objects (see Section 3.1). In our approach this order is on the indices of ground clauses comprising the input data set. Subsets of these indices are then ordered using a ‘lectic’ order ≤, the lexicographic ordering of incidence vectors representing clause subsets [GR91].

Note that our algorithm Next-Closure-RLGG (NC-RLGG) generates only the set of first-order formal concepts but not the concept lattice. However, this can be gener- ated by ordering the concepts on their extents. The NC-RLGG algorithm has been implemented as a proof-of-concept only, without the optimizations such as pointers to the most-general intent containing an attribute or literal, or binning of concepts by size of extent, as in the incremental algorithm of Godin and Missaoui [GM94b, Bai03a] used for propositional concept lattice construction in Chapter3.

7.4 Examples

In this section we show two examples of the algorithm’s input and output. Each exam- ple shows the input and output of the lattice construction algorithm on a toy problem. For real applications these would not be shown due to the complexity. Instead, visual- ization techniques as described in previous chapters would be used.

4Specifically, we use the Next-Extent version, since the lectic ordering is easier to define for extents which are ordered sets of clause indices than for first-order intents. Chapter 7. Towards First-order Ontology Learning 166

Table 7.1: Objects (input examples) for the animals domain are ground clauses. Clauses can be generated by a standard ILP system relative to background knowledge. Index Clause 1 class(a1,mammal) ← animal(a1,dog),covering(a1,hair),legs(a1,yes), habitat(a1,land),homeothermic(a1,yes). 2 class(a2,mammal) ← animal(a2,dolphin),covering(a2,none),legs(a2,no), habitat(a2,water),homeothermic(a2,yes). 3 class(a3,fish) ← animal(a3,trout),covering(a3,scales),legs(a3,no), habitat(a3,water),homeothermic(a3,no). 4 class(a4,fish) ← animal(a4,shark),covering(a4,none),legs(a4,no), habitat(a4,water),homeothermic(a4,no). 5 class(a5,fish) ← animal(a5,herring),covering(a5,scales),legs(a5,no), habitat(a5,water),homeothermic(a5,no). 6 class(a6,bird) ← animal(a6,eagle),covering(a6,feathers),legs(a6,yes), habitat(a6,air),homeothermic(a6,yes). 7 class(a7,bird) ← animal(a7,penguin),covering(a7,feathers),legs(a7,yes), habitat(a7,water),homeothermic(a7,yes). 8 class(a8,reptile) ← animal(a8,lizard),covering(a8,scales),legs(a8,yes), habitat(a8,land),homeothermic(a8,no). 9 class(a9,reptile) ← animal(a9,snake),covering(a9,scales),legs(a9,no), habitat(a9,land),homeothermic(a9,no). 10 class(a10,reptile) ← animal(a10,turtle),covering(a10,scales),legs(a10,yes), habitat(a10,water),homeothermic(a10,no).

7.4.1 Animals

The first example is a standard ILP hierarchical representation problem, a toy taxon- omy of animals [Deh97]. This shows how predicates defining an object’s membership in a taxonomy of hierarchical classes can be learned. Input data is shown in Table 7.1 and the output concept set is in Table 7.2.

Table 7.2: First-order concept lattice for the animals domain. Concepts are listed in decreasing order of generality (extent size). Each intent is a first-order definite clause which is the Least-General Generalization (LGG) of the ground clauses in the extent. Concepts with fewer than two objects in their extent are omitted.

Extent Intent Concept with extent size = 10 {1,2,3,4,5, class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), 6,7,8,9,10} homeothermic(A,F),legs(A,G). continued on next page Chapter 7. Towards First-order Ontology Learning 167

Table 7.2 – continued from previous page Extent Intent Concept with extent size = 7 {1,3,4,5,6,7,9} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,F),legs(A,F). Concepts with extent size = 6 {2,3,4,5,7,10} class(A,B) ← animal(A,C),covering(A,D),habitat(A,water), homeothermic(A,E),legs(A,F). {3,4,5,8,9,10} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,no),legs(A,F). Concepts with extent size = 5 {1,6,7,8,10} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,F),legs(A,yes). {2,3,4,5,9} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,F),legs(A,no). {3,5,8,9,10} class(A,B) ← animal(A,C),covering(A,scales),habitat(A,D), homeothermic(A,no),legs(A,E). Concepts with extent size = 4 {1,2,6,7} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,yes),legs(A,F). {2,3,4,5} class(A,B) ← animal(A,C),covering(A,D),habitat(A,water), homeothermic(A,E),legs(A,no). {3,4,5,7} class(A,B) ← animal(A,C),covering(A,D),habitat(A,water), homeothermic(A,E),legs(A,E). {3,4,5,9} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,no),legs(A,no). {3,4,5,10} class(A,B) ← animal(A,C),covering(A,D),habitat(A,water), homeothermic(A,no),legs(A,E). Concepts with extent size = 3 {1,6,7} class(A,B) ← animal(A,C),covering(A,D),habitat(A,E), homeothermic(A,yes),legs(A,yes). {1,8,9} class(A,B) ← animal(A,C),covering(A,D),habitat(A,land), homeothermic(A,E),legs(A,F). {3,4,5} class(A,fish) ← animal(A,B),covering(A,C),habitat(A,water), homeothermic(A,no),legs(A,no). continued on next page Chapter 7. Towards First-order Ontology Learning 168

Table 7.2 – continued from previous page Extent Intent {3,5,9} class(A,B) ← animal(A,C),covering(A,scales),habitat(A,D), homeothermic(A,no),legs(A,no). {3,5,10} class(A,B) ← animal(A,C),covering(A,scales),habitat(A,water), homeothermic(A,no),legs(A,D). {8,9,10} class(A,reptile) ← animal(A,B),covering(A,scales),habitat(A,C), homeothermic(A,no),legs(A,D). Concepts with extent size = 2 {1,2} class(A,mammal) ← animal(A,B),covering(A,C),habitat(A,D), homeothermic(A,yes),legs(A,E). {1,8} class(A,B) ← animal(A,C),covering(A,D),habitat(A,land), homeothermic(A,E),legs(A,yes). {1,9} class(A,B) ← animal(A,C),covering(A,D),habitat(A,land), homeothermic(A,E),legs(A,E). {2,4} class(A,B) ← animal(A,C),covering(A,none),habitat(A,water), homeothermic(A,D),legs(A,no). {2,7} class(A,B) ← animal(A,C),covering(A,D),habitat(A,water), homeothermic(A,yes),legs(A,E). {3,5} class(A,fish) ← animal(A,B),covering(A,scales),habitat(A,water), homeothermic(A,no),legs(A,no). {6,7} class(A,bird) ← animal(A,B),covering(A,feathers),habitat(A,C), homeothermic(A,yes),legs(A,yes). {7,10} class(A,B) ← animal(A,C),covering(A,D),habitat(A,water), homeothermic(A,E),legs(A,yes). {8,9} class(A,reptile) ← animal(A,B),covering(A,scales),habitat(A,land), homeothermic(A,no),legs(A,C). {8,10} class(A,reptile) ← animal(A,B),covering(A,scales),habitat(A,C), homeothermic(A,no),legs(A,yes).

In Table 7.1 we can see that the input representation is essentially propositional, since each animal is described in terms of the values it has for a set of attributes. However, in Table 7.2 the concepts are clearly first-order.

For example, looking at the subset of concepts with extent size = 4, we see the following two concepts. The first of these could be represented by a propositional concept, since Chapter 7. Towards First-order Ontology Learning 169 the common feature of these animals is that they are not homeothermic and do not have legs:

{3,4,5,9} class(A,B) ← animal(A,C), covering(A,D), habitat(A,E), homeothermic(A,no), legs(A,no).

However, the second concept intent expresses the fact that the values for the attributes “homeothermic” and “legs” are the same in each case, and this requires a first-order representation:

{3,4,5,7} class(A,B) ← animal(A,C), covering(A,D), habitat(A,water), homeothermic(A,E), legs(A,E).

In particular, first-order refinement between concepts can be seen in the restriction from the most general concept (with extent size 10) to the next most general (with extent size 7) which is the LGG of all animals with the same Boolean value (either “yes” or “no”) for the attributes “homeothermic” and “legs”.

7.4.2 Pathways

In this toy example there is a domain of gene names separated into a source set

{x1, x2, x3, x4} and a target set {y1, y2, y3, y4}. The data is a set of “one-” pathways from a source to a target gene. The hypothesis language allows clauses to be defined where each source gene is labelled with a property px, each target gene is labelled with a property py and there is a labelled relation rxy on the single source to target “hop”. The problem is to find one or more definite clauses generalising the input pathway examples. In a real-world application this would require a domain-relevant objective function on clauses, and a more powerful representation, but this example is sufficient to show the method working. Chapter 7. Towards First-order Ontology Learning 170

Table 7.3: Objects (input examples) for the pathways domain are ground clauses. Clauses can be generated by a standard ILP system relative to background knowledge. Properties px (resp. py) of genes in the source set (resp. target set) can have the values up, stable or down. Relations rxy can have the values “phospho”, “factor” or “complex” referring to typical biological interaction types.

Index Clause

1 path(x1,y2) ← px(x1,stable), py(y2,down), rxy(x1,y2,phospho). 2 path(x2,y1) ← px(x2,up), py(y1,up), rxy(x2,y1,factor). 3 path(x4,y1) ← px(x4,stable), py(y1,up), rxy(x4,y1,phospho). 4 path(x3,y2) ← px(x3,down), py(y2,down), rxy(x3,y2,factor). 5 path(x2,y3) ← px(x2,up), py(y3,stable), rxy(x2,y3,complex). 6 path(x2,y4) ← px(x2,up), py(y4,down), rxy(x2,y4,complex).

Table 7.4: First-order concept lattice for the pathways domain. Concepts are listed in decreasing order of generality (extent size). Each intent is a first-order definite clause which is the Least-General Generalization (LGG) of the ground clauses in the extent. Concepts with fewer than two objects in their extent are omitted.

Extent Intent Concept with extent size = 6

{1,2,3,4,5,6} path(A,B) ← px(A,C), py(B,D), rxy(A,B,E). Concepts with extent size = 3

{2,5,6} path(x2,A) ← px(x2,up), py(A,B), rxy(x2,A,C).

{1,4,6} path(A,B) ← px(A,C), py(B,down), rxy(A,B,D). Concepts with extent size = 2

{5,6} path(x2,A) ← px(x2,up), py(A,B), rxy(x2,A,complex).

{2,4} path(A,B) ← px(A,C), py(B,C), rxy(A,B,factor).

{2,3} path(A,y1) ← px(A,B), py(y1,up), rxy(A,y1,C).

{1,4} path(A,y2) ← px(A,B), py(y2,down), rxy(A,y2,C).

{1,3} path(A,B) ← px(A,stable), py(B,C), rxy(A,B,phospho).

We can see that the lattice contains formal concepts with potentially interesting pat- terns. For example, in Table 7.4 the concept with extent {2, 4} shows that when source and target genes are in a “factor” interaction then both genes have the same property, either “up” or “down” together, i.e., they are co-expressed. Chapter 7. Towards First-order Ontology Learning 171

7.5 Discussion

7.5.1 First-order logic based ontology representations

Currently there is no single ontology representation based on first-order logic. Most often proposed is Description Logics (DLs) [BCM+03], which is the basis for ontologies in the Semantic Web 5. DLs are a family of languages based on different fragments of first-order logic. Usually DL representations are restricted in expressive power to achieve decidability. For example, the current proposed web ontology language 6 is OWL 2, which has three specialised “profiles” or syntactic restrictions intended for certain types of application:

OWL 2 EL efficient inference on very large ontologies;

OWL 2 QL settings where large relational data sets may be queried efficiently using database technology;

OWL 2 RL settings where large sets of RDF triples may be queried efficiently using rule-based inference.

The restrictions in these profiles are motivated by the fact that representation choice typically involves application-dependent trade-offs between expressivity and efficiency. This allows DLs to have several important properties, including the ability to prove complexity results for important operations such as subsumption checking, and the ability to represent hierarchical definitions. These have led to the adoption of DLs as the basis of standards for the Semantic Web.

In this chapter we avoid commmitment to a particular Semantic Web standard and select Datalog as a representation for ontology learning. A key advantage of Data- log is Logic Programming semantics [Llo87], which allows the use of rich background knowledge in learning using ILP [Rae08]. On the other hand, Datalog has some dis- advantages. One is the lack of function symbols. These can be useful particularly for compactly representing structured objects like proteins. This is typically handled in ILP by “flattening” representations, but this can lead to longer clauses which are harder to learn.

From Chapter6 it is clear that Datalog can represent both relational queries and rules in a systems biology domain. However by restricting Datalog clauses to contain only

5http://www.w3.org/standards/semanticweb 6http://www.w3.org/TR/owl2-overview Chapter 7. Towards First-order Ontology Learning 172 unary and binary predicates the concepts and roles used in Description Logics can be represented and defined for ontologies.

7.5.2 Ontology learning from first-order concept lattices

In Chapter3 we applied a method from previous work on ontology learning in a propo- sitional representation [Bai03a] to the problem of feature construction. However, this approach has some limitations making it difficult to move to a first-order represen- tation. In this chapter we propose to avoid these by adapting two approaches from Description Logics. We first outline these limitations and then show how two repre- sentational “tricks” can be used with the first-order concept lattices generated by the proposed algorithm to enable ontology learning.

7.5.2.1 Limitations of the Theory Revision approach

Ontology learning in the propositional representation of [Bai03a] is based on two types of propositional clauses or rules. Training examples are represented as rules, where each example has the class as the rule head and the attribute-values as the literals of the rule body. Predicate invention based on the Duce operators [Mug87] is used to induce structure from regularities in the data. This is an incremental theory revision approach where the introduction of invented predicates requires all affected clauses to be revised. A second type of rule, referred to as “ontology clauses” [Bai03a], is then used to order the invented predicates into a hierarchical terminology.

There are two key limitations of this approach. The first is that at every step both the theory and concept lattice are revised. Concept lattice revision is complex and time consuming. Also, since revision uses a greedy algorithm some structural information in the concept lattice may be lost at each step. The second limitation is that the set of ontology clauses defining the hierarchical conceptual structure learned by the algorithm is separate from the theory clauses that define the concepts. Therefore it is not clear how they should be linked for combined inference. However both limitations can be overcome with techniques adapted from Description Logics to our setting. Chapter 7. Towards First-order Ontology Learning 173

7.5.2.2 Adapting a Description Logic approach

The common aspect of DLs is to restrict representations to concepts (unary predicates) and roles (binary predicates). These representations are extended to define particu- lar DLs with different types of quantification, negation, etc. The main advantage of DLs for ontologies and hence ontology learning is that they are designed to represent hierarchical knowledge. This is done by separating knowledge bases into two disjoint parts. The terminological part of the knowledge base is called the TBox and consists of definitions of concept and roles and their subsumption relations 7. These can be quite general or particular to a given domain. Knowledge about individuals in a domain is encoded by assertions in the ABox of the knowledge base. Most ontology learning in DLs focuses on TBox learning [CMSV09].

We overcome the first limitation above by restricting our first-order concept lattice to have only Datalog clauses defining clauses for unary or binary predicates, and defining the ontology learning problem to be (i) learning concept and role definitions and (ii) learning subsumption relations or inclusion axioms between two concepts or two roles. All the information necessary for TBox definitions is contained in the concept lattice and can be extracted from it: (i) concept or role definitions are given by clauses in the intents; and (ii) subsumption relations hold between pairs of subsuming concepts. Particular definitions can now be extracted from the concept lattice as required given a suitable selection function for “good” definitions. Note that subsumption relations correspond to the idea of ontology clauses in [Bai03a]. If we ensure that new terms for concepts and roles are introduced in a namespace that is disjoint from the input set of Datalog clauses no theory or lattice revision operations are required.

The second limitation is overcome simply by placing all definitions and inclusion axioms in the TBox and the original data asserting clauses for individuals in the ABox.

7.5.3 Related work

Although our approach is motivated by the structured data types in systems biology such as graphs, most work on ontology learning has focused on learning ontologies from text [CMSV09]. A full discussion of these approaches is beyond the scope of this thesis. However, the two aspects of ontology learning to which this thesis contributes are: (i) the use of Logic Programs as a first-order representation; and (ii) the use of a

7These are unary and binary predicate subsumption relations and in DLs are referred to as “general concept inclusion (GCI)” and role inclusion axioms, respectively. Chapter 7. Towards First-order Ontology Learning 174

Table 7.5: Comparison of first-order ontology learning approaches. Novelty of our approach is identified on two dimensions: one is the use of a logic programming representation, the other is the use of a closure operator.

First-order representation Description Logics Logic Programs FCA E.g., [Dis11] NC-RLGG Closure None E.g., [LH08] E.g., [Lis13]

closure operator in generalisation. We therefore focus on related work along these two dimensions. In particular, we categorised the research as shown in Table 7.5.3.

The research in this chapter is, to the best of our knowledge, the first combining gen- eral ILP learning with a closure operator based on Formal Concept Analysis described above for the NC-RLGG algorithm. This appears in the top-right entry in Table 7.5.3. The advantages of ILP for ontology learning include: efficient search; many different learning algorithms and techniques; and the use of a general-purpose programming lan- guage to encode general background knowledge, including not just domain knowledge but declarative bias mechanisms to control the hypothesis space as well [Mug95]. In this thesis the motivation for adopting a closure operator based on Formal Concept Analysis is descriptive induction. As outlined in previous chapters it provides a form of completeness in concept formation by including all descriptors relevant to a subset of domain objects, eliminating redundant concepts and enabling visualization.

In formal concept analysis a number of authors have investigated relational or logical extensions to the basic representation [PW99, FR00]; recently this has been developed for multi-relational data mining applications [RHHNV13]. These approaches however do not use a logic programming representation.

A recent proposal [Hit11] suggests research directions on FCA for Semantic Web ap- plications which typically require relational representations. Towards this is work on generating concept lattices where the intents are expressions in EL based Description Logics and the extents are the set of elements in the interpretation [Dis11, Bor13]. Such approaches are in the top-left entry of Table 7.5.3. Note they do not use general background knowledge as in ILP. However, this thesis has shown the importance of using rich background knowledge in machine learning applications in systems biology.

A separate research direction for first-order ontology learning has been the use of re- finement operators for Description Logics, e.g., [LH08]. Both top-down and bottom-up Chapter 7. Towards First-order Ontology Learning 175 refinement operators have been proposed. DL expressions are refined with respect to a DL knowledge base such as an ontology. This type of approach is in the bottom-left category in Table 7.5.3. However these refinement approaches do not use closure op- erators which we have motivated in this thesis for descriptive induction, and can only use DL knowledge bases rather than general background knowledge as in ILP.

Towards this goal [Lis13] uses a representation that combines logic programming (e.g., Datalog) and a Description Logic (e.g., SHIQ). An ILP approach using a refinement operator based on FOIL is used to learn clauses in this representation. However, it is not clear that this sufficient for ontology learning – for example, there is no mechanism to introduce new predicates, or use of closure operators for descriptive induction. This research is in the bottom-right category of Table 7.5.3.

7.6 Summary

In this chapter we introduced a new method of ontology learning in a first-order rep- resentation that is sufficient to represent concepts and relations in systems biology. The proposed NC-RLGG algorithm combines the use of a closure operator as in FCA to enable descriptive induction with an efficient generalization method from ILP. Al- though the approach has been implemented as a proof-of-concept only we have shown its operation on two toy domains and described how it overcomes key limitations of the propositional approach applied in previous chapters.

One limitation of our algorithm is that all concepts are generated during construction which can lead to large memory requirements. For future work we will investigate the possibility of using a more efficient approach, such as a modification of the BioLattice algorithm of Chapter5 to work with first-order concepts. Although the Next-Closure method generates only closed concepts and not the concept lattice, we could adapt Al- gorithm5 from Chapter5 for this task. The lattice could then be browsed interactively by the user.

Obviously we need a method to select “interesting” concepts. Lattice concepts can be evaluated using either supervised or unsupervised measures, depending on whether the data has class labels. One approach could be to extend the information-based measures used in earlier chapters to a first-order representation. For example, Quinlan’s FOIL showed how mutual information measures in rule learning could be upgraded to learning first-order clauses [Qui90]. Finally, we will test the approach on some of the large-scale relational datasets used in previous chapters.

8 Conclusions

“With every new answer unfolded, science has consistently discovered at least three new questions.”

Wernher Von Braun

This chapter summarises the results of this thesis work and proposes some future di- rections. Throughout the thesis a combination of different techniques from machine learning using both propositional learning and Inductive Logic Programming (ILP), together with Formal Concept Analysis (FCA) for ontologies and visualization tech- niques were implemented to attack the central problem of modelling complex systems biology in terms of meaningful interacting gene or protein networks.

Chapter3 investigated machine learning applications using ontology annotations to build predictive systems biology models. Formal Concept Analysis was adopted to derive ontological concepts as a pre-processing step for propositional learning.

Chapter4 described an alternative framework to propositional learning, using an ap- plication of ILP to learn integrative models of systems biology networks. Limitations of such an approach were identified and a foundation for a multi-step knowledge inte- gration and structuring approach was laid.

Chapter5 introduced “visual analytics” as an analytical reasoning tool powered by visualization that assists domain experts in applying their background knowledge and decision-making skills to data analysis problems. Here FCA was applied as a data analysis framework for discrete biological domains, where the lattice facilitates naviga- tion to identify important concepts with the addition of domain knowledge. ILP was Chapter 8. Conclusions 178 then introduced for first-order learning to construct explanatory models from ontology concepts determined by FCA.

Chapter6 demonstrated a novel “visual closure” method by extending the idea of closure in the FCA framework to integrate knowledge in a multi-step processe using ILP and visual analytics. Datalog queries from ILP rules “closed” with respect to additional layers of structural and other relevant background knowledge were then visualized for inspection by a domain specialist.

Finally, Chapter7 described an initial version of automated ontology construction in a first order representation.

In the remainder of this conclusion we give a summary of each chapter, covering results and possible improvements or extensions.

8.1 Propositional learning from ontological annotation

Chapter3 reviewed issues such as bias in applying standard Gene Ontology over- representation analysis of high-throughput biological data. To address the bias problem we proposed and implemented an alternative machine learning technique in a supervised learning framework. The two key problems for such implementations were:

1. to ensure a correct representation of dependencies among the annotated terms due to the DAG-like ontology structure;

2. to meaningfully integrate a number of multiple sources of noisy and unstructured data into classifier learning to predict a comprehensible model to systems biology behaviour.

Therefore, we implemented a number of feature construction and feature selection methods in a supervised learning framework. As a pre-processing step for feature construction the input data set was treated as a formal context and formalised by the coverage matrix approach to account for structural information in the Gene Ontology categories. Concept lattices were built using FCA, where concepts were extracted and combined as features for machine learning. The selection of subsets of “good” features or concepts was done by feature selection methods based on predictive accuracy.

For evaluation, two different problems in yeast systems biology were empirically tested, where the task was to learn a classifier to predict a selected behaviour or phenotype of Chapter 8. Conclusions 179 interest for genes or proteins, given a set of known attributes or features. We compared quantitative results on the performance of different methods obtained by a decision tree learning algorithm.

Our initial outcome of improved performance validated our method of applying a stan- dard supervised learning algorithms with no prior biological knowledge, since this was able to discover significant network relations among genes or proteins as a response to stress.

However we note that:

1. extending to a large attribute set for more expressive representation will lead to greatly increased complexity for FCA lattice construction;

2. biological data is inherently relational and propositional learning is restricted to attribute vector-based learning;

3. learning to predict phenotype at the cellular level is a harder problem than protein or gene expression at the intra-cellular level.

8.2 Learning responsive cellular networks by integrating genomics and proteomics data

Motivated by the outcomes from the previous work above, Chapter4 investigated an application of first-order learning to systems biology problems using ILP. We proposed and implemented a supervised learning framework with the goal of learning a logical network model by discovering potentially significant theories with respect to data set and relational background knowledge.

Two biological tasks were selected for learning elements of the cellular response network in yeast to investigate our method. The first task was to learn predictive theories of a protein expression network from integrated data sets, and the second one was learning to predict observable multiple response phenotypes at the cellular level.

The outcome of our work established the contributions of first-order learning of theories in systems biology domains using our representation, which thereafter can be used for visualization or further ontology applications. However, we note that:

1. while our method could learn complex clauses, they were short, with limited biological information; Chapter 8. Conclusions 180

2. learning of theories from larger data sets in ILP showed increased learning time with respect to large multi-relational structured background knowledge;

3. the representation of structured background knowledge, such as the Gene Ontol- ogy, was an open issue to be addressed.

However, biologists required more relevant information than we could incorporate in straightforward applications of ILP techniques to directly learn predictive models on the complete data set representing a biological system such as stress response. This led us to explore further the use of knowledge integration methods in a multi-step process, described next.

8.3 Visualising concept lattices for learning from inte- grated multi-relational data

Based partly on the outcome of work in previous sections, it was realized that biologists require more powerful tools to investigate common subsets of genes that are observed to have a common experimental outcome, i.e., share common systemic behaviour. This led to the design of a comprehensive visual analytic tool called the BioLattice browser (Chapter5). The core of this implementation was FCA, with several further contribu- tions, including:

1. its use as an explicit conceptualization tool to integrate data from multiple, het- erogeneous data sources;

2. the concept lattice provided a means to overview the domain structure and in- teractively navigate implicit domain ontologies;

3. concepts were extracted from the lattice and then used for further learning to provide generalized concept definitions with respect to additional background knowledge, using inductive logic programming.

To support the work we implemented an efficient concept lattice building algorithm called BioLattice. Its computational efficiency was achieved using a vertical data format representation of the formal context, with search performed over an intent-extent tree search space, and avoidance of generating all itemsets by pruning both non-frequent and non-closed itemsets. Building the concept lattice was followed by its integration into Chapter 8. Conclusions 181 a visual analytics tool. We reviewed several lattice viewing techniques where the key challenges identified were (i) scalability, due to biological data having large attribute sets, and (ii) the incomprehensibility of viewing such a large graph of concepts at once. To work around these issues, we implemented an “incremental exploration technique” to navigate through one concept at a time, while utilising the general-to-specific ordering of the ontology concepts in the lattice. Each concept in the lattice was also augmented with additional background knowledge, such as Gene Ontology categories, protein- protein interactions, etc. for further interactive analysis.

As a part of the biological validation we selected two biological tasks to test the potential of using concept lattices for learning gene response to multiple stresses:

1. biological significance of lattice concepts were carried out by hypergeometric test of genes extracted from a particular concept against an external set of gene inter- action data. The outcome of the result clearly showed their biological significance, and thereby, ontology concepts in the lattice were likely to denote some interesting biological function worthy of further study.

2. ILP was used to model dynamic cell behavior by learning rules from multiple- stress concepts extracted from the lattice. A large amount of multi-relational and ontology data was used as background knowledge. The subset of genes covered by a rule was then mapped to known protein complex architectures to assemble their interaction networks in the cell. The outcome supported the potential for inference of stress sensitivity from the relational structure of the clause as applied to the protein complexes.

In summary, our work in this chapter showed that due to the highly relational nature of the systems biology domain, the integration of ILP into the visual analytics frame- work was realistic and promising. Domain-relevant concepts identified from the lattice could be used in learning potentially useful rules defining the concepts in a first-order representation. This laid the foundation for the next stage of the work.

8.4 Augmenting formal concepts with first-order learning and visual analytics

The motivation for multi-step ontology learning came from observing the following outcomes of our previous works : Chapter 8. Conclusions 182

1. straight-forward application of propositional learning has limitations in learning from relational biological data;

2. learning rules directly by ILP application is limited by bounds on knowledge infer- ence, computational time complexity and restricted representation of structural background knowledge;

3. Chapter5 showed that the combination of domain structuring in an FCA frame- work, followed by application of ILP, has potential in biological domains for vi- sualization or further learning.

Chapter6 proposed and implemented a novel multi-step ontology application method called “visual closure” which is analogous to the closure of formal concepts in terms of their relation between intents and extents. However, visual closure takes the extent of a concept and produces the dual definition as an intent by first using ILP to learn descriptive rules, then, finalising the closure by translating to a graphical representation that can be visualised by domain specialists.

As a realisation of this method, we implemented it using efficient Datalog queries with the ability to expand first-order rules for concepts by searching additional knowledge in extensional biological databases. Finally, the application of visual closure completed the final layer of knowledge augmentation by injecting additional structural information in the domain that could not have been easily integrated during the learning phase due to complexity overheads. We used the Javascript-based Infovis technology with static graph structures to represent the data, and two interactive graphical views were pre- sented: Sunburst, i.e., a radial space-filling algorithm for protein complex architecture visualization; and force-directed graphs for visualization of all other interactions and properties.

To evaluate our method two case studies of biological significance were presented. The first task was to learn the complex mechanism of cell response and adaptation to a par- ticular stress, whereas the second task demonstrated the ability of our method to mine interesting high order inter-protein complex relations defined by the rules. As future work we proposed an alternative method of visualization by overlaying the complete structure using different colours on the force-directed graph. It would be interesting to explore methods of moving from visualization of one complex to other (related) com- plexes, possibly by an overlay using location or dynamic data from associated processes. Chapter 8. Conclusions 183

8.5 Towards First-order Ontology Learning

In this final chapter (Chapter7) we described an initial version of automated ontology construction in a restricted first-order representation. The main contribution of this chapter was lifting the FCA approach from propositional to a first-order representation and enabling the use of general background knowledge, including declarative bias, using an ILP system to learn the concept intents. To serve this purpose a restricted first- order language with sufficient ability to represent useful relational concepts in systems biology data was described. To complete the work an FCA algorithm to construct the lattice and a method for representing first order intents were outlined. We illustrated this preliminary implementation with two toy examples, and more formalised approach is planned as a part of the future work. This will include addressing learning of con- cept and role inclusion axioms suitable for a basic biomedical ontology language, and investigating objective functions to determing “good” concepts from the lattice.

8.6 Future work

The goal of this thesis was to show that machine learning approaches can be effectively applied in various ontology applications in systems biology where the problem domains can be extremely complex to comprehend, and integrative analysis of multi-relational, noisy and unstructured data from heterogeneous sources are required.

The future focus for this research will be in ontologies and bioinformatics for systems biology building on the machine learning approaches we have described. Ontologies are a basic component of the Semantic Web [BLHL01], which is important to the future of information and communication technologies (ICT) due to rapidly growing volume of unstructured data. The main goal of the Semantic Web can be summarized as the rep- resentation of content (data in different media) in a form which can have meaning for artificial as well as human agents. A key aspect of this effort is that, like the World Wide Web and HTML, it is based on standards. These include RDF (Resource Description Framework) and OWL (Web Ontology Language). Bioinformatics has been identified as a science which will increase as a research priority. It provides both infrastructure and analysis methodologies for large-scale biological research, and has been identified as an important component in the development of research capabilities in biology and biotechnology. Currently, bioinformatics is increasingly directed towards systems bi- ology with the goal of understanding, predicting and re-engineering the mechanisms underlying complex biological behaviours for significant applications [HBA12]. Chapter 8. Conclusions 184

Although we have used FCA in this thesis, all of our novel contributions have been based on the standard approach. However, there are a number of additional methods based on FCA that could be useful to extend our approaches. Several authors [Len00, For06, EDD12] have investigated similar and related concept retrieval from FCA using concept similarity measures (e.g., feature based, information content similarity etc.). We see several points for further work starting from such ideas.

In Chapter3, a similarity-based measure could help in feature selection. It is important that lattice navigation follows the monotonicity of concept lattices, which follows from the partial order on concepts that is inherent in FCA (Section 2.4.1). Also, such a measure could help in the type of ontological analysis used in Section 3.5.4.

Based on the ideas of [EDD12] additional functionality could be added to BioLattice by implementing a method to “jump” from the current concept to related concepts or categories during lattice navigation (Chapter5). This could be achieved by displaying to the user locally similar concepts retrieved by a concept similarity measure.

Information-based measures as used in [For06] may have advantages over set-based similarity measures [EDD12], due to the uncertain characteristics of of real data (e.g., noise, missing values, etc.). The work in Chapter3 on information-based measures to rank concepts supports this claim, and it should be possible to extend this to an information-based concept similarity measure since it is based on probability.

The Datalog queries in Chapter6 used in the construction of the “visual closure” of ILP rules could be further improved by similarity-based search on related protein complexes. Additionally, the use of concept similarity as a preference order on concepts is highlighted for future work in Chapter7 to potentially reduce the much larger space of first-order concepts that need to be considered. These ideas need to be investigated, however we note that in a first-order representation similarity-measures become harder due to the difficulty of mapping between first-order formulae. A Database and queries

This appendix describes data sources and the relational database schemas used in various implementations throughout the thesis.

A.1 Datasets

In Table A.1 we summarise the biological datasets used throughout the thesis.

Table A.1: Biological datasets.

Dataset Citation Description Protein expres- Godon et al. [GLL+98] Protein expression whose synthesis was sion stimulated or repressed under oxidative stress. Out of a total of 92 proteins, 56 were up expressed and 32 were down expressed. Gene expression Causton et al. [CRK+01] mRNA transcriptional response data col- (microarray) lected from yeast cellular network response. Cells were exposed to six different environ-

mental stresses such as Alkali, Acid, H2O2, Heat, Salt and Sorbitol; response for each stress in the cell was observed at 8-10 time points over a period of 2 hours. continued on next page Appendix A. Database and queries 186

Table A.1 – continued from previous page Dataset Citation Description Phenotype data Saccharomyces Genome Set of yeast deletion strains where in each Deletion Project strain exactly one gene from the genome ([WSA+99]) has been systematically removed. 26 such screens are assembled on 1094 genes from various different laboratories as an input of our method. This data was curated and supplied to us by M. Temple and is de- scribed in Chapter5. Transcription Harbison et al. [HGL+04] This data is described in Chapter6 and factor binding represents regulatory networks of 9075 in- data teractions of transcription factors and their target binding genes. Protein-protein Gavin et al. [A-C06]; Vi- Protein-protein interaction data provides a interaction data dal Lab [YBY+08] picture of cellular pathways and cascaded responses in the cell at the molecular level. Two sets of data have been used. One is from affinity purification coupled to mass spectrometry, i.e., protein complex archi- tecture data (Gavin et al.) and the other is from the high-quality (Y2H-union) Yeast two-hybrid binary interaction set (Vidal Lab). Our data was downloaded from both BioGRID ( thebiogrid.org) and Yeast In- teractome Database (interactome.dfci. harvard.edu) — Gavin et al. data has 115164 interactions among 2671 proteins; Y2H-union has 2930 interactions among 2018 proteins. Biochemical Karp [Kar01] YeastCyc biochemical pathway data was pathways data downloaded from SGD (www.yeastgenome. org). This data has 725 genes in 142 path- ways. continued on next page Appendix A. Database and queries 187

Table A.1 – continued from previous page Dataset Citation Description Geneontology Ashburner et al. [Ash00] Molecular function, biolog- data ical processes and cellular component categories for Sac- charomyces Cerevisiae were downloaded from Gene Ontology (www.geneontology.org). This data consists of total 146459 gene annotations to 4041 gene ontology categories occurring in our experimental data set curated as described above. Yeast gene IDs Cherry et al. [C+12] To resolve ambiguity in gene or gene (ORFs) and syn- product synonyms Saccharomyces Genome onyms Database (SGD) have been downloaded from www.yeastgenome.org. Genetic interac- Tong et al. [TEP+01] Synthetic lethality interactions between tions pairs of genes have been downloaded from www.thebiogrid.org. Details of this data set are in Chapter5. FunSpec Robinson et We have used the FunSpec tool (funspec. al. [RGMH02] med.utoronto.ca) for Gene Ontology over- representation analysis.

A.2 Databases

The yeast gene data model from SGD was adopted (see tables yeastreference and yeastreference synonyms in Figure A.1) as a standard nomenclature for all genes or proteins we have used in this work to integrate Gene Ontology annotation data and the other sources of experimental data. Figure A.2 shows the relational schema for GO that was used in various implementations. Appendix A. Database and queries 188

LEGEND tfbinds_mult

biochemical_pathway Strong Entity PK,FK1 promoter_id

PK,FK2 factor_id Weak Entity PK id

Strong Entity (‘n’ to ‘m’ relation) cond path_id

Weak Entity to ‘m’ (‘n’ relation) tfbinds_lone

yeastreference_synonyms(SGD) Primary key PK,FK1 factor_id PK Foreign key FK PK,FK2 promoter_idPK,FK1 yeastreference_id in_pathway PK synonym

cond PK,FK1 id

is_obsolete PK,FK2 orf_id

peroxide

PK,FK1 orf_id sorbitol

PK,FK1 orf_idgo response yeastreference(SGD)

complex_core PK,FK1 orf_id PK id response PK go_cat PK,FK1 id

PK,FK2 yeastreference_idgc_ppi gene_name systematic_name is_obsolete acid PK,FK1 cid

PK,FK2 proteinA_id PK,FK1 orf_id

PK,FK3 proteinB_id alkali

response PK,FK1 orf_id

heat response complex_module PK,FK1 orf_id

PK,FK1 module_id

gcid complex_attachmentsresponse PK,FK2 yeastreference_id

PK id PK,FK1 complex_id nacl PK,FK2 yeastreference_id

gcid PK,FK1 orf_id

response

complex_sitescomplex_localizations

complex_mod complex_module_list PK site_idPK,FK2 complex_id

PK,FK1 site_id PK,FK1 id PK module_id

site_name

FK2 module_id module_name

Figure A.1: RulesDB schema. Appendix A. Database and queries 189

LEGEND Strong Entity Week Entity Strong Entity (‘n’ to ‘m’ relation) Week Entity (‘n’ to ‘m’ relation) Primary key PK Foreign key FK

dbxref term_dbxref PK,FK1 term_id PK id evidence PK,FK2 dbxref_id xref_dbname PK id PK is_for_definition xref_key xref_keytype code xref_desc evidence_dbxref FK2 association_id FK1 dbxref_id seq_acc term2term FK2 evidence_id FK1 dbxref_id PK id

relationship_type_id term db FK1 term1_id PK id FK2 term2_id PK id complete association name name term_type fullname PK id acc graph_path datatype is_obsolete generic_url FK1 term_id PK id is_root url_syntax FK2 gene_product_id is_relation url_example is_not FK1 term1_id uri_prefix role_group FK2 term2_id assocdate relationship_type_id FK3 source_db_id distance association_qualifier relation_distance PK id gene_product FK2 term_id association_species_qualifier PK id FK1 association_id PK id value symbol FK1 dbxref_id FK1 association_id type_id FK2 species_id full_name FK2 species_id species gene_product_synonym PK id PK,FK1 gene_product_id PK product_synonym ncbi_taxa_id gene_product_count common_name lineage_string genus species FK2 term_id parent_id code left_value speciesdbname right_value FK1 species_id taxonomic_rank product_count

Figure A.2: Gene Ontology relational schema.

B Aleph parameter settings

This appendix describes Aleph settings and mode declarations used to learn rules as described in Chapters4,5 and6.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Settings % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% :- set(record,true). :- set(recordfile,‘aleph.out.1’).

:- set(i,2).

:- set(verbosity,20).

% Experimental settings :- set(evalfn,posonly). :- set(search,heuristic). :- set(clauselength,6). :- set(noise,100). :-set(gsamplesize,442).

:- set(minpos,4).

:- set(nodes,50000). Appendix B. Aleph Settings 192

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Mode declarations % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

:- modeh(1,concept(+orf)).

% Harbison "aggregated" :- modeb(3,tfbinds_mult(-orf,+orf)). :- modeb(3,tfbinds_mult(+orf,-orf)). :- modeb(3,tfbinds_lone(-orf,+orf)). :- modeb(3,tfbinds_lone(+orf,-orf)).

% Causton discretized :- modeb(1,acid(+orf,#qual_val)). :- modeb(1,alkali(+orf,#qual_val)). :- modeb(1,heat(+orf,#qual_val)). :- modeb(1,peroxide(+orf,#qual_val)). :- modeb(1,nacl(+orf,#qual_val)). :- modeb(1,sorbitol(+orf,#qual_val)).

% Gavin complexes flattened version :- modeb(3,gc_ppi(+gcid,+orf,-orf)). :- modeb(3,gc_ppi(+gcid,-orf,+orf)). :- modeb(3,gc_ppi(-gcid,+orf,-orf)). :- modeb(3,gc_ppi(-gcid,-orf,+orf)).

% YeastCyc pathways :- modeb(5,in_pathway(#biochemical_pathway,+orf)).

% Gene Ontology flattened :- modeb(4,bp(+orf,#go_cat)). C Learning networks

This appendix contains extra relevant material for Chapter4.

C.1 Microarray data preparation

We used six sets of time series microarray data analyzing yeast gene expression as a response to one of six different stress conditions [CRK+01] measured on 6191 ORFs (genes). Each time series contained 6 measurements, although time intervals varied for different stresses.

Stress up down no change Total Acid 722 494 3210 4426

H2O2 303 1566 1808 3677 Heat 2046 140 1965 4151 Alkali 611 773 2101 3485 NaCl 358 771 1652 2781 Sorbitol 323 385 1671 2379

The data were discretized using a simple non-parametric method based on a kth-order statistic of the time series. First, any time series with greater than m % of missing data were removed. Next, the time series was sorted by expression value in two lists, one ascending and the other descending. The value v0 of the baseline (t = 0) data point was determined. Then the following steps were followed:

1. if v0 is missing, the time series is categorised as ‘no change’; or Appendix C. Learning networks 194

2. if v0 is in the k lowest values in the ascending list, and there is at least one value

vi such that vi − v0 > δ, then the time series is categorised as ‘up’; or

3. if v0 is in the k highest values in the descending list, and there is at least one

value vi such that v0 − vi > δ, then the time series was categorised as ‘down’;

4. otherwise the time series is categorised as ‘no change’.

Values for the parameters were: m = 10, k = 1 and δ = 100.

C.2 Ground clauses for protein expression network

In this section and the next we give further details on the analysis of the theory from Figure 4.2 discussed in Section 4.3.4.1.

All ground instantiations of the clauses in the learned theory were generated in Prolog by computing the answer subsitutions for the positive examples executed as queries on the clauses with respect to the ground relations from the Aleph background file.

The target predicate was “induced/1”, which is true if a protein is expressed under

H2O2stress. Three types of relation on yeast systems biology were integrated in the background file:

• gene expression data under H2O2 stress (discussed above in Section C.1)

• protein-protein interactions from the BioGRID repository (thebiogrid.org)

• transcription factor binding data under a number of different stress conditions [HGL+04]

The corresponding Prolog predicates were:

% microarray data % protein-protein interaction data peroxide/2. ppi/2.

% transcription factor binding data ’ACID’/2. ’H2O2HI’/2. ’ALPHA’/2. ’H2O2LO’/2. ’BUT14’/2. ’HEAT’/2. ’BUT90’/2. ’PI-’/2. ’GAL’/2. ’RAFF’/2. Appendix C. Learning networks 195

’RAPA’/2. ’THI-’/2. ’SM’/2. ’YPD’/2.

To format the ground rule set for the domain expert each ground clause was expressed in a form of English using Prolog “portray” definitions. The syntax “X -- Pred -> Y” is read: “X” has relation “Pred” to “Y”. Following is the Prolog file that was given to the domain expert, from which he constructed the network diagram of Figure 4.3. Ground instantiations are in ‘C’-style comments, below each clause.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HSF1 -- H2O2LO -> SSA1 % Clause 1 SSA1 -- ppi -> SSA1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MSN4 -- ACID -> SSA1 induced(A) :- or ppi(A,A), HSF1 -- H2O2LO -> SSA1 ’ACID’(B,A). SSA1 -- ppi -> SSA1 MSN2 -- ACID -> SSA1 /******* or HSF1 -- H2O2LO -> SSA1 SSA1 is induced since HSP104 -- ppi -> SSA1 SSA1 -- ppi -> SSA1 MSN2 -- ACID -> HSP104 MSN4 -- ACID -> SSA1 or or HSF1 -- H2O2LO -> SSA1 SSA1 -- ppi -> SSA1 SIS1 -- ppi -> SSA1 MSN2 -- ACID -> SSA1 MSN2 -- ACID -> SIS1 or HSP104 is induced since MSN2 -- H2O2LO -> SSA1 HSP104 -- ppi -> HSP104 SSA1 -- ppi -> SSA1 MSN2 -- ACID -> HSP104 MSN4 -- ACID -> SSA1 or *******/ MSN2 -- H2O2LO -> SSA1 SSA1 -- ppi -> SSA1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MSN2 -- ACID -> SSA1 % Clause 2 or %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MSN2 -- H2O2LO -> SSA1 induced(A) :- HSP104 -- ppi -> SSA1 ’H2O2LO’(B,A), MSN2 -- ACID -> HSP104 ppi(C,A), or ’ACID’(D,C). MSN2 -- H2O2LO -> SSA1 SIS1 -- ppi -> SSA1 /******* MSN2 -- ACID -> SIS1

SSA1 is induced since SSA3 is induced since Appendix C. Learning networks 196

HSF1 -- H2O2LO -> SSA3 SSA1 -- ppi -> SSA3 GPD1 is induced since MSN4 -- ACID -> SSA1 GPD1 -- peroxide -> up or HSF1 -- H2O2LO -> SSA3 LYS20 is induced since SSA1 -- ppi -> SSA3 LYS20 -- peroxide -> up MSN2 -- ACID -> SSA1 or HSP42 is induced since HSF1 -- H2O2LO -> SSA3 HSP42 -- peroxide -> up SSA2 -- ppi -> SSA3 MSN2 -- ACID -> SSA2 HSP78 is induced since HSP78 -- peroxide -> up HSP42 is induced since HSF1 -- H2O2LO -> HSP42 HSP12 is induced since SIS1 -- ppi -> HSP42 HSP12 -- peroxide -> up MSN2 -- ACID -> SIS1 or SCL1 is induced since SKN7 -- H2O2LO -> HSP42 SCL1 -- peroxide -> up SIS1 -- ppi -> HSP42 MSN2 -- ACID -> SIS1 TRX2 is induced since or TRX2 -- peroxide -> up MSN2 -- H2O2LO -> HSP42 SIS1 -- ppi -> HSP42 PUP2 is induced since MSN2 -- ACID -> SIS1 PUP2 -- peroxide -> up

*******/ ENO1 is induced since ENO1 -- peroxide -> up %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Clause 3 SOD2 is induced since %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% SOD2 -- peroxide -> up induced(A) :- peroxide(A,up). PRE3 is induced since PRE3 -- peroxide -> up /******* CCP1 is induced since HSP26 is induced since CCP1 -- peroxide -> up HSP26 -- peroxide -> up HSP104 is induced since TPS1 is induced since HSP104 -- peroxide -> up TPS1 -- peroxide -> up TSA1 is induced since GLK1 is induced since TSA1 -- peroxide -> up GLK1 -- peroxide -> up Appendix C. Learning networks 197

PRE8 is induced since RPC40 -- ppi -> ARO4 PRE8 -- peroxide -> up GCN4 -- RAPA -> RPC40

PGM2 is induced since *******/ PGM2 -- peroxide -> up %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% DDR48 is induced since % Clause 5 DDR48 -- peroxide -> up %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% induced(A) :- YNL134C is induced since ’RAPA’(B,A), YNL134C -- peroxide -> up peroxide(B,down), ppi(A,C). ZWF1 is induced since ZWF1 -- peroxide -> up /*******

YNL274C is induced since HIS4 is induced since YNL274C -- peroxide -> up GCN4 -- RAPA -> HIS4 GCN4 -- peroxide -> down GLR1 is induced since HIS4 -- ppi -> HHF1 GLR1 -- peroxide -> up or GCN4 -- RAPA -> HIS4 HSP82 is induced since GCN4 -- peroxide -> down HSP82 -- peroxide -> up HIS4 -- ppi -> DHH1 or *******/ GCN4 -- RAPA -> HIS4 GCN4 -- peroxide -> down %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% HIS4 -- ppi -> SPT3 % Clause 4 or %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% GCN4 -- RAPA -> HIS4 induced(A) :- GCN4 -- peroxide -> down ’YPD’(B,A), HIS4 -- ppi -> SPT2 ppi(C,A), or ’RAPA’(B,C). GCN4 -- RAPA -> HIS4 GCN4 -- peroxide -> down /******* HIS4 -- ppi -> PDC1 or TKL2 is induced since GCN4 -- RAPA -> HIS4 FHL1 -- YPD -> TKL2 GCN4 -- peroxide -> down VMA6 -- ppi -> TKL2 HIS4 -- ppi -> SAM1 FHL1 -- RAPA -> VMA6 or GCN4 -- RAPA -> HIS4 ARO4 is induced since GCN4 -- peroxide -> down GCN4 -- YPD -> ARO4 HIS4 -- ppi -> TUB3 Appendix C. Learning networks 198

or HSP78 is induced since MSN2 -- H2O2LO -> SSA1 GCN4 -- RAPA -> HSP78 SSA1 -- ppi -> SSA1 GCN4 -- peroxide -> down MSN4 -- ACID -> SSA1 HSP78 -- ppi -> SSC1 or MSN2 -- H2O2LO -> SSA1 ALD5 is induced since SSA1 -- ppi -> SSA1 GCN4 -- RAPA -> ALD5 MSN2 -- ACID -> SSA1 GCN4 -- peroxide -> down or ALD5 -- ppi -> MDH3 MSN2 -- H2O2LO -> SSA1 or SSA1 -- ppi -> SSA2 GCN4 -- RAPA -> ALD5 MSN2 -- ACID -> SSA2 GCN4 -- peroxide -> down or ALD5 -- ppi -> YPL257W MSN2 -- H2O2LO -> SSA1 SSA1 -- ppi -> SIS1 *******/ MSN2 -- ACID -> SIS1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% PRE1 is induced since % Clause 6 RPN4 -- H2O2LO -> PRE1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% PRE1 -- ppi -> SSA1 induced(A) :- MSN4 -- ACID -> SSA1 ’H2O2LO’(B,A), or ppi(A,C), RPN4 -- H2O2LO -> PRE1 ’ACID’(D,C). PRE1 -- ppi -> SSA1 MSN2 -- ACID -> SSA1 /******* or RPN4 -- H2O2LO -> PRE1 SSA1 is induced since PRE1 -- ppi -> SSA2 HSF1 -- H2O2LO -> SSA1 MSN2 -- ACID -> SSA2 SSA1 -- ppi -> SSA1 MSN4 -- ACID -> SSA1 PRE9 is induced since or RPN4 -- H2O2LO -> PRE9 HSF1 -- H2O2LO -> SSA1 PRE9 -- ppi -> SIS1 SSA1 -- ppi -> SSA1 MSN2 -- ACID -> SIS1 MSN2 -- ACID -> SSA1 or TRX2 is induced since HSF1 -- H2O2LO -> SSA1 RPN4 -- H2O2LO -> TRX2 SSA1 -- ppi -> SSA2 TRX2 -- ppi -> AHP1 MSN2 -- ACID -> SSA2 MSN2 -- ACID -> AHP1 or or HSF1 -- H2O2LO -> SSA1 SKN7 -- H2O2LO -> TRX2 SSA1 -- ppi -> SIS1 TRX2 -- ppi -> AHP1 MSN2 -- ACID -> SIS1 MSN2 -- ACID -> AHP1 Appendix C. Learning networks 199

or HSF1 -- H2O2LO -> HSP82 MSN4 -- H2O2LO -> TRX2 HSP82 -- ppi -> MGA1 TRX2 -- ppi -> AHP1 MSN2 -- ACID -> MGA1 MSN2 -- ACID -> AHP1 or or HSF1 -- H2O2LO -> HSP82 YAP1 -- H2O2LO -> TRX2 HSP82 -- ppi -> SSA2 TRX2 -- ppi -> AHP1 MSN2 -- ACID -> SSA2 MSN2 -- ACID -> AHP1 or or HSF1 -- H2O2LO -> HSP82 MSN2 -- H2O2LO -> TRX2 HSP82 -- ppi -> DDR2 TRX2 -- ppi -> AHP1 MSN2 -- ACID -> DDR2 MSN2 -- ACID -> AHP1 or *******/ YAP7 -- H2O2LO -> TRX2 TRX2 -- ppi -> AHP1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MSN2 -- ACID -> AHP1 % Clause 7 or %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CIN5 -- H2O2LO -> TRX2 induced(A) :- TRX2 -- ppi -> AHP1 ’H2O2LO’(B,A), MSN2 -- ACID -> AHP1 peroxide(B,down).

PRE8 is induced since /******* RPN4 -- H2O2LO -> PRE8 PRE8 -- ppi -> SIS1 HSP26 is induced since MSN2 -- ACID -> SIS1 RIM101 -- H2O2LO -> HSP26 RIM101 -- peroxide -> down HSP82 is induced since HSF1 -- H2O2LO -> HSP82 LYS20 is induced since HSP82 -- ppi -> SSA1 SKN7 -- H2O2LO -> LYS20 MSN4 -- ACID -> SSA1 SKN7 -- peroxide -> down or HSF1 -- H2O2LO -> HSP82 HSP42 is induced since HSP82 -- ppi -> SSA1 SKN7 -- H2O2LO -> HSP42 MSN2 -- ACID -> SSA1 SKN7 -- peroxide -> down or HSF1 -- H2O2LO -> HSP82 CTT1 is induced since HSP82 -- ppi -> TDH3 SKN7 -- H2O2LO -> CTT1 MSN4 -- ACID -> TDH3 SKN7 -- peroxide -> down or HSF1 -- H2O2LO -> HSP82 TRX2 is induced since HSP82 -- ppi -> TDH3 SKN7 -- H2O2LO -> TRX2 MSN2 -- ACID -> TDH3 SKN7 -- peroxide -> down or Appendix C. Learning networks 200

UGP1 is induced since SKN7 -- H2O2LO -> UGP1 PRE1 is induced since SKN7 -- peroxide -> down PRE1 -- ppi -> UBC4 TEC1 -- BUT14 -> UBC4 TSA1 is induced since or SKN7 -- H2O2LO -> TSA1 PRE1 -- ppi -> UBC4 SKN7 -- peroxide -> down STE12 -- BUT14 -> UBC4 or DDR48 is induced since PRE1 -- ppi -> UBC4 RIM101 -- H2O2LO -> DDR48 DIG1 -- BUT14 -> UBC4 RIM101 -- peroxide -> down or or PRE1 -- ppi -> SLI15 SKN7 -- H2O2LO -> DDR48 STE12 -- BUT14 -> SLI15 SKN7 -- peroxide -> down or PRE1 -- ppi -> CDC48 ARG1 is induced since TEC1 -- BUT14 -> CDC48 SKN7 -- H2O2LO -> ARG1 SKN7 -- peroxide -> down BGL2 is induced since BGL2 -- ppi -> CHS3 *******/ STE12 -- BUT14 -> CHS3 or %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% BGL2 -- ppi -> SCW10 % Clause 8 ASH1 -- BUT14 -> SCW10 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% induced(A) :- UBA1 is induced since ppi(A,B), UBA1 -- ppi -> UBC4 ’BUT14’(C,B). TEC1 -- BUT14 -> UBC4 or /******* UBA1 -- ppi -> UBC4 STE12 -- BUT14 -> UBC4 GLK1 is induced since or GLK1 -- ppi -> HXK1 UBA1 -- ppi -> UBC4 SOK2 -- BUT14 -> HXK1 DIG1 -- BUT14 -> UBC4

CDC48 is induced since ZWF1 is induced since CDC48 -- ppi -> CDC48 ZWF1 -- ppi -> ZMS1 TEC1 -- BUT14 -> CDC48 SOK2 -- BUT14 -> ZMS1 or CDC48 -- ppi -> CDC34 PEP4 is induced since ASH1 -- BUT14 -> CDC34 PEP4 -- ppi -> PEP1 or STE12 -- BUT14 -> PEP1 CDC48 -- ppi -> FAR1 or STE12 -- BUT14 -> FAR1 PEP4 -- ppi -> PEP1 Appendix C. Learning networks 201

DIG1 -- BUT14 -> PEP1 STE12 -- BUT14 -> BEM2 or HSP82 is induced since HSP82 -- ppi -> ACT1 HSP82 -- ppi -> ADE1 TEC1 -- BUT14 -> ACT1 TEC1 -- BUT14 -> ADE1 or or HSP82 -- ppi -> MGA1 HSP82 -- ppi -> ADE1 SOK2 -- BUT14 -> MGA1 STE12 -- BUT14 -> ADE1 or or HSP82 -- ppi -> RPL42B HSP82 -- ppi -> ADE1 TEC1 -- BUT14 -> RPL42B DIG1 -- BUT14 -> ADE1 or or HSP82 -- ppi -> RPL42B HSP82 -- ppi -> BEM1 STE12 -- BUT14 -> RPL42B STE12 -- BUT14 -> BEM1 or or HSP82 -- ppi -> MRS1 HSP82 -- ppi -> YCL056C STE12 -- BUT14 -> MRS1 STE12 -- BUT14 -> YCL056C or or HSP82 -- ppi -> HOC1 HSP82 -- ppi -> YCL056C DIG1 -- BUT14 -> HOC1 DIG1 -- BUT14 -> YCL056C or or HSP82 -- ppi -> SOR1 HSP82 -- ppi -> FIG2 SOK2 -- BUT14 -> SOR1 STE12 -- BUT14 -> FIG2 or or HSP82 -- ppi -> UTH1 HSP82 -- ppi -> SOR2 ASH1 -- BUT14 -> UTH1 SOK2 -- BUT14 -> SOR2 or or HSP82 -- ppi -> CBF5 HSP82 -- ppi -> HSP31 MSS11 -- BUT14 -> CBF5 STE12 -- BUT14 -> HSP31 or or HSP82 -- ppi -> SST2 HSP82 -- ppi -> MIG3 STE12 -- BUT14 -> SST2 SOK2 -- BUT14 -> MIG3 or or HSP82 -- ppi -> GAS1 HSP82 -- ppi -> SHO1 ASH1 -- BUT14 -> GAS1 TEC1 -- BUT14 -> SHO1 or or HSP82 -- ppi -> TCB2 HSP82 -- ppi -> SHO1 STE12 -- BUT14 -> TCB2 STE12 -- BUT14 -> SHO1 or or HSP82 -- ppi -> CLA4 HSP82 -- ppi -> SHO1 STE12 -- BUT14 -> CLA4 DIG1 -- BUT14 -> SHO1 or or HSP82 -- ppi -> CLA4 HSP82 -- ppi -> BEM2 DIG1 -- BUT14 -> CLA4 Appendix C. Learning networks 202 or STE12 -- BUT14 -> CAM1 HSP82 -- ppi -> NCE4 or ASH1 -- BUT14 -> NCE4 HSP82 -- ppi -> NCA2 or STE12 -- BUT14 -> NCA2 HSP82 -- ppi -> NCE4 or SOK2 -- BUT14 -> NCE4 HSP82 -- ppi -> NCA2 or DIG1 -- BUT14 -> NCA2 HSP82 -- ppi -> NCE4 MSS11 -- BUT14 -> NCE4 *******/ or HSP82 -- ppi -> CAM1

C.3 Significant annotations from FunSpec

The tables below were constructed using the same approach as in Section 4.3.4.1. How- ever, here annotation categories were analysed separately for each clause, by collecting the set of all protein or gene names appearing in its ground instantiations as shown in Section C.2 above. For each clause the number (n) of elements in this set is given.

The p-values were obtained with a cutoff of 0.01 and with the Bonferroni correction for multiple testing applied. For each category type the table shows: the name and identifier (if any) of the category, the p-value, the set of proteins in the category from the data set, the size (k) of this set, and the size (f) of the set of all proteins or genes annotated to this category from the complete genome.

Clause 1 annotation (n = 4):

GO Biological Process Category p-value In Category k f response to stress [GO:0006950] 2.70102e − 07 SSA1 MSN4 HSP104 MSN2 4 152 response to hydrostatic pressure [GO:0051599] 2.75274e − 07 MSN4 MSN2 2 2 response to freezing [GO:0050826] 2.75274e − 07 MSN4 MSN2 2 2 heat acclimation [GO:0010286] 2.75274e − 07 MSN4 MSN2 2 2 age-dependent response to oxidative 2.75107e − 06 MSN4 MSN2 2 5 stress involved in chronological cell aging [GO:0001324]

Clause 2 annotation (n = 10):

GO Molecular Function Category p-value In Category k f unfolded protein binding [GO:0051082] 8.24484e − 10 SSA1 SSA3 HSP42 SSA2 HSP104 SIS1 6 86 Appendix C. Learning networks 203

GO Biological Process Category p-value In Category k f response to stress [GO:0006950] 1.25455e − 14 SSA1 SSA3 HSP42 HSF1 SKN7 MSN4 SSA2 9 152 HSP104 MSN2 SRP-dependent cotranslational protein target- 2.98585e − 07 SSA1 SSA3 SSA2 3 10 ing to membrane, translocation [GO:0006616] response to freezing [GO:0050826] 2.06455e − 06 MSN4 MSN2 2 2 heat acclimation [GO:0010286] 2.06455e − 06 MSN4 MSN2 2 2 response to hydrostatic pressure [GO:0051599] 2.06455e − 06 MSN4 MSN2 2 2

GO Cellular Component Category p-value In Category k f chaperonin-containing T-complex 4.10227e − 07 SSA1 HSP42 SSA2 3 11 [GO:0005832]

MIPS Functional Classification Category p-value In Category k f protein folding and stabilization [14.01] 1.18587e − 07 SSA1 SSA3 HSP42 SSA2 HSP104 5 93

MIPS Protein Complexes Category p-value In Category k f Complex Number 89, probably protein synthe- 6.18865e − 06 HSP42 SIS1 2 3 sis turnover [550.1.89]

MIPS Protein Classes Category p-value In Category k f DnaK subfamily [101.21.11] 4.10227e − 07 SSA1 SSA3 SSA2 3 11 other ATPases [41.61] 7.18011e − 06 SSA1 SSA2 HSP104 3 27 Heat shock factors [201.31.31] 2.05955e − 05 HSF1 SKN7 2 5

SMART Domains Category p-value In Category k f HSF 2.05955e − 05 HSF1 SKN7 2 5

PFam-A Domains Category p-value In Category k f HSP70 9.02827e − 07 SSA1 SSA3 SSA2 3 14

Cellzome Complexes Category p-value In Category k f YNL007C (SIS1) 6.18865e − 06 HSP42 SIS1 2 3

Clause 3 annotation (n = 25):

GO Molecular Function Category p-value In Category k f threonine-type endopeptidase activity 1.55989e − 07 SCL1 PUP2 PRE3 PRE8 4 14 [GO:0004298] endopeptidase activity [GO:0004175] 7.43549e − 07 SCL1 PUP2 PRE3 PRE8 4 20 oxidoreductase activity [GO:0016491] 4.39183e − 06 GPD1 SOD2 CCP1 TSA1 YNL134C ZWF1 8 272 GOR1 GLR1

GO Biological Process Category p-value In Category k f response to stress [GO:0006950] 2.1371e − 09 HSP26 TPS1 GPD1 HSP42 HSP78 HSP12 9 152 HSP104 DDR48 HSP82 proteasomal ubiquitin-independent protein 1.55989e − 07 SCL1 PUP2 PRE3 PRE8 4 14 catabolic process [GO:0010499] proteolysis involved in cellular protein 4.72011e − 07 SCL1 PUP2 PRE3 PRE8 4 18 catabolic process [GO:0051603] cellular response to oxidative stress 4.19879e − 06 HSP12 TRX2 CCP1 TSA1 GLR1 5 67 [GO:0034599] oxidation-reduction process [GO:0055114] 4.39183e − 06 GPD1 SOD2 CCP1 TSA1 YNL134C ZWF1 8 272 GOR1 GLR1 Appendix C. Learning networks 204

GO Cellular Component Category p-value In Category k f cytoplasm [GO:0005737] 5.00641e − 08 HSP26 TPS1 GPD1 LYS20 HSP42 HSP12 21 2026 SCL1 TRX2 PUP2 ENO1 PRE3 HSP104 TSA1 PRE8 PGM2 DDR48 YNL134C ZWF1 GOR1 GLR1 HSP82 proteasome core complex [GO:0005839] 2.12171e − 07 SCL1 PUP2 PRE3 PRE8 4 15 proteasome core complex, alpha-subunit com- 1.66178e − 06 SCL1 PUP2 PRE8 3 7 plex [GO:0019773] proteasome storage granule [GO:0034515] 2.2595e − 06 SCL1 PUP2 PRE3 PRE8 4 26

MIPS Functional Classification Category p-value In Category k f oxidative stress response [32.01.01] 3.95943e − 08 HSP12 TRX2 SOD2 CCP1 TSA1 GLR1 6 55 glycolysis and gluconeogenesis [02.01] 3.48018e − 07 GLK1 ENO1 PGM2 YNL134C ZWF1 5 41 stress response [32.01] 1.55769e − 06 TPS1 PUP2 PRE3 HSP104 DDR48 ZWF1 7 162 HSP82 C-compound and carbohydrate metabolism 1.30412e − 05 GLK1 GPD1 LYS20 HSP12 ENO1 YNL134C 7 223 [01.05] GOR1 protein folding and stabilization [14.01] 2.11514e − 05 HSP26 HSP42 HSP78 HSP104 HSP82 5 93

MIPS Protein Complexes Category p-value In Category k f Complex Number 60, 20S Proteosome (13) 1.11705e − 07 SCL1 PUP2 PRE3 PRE8 4 13 [550.3.60] Complex Number 238 [550.2.238] 2.12171e − 07 SCL1 PUP2 PRE3 PRE8 4 15 20S proteasome [360.10.10] 2.12171e − 07 SCL1 PUP2 PRE3 PRE8 4 15

MIPS Protein Classes Category p-value In Category k f HSP100/Clp family [101.41] 4.11951e − 05 HSP78 HSP104 2 3

SMART Domains Category p-value In Category k f Proteasome A N 1.66178e − 06 SCL1 PUP2 PRE8 3 7

PFam-A Domains Category p-value In Category k f Proteasome 1.55989e − 07 SCL1 PUP2 PRE3 PRE8 4 14 Proteasome A N 1.66178e − 06 SCL1 PUP2 PRE8 3 7

MDS Proteomics Complexes Category p-value In Category k f YER012W (PRE1) 1.55989e − 07 SCL1 PUP2 PRE3 PRE8 4 14

Cellzome Complexes Category p-value In Category k f YGL011C (SCL1) 1.03156e − 05 SCL1 PRE3 PRE8 3 12

Proteome Localization–Observed Category p-value In Category k f cyto 1.96353e − 06 HSP26 TPS1 GLK1 GPD1 HSP42 HSP12 16 1321 PUP2 ENO1 HSP104 TSA1 PGM2 DDR48 YNL134C ZWF1 GOR1 HSP82

Clause 4 annotation (n = 6): no categories were significant.

Clause 5 annotation (n = 14):

MIPS Protein Complexes Category p-value In Category k f Complex Number 54, probably intermediate 1.9762e − 14 HIS4 DHH1 PDC1 SAM1 TUB3 5 5 and energy metabolism [550.1.54]

Cellzome Complexes Category p-value In Category k f YCL030C (HIS4) 1.9762e − 14 HIS4 DHH1 PDC1 SAM1 TUB3 5 5 Appendix C. Learning networks 205

Clause 6 annotation (n = 20):

GO Molecular Function Category p-value In Category k f sequence-specific DNA binding transcription 8.09599e − 11 RPN4 HSF1 MGA1 SKN7 MSN4 YAP1 MSN2 9 138 factor activity [GO:0003700] YAP7 CIN5 sequence-specific DNA binding [GO:0043565] 4.05562e − 10 RPN4 HSF1 MGA1 SKN7 MSN4 YAP1 MSN2 9 165 YAP7 CIN5 DNA binding [GO:0003677] 1.90595e − 07 RPN4 HSF1 MGA1 SKN7 MSN4 YAP1 MSN2 10 449 SIS1 YAP7 CIN5

GO Biological Process Category p-value In Category k f response to stress [GO:0006950] 6.53881e − 09 SSA1 HSF1 SKN7 MSN4 SSA2 MSN2 DDR2 8 152 HSP82 response to osmotic stress [GO:0006970] 1.38524e − 06 SKN7 MSN4 MSN2 HSP82 4 29

GO Cellular Component Category p-value In Category k f proteasome core complex [GO:0005839] 1.05673e − 05 PRE1 PRE9 PRE8 3 15

MIPS Functional Classification Category p-value In Category k f stress response [32.01] 5.79541e − 06 PRE1 MGA1 MSN4 MSN2 DDR2 HSP82 6 162 transcriptional control [11.02.03.04] 1.77674e − 05 HSF1 MGA1 SKN7 MSN4 YAP1 MSN2 YAP7 8 426 CIN5 oxidative stress response [32.01.01] 1.89107e − 05 TRX2 SKN7 AHP1 YAP1 4 55

MIPS Subcellular Localization Category p-value In Category k f nucleus [750] 3.44114e − 08 SSA1 RPN4 PRE1 HSF1 PRE9 TDH3 TRX2 18 1976 MGA1 SKN7 MSN4 SSA2 AHP1 YAP1 PRE8 MSN2 SIS1 YAP7 CIN5

MIPS Protein Complexes Category p-value In Category k f Complex Number 60, 20S Proteosome (13) 6.66804e − 06 PRE1 PRE9 PRE8 3 13 [550.3.60]

MIPS Protein Classes Category p-value In Category k f Heat shock factors [201.31.31] 2.36783e − 07 HSF1 MGA1 SKN7 3 5 bZIP (Yap-like proteins) [201.11.11.11] 2.81405e − 06 YAP1 YAP7 CIN5 3 10 other ATPases [41.61] 6.63747e − 05 SSA1 SSA2 HSP82 3 27

SMART Domains Category p-value In Category k f HSF 2.36783e − 07 HSF1 MGA1 SKN7 3 5 BRLZ 1.05673e − 05 YAP1 YAP7 CIN5 3 15

PFam-A Domains Category p-value In Category k f HSF DNA-bind 2.36783e − 07 HSF1 MGA1 SKN7 3 5

Cellzome Complexes Category p-value In Category k f YDL188C (PPH22) 1.29808e − 05 PRE1 PRE9 PRE8 3 16

Clause 7 annotation (n = 11):

GO Biological Process Category p-value In Category k f response to stress [GO:0006950] 2.50133e − 06 HSP26 HSP42 CTT1 SKN7 DDR48 5 152 Appendix C. Learning networks 206

MIPS Functional Classification Category p-value In Category k f oxidative stress response [32.01.01] 1.36186e − 06 CTT1 TRX2 SKN7 TSA1 4 55

PFam-A Domains Category p-value In Category k f HSP20 2.52334e − 06 HSP26 HSP42 2 2

Clause 8 annotation (n = 47):

MIPS Functional Classification Category p-value In Category k f budding, cell polarity and filament formation 1.78185e − 09 CHS3 TEC1 BEM1 SHO1 BEM2 ACT1 MGA1 15 312 [43.01.03.05] STE12 FAR1 HOC1 ASH1 SOK2 MSS11 CLA4 DIG1 pheromone response, mating-type determina- 6.00029e − 07 CHS3 UBC4 BEM1 FIG2 STE12 FAR1 ASH1 10 189 tion, sex-specific proteins [34.11.03.07] SST2 DIG1 HSP82 stress response [32.01] 1.53834e − 05 UBC4 PRE1 SHO1 MGA1 UBA1 UTH1 ZWF1 8 162 HSP82

MIPS Phenotypes Category p-value In Category k f Pseudohyphae formation [52.10.20] 2.17332e − 06 TEC1 STE12 ASH1 SOK2 DIG1 5 31

Synthetic Genetic Array Analysis Category p-value In Category k f YDL029W (ARP2) 5.38696e − 07 CHS3 BEM1 BEM2 HOC1 UTH1 CLA4 6 44 YBR234C (ARC40) 8.02343e − 06 CHS3 BEM1 BEM2 HOC1 CLA4 5 40 YNL271C (BNI1) 0.000440832 CHS3 BEM1 BEM2 CLA4 4 51 D Workflow of a rule’s visual closure

This appendix outlines the work flow that has been implemented for generating a rule’s visual closure as described in Chapter6. The overall process is summarised in Figure D.1.

The steps are as follows:

1. apply formal concept analysis to build a concept lattice from multiple stress response data as in Figure D.2:

2. extract a concept from the lattice to learn first-order multiple stress Prolog rules using Aleph. Genes/proteins in a concept extent are included as positive exam- ples, while other relational data is used as background knowledge. The following is some example data for a concept with intent “sensitivity to hydrogen peroxide stress” and extent the set of genes/proteins sensitive to that stress. This is data (extent) is automatically saved into an Aleph-format input file from the BioLattice browser when the concept is selected by the user.

% intents:H2O2_Fields extentSize:442 minSup_100 concept(‘YLR027C’). concept(‘YMR072W’). concept(‘YDR448W’). concept(‘YDR226W’). concept(‘YMR064W’). concept(‘YBL082C’). ... Appendix D. Workflow of a rule’s visual closure 208

Visual Analytics Tool Interface

SQL queries Extracted rules Extracted concepts Prolog Facts & RulesDB SGD knowledge base

Pre- processed by FCA

Protein-protein Phenotype data Genomics Transcription interaction data Biochemical GeneOntology ( Saccharomyces data factor data (Gavin et al. pathway data DB Genome Deletion (Causton (Harbison et al.) and (yeastgenome.org) (Ashburner et Project) et al.) Vidal lab) al.)

Figure D.1: Data flow diagram for visual closure of a rule. Appendix D. Workflow of a rule’s visual closure 209

CONCEPT_INDEX INTENT EXTENT PARENT CHILD

C_592 CHP_Thorpe; BLM; AKR1; ARV1; GAL11; C_560; C_698; C_945 C_589; C_467; C_565; C_578; LoaOOH_Thorpe PHO85; REG1; SLX8; C_564; C_593 SNF6; SRB5 C_593 CHP_Thorpe; BLM; AKR1; ARV1; GAL11; C_592; C_594; C_590; C_563 LoaOOH_Thorpe; Diamide_Thorpe SNF6; SRB5 C_699; C_960 C_583 CHP_Thorpe; BLM; AKR1; ARV1; ERG4; C_560; C_595; C_582; C_585; C_587 Mefloquine_Fields PHO85; ROX3 C_1197 C_585 CHP_Thorpe; BLM; AKR1; ARV1; ERG4; ROX3 C_583;C_594; C_633; C_536; C_580; C_586 Mefloquine_Fields; C_1184 Diamide_Thorpe C_586 CHP_Thorpe; BLM; AKR1; ARV1; ROX3 C_585; C_587; C_537; C_515; C_427 Mefloquine_Fields; C_591; C_606; C_118 Diamide_Thorpe; Sorbate

Figure D.2: Tabular list of concept details in a BioLattice browser window. Each concept index in the left-hand column is clickable by the user. Clicking on the concept index pops-up a dialog box to allow download of the concept’s extent in Prolog format. Each object is a ground instance of the target predicate “concept/1”.

The set of Aleph rules generated is as follows (shown in Figure D.3):

Figure D.3: Set of first order rules generated by Aleph. Each rule expresses a relation for some genes in the selected concept to the predicates in the background knowledge. A rule can be selected by the user to generate its “visual closure”.

3. run Datalog queries to generate the full extension of a selected rule. An example is shown in Figure D.4 for the following rule (the 6th rule in Figure D.3):

concept(A) :- gc_ppi(B,A,C), heat(C,down), gc_ppi(B,C,D), bp(D,‘GO:0030435’), bp(A,‘GO:0006810’). Bibliography 210

Figure D.4: A learned rule, at the top of the window, and below the rows of its extensional representation.

4. select a complex to facilitate interactive visualization centered on a complex as shown in Figure D.5. This enables the user to investigate additional details such as functional similarities among protein complex subgroups, transcription factor binding conditions, etc.

Figure D.5: Visualization of all relations comprising the “visual closure” of the selected rule with respect to the V0 vacuolar ATPase complex. Bibliography

[A-C02] A-C. Gavin and M. B¨osche and R. Krause et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415:141–147, 2002.

[A-C06] A-C. Gavin and P. Aloy and P. Grandi and R. Krause and M. B¨osche et al. Proteome survey reveals modularity of the yeast cell machinery. Nature, 440:631–636, 2006.

[AA05] N.L. Anderson and N.G. Anderson. Proteome and proteomics: new tech- nologies, new concepts, and new words. Electrophoresis, 19(11):1853– 1861, 2005.

[AB04] F. Azuaje and O. Bodenreider. Incorporating Ontology-Driven Similar- ity Knowledge into Functional Genomics: An Exploratory Study. In BIBE2004: Proc. Fourth IEEE Symposium on Bioinformatics and Bio- engineering, pages 317–324, 2004.

[ABT07] E. Akand, M. Bain, and M. Temple. Learning from Ontological Annota- tion: an Application of Formal Concept Analysis to Feature Construction in the Gene Ontology. In Proc. Third Australasian Ontology Workshop (AOW-2007), pages 15–23, 2007.

[ABT09] E. Akand, M. Bain, and M. Temple. Learning Responsive Cellular Net- works by Integrating Genomics and Proteomics Data. ILP 2009: 19th Intl. Conference on Inductive Logic Programming. Poster and Online pa- per. http://dtai.cs.kuleuven.be/ilp-mlg-srl/papers/ILP09-44. , 2009. Bibliography 212

[ABT10] E. Akand, M. Bain, and M. Temple. Learning with Gene Ontology An- notation using Feature Selection and Construction. Applied Artificial Intelligence, 24(1):5–38, 2010.

[ADB+03] J. Allen, H.M. Davey, D. Broadhurst, J.K. Heald, J.J. Rowland, S.G. Oliver, and D.B. Kell. High-throughput classification of yeast mutants for functional genomics using metabolic footprinting. Nature Biotechnology, 21(6):692–696, 2003.

[ADF+06] A. Ahmed, T. Dwyer, M. Forster, X. Fu, J. Ho, S.H. Hong, D. Kosch¨utzki, C. Murray, N. Nikolov, R. Taib, et al. Geomi: Geometry for maximum insight. In Graph Drawing, pages 468–479. Springer, 2006.

[AIS93] R. Agrawal, T. Imieli´nski,and A. Swami. Mining association rules be- tween sets of items in large databases. ACM SIGMOD Record, 22(2):207– 216, 1993.

[AKS77] JC Alwine, DJ Kemp, and GR Stark. Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proceedings of the National Academy of Sciences of the United States of America, 74(12):5350, 1977.

[ALI+00] S.M. Arfin, A.D. Long, E.T. Ito, L. Tolleri, M.M. Riehle, E.S. Paegle, and G. Hatfield. Global Gene Expression Profiling in Escherichia coliK12. Journal of Biological Chemistry, 275(38):29672, 2000.

[Alp10] E. Alpaydin. Introduction to Machine Learning (2nd Edn.). MIT Press, Cambridge, MA, 2010.

[AM06] A.V. Antonov and H.W. Mewes. Complex functionality of gene groups identified from high-throughput data. Journal of molecular biology, 363(1):289–296, 2006.

[AMS+96] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo, et al. Fast discovery of association rules. Advances in knowledge discovery and data mining, 12:307–328, 1996.

[AOS+99] MA Andrade, C. Ouzounis, C. Sander, J. Tamames, and A. Valencia. Functional classes in the three domains of life. Journal of molecular evolution, 49(5):551–557, 1999. Bibliography 213

[ARL06] A. Alexa, J. Rahnenfuhrer, and T. Lengauer. Improved scoring of func- tional groups from gene expression data by decorrelating GO graph struc- ture. Bioinformatics, 22(13):1600, 2006.

[ASDUD04] F. Al-Shahrour, R. D´ıaz-Uriarte,and J. Dopazo. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20(4):578, 2004.

[Ash00] Ashburner, M. and the Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, 2000.

[Bad03] L. Badea. Functional discrimination of gene expression patterns in terms of the gene ontology. In Pacific Symposium on Biocomputing (PSB 2003), pages 565–576, 2003.

[Bai02] M. Bain. Structured features from concept lattices for unsupervised learn- ing and classification. AI 2002: Advances in Artificial Intelligence, pages 557–568, 2002.

[Bai03a] M. Bain. Inductive Construction of Ontologies from Formal Concept Analysis. In T. Gedeon and L. Fung, editors, AI 2003: Proc. of the 16th Australian Joint Conference on Artificial Intelligence, pages 88–99, Berlin, 2003. Springer.

[Bai03b] M. Bain. Inductive Construction of Ontologies from Formal Concept Analysis. In AI 2003: Advances in Artificial Intelligence: 16th Australian Conference on AI, Perth, Australia, December 3-5, 2003: Proceedings. Springer, 2003.

[Bai03c] M. Bain. Learning Ontologies from Concept Lattices. Using Conceptual Structures: Contributions to ICCS, pages 199–212, 2003.

[Bai04] M. Bain. Predicate Invention and the Revision of First Order Concept Lattices. In P. Eklund, editor, ICFCA 2004: Proc. of the 2nd Intl. Confer- ence on Formal Concept Analysis, pages 329–326, Berlin, 2004. Springer (LNAI 2961).

[Bak13] M. Baker. Big biology: The ’omes puzzle. Nature, 494(7438):416–419, 2013.

[BB03] A-L. Barab´asiand E. Bonabeau. Scale-free networks. Scientific American, 288:60–69, 2003. Bibliography 214

[BCM+03] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. Patel- Schneider (Editors). The Description Logic Handbook: Theory, Imple- mentation, and Applications. Cambridge University Press, Cambridge, UK, 2003.

[BDDL+98] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan. Predicting function: from genes to genomes and back. Journal of Molecular Biology, 283(4):707–725, 1998.

[BG96] F. Bergadano and D. Gunetti. Inductive logic programming: from ma- chine learning to software engineering. The MIT Press, 1996.

[BH02] P. Baldi and G.W. Hatfield. DNA microarrays and gene expression: from experiments to data analysis and modeling. Cambridge Univ Pr, 2002.

[BH03] G.D. Bader and C.W.V. Hogue. An automated method for finding molec- ular complexes in large protein interaction networks. BMC bioinformat- ics, 4(1):2, 2003.

[BK03] B. Bushnell and A. Kornberg. Complete, 12-subunit RNA polymerase II at 4.1-A˚ resolution: Implications for the initiation of transcription. Pro- ceedings of the National Academy of Sciences, 100(12):6969–6973, 2003.

[BL01] P. Baldi and A.D. Long. A Bayesian framework for the analysis of mi- croarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17(6):509, 2001.

[BLHL01] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web, May, 2001. Scientific American, 2001.

[Blo99] H. Blockeel. Top-down induction of first order logical decision trees. Ai Communications, 12(1):119–120, 1999.

[BM70] M. Barbut and B. Monjardet. Ordre et Classification: alg`ebre et combi- natoire. Hachette, 1970.

[BMUT97] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997: Proc. of the ACM Intl. Conference on Management of Data, pages 255–264, 1997.

[Bor13] D. Borchmann. Towards an Error-Tolerant Construction of EL⊥- Ontologies from Data Using Formal Concept Analysis. In P. Cellier, Bibliography 215

F. Distel, and B. Ganter, editors, ICFCA 2013 - 11th International Con- ference on Formal Concept Analysis, volume 7880 of LNCS, pages 60–75. Springer, 2013.

[BR04] J.B.L. Bard and S.Y. Rhee. Ontologies in biology: design, applications and future challenges. Nature Reviews Genetics, 5(3):213–222, 2004.

[Bre84] L. Breiman. Classification and regression trees. Chapman & Hall/CRC, 1984.

[BS04] T. Beissbarth and T. Speed. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics, page 881, 2004.

[BS06] O. Bodenreider and R. Stevens. Bio-ontologies: current trends and future directions. Briefings in Bioinformatics, 7(3):256, 2006.

[BSG+04] M. Bada, R. Stevens, C. Goble, Y. Gil, M. Ashburner, J.A. Blake, J.M. Cherry, M. Harris, and S. Lewis. A short study on the success of the Gene Ontology. Web Semantics: Science, Services and Agents on the World Wide Web, 1(2):235–240, 2004.

[BST06] Z. Barutcuoglu, R.E. Schapire, and O.G. Troyanskaya. Hierarchical multi- label prediction of gene function. Bioinformatics, 22(7):830, 2006.

[Bun88] W. Buntine. Generalised subsumption and its applications to induction and redundancy. Artificial Intelligence, 36(2):149–176, 1988.

[BWG+04] E.I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J.M. Cherry, and G. Sherlock. GO:: TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20(18):3710, 2004.

[C+12] J. Cherry et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Research, 40:D700–5, 2012.

[Cag09] G. Cagney. Interaction networks: Lessons from large-scale studies in yeast. Proteomics, 9:4799–4811, 2009.

[Car04] V. J. Carey. Ontology concepts and tools for statistical genomics. Journal of Multivariate Analysis, 90:213–228, 2004.

[CC05] A. Chaudhuri and J. Chant. Protein-interaction mapping in search of effective drug targets. BioEssays, 27:958–969, 2005. Bibliography 216

[CGT89] S. Ceri, G. Gottlob, and L. Tanca. What You Always Wanted to Know About Datalog (And Never Dared to Ask). IEEE Transactions On Knowledge And Data Engineering, 1(1):337–350, 1989.

[Cha87] G. Chaitin. Information, Randomness and Incompleteness - Papers on Algorithmic Information Theory. World Scientific Press, Singapore, 1987.

[CHST04] P. Cimiano, A. Hotho, G. Stumme, and J. Tane. Conceptual knowledge processing with formal concept analysis and ontologies. Concept Lattices, pages 199–200, 2004.

[CMSV09] P. Cimiano, A. Maedche, S. Staab, and J. Voelker. Ontology Learning. In S. Staab and R. Studer, editors, Handbook on Ontologies (Second Edi- tion), pages 245–267. Springer, Berlin, 2009.

[Cos10] Costanzo, M. et al. The Genetic Landscape of a Cell. Science, 327(5964):425–431, 2010.

[CR93] C. Carpineto and G. Romano. Galois: An order-theoretic approach to conceptual clustering. In Proc. Of the 10th Conference on Machine Learn- ing, Amherst, MA, Kaufmann, pages 33–40. Citeseer, 1993.

[CR96] C. Carpineto and G. Romano. A lattice conceptual clustering system and its application to browsing retrieval. Machine Learning, 24(2):95– 122, 1996.

[CRK+01] H.C. Causton, B. Ren, S.S. Koh, C.T. Harbison, E. Kanin, E.G. Jen- nings, T.I. Lee, H.L. True, E.S. Lander, and R.A. Young. Remodeling of yeast genome expression in response to environmental changes. Molecular Biology of the Cell, 12(2):323, 2001.

[CS00] R. Cole and G. Stumme. Cem–a conceptual email manager. Conceptual Structures: Logical, Linguistic, and Computational Issues, pages 438–452, 2000.

[CWB+04] K.R. Christie, S. Weng, R. Balakrishnan, M.C. Costanzo, K. Dolinski, S.S. Dwight, S.R. Engel, B. Feierbach, D.G. Fisk, J.E. Hirschman, et al. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic acids research, 32(Database Issue):D311, 2004. Bibliography 217

[CYC13] H. Chen, T. Yu, and J. Chen. Semantic Web meets Integrative Biology: a survey. Briefings in Bioinformatics, 14(1):109–125, 2013.

[Dav99] E.H. Davidson. A view from the genome: spatial control of transcription in sea urchin development. Current opinion in genetics & development, 9(5):530–541, 1999.

[DB05] K. Dolinski and D. Botstein. Changing perspectives in yeast re- search nearly a decade after the genome sequence. Genome research, 15(12):1611, 2005.

[Deh97] L. Dehaspe. Maximum entropy modeling with clausal constraints. In N. Lavrac and S. Dzeroski, editors, Proc. of Inductive Logic Program- ming, volume 1297 of Lecture Notes in Computer Science, pages 109–124. Springer, Berlin, 1997.

[Dis11] F. Distel. Learning Description Logic Knowledge Bases from Data Using Methods from Formal Concept Analysis. PhD thesis, Technical University of Dresden, 2011.

[DJSH+03] G. Dennis Jr, B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, and R.A. Lempicki. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol, 4(5):P3, 2003.

[DL94] S. Dzeroski and N. Lavrac. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.

[DP05] C. Ding and H. Peng. Minimum redundancy feature selection from mi- croarray gene expression data. Journal of Bioinformatics and Computa- tional Biology, 3(2):185–206, 2005.

[DV01] D. Devos and A. Valencia. Intrinsic errors in genome annotation. Trends in Genetics, 17(8):429–431, 2001.

[EDD12] P. Eklund, J. Ducrou, and F. Dau. Concept similarity and related cate- gories in information retrieval using formal concept analysis. International Journal of General Systems, 41(8):826–846, 2012.

[EGSW00] P. Eklund, B. Groh, G. Stumme, and R. Wille. A Contextual-Logic Ex- tension of TOSCANA, Conceptual Structures: Logical, Linguistic, and Computational Issues. In Proceedings of the 8th International Confer- ence on Conceptual Structures (ICCS 2000), Darmstadt, Springer-Verlag, pages 453–467, 2000. Bibliography 218

[EGW+09] P. Eklund, P. Goodall, T. Wray, B. Bunt, A. Lawson, L. Christidis, V. Daniel, and M. Van Olffen. Designing the Digital Ecosystem of the Virtual Museum of the Pacific. In DEST’09: 3rd IEEE International Conference on Digital Ecosystems and Technologies, 2009, pages 377– 383, 2009.

[ESBB98] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster anal- ysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863, 1998.

[ESH+99] K. Entian, T. Schuster, J. Hegemann, D. Becher, H. Feldmann, U. G¨uldener,R. G¨otz,et al. Functional analysis of 150 deletion mu- tants in Saccharomyces cerevisiae by a systematic approach. Molecular and General Genetics MGG, 262(4):683–702, 1999.

[Fel98] C. Fellbaum. WordNet: An electronic lexical database. The MIT press, 1998.

[FH95] T. Fruewirth and P. Hanschke. Terminological reasoning with constraint handling rules. In P. van Hentenryck and V. Saraswat, editors, Principles and Practice of Constraint Programming. MIT Press, Cambridge, MA, 1995.

[Fie02] O. Fiehn. Metabolomics–the link between genotypes and phenotypes. Plant Molecular Biology, 48(1):155–171, 2002.

[FK08] Sebastian Fr¨ohlerand Stefan Kramer. Inductive logic programming for gene regulation prediction. Mach. Learn., 70(2-3):225–240, 2008.

[FLM+99] B. Futcher, GI Latter, P. Monardo, CS McLaughlin, and JI Garrels. A sampling of the yeast proteome. Molecular and cellular biology, 19(11):7357, 1999.

[FLNP00] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian net- works to analyze expression data. Journal of , 7(3- 4):601–620, 2000.

[FLV08] B. Fortuna, N. Lavraˇc,and P. Velardi. Advancing topic ontology learning through term extraction. PRICAI 2008: Trends in Artificial Intelligence, pages 626–635, 2008. Bibliography 219

[FMM+89] JB Fenn, M. Mann, CK Meng, SF Wong, and CM Whitehouse. Elec- trospray ionization for mass spectrometry of large biomolecules. Science, 246(4926):64, 1989.

[FMPS98] P. Finn, S. Muggleton, D. Page, and A. Srinivasan. Pharmacophore dis- covery using the Inductive Logic Programming system Progol. Machine Learning, 30:241–271, 1998.

[FNR98] D. Faure, C. N´edellec,and C. Rouveirol. Acquisition of Semantic Knowl- edge using Machine learning methods: The System” ASIUM”. In Uni- versite Paris Sud. Citeseer, 1998.

[For06] A. Formica. Ontology-based concept similarity in formal concept analysis. Information Sciences, 176(18):2624–2641, 2006.

[FR00] S. Ferr´eand O. Ridoux. A Logical Generalization of Formal Concept Analysis. In Guy Mineau and Bernhard Ganter, editors, Proc. Eighth Intl. Conf. on Conceptual Structures, pages 371–384, Berlin, 2000. Springer.

[FS89] S. Fields and O. Song. A novel genetic system to detect protein protein interactions. Nature, 340(6230):245–246, 1989.

[FSG13] X. Fern´andez-Su´arezand M. Galperin. The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Research, 48:D1–D7, 2013.

[FWA10] A. Freitas, D. Wieser, and R. Apweiler. On the Importance of Comprehen- sible Classification Models for Protein Function Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(1):172– 182, 2010.

[G+02] G. Giaever et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature, 418(6896):387–391, 2002.

[GAAC+97] AEA Goffeau, R. Aert, ML Agostini-Carbone, A. Ahmed, M. Aigle, L. Al- berghina, K. Albermann, M. Albers, M. Aldea, D. Alexandraki, et al. The yeast genome directory. Nature, 387(6632):5–6, 1997.

[GBRV07] S. Grossmann, S. Bauer, P.N. Robinson, and M. Vingron. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics, 23(22):3024, 2007.

[GE03] I. Guyon and A. Elisseeff. An introduction to variable and feature selec- tion. The Journal of Machine Learning Research, 3:1157–1182, 2003. Bibliography 220

[Gen10] M. Genesereth. Data Integration: The Relational Logic Approach. Mor- gan and Claypool, San Francisco, CA, 2010.

[GKT+04] W. Gunji, T. Kai, Y. Takahashi, Y. Maki, W. Kurihara, T. Utsugi, F. Fu- jimori, and Y. Murakami. Global analysis of the regulatory network structure of gene expression in saccharomyces cerevisiae. DNA research, 11(3):163–177, 2004.

[GLJ+01] D. Greenbaum, N. Luscombe, R. Jansen, Q. Jiang, and M. Gerstein. Interrelating Different Types of Genomic Data, from Proteome to Secre- tome: ’Oming in on Function. Genome Research, 11:1463–1468, 2001.

[GLL+98] C. Godon, G. Lagniel, J. Lee, J.M. Buhler, S. Kieffer, M. Perrot, H. Boucherie, M.B. Toledano, and J. Labarre. The H2O2 stimulon in Sac- charomyces cerevisiae. Journal of Biological Chemistry, 273(35):22480, 1998.

[GM94a] R. Godin and R. Missaoui. An incremental concept formation approach for learning from databases. Theoretical Computer Science, 133(2):387– 419, 1994.

[GM94b] R. Godin and R. Missaoui. An incremental concept formation approach for learning from databases. Theoretical Computer Science, 133:387–419, 1994.

[GMA93] R. Godin, R. Missaoui, and A. April. Experimental comparison of naviga- tion in a Galois lattice with conventional information retrieval methods. International Journal of Man-Machine Studies, 38:747–767, 1993.

[GMF+03] J.H. Gennari, M.A. Musen, R.W. Fergerson, W.E. Grosso, M. Crub´ezy, H. Eriksson, N.F. Noy, and S.W. Tu. The evolution of Prot´eg´e:an envi- ronment for knowledge-based systems development. International Jour- nal of Human-Computer Studies, 58(1):89–123, 2003.

[GOB+10] N. Gehlenborg, S.I. O’Donoghue, N.S. Baliga, A. Goesmann, M.A. Hibbs, H. Kitano, O. Kohlbacher, H. Neuweger, R. Schneider, D. Tenenbaum, et al. Visualization of ’omics data for systems biology. Nature Methods, 7:S56–S68, 2010.

[GR81] M.M. Garner and A. Revzin. A gel electrophoresis method for quantifying the binding of proteins to specific DNA regions: application to compo- nents of the Escherichia coli lactose operon regulatory system. Nucleic Acids Research, 9(13):3047, 1981. Bibliography 221

[GR91] B. Ganter and K. Reuter. Finding All Closed Sets: A General Approach. Order, 8:283–290, 1991.

[GRFA99] S.P. Gygi, Y. Rochon, B.R. Franza, and R. Aebersold. Correlation be- tween protein and mRNA abundance in yeast. Molecular and cellular biology, 19(3):1720, 1999.

[Gru93] T. R. Gruber. A Translation Approach to Portable Ontology Specifica- tions. Knowledge Acquisition, 6(2):199–201, 1993.

[GS12] G. Gottlob and T. Schwentick. Rewriting Ontological Queries into Small Nonrecursive Datalog Programs. In G. Brewka, T. Eiter, and S. McIlraith, editors, Proc. 13th Intl. Conference on Knowledge Representation and Reasoning (KR 2012), 2012.

[GST+99] TR Golub, DK Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, JP Mesirov, H. Coller, ML Loh, JR Downing, MA Caligiuri, et al. Molec- ular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531, 1999.

[Gun10] J. Gunawardena. Models in Systems Biology: The Parameter Problem and the Meanings of Robustness. In H. Lodhi and S. Muggleton, edi- tors, Elements of Computational Systems Biology, pages 21–47. Wiley, Hoboken, NJ, 2010.

[GW97] B. Ganter and R. Wille. Applied lattice theory: formal concept analysis. Preprints http://wwwbib.mathematik.tudarmstadt.de/Math- Net/Preprints/Listen/pp97.html, 1997.

[GW99a] B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foun- dations. Springer, Berlin, 1999.

[GW99b] B. Ganter and R Wille. Formal Concept Analysis: Mathematical Foun- dations. Springer, Berlin, 1999.

[GWV03] H. Ge, A. Walhout, and M. Vidal. Integrating ‘omic’ information: a bridge between genomics and systems biology. Trends in Genetics, 19(10):551–560, 2003.

[GZL+05] Z. Guo, T. Zhang, X. Li, Q. Wang, J. Xu, H. Yu, J. Zhu, H. Wang, C. Wang, E.J. Topol, et al. Towards precise classification of cancers based on robust gene functional expression profiles. BMC bioinformatics, 6(1):58, 2005. Bibliography 222

[Hal00] M.A. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 359–366. Citeseer, 2000.

[HBA12] L. Hood, R. Balling, and C. Auffray. Revolutionizing medicine in the 21st century through systems approaches. Biotechnology Journal, 7(8):992– 1001, 2012.

[HC12] J.R. Harger and P.J. Crossno. Comparison of open-source visual analytics toolkits. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 8294, page 11, 2012.

[HFG+03] W. Huh, J. Falvo, L. Gerke, A. Carroll, R. Howson, J. Weissman, and E. O’Shea. Global analysis of protein localization in budding yeast. Na- ture, 425(6959):686–691, 2003.

[HGL+04] C.T. Harbison, D.B. Gordon, T.I. Lee, N.J. Rinaldi, K.D. Macisaac, T.W. Danford, N.M. Hannett, J.B. Tagne, D.B. Reynolds, J. Yoo, et al. Tran- scriptional regulatory code of a eukaryotic genome. Nature, 431(7004):99– 104, 2004.

[HH03] M. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15(6):1437 – 1447, 2003.

[Hit11] P. Hitzler. What’s happening in semantic web ... and what fca could have to do with it. In P. Valtchev and R. Jaschke, editors, Proc. 9th Intl. Conf. on Formal Concept Analysis (ICFCA 2011), volume 6628 of Lecture Notes in Artificial Intelligence, pages 18–23. Springer, Berlin, 2011.

[HL00] S.P. Hunt and F.J. Livesey. Functional genomics. Oxford University Press, 2000.

[HLK03] T.R. Hvidsten, A. Laegreid, and J. Komorowski. Learning rule-based models of biological process from gene expression time profiles using Gene Ontology. Bioinformatics, 19(9):1116, 2003.

[HMM00] I. Herman, G. Melan¸con,and M.S. Marshall. Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics, pages 24–43, 2000.

[HMS66] E.B. Hunt, J. Marin, and P.J. Stone. Experiments in induction. Academic Press, New York, 1966. Bibliography 223

[HPYM04] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data mining and knowledge discovery, 8(1):53–87, 2004.

[HRBHF07] A. Hampshire, D. Rusling, V. Broughton-Head, and K. Fox. Footprinting: a method for determining the sequence selectivity, affinity and kinetics of DNA-binding ligands. Methods, 42(2):128–140, 2007.

[HRS+02] L.S. Heath, N. Ramakrishnan, R.R. Sederoff, R.W. Whetten, B.I. Chevone, C.A. Struble, V.Y. Jouenne, D. Chen, L. Van Zyl, and R. Grene. Studying the functional genomics of stress responses in loblolly pine with the Expresso microarray experiment management system. Comparative and functional genomics, 3(3):226, 2002.

[HSD08] Z. Hu, E.S. Snitkin, and C. DeLisi. Visant: an integrative framework for networks in systems biology. Briefings in bioinformatics, 9(4):317–325, 2008.

[IB95] N.B. Ivanova and A.V. Belyavsky. Identification of differentialy expressed genes by restirction endonuclease-based gene expression fingerprinting. Nucleic acids research, 23(15):2954, 1995.

[IGH01] T. Ideker, T. Galitski, and L. Hood. A new approach to decoding life: systems biology. Ann. Rev. Genom. Hum. Genet, 2:343–372, 2001.

[JA06] P. Jafari and F. Azuaje. An assessment of recently published gene expres- sion data analyses: reporting experimental design and statistical factors. BMC Medical Informatics and Decision Making, 6(1):27, 2006.

[Jam92] D.J. Jamieson. Saccharomyces cerevisiae has distinct adaptive responses to both hydrogen peroxide and menadione. Journal of bacteriology, 174(20):6678–6681, 1992.

[JMT11] M. Jiline, S. Matwin, and M. Turcotte. Annotation concept synthesis and enrichment analysis: a logic-based approach to the interpretation of high-throughput experiments. Bioinformatics, 27(17):2391–2398, 2011.

[JSR03] J. Jaeger, R. Sengupta, and W.L. Ruzzo. Improved Gene Selection for Classification of Microarrays. In Pacific Symposium on Biocomputing, pages 53–64, 2003. Bibliography 224

[JUA05] T. Jirapech-Umpai and S. Aitken. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, 6(1):148, 2005.

[Kan00] M. Kanehisa. Post-genome Informatics. Oxford University Press, Oxford, 2000.

[Kar01] P. Karp. Pathway Databases: A Case Study in Computational Symbolic Theories. Science, 293:2040–2044, 2001.

[Kas99] V. Kashyap. Design and creation of ontologies for environmental infor- mation retrieval. In Proceedings of the 12th Workshop on Knowledge Acquisition, Modeling and Management. Citeseer, 1999.

[KB03] M. Kanehisa and P. Bork. Bioinformatics in the post-sequence era. nature genetics, 33:305–310, 2003.

[KC04] M. Kim and P. Compton. Evolutionary document management and retrieval for specialized domains on the web. International Journal of Human-Computer Studies, 60(2):201–241, 2004.

[KD05] P. Khatri and S. Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21(18):3587, 2005.

[KDOK02] P. Khatri, S. Draghici, G.C. Ostermeier, and S.A. Krawetz. Profiling gene expression using onto-express. Genomics, 79(2):266–270, 2002.

[Ken01] A.K. Kenworthy. Imaging protein-protein interactions using fluorescence resonance energy transfer microscopy. Methods, 24(3):289–296, 2001.

[KFD+03] O.D. King, R.E. Foulger, S.S. Dwight, J.V. White, and F.P. Roth. Pre- dicting gene function from patterns of annotation. Genome research, 13(5):896, 2003.

[KG02] M. Kanehisa and S. Goto. KEGG for computational genomics. In T. Jiang, Y. Xu, and Zhang M.Q., editors, Current Topics in Compu- tational Molecular Biology, pages 301–315. MIT Press, Cambridge, MA, 2002.

[Kin04] R.D. King. Applying inductive logic programming to predicting gene function. AI Magazine, 25(1):57–68, 2004. Bibliography 225

[Kit02] H. Kitano. Systems biology: a brief overview. Science, 295:1662–1664, 2002.

[KJ97] R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324, 1997.

[KLS+09] S. Kadupitige, K. Leung, J. Sellmeier, J. Sivieng, D. Catchpoole, M. Bain, and B. Gaeta. MINER: Exploratory Analysis of Gene Interaction Net- works by Machine Learning from Expression Data. BMC Genomics, 10 (Suppl 3):S17, 2009.

[KMS+08] D. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler. Visual analytics: Scope and challenges. Visual Data Mining, pages 76– 90, 2008.

[KO01] S. Kuznetsov and S. Obiedkov. Algorithms for the construction of con- cept lattices and their diagram graphs. Principles of Data Mining and Knowledge Discovery, pages 289–300, 2001.

[L+01] E. Lander et al. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001.

[L+02] T. Lee et al. Transcriptional Regulatory Networks in Saccharomyces Cere- visiae. Science, 298:799–804, 2002.

[LBW+04] D.J. Lockhart, E.L. Brown, G.G. Wong, M. Chee, and T.R. Gingeras. Expression monitoring by hybridization to high density oligonucleotide arrays, November 23 2004. US Patent App. 10/998,518.

[LDG91] N. Lavraˇc,S. Dˇzeroski,and M. Grobelnik. Learning nonrecursive defini- tions of relations with linus. In Y. Kodratoff, editor, EWSL-91: Proc. of the European Working Session on Learning, pages 265–281, Berlin, 1991. Springer.

[Lee01] K.H. Lee. Proteomics: a technology-driven and technology-limited dis- covery science. TRENDS in Biotechnology, 19(6):217–222, 2001.

[Len00] K. Lengnink. Ahnlichkeit¨ als distanz in begriffsverb¨anden.In Begriffliche Wissensverarbeitung, pages 57–71. Springer, 2000.

[LH08] J. Lehmann and P. Hitzler. A Refinement Operator Based Learning Al- gorithm for the ALC Description Logic. In Proc. of the 17th Intl. Con- ference on Inductive Logic Programming, ILP’07, pages 147–160, Berlin, 2008. Springer. Bibliography 226

[LHE+99] A. Lueking, M. Horn, H. Eickhoff, K. B¨ussow, H. Lehrach, and G. Walter. Protein microarrays for gene expression and antibody screening. Analyt- ical Biochemistry, 270(1):103–111, 1999.

[Lie95] M. Liebman. Bioinformatics: An Editorial Perspective. Network Science Web Site, 1995.

[Lin99] A.J. Link. 2-D proteome analysis protocols. Humana Pr Inc, 1999.

[Lis13] F. Lisi. Learning Onto-Relational Rules with Inductive Logic Program- ming. In J. Volker and J. Lehmann, editors, Perspectives of Ontology Learning. IOS Press, Amsterdam, 2013.

[Llo87] J. W. Lloyd. Logic Programming, 2nd Edition. Springer-Verlag, Berlin, 1987.

[LM98] H. Liu and H. Motoda. Feature selection for knowledge discovery and data mining. Springer, 1998.

[LP93] P. Liang and A. Pardee. Distribution and cloning of eukaryotic mRNAs by means of differential display: refinements and optimization. Nucleic Acids Research, 21(14):3269, 1993.

[LPP+13] L. Langohr, V. Podpeˇcan, M. Petek, I. Mozetiˇc,K. Gruden, N. Lavraˇc, and H. Toivonen. Contrasting Subgroup Discovery. Computer Journal, 56(3):289–303, 2013.

[Mar03] P. Martin. Correction and extension of wordnet 1.7. Lecture notes in computer science, pages 160–173, 2003.

[MBR+04] D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry, and B. Jacq. GO- ToolBox: functional analysis of gene datasets based on Gene Ontology. Genome biology, 5(12):R101, 2004.

[MF92] S. Muggleton and C. Feng. Efficient induction of logic programs. Inductive logic programming, 38:281–298, 1992.

[MG95] G.W. Mineau and R. Godin. Automatic structuring of knowledge bases by conceptual clustering. IEEE Transactions on Knowledge and Data Engineering, 7(5):824–829, 1995.

[Mic83] R. S. Michalski. A theory and methodology of inductive learning. In R. Michalski, J. Carbonnel, and T. Mitchell, editors, Machine Learning: Bibliography 227

An Artificial Intelligence Approach, pages 83–134. Tioga, Palo Alto, CA, 1983.

[Mic91] D. Michie. Methodologies from Machine Learning in Data Analysis and Software. The Computer Journal, 34(6):559–565, 1991.

[Mit97] T.M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[MS83] R.S. Michalski and R.E. Stepp. Learning from observation: Conceptual clustering. Machine Learning: An artificial intelligence approach, 1:331– 363, 1983.

[MS04] A. Maedche and S. Staab. Ontology learning. Handbook on ontologies, pages 173–190, 2004.

[MTV94] H. Mannila, H. Toivonen, and A.I. Verkamo. Efficient algorithms for discovering association rules. In AAAI Wkshp. Knowledge Discovery in Databases, 1994.

[Mug87] S. Muggleton. Duce, an oracle based approach to constructive induction. In Proceedings of the Tenth International Joint conference on Artificial intelligence, pages 287–292. Citeseer, 1987.

[Mug91a] S. Muggleton. Inductive logic programming. New generation computing, 8(4):295–318, 1991.

[Mug91b] S. Muggleton. Inductive Logic Programming. New Generation Comput- ing, 8:295–318, 1991.

[Mug95] S. Muggleton. Inverse Entailment and Progol. New Generation Comput- ing, 13:245–286, 1995.

[Mug96] S. Muggleton. Learning from positive data. In Proc. of the International Workshop on Inductive Logic Programming, 1996.

[Mug05] SH Muggleton. Machine learning for systems biology. In Proc. 15th Intl. Conference on Inductive Logic Programming, pages 416–423. Springer, 2005.

[Mus07] A. Mushegian. Foundations of comparative genomics. Academic Press, 2007.

[OD05] M.A. O’Malley and J. Dupre. Fundamental issues in systems biology. BioEssays, 27(12):1270–1276, 2005. Bibliography 228

[O’F75] P. O’Farrell. High resolution two-dimensional electrophoresis of proteins. Journal of Biological Chemistry, 250(10):4007, 1975.

[Oli96] S. Oliver. A network approach to the systematic analysis of yeast gene function. Trends in genetics: TIG, 12(7):241, 1996.

[OTPC07] I. Ong, S. Topper, D. Page, and V. Costa. Inferring regulatory networks from time series expression data and relational data via inductive logic programming. Inductive Logic Programming, pages 366–378, 2007.

[OWKB98] S.G. Oliver, M.K. Winson, D.B. Kell, and F. Baganz. Systematic func- tional analysis of the yeast genome. Trends in Biotechnology, 16(9):373– 378, 1998.

[PBTL99a] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Database Theory: ICDT99, pages 398–416. Springer, 1999.

[PBTL99b] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1):25–46, 1999.

[PC03] D. Page and M. Craven. Biological applications of multi-relational data mining. ACM SIGKDD Explorations Newsletter, 5(1):69–79, 2003.

[PCH07] L. P˜ena-Castilloand T. Hughes. Why Are There Still Over 1000 Unchar- acterized Yeast Genes? Genetics, 176(1):7–14, 2007.

[PCY95] J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. ACM SIGMOD Record, 24(2):186, 1995.

[Plo71] G. D. Plotkin. Automatic Methods of Inductive Inference. PhD thesis, Edinburgh University, August 1971.

[PMS01] M. Pazzani, S. Mani, and W. Shankle. Acceptance of rules generated by machine learning among medical experts. Methods of Information in Medicine, 40(5):380–385, 2001.

[PPK05] Y.R. Park, C.H. Park, and J.H. Kim. GOChase: correcting errors from Gene Ontology-based annotations for gene products. Bioinformatics, 21(6):829, 2005.

[Pre97] S. Prediger. Logical scaling in formal concept analysis. Conceptual struc- tures: Fulfilling Peirce’s dream, pages 332–341, 1997. Bibliography 229

[Pri06] U. Priss. Formal concept analysis in information science. Annual review of information science and technology, 40:521, 2006.

[PW99] S. Prediger and R. Wille. The lattice of concept graphs of a relationally scaled context. In W. Tepfenhart and W. Cyre, editors, Proc. 7th Intl. Conf. on Conceptual Structures (ICCS’99), volume 1640 of Lecture Notes in Computer Science, pages 401–414. Springer, Berlin, 1999.

[Qua06] J. Quackenbush. From ‘omes to biology. Animal Genetics, 37:48–56, 2006.

[Qui79] J.R. Quinlan. Discovering rules by induction from large collections of examples. Expert Systems, 1979.

[Qui90] J. Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239–266, 1990.

[Qui93] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993.

[Rae08] L. De Raedt. Logical and Relational Learning. Springer, Berlin, 2008.

[Res95] P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In International Joint Conference on Artificial Intelligence, volume 14, pages 448–453, 1995.

[RGK+08] S. Rogers, M. Girolami, W. Kolch, K.M. Waters, T. Liu, B. Thrall, and H.S. Wiley. Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models. Bioinformat- ics, 24(24):2894, 2008.

[RGMH02] M.D. Robinson, J. Grigull, N. Mohammad, and T.R. Hughes. FunSpec: a web-based cluster interpreter for yeast. BMC bioinformatics, 3(1):35, 2002.

[RHHNV13] Mohamed Rouane-Hacene, Marianne Huchard, Amedeo Napoli, and Petko Valtchev. Relational concept analysis: mining concept lattices from multi-relational data. Annals of Mathematics and Artificial Intelligence, 67(1):81–108, 2013.

[Ril98] M. Riley. Genes and proteins of Escherichia coli K-12. Nucleic acids research, 26(1):54, 1998. Bibliography 230

[Riv07] Rivals, I. and Personnaz, L. and Taing, L. and Potier, M.-C. Enrich- ment or depletion of a GO category within a class of genes: which test? Bioinformatics, 23(4):401–407, 2007.

[RL97] M. Riley and B. Labedan. Protein evolution viewed through Escherichia coli Protein sequences: Introducing the notion of a structural segment of homology, the module1. Journal of molecular biology, 268(5):857–868, 1997.

[RMBB89] R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and appli- cation of a metric on semantic nets. IEEE transactions on systems, man and cybernetics, 19(1):17–30, 1989.

[RSA00] S. Raychaudhuri, J.M. Stuart, and R.B. Altman. Principal components analysis to summarize microarray experiments: application to sporulation time series. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, page 455. NIH Public Access, 2000.

[RTB+01] L.M. Raamsdonk, B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M.C. Walsh, J.A. Berden, K.M. Brindle, D.B. Kell, J.J. Rowland, et al. A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotechnology, 19(1):45–50, 2001.

[RW97] S.M. Roberts and F. Winston. Essential functional interactions of saga, a saccharomyces cerevisiae complex of spt, ada, and gcn5 proteins, with the snf/swi and srb/mediator complexes. Genetics, 147(2):451–465, 1997.

[SDB03] RL Somorjai, B. Dolenko, and R. Baumgartner. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics, 19(12):1484, 2003.

[SEAW+07] R. Stevens, M. Ega˜naAranguren, K. Wolstencroft, U. Sattler, N. Drum- mond, M. Horridge, and A. Rector. Using OWL to model biological knowledge. International journal of human-computer studies, 65(7):583– 594, 2007.

[SF04] NH Shah and NV Fedoroff. CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology. Bioinformatics, 20(7):1196, 2004.

[SFE+00] J.G. Sutcliffe, P.E. Foye, M.G. Erlander, B.S. Hilbush, L.J. Bodzin, J.T. Durham, and K.W. Hasel. TOGA: an automated parsing technology for analyzing expression of nearly all genes. Proceedings of the National Academy of Sciences, 97(5):1976, 2000. Bibliography 231

[Sha48] C. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948.

[SHZ07] U. Sauer, M. Heinemann, and N. Zamboni. Getting closer to the whole picture. Science, 316(5824):550–551, 2007.

[SIL07] Y. Saeys, I. Inza, and P. Larra˜.A review of feature selection tech- niques in bioinformatics. Bioinformatics, 23(19):2507, 2007.

[SM09] N. Shah and M. Musen. Ontologies for Formal Representation of Biolog- ical Systems. In S. Staab and R. Studer, editors, Handbook on Ontologies (Second Edition), pages 445–461. Springer, Berlin, 2009.

[Smi03] B. Smith. Ontology. In L. Floridi, editor, Blackwell Guide to the Philos- ophy of Computing and Information, pages 155–166. Blackwell, Oxford, 2003.

[SMO+03] P. Shannon, A. Markiel, O. Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: a software environ- ment for integrated models of biomolecular interaction networks. Genome research, 13(11):2498–2504, 2003.

[SON95] A. Savasere, E. Omiccinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of the Intl. Confer- ence on Very Large Data Bases, pages 432–444, 1995.

[Sri99] A. Srinivasan. The Aleph manual. Computing Laboratory, Oxford Uni- versity, 1999.

[SS09] S. Staab and R. Studer. Handbook on Ontologies (Second Edition). Springer, Berlin, 2009.

[SSD+02] L. Steinmetz, C. Scharfe, A. Deutschbauer, D. Mokranjac, Z. Herman, T. Jones, A. Chu, G. Giaever, H. Prokisch, P. Oefner, et al. Systematic screen for human disease genes in yeast. Nature Genetics, 31(4):400–404, 2002.

[SSDB95] M. Schena, D. Shalon, R.W. Davis, and P.O. Brown. Quantitative mon- itoring of gene expression patterns with a complementary DNA microar- ray. Science, 270(5235):467, 1995.

[SSP+05] J.L. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J.M. Mato, L.A. Martinez-Cruz, F.J. Corrales, and A. Rubio. Correlation between gene Bibliography 232

expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2(4):338, 2005.

[STB+02] G. Stumme, R. Taouil, Y. Bastide, N. Pasquier, and L. Lakhal. Com- puting iceberg concept lattices with TITANIC. Data & Knowledge Engi- neering, 42(2):189–222, 2002.

[Stu04] G. Stumme. Iceberg Query Lattices for Datalog. ICCS2004: Proc. Intl. Conference on Conceptual Structures, pages 234–234, 2004.

[TB00] K. Tipton and S. Boyce. History of the enzyme nomenclature system. Bioinformatics, 16(1):34, 2000.

[TEP+01] A. Tong, M. Evangelista, A. Parsons, H. Xu, G. Bader, N. Page, M. Robinson, S. Raghibizadeh, C. Hogue, H. Bussey, B. Andrews, M. Ty- ers, and C. Boone. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science, 294:2364–2368, 2001.

[TF04] C. Tucker and S. Fields. Quantitative genome-wide analysis of yeast deletion strain sensitivities to oxidative and chemical stress. Comparative and Functional Genomics, 5(3):216–224, 2004.

[TFA+04] G.W. Thorpe, C.S. Fong, N. Alic, V.J. Higgins, and I.W. Dawes. Cells have distinct mechanisms to maintain protection against different reactive oxygen species: oxidative-stress-response genes. Proceedings of the Na- tional Academy of Sciences of the United States of America, 101(17):6564, 2004. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC404085/.

[TGB+02] O.G. Troyanskaya, M.E. Garber, P.O. Brown, D. Botstein, and R.B. Altman. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18(11):1454, 2002.

[TGNK00] R.L. Tatusov, M.Y. Galperin, D.A. Natale, and E.V. Koonin. The COG database: a tool for genome-scale analysis of protein functions and evo- lution. Nucleic acids research, 28(1):33, 2000.

[THC+99] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church. Systematic determination of genetic network architecture. Nature genet- ics, 22:281–285, 1999.

[TKWC99] P. T¨or¨onen,M. Kolehmainen, G. Wong, and E. Castren. Analysis of gene expression data using self-organizing maps. FEBS letters, 451(2):142–146, 1999. Bibliography 233

[TNCKM06] A. Tamaddoni-Nezhad, R. Chaleil, A. Kakas, and S. Muggleton. Appli- cation of abductive ILP to learning metabolic network inhibition from temporal data. Machine Learning, 64(1):209–230, 2006.

[TOTZ01] J.G. Thomas, J.M. Olson, S.J. Tapscott, and L.P. Zhao. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Research, 11(7):1227, 2001.

[TPBL00] R. Taouil, N. Pasquier, Y. Bastide, and L. Lakhal. Mining Bases for Association Rules using Closed Sets. In ICDE2000: Proc. of the Intl. Conference on Data Engineering, pages 307–307. IEEE, 2000.

[TPD05] M.D. Temple, G.G. Perrone, and I.W. Dawes. Complex cellular responses to reactive oxygen species. Trends in Cell biology, 15(6):319–326, 2005.

[TZLT08] I. Trajkovski, F. Zelezny, N. Lavrac, and J. Tolar. Learning relational descriptions of differentially expressed gene groups. IEEE Transactions on Systems, Man and Cybernetics: Part C Applications and Reviews, 38(1):16–25, 2008.

[UGMW01] J.D. Ullman, H. Garcia-Molina, and J. Widom. Database systems: the complete book. Prentice Hall, 2001.

[Usc11] M. Uschold. Making the Case for Ontology. Applied Ontology, 6(4):377– 385, 2011.

[VAM+01] J.C. Venter, M.D. Adams, E.W. Myers, et al. The sequence of the human genome. Science, 291(5507):1304–1351, 2001.

[VGRH03] P. Valtchev, D. Grosser, C. Roume, and M.R. Hacene. Galicia: an open platform for lattices. In Using Conceptual Structures: Contributions to 11th Intl. Conference on Conceptual Structures (ICCS03), pages 241–254, 2003.

[VL13] A. Vavpetiˇcand N. Lavraˇc. Semantic Subgroup Discovery Systems and Workflows in the SDM-Toolkit. Computer Journal, 56(3):304–320, 2013.

[VMGM02] P. Valtchev, R. Missaoui, R. Godin, and M. Meridji. Generating fre- quent itemsets incrementally: two novel approaches based on Galois lat- tice theory. Journal of Experimental & Theoretical Artificial Intelligence, 14(2):115–142, 2002. Bibliography 234

[VZVK95] V. Velculescu, L. Zhang, B. Vogelstein, and K. Kinzler. Serial analysis of gene expression. Science, 270(5235):484, 1995.

[WE11] T. Wray and P. Eklund. Exploring the Information Space of Cultural Collections Using Formal Concept Analysis. In P. Valtchev and R. Jan- schke, editors, Formal Concept Analysis, volume 6628 of Lecture Notes in Computer Science, pages 251–266. Springer, Berlin, 2011.

[WF05] I.H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub, 2005.

[WMC+01] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vap- nik. Feature selection for SVMs. Advances in neural information process- ing systems, pages 668–674, 2001.

[WSA+99] E.A. Winzeler, D.D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. Andre, R. Bangham, R. Benito, J.D. Boeke, H. Bussey, et al. Func- tional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science, 285(5429):901–906, 1999.

[WSGA03] C.J. Wroe, R. Stevens, C.A. Goble, and M. Ashburner. A methodology to migrate the gene ontology to a description logic environment using DAML+ OIL. In Pac Symp Biocomput, volume 8, pages 624–635, 2003.

[WT00] A. Williams and C. Tsatsoulis. An instance-based approach for identify- ing candidate ontology relations within a multi-agent system. In OL2000: Proc. of the First Workshop on Ontology Learning, 2000.

[WTH+05] Y. Wang, I.V. Tetko, M.A. Hall, E. Frank, A. Facius, K.F.X. Mayer, and H.W. Mewes. Gene selection from microarray data for cancer classification–a machine learning approach. Computational Biology and Chemistry, 29(1):37–46, 2005.

[Yan04] G. Yang. The complexity of mining maximal frequent itemsets and - imal frequent patterns. In Proceedings of the tenth ACM SIGKDD in- ternational conference on Knowledge discovery and data mining, pages 344–353. ACM, 2004.

[YBY+08] H. Yu, P. Braun, M. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, et al. High-Quality Binary Protein Interaction Map of the Yeast Interactome Network. Sci- ence, 322(5898):104–110, 2008. Bibliography 235

[ZCH05] J. Zhang, D. Caragea, and V. Honavar. Learning ontology-aware classi- fiers. In Discovery Science, pages 308–321. Springer, 2005.

[ZFW+03] B.R. Zeeberg, W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sunshine, S. Narasimhan, D.W. Kane, W.C. Reinhold, S. Lababidi, et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol, 4(4):R28, 2003.

[ZGS07] X. Zhu, M. Gerstein, and M. Snyder. Getting connected: analysis and principles of biological networks. Genes & Development, 21:1010–1024, 2007.

[ZH02] M.J. Zaki and C.J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In 2nd SIAM International Conference on Data Mining, pages 457–473, 2002.

[ZH05] M.J. Zaki and C. Hsiao. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17(4):462–478, 2005.

[ZKF+99] R.A. Zubarev, N.A. Kruger, E.K. Fridriksson, M.A. Lewis, D.M. Horn, B.K. Carpenter, and F.W. McLafferty. Electron capture dissociation of gaseous multiply-charged proteins is favored at disulfide bonds and other sites of high hydrogen atom affinity. J. Am. Chem. Soc, 121(12):2857– 2862, 1999.

[ZKSH06] J. Zhang, D-K. Kang, A. Silvescu, and V. Honavar. Learning accurate and concise naive bayes classifiers from attribute value taxonomies and data. Knowledge and Information Systems, 9(2):157–179, 2006.

[ZR09] Y-Q. Zhang and J. Rajapakse. Machine Learning in Bioinformatics. John Wiley & Sons, Hoboken, NJ, 2009.