Ontology based analysis of experimental

Andrea Splendiani Institute Pasteur, Unité de Biologie Systémique, rue du dr. Roux 25-28, 75015 Paris, France University of Milano-Bicocca, DISCO, via Bicocca degli Arcimboldi 8, 20126 Milano, Italy [email protected]

Abstract relating such a large scale experimental evidence to the exist- ing biological is essential in order to understand We address the problem of linking observations the phenomenon under study. from reality to a semantic web based knowledge Given the vast amount of data generated by high through- base. Concepts in the biological domain are in- put technologies, this necessitates automated support. creasingly being formalized through ontologies, At the same time there is an increasing availability of struc- with an increasing adoption of semantic web stan- tured biological , in the semantic web framework dards. Atthe same time biologyis becominga data- in particular. The Gene Ontology[Ashburner et al., 2000] ini- centric , since the increasing availability of tiative provides ontologies describing functions of gene prod- high throughput technologies yields a humanly in- ucts. It currently encompasses more than 17000 terms linked tractable amount of data describing the behavior by relations of inheritance and containment and it is avail- of biological systems at the molecular level. This able in RDF. MGED-ontology1 provides an ontology to de- creates the need for automated support to interpret scribe attributes relevant to mRNA experiments in OWL, and biological data given the pre-existing knowledge the BioPAX[The BioPAX workgroup] initiative is defining a about the biological systems under study. While common standard to represent biological pathways and inter- this is currently addressed through the analysis of action networks in OWL. This last initiative is of particular attributes associated to biological entities, the avail- interest since it provides a common ontological framework ability of ontologies that represent biological sys- for the unification of different resources. The availability tems makes it possible to improve the extent to of such resources makes it possible to partially automatize which pre-existing knowledge can be used. The se- the association between experimental evidence and existing mantic web, in particular, provides a framework to knowledge to effectively lead data analysis. integrate and create a formalized biological knowl- edge base. Linking ontological knowledge to ob- served data is inherently approximate, because of 2 Ontologies and data analysis the quality of observations, the relation between Focusing on ontologies that describe the behavior of biologi- observed data and entities and the classification of cal systems at the molecular level, there is a range of ontolo- entities. We present an overall framework project gies that vary in scope and depth. While available knowledge and its current development status. of some biological systems is enough to build causal mod- els, in general such knowledge is limited and most of ontolo- 1 Introduction gies have a low ontological commitment. When dealing with system-wide observations, this second class of ontologies is Scientific exploration constantly involves relating experimen- most relevant. tal evidence to existing knowledge. In the Life Science fields For instance, in the case of mRNA profiling, the behavior existing knowledge is commonly encoded in a corpus of sci- of thousands of genes in a cell in response to some sort of entific publications. This knowledge is used by scientists to stimuli is observed. For each gene, measures of its activity are design and interpret experiments that in turn lead to new dis- provided. These data are usually related to Gene Ontology to coveries. Traditionally this meant to relate a limited amount derive a functional characterization of the cell response. of experimental evidence to pre-defined hypotheses. Associations of genes to specific classes in Gene Ontology Recently, the availability of high-throughput technologies, are determined based on available knowledge. By its seman- such as DNA sequencing, mRNA and proteomic profiling is tics, association of a gene to a class implies association of a challangingchallenging this existing paradigm. Such tech- gene to its super classes too. Thus a gene is annotated with a nologies allow the observation of the behavior of biological set of classes that act as attributes describing specific biolog- systems at the molecular level on a system-wide basis. This ical functions. means that a humanly intractable amount of data is available, most of which does not relate to previous hypotheses. Thus 1www.mged.org It is common practice to define a subset of relevant genes date the plausibility of nodes associated to concepts given ev- from experimental data and to study the incidence of these idence encoded in roots nodes. This updates involves consid- attributes derived from Gene Ontology through statistical ering both the experimental evidence, and uncertainty assess- tests[Beissbarth et al., 2004; Maere et al., 2005]. ment related to it. Sometimes relations of inheritance and independence are used to measure “conceptual distances” among genes[Joslyin 6 Conclusion et al., 2004]. The Life Science community is one of the early adopters of semantic web technologies. The need to represent and 3 Uncertainty integrate a vast amount of different information is pushing Uncertainty plays a key role in the task of associating experi- the development of this technology. The analysis of high- mental evidence to ontological knowledge, at several levels. throughput data poses naturally the need of approximate rea- Uncertainty in the definition and relations between classes soning and uncertainty representation. plays a limited role. There is not a specific support for uncer- tainty in OWL, and the definition of ontologies is an ongoing References task where crisp definitions are valuable. [Ashburner et al., 2000] Ashburner M, et al. Gene on- Association between instances and classes is one point tology: tool for the unification where uncertainty plays a critical role. Almost every on- of biology. The Gene Ontology tology encodes a confidence in the association through “ev- Consortium. Nat Genetics 2000 idence codes” (describing the kind of supporting evidence) May;25(1):25-9 and eventually a p-value or citations of relevant scientific lit- erature. See [Karp et al., 2004] for an example of an ontology [The BioPAX workgroup] The BioPAX workgroup. for experimental evidence. BioPAX: Biological Pathways Association between data and ontologies is then inherently Exchange. www.biopax.org uncertain. Uncertainty may come not only from the experi- [Beissbarth et al., 2004] Beissbarth T, Speed TP. GOstat: mental setup and measurements, but also from the biological find statistically overrepresented source of variability, and from misconceptions or omissions Gene Ontologies within a group in the available knowledge. of genes. Bioinformatics 2004 Jun12;20(9):1464-5 4 Our project [Maere et al., 2005] Maere S, Heymans K, Kuiper The way experimental data are associated to existing ontolo- M. BiNGO: a Cytoscape plugin gies now does not take into account all the information en- to assess overrepresentation of coded in ontologiesand does not providea way to reason over Gene Ontology categories in bio- related uncertainty. We plan to overcome these limitations by logical networks. Bioinformatics providing a framework for approximate reasoning based in 2005 Aug 21; 3448-3449 ontologies. [Karp et al., 2004] Karp PD, Paley S, Krieger CJ, In particular, we focus on OWL ontologies describing bi- Zhang P. An evidence ontol- ological pathways and on mRNA data. Given an ontology ogy for use in pathway/genome representing a collection of pathways and related concepts databases. Pac Symp Biocomput. (including evidence support), and a set of experimental data, 2004; 190-201 we define a new ontology as the union of the two, represent- [Joslyin et al., 2004] Joslyn CA, Mniszewski SM, Ful- ing observed evidence and the previous knowledge. mer A, Heaton G. The gene on- Thus we plan to use the structure of previous knowledge to tology categorizer. Bioinformat- compute plausibility of concepts being pertinent to observed ics 2004 Aug 4;20 Suppl 1:I169- conditions. This can be done through a rule based approach, I177 where inherent structure of pathways ontologies would en- sure convergence of plausibility distributions. [Shannon et al., 2003] Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ram- 5 Current development age D, Amin N, Schwikowski B, Ideker Y. Cytoscape: a soft- We have developed an infrastructure where ontologies ware environment for integrated can be merged and represented. This is based on the models of biomolecular interac- Cytoscape[Shannon et al., 2003] software for molecular in- tion networks. Genome Res. 2003 teraction analysis which is used as a link to experimental data Nov;13(11):2498-504 and an interactive visualizer for RDF ontologies. Rule sys- tems for unifying the ontologies and graph transformations to represent views of ontologies are also provided. Based on this, we plan to providea Bayesian network along with the ontology, or possible derivation of that, and to up-