From: ISMB-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.

A knowledge base for D. melanogaster interactions involved in pattern formation

J~rSme Euzenat+’*, Christophe Chemla"’# and Bernard Jacq"

÷ INRIARhrne-Alpes ’Laboratoirede G~nrtiqueet Physiologiedu l~veloppement 655 avenuede I’Europe, Parc scientifique de Luminy,CNRS Case 907, 38330Montbonnot Saint-Martin, France 13288Marseille cedex 9, France [email protected] jacq @Igpd.univ-mrs. fr

Abstract macromolecules, data concerning precise molecular The understanding of pattern formation in Drosophila interactions between them are underrepresented in these requires the handling of the manygenetic and molecular databases. Themajority of these databases can be classified interactions whichoccur betweendevelopmental . For as "mainly structural" because the core of their that purpose,a knowledgebase (KNIFE)has been developed in order to structure and manipulatethe interaction data. informational content is based on various aspects of DNA, KNIFEcontains data about interactions published in the RNAor sequence and/or structure. Relatively few literature and gatheredfrom various databases. Thesedata databases have a content and an organisation which are are structured in an object knowledgerepresentation system oriented towards the biological function of the genes and into various interrelated entities. KNIFEcan be browsed the relationships between structure and function. In the through a WWWinterface in order to select, classify and examinethe objects and their referencesin other bases. It field of genetic diseases, OMIM,a catalogue of human also providesspecialised biological tools suchas interaction genes and genetic disorders [OMIM1996] provides the networkmanipulation and diagnosis of missinginteractions. user with both structural (e.g. molecular genetics, biochemistry, genetic mapping) and functional data (e.g. We are interested in the biological process of pattern clinical features, diagnosis, inheritance) on humangenes. formation in Drosophila and in understanding the basis of specific identity acquisition by the different body parts In order to cope with problemsof genetic regulation, it is [Fasano et al. 1991; Rrder, Vola and Kerridge 1992; first essential to be able to describe and organise biological Alexandreet al. 1996]. In Drosophila, different classes of facts such as those described above in a standardised way. genes involved in the segmentation processes (maternal, For this purpose, we recently described [Jacq et al. 1997] gap, pair-rule and segment polarity genes) divide the the concepts, organisation, content and use of GIF-DB embryo along the antero-posterior axis into repeated (Gene Interactions in the Fly Database), a new WWW homologousunits [Niisslein-Volhard and Wieschaus1980; repository for data on gene interactions involved in Gaul and J~ickle 1990] which will develop specific Drosophila embryonic development and the regulatory identities and morphogeneticfeatures under the control of networksin whichthey are involved. GIF-DBis a collection homeotic genes [Lewis 1978]. Specific interactions within of hypertext files, each of them describing an interaction and between these gene families are essential for the between two molecular partners (Protein/DNA, establishment of a correct body pattern. Being able to Protein/RNA, Protein/Protein). All data found in GIF-DB access, query and manipulate the data on these comefrom the literature. developmental genes and their functional interactions The production of GIF-DBis a first step in the process of within specific regulatory networks is nowan important managing scientific information concerning genetic requirement for developmental and molecular biologists interactions. Althoughquite complete and simple to use, a studying gene regulation. collection of hypertext files is only imperfectly suited to Gene molecular interactions, i.e. direct molecular represent and query the knowledge we already have on interactions involving DNA,RNAand , play an such complex biological problems as molecular essential role in all knownbiological processes. Although interactions. different databases exist for each of these three types of Afurther step in that direction requires a system providing # Both authors contributed equally to this work. more structure (e.g. objects and class hierarchies) and Copyright © 1997, American Association for Artificial Intelligence (www.aaai.org).All rights reserved. manipulation capabilities (e.g. classification, network

~o8 ISMB-97 traversing) than a relational database or a flat WWWserver. concepts (an object is an instance of one and only one Such an organisation will provide the opportunity to gather concept). As an example, the proteinconcept concerns all interactions into networks and to compare, check and the individual proteins. Ontological prerogatives are simulate these networks. attached to the concept. They warrant the integrity of an The KNIFE system (Knowledge on Networks of object (i.e. that the object cannot be modified in a way Interactions in the Fly Embryo)l, which is presented here, which would lead it to no longer be an instance of the is a knowledge base putting together data and methods concept) and its identity (i.e. it can always be uniquely about interactions gathered in other databases. It has been identified). These prerogatives play an important role when developed using the TROPESknowledge representation the object is created and registered. system, is directly accessible from the WWWand provides The conceptalso defines the structure of its instances. The several specific utilities such as network traversing and structure of an object is uniquely determined by a set of analysis. fields and their basic domainsindependently of the classes The choice of an object-based knowledge representation to which the object can be attached. For example, the system has already been madeby several teams working on instances of the protein concept have a name, a size and metabolism [Karp and Mavrovouniotis 1994] or genetic a sequence field. The basic domainof a field is either a regulation [Perri~re et al. 1993; Hoogland and Bi4mont primitive type (string for the name,integer for the size), 1997]. The more closely related work to this one is the concept (protein-sequence for sequence) or a type work around the EcoCYCknowledge base on Escherichia constructed from primitive types and concepts with the coli genes and metabolism [Karp et al. 1996]. It is an help of set and list constructors. example of a knowledgebase which integrates functional The field values are part of the objects and do not depend aspects since one can find both data on gene structure and uponthe classes to which the objects are attached. For their function in the regulation of biochemical pathways. instance, the protein B!COIDis represented by an object Although it focuses on metabolism instead of genetic instance of the protein class. Its namefield has the string regulation, there are several features commonwith the ~BICO’rD"for value, its size field has integer 494 for value work presented here: use of knowledge base technology; and its sequence field has the instance of the protein- availability through WWW,including for graph drawing; sequenceconcept also namedB’rco’rD for value. developmentof specific algorithms for using the networks (pathways). An important difference is that the context Viewpoints whichgenetic interactions are valid is not well knownyet. Objects can be seen under several viewpoints, For instance, BICOID is classified as a transcription-factorunder The TROPESobject-based knowledge representation system the biochemical-function viewpoint and as a nuclei is first described. Then, the objects, modellingKNIFE data, protein under the initial- and final-sub-cellular- are presented at length. The functions offered by KNIFEare location. The viewpoints allow to restrict the view on exposed in the remainder: first the general functions instances and to organise the concept into particular available in TROPES(WWW browsing, classification, taxonomies. A viewpoint determines: filtering) are presented; then, specific methodstied to the ¯ The set of fields whichare relevant under the viewpoint exploration of genetic interactions (graph traversing and (the sequenceis not relevant from the sub-cellular- diagnosis of missing interactions) are detailled. The location viewpoints as well as the size under the discussion focuses on a long term goal: the simulation of biochemical-function one, and thus the interaction networks. corresponding fields are hidden under the respective viewpoints). Short presentation of TROPES ¯ A hierarchy of classes under whichthe instances of the concept can be classified. Each viewpoint offers to the TROPESis an object-based representation system which user a new taxonomy under which the classification favours classification. It is here presented throughits basic operation depends on different criteria. Theyallow to notions, while more specific descriptions will be found in focus on particular aspects of objects without being the remaining sections. disturbed by others. Classes are related through the specialisation relation and determine progressive Objects and concepts subsets of the set of instances of the concept. Underthe biochemical-function viewpoint, proteins are In TROPES[Marifio et al. 1990; Sherpa 1995] individuals are represented as objects. The objects are partitioned into divided into enzymes, DNA-associated, growth- factor, and other classes; DNA-associated can be 1 http : //www-biol. univ-mrs, fr/-igpd/knife.html

Euzenat 1o9

...... -.- ..-.....-.-....-.-.-.-.- -.-.-.- - ¯ -.-...... l initial Viewpoint- biochemical ize ~k -sub-cellular b~ -function I\-lOcatiOn I ~ ~~A. I pr°tein L--~ Concep[,,~protein ~- .~ " "~_" ]: - ~ "~ - ~.:" " "[ " - ~ protein protein I I II I -celIula~IA I ~ ~ ] large~ II I -matrix V I~I 1 DNA " l~ I v %.| I ,/ I | | -associatedD~’ average ~ ~ I I ~ ~ % 1 I /~ 0 growth-factor Is~[1u I plasmic-r~l I| I,//\ I k I I 0 |/chromatin U I 1% . . I \ I cytoplasmic %/-compaction/ ~kt~ac~SCnpt~°n I ~ I | 0 I _factor / l~la for ~Ob

BICOID

Figure I. The protein concept is visible under the size, initial-sub-cellular-locationand biochemical- functionviewpoints. Each of themdetermines a hierarchyof classeswhose root classis namedprotein. For instance,under the biochemical-functionviewpoint there is a decompositionof the set of proteinsfollowing their functions.The B I CO I D objectis attachedto the averageclass under the s i z e viewpoint,to the nuc i ear classunder the initial-sub- cellular-locationviewpoint and to the DNA-associatedclass under the biochemical-functionviewpoint ¯ vided into transcription- factor and chromatin- members of its super-classes (extension inclusion compaction- factor. property); second, the constraints defined in a class apply to all the membersof that class (and thus to the objects Classes and taxonomy attached to all of the more specific classes). In the above A class defines constraints that objects must satisfy in example,it meansthat: ¯ all transcription-factorSare DNA-associateciand order to belong to the class. It is a projection of the structure of the conceptretaining only relevant fields and a ¯ all transcription-factors, as DNA-associated, restriction of the possible values of these fields. This is inherit their constraints (e.g. being located in the achieved with the help of: nucleus). ¯ Primitive domain restriction provided by domain So the strengthening of the constraints from class to sub- enumeration, exclusion or bounding (the effect field classes is parallel to the restriction of the extension. As of an interaction has a string value enumerated by opposedto instanciation, objects can be attached to a class "activation", "repression" and ~activation-and- and can be detached from it at anytime. Classes also allow repression"; the size of an average protein can be the expression of hypothetical knowledge in terms of between3 0 0 and 10 0 0 an]ino-acids). default values or default inference methods. ¯ Attachmentrestrictions for concepts: the field values must not only be instances of a particular concept but TROPES has other features which are not relevant here. can be constrainedto belongto particular classes of that Some of them will be presented when needed in the concept (the value of the sequence field of a protein remainingsections. must be memberof the protein-sequence class). ¯ Constraints on field values. These constraints are Biological knowledgerepresentation membershipconstraints or constraints between fields (the s i z e of a RNAcannot exceed that of its DNA- The core of the systemis described hereafter. It is madeof sequence). a repository (expressed in TROPES)of concepts, classes and ¯ Cardinality restriction on sets (rasp. lists) by bounding instances representing biological entities. The data their numberof elements. originates from other standard repositories and GIF-DB. Anobject is attached to (to be opposedto "is memberof") Figures given here are those at the date of DecemberI st, only one more specific sub-class under a viewpoint, but is 1996. member of all the classes of which this class is a specialisation. The interpretation of the spccialisation General overview relation is twofold: first, the membersof a class are The knowledge base, while still incomplete, has been

uo ISMB-97 carefully designed in order to reflect the complexity of field and the binding protein through the overlaps- gene interaction. The KNIFEbase contains 16 concepts with-proteinfield. related with manydifferent aspects of interactions. Figure precursor-rna(27): The precursors of a gene are the 2 presents these concepts and the relationships between RNAproducts which have not been spliced. It was them through fields referring to each others. The main necessary to introduce precursors because the same concepts, with regard to the present paper, are network. precursor can be spliced differently or two different interaction, expression, gene and protein. Theywill precursors can give rise to the same product after be detailed in depth below. splicing. The other concepts, although being useful to the biologist rna(29): rnaobjects record the information concerning looking for where, howand whyan interaction can happen, RNA.An obvious one is the sequence field referring to are not yet used by the automatic facilities of KNIFEThey the rna-sequence itself, the product field referring to can be briefly described in the following manner(numbers the protein expressed by the RNAstrand. The in parentheses indicate the number of instances in the expression field is meantto describe in which context actual knowledgebase): this particular RNAstrand is to be expressed. These rna mutant (0): the mutant concept describes the kind of objects are referred to by the gene concept, allowing mutations which can be applied to a gene and the them to determine their corresponding expression resulting phenotype.It is not presently used. patterns. binding- site- feature (43): describes the binding-sites dna-sequence (51), rna-sequence (4), protein- involved in the interaction betweenbiological objects. sequence (25): the individual sequences of DNAare They refer the actual sequence through their sequence described by subsequences (called dna-part)and

> --> dependency single-valued field maintained ~,- set-valued field automatically inding-sites

binding-site-feature

target-rna precursor

--.gene

sequence expression product

expression

[ expression rna-sequenc~] I protein-sequencel / pans signals signals features]

[dna-sequence-part-feature~{dna-signal-featureI rna-sequence-part-featur~lrna-signal-featur~ I protein-feature~ Figure 2. KNIFEconcepts and their dependencies.Boxes represent concepts while labelled arrowsrepresent object fields.

Euzenat 111 signals detected in the sequences (dna-signal);these The two main elements involved in interactions are fields contain objects knownas dna-sequence-part- proteins and genes. Figure 1 displays three viewpoints on feature and dna-s ignal-feature. Tile same applies the protein concept which has been largely described withslight variations to rna-sequenceand protein- above. This concept introduces views depending on the sequence (with no signal/part distinction). They initial and final locations of a protein involved, for originate from the international sequence databases instance, in the signal transduction pathways; the (EMBL[Rice et al. 1993] and SWlSSPROT[Bairoch and biochemical-function view considers the families of Apweiler 1996]). protein biochemicalfunctions. dna- signal - feature(2), rna- s igna i - lea ture (4): The gene concept refers to the sequence of the gene, the describes the various signals (binding sites for precursor RNAand the protein it codes for through the regulation proteins, splicing sites) that can be found in corresponding fields. It can be viewed under four the regulatory region of a gene or precursor. viewpoints: its cytological-location,its phenotypic dna-sequence-part-feature (2), rna-sequence- class, its number of precursors and the size of its part-feature(4): describe the functional parts that transcription unit. This is an important issue since the time can be found in a gene. For RNAthese are just the parts required by the R N A-polymerase to transcript D NA corresponding to that of DNA. depends on the size of the unit. This can thus be used in protein-feature (152): the various structural and order to impose temporal constraints to the models of functional features (zinc-finger, leucine-zipper, embryogenesis(for somehomeotic genes, the transcription homeodomainand helix) of a protein. can last morethan one hour). At start up, KNIFEcomputes the reverse links from genes Interactions, proteins and genes and proteins. It is then possible to refer to interactions The main concept in KNIFEis the interactionconcept regulating the gene (incoming)and to interactions in which containing currently interactions between proteins and the gene product is regulator (outgoing). genes (type I). Each interaction connects an effector Genes and proteins originate mainly from FLYBASE (which is a protein, instance of the protein concept) to [Flybase 1996] and SWlSSPROT[Bairoch and Apweiler target (which is a gene, instance of the gene concept). In 1996] and their representation obey the FLYBASEcontrol addition, the interaction has a effect field whichcontains vocabulary. At the moment,there are 25 genes, 26 proteins a string qualifying the interaction effect as "activation’, and 62 interactions. "repression" or both. The interactions also mention the following fields: Expression patterns binding-s i tes: refers to the set of binding sites (concept The expression concept provides, for a particular binding-site-feature) involved in the regulation; developmentstage and a particular mutation (or wild type structure-of-target-product, structure-of- individual), the Iocalisation of the expression of a gene, effector:refers by a string to the knownstructural protein or a particular RNA.Expression patterns are thus patterns in the products (e.g. zinc-fingers, identifiable throughthe followingfields : homeodomains); developmental-stage is a string which denotes the dose-dependence:indicates if the interaction depends on stage at whichthe expression is found; the concentration of the effector. This is not currently gene-name contains the name of the corresponding gene used by the algorithms but will be useful in the future; object; protein-partners: contains a set of proteins which genetic-context is a string denoting the kind of are also involvedin the regulation; mutation(or "WT"for wild type). other fields contain information from the literature among It provides a set of segments of the embryoin which the which the experimental methods used for pointing out product of the gene is found. These patterns are encodedin the interaction or the regions of the embryowhere there various ways. The main one is the egg-length-domain is evidencefor the interaction [Jacq et aL 1997]. field which contains the boundsof the intervals of the egg Protein-RNAand protein-protein interactions can also be (expressed in percentage of the length of the embryo) represented. All these interactions are viewed through wherethe product is found. For instance, the expression of viewpoints depending on their types (whether they imply the fushitarazu gene at the cellular blastoderm stage for protein, RNAor DNAtarget), the structure of effectors and a wild individual is madeof 7 stripes (shownon Figure 3): target product, the effect of the interaction and the classes 11-16, 20-24, 28-32, 36-40, 44-48, 52-56 and 60-63%of of the involved products (classified after [Pankratz and the egg length. The egg-length-domainis not the only Jiickle 1993] for gap genes). The interactions in the base field containing this information: it is encodedin other comefrom GIF-DB[Jacq et al. 1997]. common formats (regions, segments and

ISMB-97 Figure 3. Theexpression patterns of the expres s i on conceptare displayedby a Java applet. This allows the users to quickly identifythe patternson the fly egg. parasegments). interactions,is described below. There is no network This information can be found in the two main views stored as such in the base: building a meaningful network which are offered to the user and which decomposesthe is the goal of manipulatingthe data. expression with regard to the expression-regior~ (the place of expression: thorax, abdomen, terminal- anterior,etc.) and theexpression-domains (thepatter. KNn~access ofexpression: s tripe, gradient, etc.). An important benefit from the modelling of the knowledge To date, there are 46 expression patterns comingfrom the in TROPESis the possibility to automatically obtain a web DEXIFLYdatabase [Horn et aL, submitted]. server, allowing, without any effort from the developer, to manipulate the base. This is the subject of that section Interaction networks while the next one concentrates on the specific algorithms In the KNIFEsense, a networkis nothing more than a set of developedfor KNIFEand integrated in the website. interactions. Its interactions field contains the set of interaction objects involved in the network. It is Webavailability identified by a name. Ideally the networks should be bound TROeF_.~can be transformed into a HTTPserver [Euzenat to a context expressedthrough a set of fields: 1996], which means that the knowledge base is type is the genetic context of the embryo(wild type or automatically turned into a Website. Each TROPESentity mutant for a particular gene); (e.g. object, concept, concept field or class) is provided development-stage speaks for itself; with an URLand a display function which generates a beginning and end are the be~nningand end of egg- HTMLpage (which may contain references to the other length wherethe network is supposedto occur. entities through their own URL). Whensuch a page is The use of the exceptions field, containing a subset of the displayed by a HTI’Pclient (or browser), if the user selects

Euzenat 113 Figure4. Anetwork. The list of fields is displayedabove a graphic representationof the network.Normal arrows (-->) indicate activatinginteraction, T-endingarrows (--I) andinhibiting interactionand others (-->1) a twofoldinteraction. Thegraph applet can manipulatedby the usersso that the displaysuits their needs. a particular hypertext link in the page, this will fetch the program which is run in the H’I’FP browser) which draws page corresponding to the embeddedURL and this page graphic pictures of the considered entity. An important will be loaded in the client. Each page is generated on aspect of this is the possibility to automatically generate a demandfrom the object in the base. call to an applet with the correspondingdata. This function This capability alone has several advantages since it also offers the opportunity to generate automatically forms generates a site free of dangling links and allows the that will send a specific query to an independent database remote access to the knowledge base without special without the burdenfor the user to construct the query. This training or special computer model. But other, more is used in KNIFEfor providing access to the remote elaborate, advantages comewith someadditional work. resources familiar to the user community(FLYBASE for genetics, EMBLand SWlSSPROTfor gene and protein First, it is possible to provide annotations to TROPES sequences and GIF-DBfor interactions). entities. These annotations can be HTMLfiles which will be incorporated in the usual page generated by TROPESin Queries order to documentthe displayed object. This provides a From a knowledge base described according to the first level of enrichment of the Webpages. The included presentation above, it is possible to apply operations page can be arbitrarily complex (containing U RL or provided by the knowledgerepresentation system. So far, images). TROPESprovides two main operations: classification and Moreover,if the default display pattern of the pages is not filtering. Classification consists, for a particular object, in suited to the current task, it is possible to redefine it. Figure identifying the classes to whichit can be attached; filtering 3 and Figure 4 showsuch redefinition with the call, during consists, froma particular class, in identifying the objects the generation of the page, to a specialised applet (a which could be attached to it. For their respective

~14 ISMB-97 Figure5. Afiltering: the result of filtering the proteinsof averagesize whosebiochemical function is a transcriptionfactor is: DECAPENTAPLEGIC,SEX COMBSREDUCED, WINGLESS, TAILLESS, RUNT, DORSALand BICOID. purposes, these operations comparethe values in the object interactions. fields (whatever they are: string, number,objects...)to the constraints attached to the class fields. Networkcreation and traversing TROPESallows to use these reasoning mechanismsin order The KNIFEknowledge base page contains a panel with to issue queries against the base through the Web. For several operations which apply to the whole base and are example, filters are used for selecting objects which are specific to the application. First several buttons allow the attached to certain classes and contain particular field interactive creation of new networks through the direct values. These filters are accessible through the HTrPserver selection of the involved interactions or the selection of and allow a moresophisticated search than full text search. products which must be source or/and target of the For instance, it is possible to ask for all the instances of interactions. This allows to create and store sub-networks protein, average size which are classified as in and of particular interest (e.g. the gap-genenetwork). whose biochemical-functionis transcription- KNIFEalso allows the enumeration of the paths in a factor. The result is provided in Figure 5. networkbetween a particular product and a particular gene (selected interactively). For instance, there are 51 possible Network manipulation ways for the BICOIDprotein to regulate the hunchback gene from the interactions stored in the base (Figure 6). The manipulation features described so far are provided in The implemented algorithm is a classical three-passes a standard way by TROPES.No particular programmingis traversal which proceeds in O(INI) where N is the set required, but the knowledgebase description. In order to nodes in the graph (proteinobjets). It is generally faster go further somealgorithms have been developed specially than the time required for displaying the result. for the needs of KNIFE.They consist in a graph traversal algorithm and an algorithm for diagnosis of missing

Euzenat 115 Figure6. Graphtraversing providesall the paths of length less than or equal to 7 betweenthe BICOIDprotein and the hunchback gene. Then, the system will build a network for each interval Diagnosis of missing interactions obtained and each network will contain the interactions The diagnosis of missing interactions is a new algorithm whose source is a product expressed in the concerned whichcould be invaluable for building networks. Its aim is region. For instance, and for the interaction contained in the detection of interactions which should have activated the KNIFEbase, the interactions involved between48% and (resp. inhibited) the expression of a gene while this 49%of the egg length are (--> meansactivation, --I means expression is not (resp. is) observed. For that purpose, the repression): user selects the type of fly (e.g. wild type or mutant) and GF2 i: even-skipped--> even-skipped the developmental stage considered (syncitial or cellular 0 10 20 30 40 50 60 70 80 90 100 blastoderm in the current state of the base). It is also I I F possible to select a particular area (expressed by a segment ’i’ili~2iii:.~ii!ii:ii~ Deform’ed i of the egg) of the fly. The algorithm will then fetch the L expression of all the knowngenes at that stage for that runt ’ individual and construct a division of the embryobased on fushi tarazu ,NI ’ I all the patterns. For instance, if we consider the genes n Deformed,runt, fushi tarazu, Antennapediaand ¯~.,.:.::-.: f-:.:.: :a I Antennapedia even-skipped, whose expression patterns (for the wild i , even-skipped animalat cellular blastoderrn stage) are shownin Figure 7, ... I ~ I I I the result is: 4849% 0-9-11-16-17-20-24-25-28-32-33-36-40-41-44-45-48-49- 52-56-57-60-63-64-65-68-75-100 Figure7. Expressionpatterns andthe consideredinterval between 48 and 49%of the egglength. n6 ISMB-97 GF22: even-skipped--] fushi tarazu KNIFEis the first knowledgebase which is devoted to the GF64: Antennapedia --[ Sex combs reduced description of gene interactions and their networks. Upto GF20" even-skipped --> engrailed now, the amountof biological data present in the base is GF63: Antennapedia--> Sex combs reduced not yet sufficient to get completelysignificant results with GF59: even-skipped--> Deformed the algorithms. This conclusion is in fact the result of the GF40: runt --I even-skipped use of the algorithms themselves, since they are able to Afterward, for each network, the system will point out the detect inconsistencies in the data. Another problemis that products which are expressed although they are inhibited the missing interactions algorithm is too simplistic at the by an active interaction (and activated by no active momentsince biological interactions cannot be completely interaction) or those whichare not expressed although they described using simple boolean formulas. However, the are activated by an active interaction (and inhibited by no above algorithm has been designed as a tool for pointing active interactions). In the formerexample, the interactions out possible problems and should not be considered as an GF20 and GF59are in such a case. For instance, GF20tells interaction simulator. that if even-skipped is present, it will activate the Asa matter of fact, the simulation of interactions in the fly expression of engrailed, but engrailed is not present embryois a long-term objective of such a research. No and no other product seems to inhibit it. This meansthat simulation algorithm has been developed so far in the the base is incomplete on that interaction: either KNIFE knowledge base and this is due to a double engrailed is expressed (and so its expression pattern is requirement:in order to be tractable, the problemshould be wrong) or somethingrepresses its expression. simplified; in order to be useful, it must not be simplified On another hand, there are no exception on Sex combs too much. There are several possible approaches that one reduced which is activated by Antennapedia because could envision in order to tackle the problem. there is another repression interaction (from Antennapedia tOOin that case). The first one is boolean simulation whichconsists, from a These interactions are put in the exceptions field of the network and a set of products (supposed to be present at network. This only meansthat the content of the base does the beginning), in generating for each possible product the not explain the observations. This is an invitation for the status: present, absent or indeterminate. This type of researcher to add new knowledgeabout interactions in the simulation does not raise any problem but complexity. system or to plan an experimentin order to gather evidence However,it is not really accurate since it does not account for this lack of knowledge. for threshold, efficiency and time, which are important If there exists a bound to the number of intervals in a issues in developmentalbiology. pattern,2) the complexity of the algorithm is in O(INI*IAI The second one does take into account the fact that the whereA is the set of interactions. Thereare several waysin interactions do not act in a uniform waydepending on the which this algorithm could be improved: by targeting the concentration of the source in the cell. Moreover, the diagnosis on a particular gene or a particular network. efficiency and result of interactions is not all-or-nothing but can vary depending on that concentration. Simulating the networkon that basis wouldrequire the introduction in Discussion the base of the knowledge about the thresholds and The KNIFEknowledge base described above presents efficiency as well as the initial quantity of product. The several noticeable features: simulation could then be produced in a straightforward ¯ the data is type-checked(detection of type-mismatchor manner(provided that we are able to combinethe threshold misspelling for instance); and efficiency of two interactions on the sameproduct) or ¯ it can be browsedand madeavailable in a client-server in a more sophisticated way [Thomas1991 ]. However,this fashion (and this is due to the use of a formal solution has the flaw of not considering the time necessary representation language); for producing the product and then fails to understand the ¯ it allows the linking of these data with other knowledge completedevelopment of interactions. sources (other bases, raw data or bibliographic data) The deeper possible simulation takes the temporal while offering its ownperspective on the data; phenomenon into account. It considers not only the ¯ it provides efficient ways of manipulating the data threshold effects and interaction efficiency but also the throughfilters or classification; different delays necessary first to transcribe a gene and to ¯ it allows the usage of specialised algorithms whichtake obtain a functional protein and, second, for that protein, to their input in the base and deliver their output to the regulate a downstreamtarget. A problematic point with this base. approach is the present scarcity of biological data. It is Froma biological point of view, it has to be noted that perhaps also interesting to note that the problem of the simulation of regulatory networks has somesimilarities,

Euzenat n7 from a formal point of view, with that of metabolism backbones. In proceedings 5th WWWworkshop on simulation. Since this latter problemis at the momentthe "artificial intelligence-based tools to help W3users", Paris object of intense research work [Karp and Mavrovouniotis fFR) 1994; Karpet al. 1996a; Karp et al. 1996b]it is hopedthat http : //~aw. inrialpes,fr/sherpa/papers / euzenat9 6 future progress in this area will benefit to regulatory a. html network simulation research. Fasano, L., Rrder, L., Corr, N., Alexandre, E., Vola, C., Jacq, B., and Kerridge, S. 1991. The gene teashirt is There are three main streams in the future developmentof required for the development of Drosophila embryonic trunk segments and encodes a protein with widely spaced the system. The most important one is the feeding of the zinc finger motifs. Cell 64:63-79 base with more data. This new data will allow to test the implemented algorithms in context and to evaluate the The Flybase consortium 1996. FlyBase: a Drosophila coverage of the data available. The second direction for database. Nucleic acid research 24:53-56 improvementis in the user interface: at the moment,a Gaul, U., and J/ickle, H. 1990. Role of gap genes in early generic interface is proposed by the TROPESsystem, but it development. Adv. Genet. 27:239-275 wouldbe better to re-design the actual pages so that they Hoogland, C., and Bi4mont, C. 1997. Drosoposon: a are more adapted to a biologist end-user. This means knowledge base on chromosomal localization of inclusion of some biological documentation pages and transposable element insertions in Drosophila. CABios direct links to bibliographic sources for instance. Thethird i 3(1):61-68 important aspect is the development of new algorithms. Jacq, B., Horn, F., Janody, F., Gompel,N., Serralbo, O., This covers algorithms for simulating various aspects of Mohr, E., Leroy, C., Bellon, B., Fasano, L., Laurenti, P., genetic regulation and also algorithms for comparing and Rrder, L. 1997. GIF-DB: a WWWdatabase on gene regulation networks between two different development interactions on gene interactions involved in Drosophila stages or two different organisms. melanogaster development. Nucleic Acids Res. 25:67-72 Karp, P., and Mavrovouniotis, M. 1994. Representing, analyzing, and synthetizing biochemical pathways. IEEE Acknowledgements Expert 9(2): 1 1-22 http://www.ai.sri.com/cgi- bin/pubs/papers / Karp9 4- This work is supported by the ACC-SVI3of the French 11 : Representing/document.ps ministry of education and research (MENESR),by CNRS and by INRIA. Many thanks to Ulfar Erlingsson and Karp, P., Riley, M., Paley, S., and Pelligrini-Toole, A. Mukkai Krishnamoorthy (Rensselaer, Troy) for the 1996a. Nucleic Acids Res. 24:32-40 graphdraw Java applet used in the KNIFE base Karp, P., Ouzounis, C., and Paley, S. 1996b. HinCyc: a (http : //’~v~. cs. rpi. edu/pro jects/pb/graphdraw)and knowledge base of the complete genome and metabolic Sylvain Chicois for the adaptation of this applet to the pathways of H. influenzae. In proceedings 4th ISMB, St Louis (MO US)http://www.ai.sri.com/cgi- needs of KNIFEand a first implementationof the diagnosis bin/pubs/ papers /Karp96 : HinCyc / document,ps algorithm. Wealso would like to thank Florence Horn and Laurence Rrder for all gene spatial expression data, Lewis, E. 1978. A gene complexcontrolling segmentation Laurent Fasano and Michel Page for helpful comments. in Drosophila. Nature, 276:565-570 Finally, the authors would like to thank Fran~;ois Marifio, O., Rechenmann, F., and Uvietta, P. 1990. Rechenmannfor continuous support, fruitful discussions Multiple perspectives and classification mechanismin and helpful comments. Object-oriented Representation. In proceedings 9th ECAI, Stockholm (SE), pp425-430 OMIMTeam 1996. Online Mendelian Inheritance in Man References (OMIM),a catalog of humangenes and genetic disorders. Johns Hopkins University, Baltimore (MD US) and Alexandre,E., Graba, Y., Fasano, L., Gallet, A., Perrin, L., De Zulueta, P., Pradel, J., Kerridge, S., and Jacq, B. 1996. National Center for Biotechnology Information, Bedestha (MD US) http : //www3. ncbi. nlm. nih. gov/omim/ The Drosophila teashirt homeotic protein is a DNA- binding protein and modulo, a HOM-Cregulated modifier Niisslein-Volhard, C., and Wieschaus, E. 1980. Mutations of variegation, is a likely candidatefor being a direct target affecting segment number and polarity in Drosophila. gene. Mech. dev. 59:19 !-204 Nature 287:795-801 Bairoch, A., and Apweiler, R. 1996. The SWISS-PROT Pankratz, M., and J~ickle, H. 1993. Blatoderm protein sequence data bank and its new supplement segmentation. In Bate, M., Martinez-Arias, A., eds., The TREMBL.Nucleic Acich" Res. 24:21-25 development of Drosophila melanogaster, pp467-516 Cold Euzenat, J. 1996. Knowledge bases as Web page Spring Harbor Laboratory press

118 ISMB-97 Perri~re, G., Dorkeld, F., Rechenmann,F., and Gautier, C. SHERPA project 1995. TROPES1.0 reference manual, 1993. Object-oriented knowledgebases for the analysis of Internal report, INRIA Rhbne-Alpes, Grenoble (FR), prokaryotic and eukaryotic genomes. In proceedings 1st 85p. ftp : / / ftp. inrialpes,fr/pub/sherpa/rapports / t ISMB, Bethesda (MDUS), pp319-327 ropes-manual,ps. gz Rice, C., Fuchs, R., Higgins, D., Stoehr, P., and Cameron, Thomas, R. 1991. Regulatory networks seen as G. 1993. The EMBLNucleotide sequence Database. asynchronous automata: a logical description. J. theor. Nucleic Acids Res. 21:2967-2971 Biol. 153:1-23 Rrder, L., Vola, C., and Kerridge, S. 1992. The role of the teashirt gene in trunk segmental identity in Drosophila development. Development 115:1017-1033

Euzenat u9