From: ISMB-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

TAMBIS- Transparent Access to Multiple Information Sources.

Patricia G. Baker ‘~, Andy Brass ~, Sean Bechhofer b, t’, b. Robert Stevensb.

;’School of Biological Sciences, hDepartment of . Stopford Building, , University of Manchester, Oxford Road, Oxford Road, Manchester, M 13 9PT Manchester, M 13 9PT U.K. U. K. Telephone: 44 (161) 275 6142 Telephone: 44 (161) 275 2000 Fax: 44 (161)275 6236 Fax: 44 (161) 275 5082

pbaker@ inanchester.ac.uk [email protected] abrass @manchester.ac.uk norln @cs.man.ac.uk seanb @cs.man.ac.uk stevensr @cs.man.ac.uk

Abstract those for protein sequences, genome projects. DNA sequences, protein structures and motifs. Also available The TAMBISproject aims to provide transparent access to are a range of specialist interrogation and analysis tools, disparate biolo-ical~ :rod analysis, tools, enahlimz each typically associated with a particular users to utilize a wide range of resources with the flwmat. Frequently the infortnation sources have different minimum of eflbrt. A prototype system has been structures, content and query languages: the tools have no developed that includes u knowledge base of biological commonuser interface and often only work on a limited terminolt~gy Ithe biological Concept Model). a model ol the underlying data sources (the Source Model) and subset of the data. ’knowledge-driven"user interface. Biological concepts arc Whenbiologists need to ask questions of multiple sources captured in the knowledgebase using a description logic they must perfl~rm the lkfllowing tasks during query called GRAILThe Concept Model provide,,, the user with lbrnmlation and execution: the concepts n¢cessar3, It) construct a wide range of ¯ identit), sources and their locations multiple-source queries, and the user interface provides a ¯ identify the content/function of sources flexible means of constructing and manipulating those ¯ recognise components of a query and target them to queries. The Source Modelprovide:, a description at the tmdcrlying sources and mappings between terms used in appropriate sources in the optimal order Ihe sources and terms in the biological Cnnccpt Model. ¯ communicate with sources The Conccpl Model and Source Model provide a level o1’ ¯ transform data between source fl~rlnats indirection that shieMs the user from source detail,,,. ¯ express syntactically complex queries and providing a high level of source transparency. Source ¯ merge results from different sources. independent, declarative queries formed from terms in the Concept b,,lt~dcl tire translormed into a set of source Many biologists still use collections of stand-alone dependent, executable procedures. Query Iormtflation, Iranslation and execution is demtmslratetl using u working resources (many of which are Web-based) to formulate example. and execute queries. This means that all of the tasks listed above must be carried out by the user. This places a Introduction burden on biologists, most of whom are not Bioinformatics experts, and limits the use that can be The biological COlnmutfity is a distributed tmc witll a made of the available inforlnation. The greater the number culture of sharing and rapid dissemination of inlbrlnatlon. of the above tasks that are taken on by tile system, Ihe Each separate area of inolecular biology generates its own greater the transparency of the overall task of query data and therefore its own information sources, including fi~rlnulation and execution. There are exalnples in the

Copyright t~ 1998 American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Baker 25 biological communily of systems which seek IL~ relieve The Transparent Access to Multiple BioiriforrlmllCS the user of sc)me of this burden by easing access to Irlfol’nlation Sources (TAMBIS)p,t)j¢cl aims tli i)iovitle multiple, heterogeneous information sources; however. the user with the rnaximum source transparency using (il these systems vary in their degree of transparency. a canonical represenlation of biological terminoh>gy The Sequence Retrieval System (SRS) (Etzold 1996), against which the user cau tornmlate queries and lii) example, attempts source interoperation using iwede..thwd mappings from terms in Ihe representation onto terms in hypertext links with which the user can navigate bmween external sources. TAMBISiherel~.~re pruvides a level of sources. A form-based interface allows the user to ask indirection belween the user and the external sc~tii’ces co.nplex, although restricted, queries over multiple which relmwes from the user the necessity to perform the ,,~mrces that are executed simultaneously. Queries tasks listed above. In order to de} this TAMBISadopts the c~mw~sedof sub-queries, which have to he executed in a three layer model of the classical mediat~u/wrappe, given order, nmst be issued separately by the user, and the architecture (Wiederhold 1992h a schematic view of results of one sub-query piped into another by hand. which is ilhisirated in figure I. While SRS provides 1h¢ user with transparency from ].ayer I comprises a knm.vledge base (conceptual model") ~’onmmnication (i.e. Iocatitm, connection protoc~ls and of biological lern,inolo<,v, e. and a knowledge-drivenuser query hmguage) with sources, it does not hide them. :rod interface. Using the interface, the user conlhhlt.’s It.’,uis pr~wides no guidance as to which scmrce is most from the knowledge base to form dechiralive, source appropriate for a given query. independent queries, l.ayer 2 is a mediation layer lhat (i) The C~dleciion Programming Language ICPL) is a, identifies the appropriate sources to satist} a ClUe,.’,’ and functional programming language that allows data m be (it) rewrites the query to a series t)|" source dependent described and manipulated as c~mlplex data types such as ordered procedures. I.:lyer 3 c~l,nprises external sources sets. lists and recL~rds. These data types are suitable for wrapped with a consistent structural model, piov~ding t, modelling bmlogical data and have been used to do so in common interface thai affords conununication and the BioKliesli system (Buneman 1995) where bioh)gical netw~)rk transparency. Where possible FAMBISexph,ts .~ouJccs ;.ire given :., CPI. driver and a set of [unclums to existing iechnoh)gies. I.aye, 3. therefore, currently utlli/es nlaililmh.tte d:.ita, cPg/P,i

" Because the bioh)gical knowledge base is a conCCl;tmd model <~f biological termim)logy, the wc~rds "concept" and "term’ are used inlerchangeably in this paper.

26 ISMB-98 Knowledge-DrivenGraphical User Interface Layer 1 Query. formulation Biological Concept Model + ~k declarative query

Query Transformation Source Model Layer 2 Query planning and translation (source mediation)

ordered execution I Layer 3 Query execution

(wrappedsources.)

Figure I. TAMBISthree layer, mediator/wrapper architecture. order to share standardised and unambiguousinformation, terms "hasFunction’ and ’Hydrolase" to fl)rm controlled vocabularies, or terminologies, can be used as a composite term ’Motif which isComponentOfProtein framework for expressing and communicatingideas in a and hasFunction Hydrolase’: this term is both a consistent manner. The TAMBISbiological Concept concept and a queo’. Model describes such a terminology. This knowledge it is a classification schemethat organises terms into base covers terms associated with proteins and nucleic a hierarchy based on the "isa’ relationship (also acids, their component parts and their structures, known as the subsumption relationship). For biological functions and processes, tissues and taxonomy. example, ProteinSequence qsa’ more specialised kind The terminology has two key aspects: of Sequence. it is compositional, resembling a dictionary of elementary terms that are assembled according to a To be truly effective, such a terminology needs to be restricted grammar to form new complex composite represented in a schemethat can reason about the inferred terms. ’Fhese composite terms can in turn be relationships between terms and their components, can components in new compositional terms, so the control the formation of terms, and can automatically terminology is recursive. For example, the term classify’ terms based on their components so that the "Motit" can be combined with the terms hierarchy takes care of itself. As terms tire changedthe "isComponentOf" and "Protein’, to create a new schemeshould also dynamically reclassify them to ensure composite term "Motif which isComponentOf the hierarchy’s correctness. Protein’. This in turn could be combined with the Description Logics (DL), also known as Terminology

Baker 27 Logics, are a family of logics explicitly designed to expressive than most other DLs but it compensates for represent taxonomic and conceptual knowledge of an this by supporting a powerful set of assertion axioms and application domain on an abstract level; for an overview a multi-layer sanctioning mechanism. These sanctions see (Borgida 1995). DLs are usually given a Tarski style decree whether two concepts are permitted to be related declarative semantics, which allows them to be seen as via some relationship and so constrain the construction of sub-languages of first order predicate logic. In the complex concepts. Sanctions ensure that only TAMBISproject we use the (]RAIL DL (Rector 1996), semantically valid concepts are formed and that a large deveh~pedat Manchester. Briefly, a DL is an "isa’-based number of complex concepts can be inferred from a chtssificalion system thai allows a recursive, sparse model. As only reasonable concepts can be inferred compositional model to be built from terms and binary from the model the user is allowed to construct only those relations. A base term can be combined wilh any number queries that it is reasonable to ask. For example, in figure of relation-term pairs for criteria) to create a more 2, asserting that "SequenceComponent isComponentOf complex term. Any of these terms can be composite Protein" is legal, is sufficient to infer that ’Motif (complex) or elementary. Figure 2 gives a small fragment isComponentOf Protein’ without having to create it or t)f the GRAIL classification, omitting the term position it until it is asked for. Therefore, only a small constructors. In this example "Motiff is the base term and number of constraints need be asserted in order that a ’isComponcntOf Protein" is the criterion with which it is large number of concepts can be inferred. combined. GRAILsupports thc :lutomatic classification of In TAMBISthe biological Concept Model is used to: c~mcepts into "isa" hierarchies by reasoning about the ¯ describe the metadata of the underlying data sources. component descriptions of the concepts. Therefore, representing an over-arching universal schema "lhotein Motif" would be classified automatically as a ¯ express queries in the modelling language chihl of "Motif" and a parent of ’Poecilia Reticulata ¯ drive a GUI user interface for query tbrmulalion Protein Mollff based on its definition. Only 3 of the II ¯ mediate between the various data sources by ’lsa" rclatitmships shown in figure 2 have been hand- exploiting the biological concept hierarchy to assist crafted bv lhc knt)wledge modeller. DI,s support nluhi- in the identification and resolution of equivalences or dimensional classification so that the same concept can be near equivalences - simJhtr approaches have been classified in inany, ways, thus allowing for the different taken in non-biological pr~jects, for example SIMS user views of a concept. The classification is dyimmic so [ Arens931. :is the description of :t concept is further elaborated it is aut

28 ISMB-98 isComponentOf hasOrganismSource

Protein

hasFunction FunctioiOrgism I Se~aenceComponen: Poecilia Sequenc~Component ~ hasFunction Hydro]ase. reticu[ata i % Hvdrolase Mo:if SequenceCompon~nt !sCompcn~ntUf ~ , Protein ! I

Motif hasF!:nc’.ion Hydrolase. ," Notif isComponez~cOf Protein p"

¯ ,e ¯ x. d ~. t:oz1- ", ,t .

"t¢" Figure 2. A ~,implified fragment of the TAMBISGRAIl, model showing the I’~)wer of auto-classification: the only "isa" relationships that have bccn ’hand-crafted" by the knowledgeworker arc indicated by the solid arrows. All the other terms are implied by the sanctioning ~,chcmc and automatically and dynamically classificd upon request, as indicated by the broken arrows. The solid lines indicate Ihc ~,anctionedrclatitmships betweenterms. It is these relationships that allow the construction of all of the compositeterms shown. tirol. Figure 3 shows the navigator focused on the ctmcept "hasOrganismSource PoeciliaReticutata’. The query is "Protein Structure’. The concept currently in focus equivalent to the English expression "’find all motiIis occupies the center of the frame and related concepts fi’om occurring in guppy proteins". the Knowledge Base are displayed around it. The model It is important to appreciate that m TAMBISthe term may be browsed by promoting any of the related concepts concept is interchangeable with the term query. Therefore, to be lhe central concept. The new central concept is then in constructing a concept (a description of what you the surrounded by all its related concepts. user wants) the user is constructing a query ("what things Having identified a concept of interest, for example exist that fit the description I have just given’?"). "motiF, the user may want lo form a query based on that c(>ncepl. A Query Manipuhtlion tool ,dives the user an Query Planning and Translation option to add more intormation about the concept (or Queries expressed in GRAILare dechtrative and source .wecialise the concept) by presenting all the legltmmte mdependent. GRAIL queries thus specify what crilcria that can be applied to the concept "motif {see information is required, but neither how it should be fio ure 4 ). obtained nor from where. It is the role of the query The user may choose one or many of these criteria, If they planning and translation layer to provide this additional chose, for example. "isComponentOf Protein’, the query information. This layer takes as input a GRAIl. query and is cquivalent to the English expressi(m "’find all protein generates as output an executmn plan in CPL. The motif<’. Having constructed the query the user may planning and translation process is broken into three main inanipulate the whole query or any of its component sub- steps: queries by li) the addition or removal criteria or (it) ¯ 77an.Walton into a QueO Internal Form (QIF)." The rcplacemcnt of terms with more specialized or more GRAIL query is unnested and certain query general terms. Figure 5a shows a query that has been constructs are simplified. built by further specialisatitm of the term ’Protein" in the ¯ Query Planning: A search algorithm considers above query by addition of the criterion alternative evaluation orders for the components of

Baker 29 tile QIF generated at step 1, with a view to optimisatitm (Paton 1990, Fegaras 1997). The QII: is a list identifying botll valid and efficient ways of of query components, each of which is a tuple ¢Ba.~e. evaluating the query. Variahle, Criteria, Cost, Cardinality) representing lhe (’~Jde Gcm.’ration: The query plan that resulls from evaluation of part of the query. Base is the base concept the planning phase is converted into a CPI, program ~1" the component, Variable is the name of the variable tot execution. used to store values retrieved as a result of evaluation the component, Criteria represents the set of crilcria The following subsections elaborate on the above steps. associated with Base, Cost is an estimate of the ct~st of b~tla detailing what is d(me at each stage and outlining the evaluating the component, and Cardinalitv is the size tff auxiliary data structures that are required. the collection that it is anticipated will result fiom Translation into Query Internal Form (QIF). GRAIL evaluating the component. Values l\~r Cost and queries are intrinsically nested structures. However, Cardinality are computed by the planner. Figure 5a qlows nested language structures generally imply some an example query that is equivalent to the English query evaluation order, so we follow a number of earlier query "’find all motifs in Poecilia reticulata (guppy) pr(~teins". planners in unnesting the source query prior t~ query The GRAII~ representation

Figure 3. TAMBISprototype user interface navigati~m tool showing the navig.’ttion of the concept "Protein Structure’. The central term is surrounded by related terms. Each related term is coloured according to its relationship with the central term. Thcre arc four p~ssible relatitmships: parent terms - concepts immediately above it in the hierarchy with which it has an "i~a" rclati~mship e.g. "Structure’: child terms - concepls immediately below it in the hierarchs, which have "isa’ relationship,, with it e.g. "Protein Tertiary Structure’: defining terms - relation-term pairs that form parl ()f its definition eg. "is structttre Protein’: sanctioned terms - concepts with which it has appropriately sanctioned relationships but which do t+,~l t¢wtn part ~>t the concept"., definition eg. +is dctcrmincd by Method of DetermimngSlructure’.

30 ISMB-98 Figure 4. Anexample from the TAMBISuser interface prototype showingthe relationships that can be used to specialise the concept of ’motif’.

Query Planning. The query planner seeks to identify of this query is "Motif which isComponentOf (Protein both legal and efficient ways of evaluating queries given which hasOrganismSource PoeciliaReticulata)" (figure the available CPLfunctions. The planner exploits the 5b). Theinitial QIFof this query is shownin figure 5c. augmentation heuristic (Swami1989), which essentially Each term (concept) in a criterion is itself represented involves examiningall the query componentsin a query, a query component and is associated with the variable selecting the most promising for initial evaluation, and used to store instances that result from the evaluation of repeating the process for the remaining components. The the component. The other form of mapping that takes Source Modelis central to the planning process, as it place during the translation to QIFis the simplification of indicates which CPLfunctions can be used to evaluate componentswhere appropriate - for example the removal which query components. Lack of space prevents a of query componentswhich exist only to support certain detailed description of the Source Model, but the modelling strategies employedby the knowledgeworker. following are the principle components: The mappinginto the QIF is defined as a set of rewrite ConceptIteration: Conceptiteration information is a rules of the form: triple (Concept, FunctionSignature, ArgumentMapping),where Concept is a concept from rewrite the Concept Model, FunctionSignature is the as signature of a CPLfunction, and ArgumentMapping if is a description of howinput parameters for the CPL function should be obtained. The concept template is capable of matching concepts ¯ with specific structures in the biological ConceptModel, Criterion Evaluation: Criterion evaluation information indicates howthe criteria of a concept the QIF component is as described above, and the can be evaluated in CPL. This is described using condition makes tests involving the Concept Model and tuples of the form (Concept, Criterion, the Source Model. However,conditions never refer to the specific functions that maybe used to evaluate a query, as FunctionSignature, ArgumentMapping), where Conceptis the base concept to whichthe criterion is planning is the sole preserve of the planner described applied, Criterion is the criterion in question and below. FunctionSignature and ArgumentMapping are as described for concept iteration.

Baker 31 ¯ Coercion: CPL functions may retrieve values of system is used to elicit user requirements. The prototype different CPLtypes to represent the same concept. user interface is currently implementedin SmallTalk. It For example, retrieval of protein information from a has muchof the functionality that it is envisaged will be specialist protein database such as SWISSPROT needed in the final system, although the look and yields a complex record structure that contains behaviour of the interface is likely to changeas the final significant amountsof information about the protein. implementationwill be in Java to facilitate its use on the Retrieval of information from a motif database such World WideWeb. We are currently eliciting general user as PROSITE,however, is likely to yield only the requirements from academia and industry by means of a accession numbers. This meansthat the query planner questionnaire. This is ensuring that the Concepts Model needs to knowthings like how to obtain a detailed allows the formulation of the kinds of questions that description from an accession numberand vice versa. biologists want to ask. Noformal user evaluation of the Such relationships are described using tuples of the prototype knowledge-driven user interface has yet been form: (CPL_type, CPL_O’pe,mapping junction) performed,although it is envisaged to play a majorpart in ¯ Costing: Information on the anticipated cost of the developmentof the system. A Java implementation of evaluating a CPLfunction and the likely cardinality the query transformation moduleis in place, although its of the result is stored using tuples of the form: accuracy has not yet been evaluated. A more sophisticated ( FunctionSignature,Cost, Cardinality). planner will be needed in the future. There are currently 15 wrapped sources including the BLASTsuite of The planner has two principle components, the search programs, SwissProt, Prosite, BLOCKSamd PRINTS. algorithm described at the start of this sub-section and a The Source Model has mappings between a range of CPL list of rules that indicate under which circumstances functions acting on these sources and the corresponding specific techniques may be used to evaluate a query concepts in the biological model. These mappingsdictate component.Such rules are of the form: the number of queries that can be answered by TAMBIS and so the development of a more comprehensive Source Rewrite Model is the next priority task. As suggested by as (Davidson 1995), this approach is high cost but high if benefit, and there are still manychallenges to address - given issues such as: tools for adding newsources; changes in sources; incorporation of CORBAsources; dynamic The QIF Componentis as described above, the Function query optimisation based on network performance; user List is a list of CLPfunctions with bound arguments, the intervention and results attribution. Cost is an estimate of howlong it will take to evaluate each of the functions in the Function List, and the Summary Cardinality is the total numberof concepts given the set The TAMBISproject is pursuing a novel approach that of boundvariables. The condition invariably refers to the functions that are available in the Source Modeland the will yield an integrated solution to the problem of disparate biological databases and analysis tools. The set of boundvariables. common schema (Biological Knowledge Base) Code Generation. The code generator takes as input an represented in a Description Logic, presenting the user ordered list of query components and their associated with a rich description of the domain from which they functions, and generates a single CPLprogram that binds mayflexibly and intuitively construct and modify queries. together the CPL functions. The code generator is The queries are deconstructed, rewritten into a common straightforward, and makes a single pass through its query language and dispatched to one or more wrapped inputs in generating the execution plan (figure 5d). resources. The use of a knowledge base and wrapped For result presentation, TAMBISmakes use of a CPL resources removesthe need tbr the user to know(i) which function that transforms its data structures into HTMLfor are the appropriate resources to use and (ii) howto access display using a WWWbrowser (figure 5e). them, thus greatly reducing the time taken to analyze their data. Project Status The prototype Biological Model is well populated by Acknowledgements concepts describing those areas required for the The TAMBIS project is funded jointly by the construction of commonqueries, such as queries about EPSRC/BBSRC Bioinformatics Programme and by protein structure and nucleic acid coding signals. The Zeneca Pharmaceuticals, whosesupport we are pleased to model currently contains around 1500 concepts and has acknowledge. the capability to inter manymore. The biological concept model will become better populated as the prototype

32 ISMB-98 a)

b) Motif which isComponentOf (Protein which hasOrganismSource PoeciliaReticulata)

[ ( Motif, Motif-1, [(isComponentOfProtein, Protein- 1 )l, - 1, 1 (Protein, Protein-I, [(hasSourceOrganismPoeciliaReticulata, null)l, -1, -1)

d) { Motif-1 I kProtei n- 1 <-get-sp-entry-by-os(" POECILIA+RETICULATA" ), Motif- 1 <-do-prosite-scan-by-entry-rec(Protein- l )

~’~ r~’;~’,’. " ...... : e) !!ii!ii?iiii!ii!i if?i?ii ?iiiiii ? ililNii:iii iiiiil

¯ name ASN_GLYCOSYLATION prositeid PS00001 fl~eifl PDOC00001 description N-glycosylation site :: pattern N[AP][ST][AP] : i match

NFSR ¯ name CK2 PHOSPHO_SITE proslteid PS00006 doeid PDOC00006 description Casein kinase II phosphorylation site pattern [ST]. (2} [DE] match 9 ii

Figure 5. An example showing the stages in the information retrieval process using TAMBIS.a) The knowledge-driven GUI allows the user to construct a declarative, conceptual and source independent query. The query formulated at the interface is represented in GRAILas shown in b). e) The single GRAILquery is transformed into query internal form (QIF). d) The is transformed into a functional, source-dependent query in CPL. e) The results from the CPL wrapped sources are presented to the user via a Webbrowser.

Baker 33 References Karp P, A Strategy for DatabaseInteroperation, in Journal Arens Y, Chee C.Y., Hsu C-H, Knoblock C.A. Retrieving of Computational Biology, 1996. and Integrating Data from Multiple Information Sources, KempG.J.L. and Gray PM.G., Using the Functional Data in Journal on Intelligent and CooperativeInformation Modelto Integrate Distributed Biological Data Sources, Systems, 2:127-158,1993. Proc. 8th Int. Conf. on Scientific and Statistical Database Borgida A., Description Logics in Data Management. Management,IEEE Press, 176-195, 1996. IEEE Transactions on Knowledgeand Data Engineering, Markowitz,V.M., and Ritter, O., Characterizing 7(5): 671-682, 1995. Heterogeneous Molecular Biology Database Systems, Journal of ComputationalBiology, 2(4), 1995. BunemanP., Davidson S.B., Hart K., Overton C. and WongL. A Data Transformation System for Biological Paton, N.W. and Gray, P.M.D., Optimising and Executing Data Sources In Proceedings of VLDB,Sept. 1995 Daplex Queries Using Prolog, The ComputerJournal, Vol (Zurich, Switzerland). 33, No 6, 547-555, 1990. DavidsonS.B., Overton C., BunemanP., Challenges in Rector A.L., BechhoferS., GobleC.A., HorrocksI, Integrating Biological Data Sources, Journal of Nowlan W.A., Solomon W.D., The GALENmodelling Computational Biology Vol 2, No 4, 1995. language for medical terminology, in AI in Medicine 1996. Donini, F., Lenzerini, M., Nardi, D., Nutt, W., ’The Complexityof ConceptLanguages’, KR-91,pp 151 - 162, Rector A. and HorrocksI. Experience building a Large, 1991. Re-usable Medical Ontology using a Description Logic with Transitivity and ConceptInclusions. AAAISpring Etzold T, UlyanovA, Argos P, SRS:information retrieval Symposiumon Ontological Engineering, 1997. system for molecular biology data banks. Methods Enzymol.1996, 266: I 14-128. Rodriguez-TomeP, Helgesen C, Lijnzaad P, Jungfer K, A CORBAserver for the radiation hybrid database. Fegaras L. An experimental optimizer for OQL.Technical Proceedings of the ISMB1997, 5:250-253. Report TR-CSE-97-007,CSE, University of Texas at Arlington, 1997. WiederholdG. Mediators in the Architecture of future Information Systems, IEEE Computer21 (3) March1992, pp. 38-50.

34 ISMB-98