TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources
Total Page:16
File Type:pdf, Size:1020Kb
From: ISMB-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. TAMBIS- Transparent Access to Multiple Bioinformatics Information Sources. Patricia G. Baker ‘~, Andy Brass ~, Sean Bechhofer b, Carole Goble t’, Norman Paton b. Robert Stevensb. ;’School of Biological Sciences, hDepartment of Computer Science. Stopford Building, University of Manchester, University of Manchester, Oxford Road, Oxford Road, Manchester, M 13 9PT Manchester, M 13 9PT U.K. U. K. Telephone: 44 (161) 275 6142 Telephone: 44 (161) 275 2000 Fax: 44 (161)275 6236 Fax: 44 (161) 275 5082 pbaker@ inanchester.ac.uk [email protected] abrass @manchester.ac.uk norln @cs.man.ac.uk seanb @cs.man.ac.uk stevensr @cs.man.ac.uk Abstract those for protein sequences, genome projects. DNA sequences, protein structures and motifs. Also available The TAMBISproject aims to provide transparent access to are a range of specialist interrogation and analysis tools, disparate biolo-ical~ databases:rod analysis, tools, enahlimz each typically associated with a particular database users to utilize a wide range of resources with the flwmat. Frequently the infortnation sources have different minimum of eflbrt. A prototype system has been structures, content and query languages: the tools have no developed that includes u knowledge base of biological commonuser interface and often only work on a limited terminolt~gy Ithe biological Concept Model). a model ol the underlying data sources (the Source Model) and subset of the data. ’knowledge-driven"user interface. Biological concepts arc Whenbiologists need to ask questions of multiple sources captured in the knowledgebase using a description logic they must perfl~rm the lkfllowing tasks during query called GRAILThe Concept Model provide,,, the user with lbrnmlation and execution: the concepts n¢cessar3, It) construct a wide range of ¯ identit), sources and their locations multiple-source queries, and the user interface provides a ¯ identify the content/function of sources flexible means of constructing and manipulating those ¯ recognise components of a query and target them to queries. The Source Modelprovide:, a description at the tmdcrlying sources and mappings between terms used in appropriate sources in the optimal order Ihe sources and terms in the biological Cnnccpt Model. ¯ communicate with sources The Conccpl Model and Source Model provide a level o1’ ¯ transform data between source fl~rlnats indirection that shieMs the user from source detail,,,. ¯ express syntactically complex queries and providing a high level of source transparency. Source ¯ merge results from different sources. independent, declarative queries formed from terms in the Concept b,,lt~dcl tire translormed into a set of source Many biologists still use collections of stand-alone dependent, executable procedures. Query Iormtflation, Iranslation and execution is demtmslratetl using u working resources (many of which are Web-based) to formulate example. and execute queries. This means that all of the tasks listed above must be carried out by the user. This places a Introduction burden on biologists, most of whom are not Bioinformatics experts, and limits the use that can be The biological COlnmutfity is a distributed tmc witll a made of the available inforlnation. The greater the number culture of sharing and rapid dissemination of inlbrlnatlon. of the above tasks that are taken on by tile system, Ihe Each separate area of inolecular biology generates its own greater the transparency of the overall task of query data and therefore its own information sources, including fi~rlnulation and execution. There are exalnples in the Copyright t~ 1998 American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Baker 25 biological communily of systems which seek IL~ relieve The Transparent Access to Multiple BioiriforrlmllCS the user of sc)me of this burden by easing access to Irlfol’nlation Sources (TAMBIS)p,t)j¢cl aims tli i)iovitle multiple, heterogeneous information sources; however. the user with the rnaximum source transparency using (il these systems vary in their degree of transparency. a canonical represenlation of biological terminoh>gy The Sequence Retrieval System (SRS) (Etzold 1996), against which the user cau tornmlate queries and lii) example, attempts source interoperation using iwede..thwd mappings from terms in Ihe representation onto terms in hypertext links with which the user can navigate bmween external sources. TAMBISiherel~.~re pruvides a level of sources. A form-based interface allows the user to ask indirection belween the user and the external sc~tii’ces co.nplex, although restricted, queries over multiple which relmwes from the user the necessity to perform the ,,~mrces that are executed simultaneously. Queries tasks listed above. In order to de} this TAMBISadopts the c~mw~sedof sub-queries, which have to he executed in a three layer model of the classical mediat~u/wrappe, given order, nmst be issued separately by the user, and the architecture (Wiederhold 1992h a schematic view of results of one sub-query piped into another by hand. which is ilhisirated in figure I. While SRS provides 1h¢ user with transparency from ].ayer I comprises a knm.vledge base (conceptual model") ~’onmmnication (i.e. Iocatitm, connection protoc~ls and of biological lern,inolo<,v, e. and a knowledge-drivenuser query hmguage) with sources, it does not hide them. :rod interface. Using the interface, the user conlhhlt.’s It.’,uis pr~wides no guidance as to which scmrce is most from the knowledge base to form dechiralive, source appropriate for a given query. independent queries, l.ayer 2 is a mediation layer lhat (i) The C~dleciion Programming Language ICPL) is a, identifies the appropriate sources to satist} a ClUe,.’,’ and functional programming language that allows data m be (it) rewrites the query to a series t)|" source dependent described and manipulated as c~mlplex data types such as ordered procedures. I.:lyer 3 c~l,nprises external sources sets. lists and recL~rds. These data types are suitable for wrapped with a consistent structural model, piov~ding t, modelling bmlogical data and have been used to do so in common interface thai affords conununication and the BioKliesli system (Buneman 1995) where bioh)gical netw~)rk transparency. Where possible FAMBISexph,ts .~ouJccs ;.ire given :., CPI. driver and a set of [unclums to existing iechnoh)gies. I.aye, 3. therefore, currently utlli/es nlaililmh.tte d:.ita, cPg/P,i<lKliesli thus acts a.’, a (’PL/BioKhesii although the hmg-teim intention ix Ic~ use ,,luitidatah;.tse h.in gi.l;.tge and provides ways ~1 CORBAwrapped services ( RodrigueT.-Toine 1997 ). ,,liulipulatlng and piping resuhs, allowing the user to The conceptual model is central to the "I’AMBIS fornmlate complex,ad ]lol" queries. Thedetails of h~cillilln archilecture. Its use in driving query furinulalitnl and arid ;Access Ill these data sources are hidden: however, tile facilitating source integratilm, is no’,.el in the bi,lh~gic~,l il.lenlifical,on t)l" which data source t(i use aild Ihe domain. The emphasis on the model in this paper is. construction of the query in CPI. is still left to the user. theref(we, c~mamensuraie with ils importance. Comp.’lrable facilities are provided by P/FDM (Kemp 19’4fi). which also uses a functional language, although The Architecture P/H)Mha.’, a more object-oriented type system and has its Although a detailed description of the TAMBIS ~,a’ll hical database. architecture is outside the scope of this paper, al general {Markowitz 1995) uses an object model, the OPM. as overview is appropriale. The five nlam c¢mlponents td the co,llilliin data model liar the sources and a suite of OPM- FAMBISarchitecture are: based tools for exploring them. Each source either ha.., an ¯ The biological Concept l’vI~)del (knuwlcdgeb:.iscl ()PM schema or is retro-filted with one viii a view rllecha,usm. A multidatabase directory describes how ¯ The knowledge-driven graphical user interlace i(’7,[.ll each database is linked l{~ another. However, there is no ¯ The Source Model aliempl at hiding the databases from the user. who ix still ¯ Thc Query Transformation Module expected t~ identify then’l and navigate through them. ¯ The Query Execution Module Queries cain be specified via a multidatabase query language OPM-QI.. or using a Web interface. The Biological Concept Model Somebioint’ormatics researchers recognise [hilt Se,ll;.,rl[IC schema and data matching would be greatly aided by’ a comprehensive thesaurus of terms (1)avidson 1995) or reference ~mlology of biological concepts I Karp 19c)6L In " Because the bioh)gical knowledge base is a conCCl;tmd model <~f biological termim)logy, the wc~rds "concept" and "term’ are used inlerchangeably in this paper. 26 ISMB-98 Knowledge-DrivenGraphical User Interface Layer 1 Query. formulation Biological Concept Model + ~k declarative query Query Transformation Source Model Layer 2 Query planning and translation (source mediation) ordered execution I Layer 3 Query execution (wrappedsources.) Figure I. TAMBISthree layer, mediator/wrapper architecture. order to share standardised and unambiguousinformation, terms "hasFunction’ and ’Hydrolase" to fl)rm controlled vocabularies, or terminologies, can be used as a composite term ’Motif which isComponentOfProtein framework for expressing and communicatingideas in a and hasFunction Hydrolase’: this term is both a consistent manner. The TAMBISbiological Concept concept and a queo’. Model describes such a terminology. This knowledge it is a classification schemethat organises terms into base covers terms associated with proteins and nucleic a hierarchy based on the "isa’ relationship (also acids, their component parts and their structures, known as the subsumption relationship). For biological functions and processes, tissues and taxonomy. example, ProteinSequence qsa’ more specialised kind The terminology has two key aspects: of Sequence. it is compositional, resembling a dictionary of elementary terms that are assembled according to a To be truly effective, such a terminology needs to be restricted grammar to form new complex composite represented in a schemethat can reason about the inferred terms.