A Proposal for a Standard CORBA Interface for Genome Maps

Vol. 15 no. 2 1999 BIOINFORMATICS Pages 157-169 A proposal for a standard CORBA interface for genome maps Emmanuel Barillot1,2,UlfLeser3, Philip Lijnzaad4, Christophe CussatĆBlanc1,2, Kim Jungfer4, Frédéric Guyon1, Guy Vaysseix1,2, Carsten Helgesen4,5 and Patricia RodriguezĆTomé4 1GISInfobiogen, 7 rue Guy Môquet ć BP 8, 94801 Villejuif cedex, France, 2Généthon, 1 rue de l'Internationale ć BP60, 91002 Évry cedex, France, 3Technische Universität Berlin, FB 13 ć CIS, Einsteinufer 17, DĆ10587 Berlin, Germany, 4The European Bioinformatics Institute, EMBL Outstation, Hinxton Hall, Hinxton, Cambridge CB10 1SD, UK and 5Department of Informatics, University of Bergen, HiB, NĆ5020 Bergen, Norway Received on October 8, 1998; revised on December 16, 1998; accepted on December 17, 1998 Abstract number of internal relationships (e.g. orthology, gene map- … Motivation:The scientific community urgently needs to ping, gene regulation ), most of them still unknown. standardize the exchange of biological data. This is helped The success of the genomic projects now depends greatly by the use of a common protocol and the definition of shared on our ability to manipulate and interact with the data they data structures. We have based our standardization work on have yielded and to discover these unknown relationships. In CORBA, a technology that has become a standard in the past that respect, the scientific community of geneticists and mol- years and allows interoperability between distributed ob- ecular biologists still lacks the facility that would offer a jects. unique and operational view on these data. Currently the way Results: We have defined an IDL specification for genome we exchange biological data is far from standardized. There maps and present it to the scientific community. We have is a need to: implemented CORBA servers based on this IDL to distribute • integrate data of different nature (mapping, sequence, RHdb and HuGeMap maps. The IDL will co-evolve with the function, proteins, metabolic pathways...), needs of the mapping community. • integrate data from different species (human, mam- Availability: The standard IDL for genome maps is available … at http:// corba.ebi.ac.uk/RHdb/EUCORBA/MapIDL.html. malians, micro-organisms ), The IORs to browse maps from Infobiogen and EBI are at • integrate data from different sources (even the data of http://www.infobiogen.fr/services/Hugemap/IOR and a given nature on a given species are often dissemi- http://corba.ebi.ac.uk/RHdb/EUCORBA/IOR. nated among several databases with heterogeneous for- Contact: [email protected], [email protected] mats). These issues are non-trivial. We propose to address them Introduction by using: By providing a common basis to all the biological sciences, (1) a common protocol that offers a transparent access to genetics and molecular biology are at the crossroads of the remote databases, study of life. This situation and the recent advances in mol- (2) a common application programming interface, ecular biology have set the stage for the numerous large scale genomic projects that have been launched during the past ten (3) a common schema describing all the data and their rela- years. tionships. This has led to a huge amount of data of different nature: This problem of standardization constitutes the subject of for Homo Sapiens, the mapping projects have reached this paper. In the second section, we expose how to integrate completion, the sequencing phase is on its way, and the physically the data and offer a common language to the pro- scientific community now turns its attention to gene func- grammers. In the third section, we propose a common inter- tion. Other species have been analyzed more deeply, or, more face for the genome maps. In the fourth section, we explain commonly, hardly studied. All these data present a great how to use this common interface. E Oxford University Press 157 E.Barillot et al. An integrated view on biological data by means of IDL and ORBs. Secondly, the remaining semantic problem is tackled by prescribing a standard interface to The Common Object Request Broker Architecture related data in different databases. For instance, there are (CORBA) from the OMG (Object Management Group, many databases that store genome maps, but, for many rea- 1996) offers solutions to the three problems listed above. sons, they use different schemas to describe their content. • CORBA offers a common communication protocol: it Transforming one representation into another, a task that is specifies the protocol by which Object Request necessary to compare and integrate the data, is difficult and Brokers (ORB) exchange data: the Internet Inter-ORB requires detailed knowledge of every database. We solve this Protocol (IIOP). IIOP is used by any CORBA 2.0 com- problem by proposing a well-defined standard representa- pliant ORB, and therefore ORBs from different tion written in IDL. If each database provides an interface vendors can inter-operate transparently to the user. The based on this IDL, clients only need to know this interface user only sends requests to its own ORB, and this ORB and can thereby retrieve data from heterogeneous databases contacts other ORB servers if necessary and handles all in a homogeneous fashion. the communication tasks. The EBI and Infobiogen groups have worked together for a year to define a common schema to represent the genome • CORBA introduces the Interface Definition Language maps from their respective databases. (IDL) to model data, which provides a language-independent way of describing the public interface of objects. The IDL is independent from the machine archi- A common interface for genome maps tecture. Mapping of the IDL to a large variety of pro- A common interface for genome maps has been defined in gramming languages (C++, Java, Smalltalk, Fortran...) IDL and is given in the appendix. Figure 1 presents a synop- are defined. tic description in UML (Unified Modeling Language, see • IDL is a language to model the data. It solves the syn- http://www.rational.com/uml). tactic problem, but not the semantic problem: one still The genome map IDL is presented in the remainder of this has to agree on a common schema to obtain full inter- section. We have kept this IDL as simple as possible. Our aim operability. is to describe the representation of a map, and not the other It is beyond the scope of this paper to describe CORBA data used in the mapping process or in the map computation. more deeply and we refer the reader to the Object Manage- Any object in the IDL inherits from an ancestor ‘MapOb- ment Group (1996), Siegel (1996), Achard and Barillot ject’, which contains as attributes the name of the database (1997) or to the special issue of the ACM (1998) for an intro- where the object resides, the object identifier in this database duction to CORBA. CORBA has been adopted by several and the human readable name used by this database. It has groups in the bioinformatic community. For example, a also a method to get the cross-references of the object (to CORBA layer has been developed for several databases by cope with aliases in other databases). different groups: RHdb (Rodriguez-Tomé et al., 1997) at the European Bioinformatics Institute, HuGeMap (Barillot et Mappable objects al., 1998) and Virgil (Achard and Barillot, 1998; Achard et al., 1998) at Infobiogen, ArkDB (Hu et al., 1998) at the Ros- We use an inheritance hierarchy to describe the different ob- lin Institute, Micado (Biaudet et al., 1997) at INRA. Other jects to be placed on genome maps. The root of the hierarchy databases, such as IXDB (Leser et al., 1998c), are planning is the class ‘Mappable’, which inherits from ‘MapObject’ to implement it in the near future (Leser et al., 1998a). and stores: Other solutions can be envisaged to address the problem of • A generic object identification (inherited from ‘Ma- technical inter-operability. The main ones are the Java Re- pObject’). mote Method Invocation (RMI) and Microsoft’s Distributed • The species and the chromosome. This information is Component Object Model (DCOM). These solutions are currently stored as strings, but should in future versions proprietary’s products and lack cross-language (RMI) or be replaced by links to taxonomy and cytogenetic data- cross-platform support (DCOM). CORBA is an open sol- bases to ensure consistency. ution specified by a consortium of academic and commercial • organizations. Moreover the Java Development Kit 2.0 will The ‘type’ of the object (for markers: STS, EST, etc.; include a CORBA compliant ORB and DCOM-CORBA for clones: YAC, BAC, P1, etc.; for maps: genetic, gateways already exist. physical, radiation hybrid, sequence, etc.). We propose to use CORBA as the basis for interoperability • A method to get all the maps containing this object on in two stages: first, the CORBA architecture and services di- the server. Note that this method ‘getMaps’ does not rectly provide a solution to many of the technical problems return a list of maps, but a list of ‘MapElement’ objects 158 A standard CORBA interface for genome maps Fig. 1. Graphical representation of the Genome Map IDL in UML notation. that give the positional information and a pointer to the or left and right positions. Examples are clones or map (see below). chromosome bands. ‘Mappable’ is an abstract class which is not intended to be ‘Segment’ is derived in ‘Map’ and ‘Clone’. Therefore, ob- instantiated. Currently two subclasses are derived from jects on a map are either markers, clones, or other maps. This ‘Mappable’: spans a recursive data structure, allowing for maps that con- • ‘Point’; points are mappable objects that have a point tain maps which themselves contain maps, etc. Because we location. Examples are STS’s, amplimers and, in gen- are concentrating on the mapping aspect of genomic data, we eral, loci if they are not further defined.

Load more