Vol. 15 no. 2 1999 BIOINFORMATICS Pages 157-169

A proposal for a standard CORBA interface for genome maps

Emmanuel Barillot1,2,UlfLeser3, Philip Lijnzaad4, Christophe CussatĆBlanc1,2, Kim Jungfer4, Frédéric Guyon1, Guy Vaysseix1,2, Carsten Helgesen4,5 and Patricia RodriguezĆTomé4

1GISInfobiogen, 7 rue Guy Môquet ć BP 8, 94801 Villejuif cedex, France, 2Généthon, 1 rue de l'Internationale ć BP60, 91002 Évry cedex, France, 3Technische Universität Berlin, FB 13 ć CIS, Einsteinufer 17, DĆ10587 Berlin, Germany, 4The European Bioinformatics Institute, EMBL Outstation, Hinxton Hall, Hinxton, Cambridge CB10 1SD, UK and 5Department of Informatics, University of Bergen, HiB, NĆ5020 Bergen, Norway

Received on October 8, 1998; revised on December 16, 1998; accepted on December 17, 1998

Abstract number of internal relationships (e.g. orthology, map- … Motivation:The scientific community urgently needs to ping, gene regulation ), most of them still unknown. standardize the exchange of biological data. This is helped The success of the genomic projects now depends greatly by the use of a common protocol and the definition of shared on our ability to manipulate and interact with the data they data structures. We have based our standardization work on have yielded and to discover these unknown relationships. In CORBA, a technology that has become a standard in the past that respect, the scientific community of geneticists and mol- years and allows interoperability between distributed ob- ecular biologists still lacks the facility that would offer a jects. unique and operational view on these data. Currently the way Results: We have defined an IDL specification for genome we exchange biological data is far from standardized. There maps and present it to the scientific community. We have is a need to: implemented CORBA servers based on this IDL to distribute • integrate data of different nature (mapping, sequence, RHdb and HuGeMap maps. The IDL will co-evolve with the function, , metabolic pathways...), needs of the mapping community. • integrate data from different species (human, mam- Availability: The standard IDL for genome maps is available … at http:// corba.ebi.ac.uk/RHdb/EUCORBA/MapIDL.html. malians, micro-organisms ), The IORs to browse maps from Infobiogen and EBI are at • integrate data from different sources (even the data of http://www.infobiogen.fr/services/Hugemap/IOR and a given nature on a given species are often dissemi- http://corba.ebi.ac.uk/RHdb/EUCORBA/IOR. nated among several databases with heterogeneous for- Contact: [email protected], [email protected] mats). These issues are non-trivial. We propose to address them Introduction by using: By providing a common basis to all the biological sciences, (1) a common protocol that offers a transparent access to genetics and molecular biology are at the crossroads of the remote databases, study of life. This situation and the recent advances in mol- (2) a common application programming interface, ecular biology have set the stage for the numerous large scale genomic projects that have been launched during the past ten (3) a common schema describing all the data and their rela- years. tionships. This has led to a huge amount of data of different nature: This problem of standardization constitutes the subject of for Homo Sapiens, the mapping projects have reached this paper. In the second section, we expose how to integrate completion, the sequencing phase is on its way, and the physically the data and offer a common language to the pro- scientific community now turns its attention to gene func- grammers. In the third section, we propose a common inter- tion. Other species have been analyzed more deeply, or, more face for the genome maps. In the fourth section, we explain commonly, hardly studied. All these data present a great how to use this common interface.

E Oxford University Press 157 E.Barillot et al.

An integrated view on biological data by means of IDL and ORBs. Secondly, the remaining seman- tic problem is tackled by prescribing a standard interface to The Common Object Request Broker Architecture related data in different databases. For instance, there are (CORBA) from the OMG (Object Management Group, many databases that store genome maps, but, for many rea- 1996) offers solutions to the three problems listed above. sons, they use different schemas to describe their content. • CORBA offers a common communication protocol: it Transforming one representation into another, a task that is specifies the protocol by which Object Request necessary to compare and integrate the data, is difficult and Brokers (ORB) exchange data: the Internet Inter-ORB requires detailed knowledge of every database. We solve this Protocol (IIOP). IIOP is used by any CORBA 2.0 com- problem by proposing a well-defined standard representa- pliant ORB, and therefore ORBs from different tion written in IDL. If each database provides an interface vendors can inter-operate transparently to the user. The based on this IDL, clients only need to know this interface user only sends requests to its own ORB, and this ORB and can thereby retrieve data from heterogeneous databases contacts other ORB servers if necessary and handles all in a homogeneous fashion. the communication tasks. The EBI and Infobiogen groups have worked together for a year to define a common schema to represent the genome • CORBA introduces the Interface Definition Language maps from their respective databases. (IDL) to model data, which provides a language-inde- pendent way of describing the public interface of ob- jects. The IDL is independent from the machine archi- A common interface for genome maps tecture. Mapping of the IDL to a large variety of pro- A common interface for genome maps has been defined in gramming languages (C++, Java, Smalltalk, Fortran...) IDL and is given in the appendix. Figure 1 presents a synop- are defined. tic description in UML (Unified Modeling Language, see • IDL is a language to model the data. It solves the syn- http://www.rational.com/uml). tactic problem, but not the semantic problem: one still The genome map IDL is presented in the remainder of this has to agree on a common schema to obtain full inter- section. We have kept this IDL as simple as possible. Our aim operability. is to describe the representation of a map, and not the other It is beyond the scope of this paper to describe CORBA data used in the mapping process or in the map computation. more deeply and we refer the reader to the Object Manage- Any object in the IDL inherits from an ancestor ‘MapOb- ment Group (1996), Siegel (1996), Achard and Barillot ject’, which contains as attributes the name of the database (1997) or to the special issue of the ACM (1998) for an intro- where the object resides, the object identifier in this database duction to CORBA. CORBA has been adopted by several and the human readable name used by this database. It has groups in the bioinformatic community. For example, a also a method to get the cross-references of the object (to CORBA layer has been developed for several databases by cope with aliases in other databases). different groups: RHdb (Rodriguez-Tomé et al., 1997) at the European Bioinformatics Institute, HuGeMap (Barillot et Mappable objects al., 1998) and Virgil (Achard and Barillot, 1998; Achard et al., 1998) at Infobiogen, ArkDB (Hu et al., 1998) at the Ros- We use an inheritance hierarchy to describe the different ob- lin Institute, Micado (Biaudet et al., 1997) at INRA. Other jects to be placed on genome maps. The root of the hierarchy databases, such as IXDB (Leser et al., 1998c), are planning is the class ‘Mappable’, which inherits from ‘MapObject’ to implement it in the near future (Leser et al., 1998a). and stores: Other solutions can be envisaged to address the problem of • A generic object identification (inherited from ‘Ma- technical inter-operability. The main ones are the Java Re- pObject’). mote Method Invocation (RMI) and Microsoft’s Distributed • The species and the . This information is Component Object Model (DCOM). These solutions are currently stored as strings, but should in future versions proprietary’s products and lack cross-language (RMI) or be replaced by links to taxonomy and cytogenetic data- cross-platform support (DCOM). CORBA is an open sol- bases to ensure consistency. ution specified by a consortium of academic and commercial • organizations. Moreover the Java Development Kit 2.0 will The ‘type’ of the object (for markers: STS, EST, etc.; include a CORBA compliant ORB and DCOM-CORBA for clones: YAC, BAC, P1, etc.; for maps: genetic, gateways already exist. physical, radiation hybrid, sequence, etc.). We propose to use CORBA as the basis for interoperability • A method to get all the maps containing this object on in two stages: first, the CORBA architecture and services di- the server. Note that this method ‘getMaps’ does not rectly provide a solution to many of the technical problems return a list of maps, but a list of ‘MapElement’ objects

158 A standard CORBA interface for genome maps

Fig. 1. Graphical representation of the Genome Map IDL in UML notation.

that give the positional information and a pointer to the or left and right positions. Examples are clones or map (see below). chromosome bands. ‘Mappable’ is an abstract class which is not intended to be ‘Segment’ is derived in ‘Map’ and ‘Clone’. Therefore, ob- instantiated. Currently two subclasses are derived from jects on a map are either markers, clones, or other maps. This ‘Mappable’: spans a recursive data structure, allowing for maps that con- • ‘Point’; points are mappable objects that have a point tain maps which themselves contain maps, etc. Because we location. Examples are STS’s, amplimers and, in gen- are concentrating on the mapping aspect of genomic data, we eral, loci if they are not further defined. We currently do not store extensive information about the objects them- have defined only one subclass of ‘Point’: ‘Marker’. selves apart from their positions. Hence, markers are simply This is sufficient for the data in our databases which, objects with a name and a type. from a map representation point-of-view, treats differ- Genome maps are in the first place simply segments. The ent types of markers (STS, EST...) homologously. location of all mappable objects is defined in a separate data Other implementations of map IDL will define other structure (see below). This construction has a number of ad- subclasses with more specific attributes and methods. vantageous properties: • ‘Segment’; segments are mappable objects that have • Maps can span entire or only certain re- some extent, defined for instance by flanking markers gions. It is common-place to define such sub-maps by

159 E.Barillot et al.

flanking markers, often without having an exact posi- form a hierarchy and a partition (at each level they cover the tion in mind. whole genome with no overlap). Hence ‘CytogeneticEle- ment’ has some other methods to get the contained cytogene- • Maps can contain other maps as objects. This is for in- tic elements, the containing cytogenetic element, and the stance useful if one wants to dynamically assemble a neighboring cytogenetic elements having the same father in chromosome-wide map from a number of regional the cytogenetic hierarchy. maps.

• Segments do not necessarily exist physically. For in- Mapped objects and map positions stance, one can use segments to model bins, which are Mapped objects are linked to maps through the interface regions between two mapped markers, where objects in ‘MapElement’. ‘MapElement’ is an abstract interface which this bin do not have any further position information is specialized in five different children: ‘PointPosition’, ‘Or- attached. deredPosition’, ‘RangePosition’, ‘VaguePosition’ and ‘In- • Cytogenetic information is treated homogeneously tervalPosition’. The choice of the derived interface to use with respect to the other type of data. Therefore it is depends on the nature of the mapped element and on the prin- possible to have maps that do not contain any chromo- ciple of its positioning. some bands, which is often the case if sub-maps are We introduced ‘MapElement’ mainly for two reasons. only imprecisely placed. First, it allows arbitrary N:M relationships between maps and objects. That means that each object can be placed on several • The superclass ‘Mappable’ offers a uniform interface maps, even many times on the same map. Secondly, the ob- to access mapped objects. jects on a map can be stored in different databases. The server holding the map would only store the map object itself and ‘Mappable’ and each class inheriting from ‘Mappable’ all location objects; these location objects would point to ob- offer their specific methods. For instance, ‘Mappable’ offers jects in other databases that can contain more object-specific a method to retrieve all maps containing this ‘Mappable’ ob- data. This can be extremely helpful for example for the place- ject. ‘Map’ currently offers four methods: retrieve all con- ment of which can have extensive information at- tained objects (‘getAllElements’), retrieve the number of tached. Instead of copying this data, a mapping database can contained objects (‘getNrOfElements’), retrieve all con- simply store the link. Then each client can access the most tained objects between two objects (‘getRangeBetweenOb- actual data by following this link. jects’) and retrieve all objects around a certain position (‘ge- Each interface inherited from ‘MapElement’ stores the fol- tElementsInSegment’). These methods can throw specific lowing information: exceptions, for instance if one tries to retrieve a range that is not contained in the map. • Pointers to the contained object and to the containing Certainly, other types will be added to address more spe- object. The containing object can only be a map. We cific applications. These should be modeled as subclasses thereby prevent the temptation to store other types of from the current generic ‘Map’ class to ensure portability of data in this class, for instance hybridization results. client applications. This would lead to a miss-use of our interface defini- ‘Map’ is specialized in ‘LinearMap’ and in ‘Bin’. ‘Linear- tions and to unnecessarily complex code. Map’ maps have a numeric and linear coordinate systems, • The location information itself. It is specific to the de- such as base pairs for physical maps of centimorgans for gen- rived interface: etic maps. The borders of this coordinate systems are stored in ‘maxCoordinate’ and ‘minCoordinate’. They have – For ‘PointPosition’, it is a single coordinate (for in- methods to get a map range defined from two positions or stance for exactly positioned markers). ‘PointPosition’ from an object and a maximum distance to this object. ‘Bin’ attributes also include a boolean that indicates if the is useful to model any map based on a framework map (e.g. element was on the framework used to build the map. radiation hybrid maps, cytogenetic-based maps, etc). In this case, each interval defined by the framework is modeled by – For ‘OrderedPosition’, it is an order information (a a ‘Bin’. A ‘Bin’ contains other ‘Mappable’ objects (typically rank, for maps where objects have only an order, but no markers). This relationship is modeled by a ‘MapElement’ absolute position). (see below). – For ‘RangePosition’, it is a left and a right flanking Cytogenetic elements, that is chromosomes, arms, telom- ‘Mappable’ object. eres, centromeres, bands and sub-bands, are modeled by a ‘CytogeneticElement’ derived from ‘Bin’. This was done to – For ‘VaguePosition’, there is no location informa- cope with the specificities of cytogenetic elements: they tion.

160 A standard CORBA interface for genome maps

– For ‘IntervalPosition’, it is a left and a right position bleFactory’ (markers, clones and cytogenetic elements in- (for instance for clones, which have a left and a right herit from Mappable). ‘MapFactory’ has methods to get an end coordinate). object from its name or from its identifier. ‘MappableFac- tory’ has also a method to get a ‘Mappable’ from a cross-ref- • The ‘precision’ of the location information. It consists erence (to cope with aliases, such as GDB D-numbers (Le- of the confidence in the mapping of a marker (the ‘lods- tovsky et al., 1998), EMBL accession numbers (Stoesser et core’) and the ‘error’ on the position. al., 1998), RH numbers (Lijnzaad et al., 1998)...). ‘MapFac- Some examples for the usage of the data model in typi- tory’ has also a method to get a list of maps from a type of cal situations follow: map, a species name, a chromosome name and/or a list of mapped objects. • Typically, a physical map is represented in our model These factories are accessed through IORs. Each factory as a ‘LinearMap’. Clones are connected to it through an in the IDL has its own IOR. By publicizing them, database ‘IntervalPosition’, while markers are connected managers allow client programs to access their servers. through a ‘PointPosition’. Another solution would be to use the CORBA Naming • A genetic map will be treated like a physical map of and/or Trading service. In CORBA, the user will mainly markers, but with a different unit (centiMorgan instead work with references to objects inside a computer program. of ). However, the object references have to be obtained from • somewhere, and this is the role of the so-called Naming and A radiation hybrid map is represented as a ‘Linear- Trading services. The Naming service functions as a direc- Map’, but the choice of the ‘MapElement’ is more elab- tory lookup, by name string, of named objects; the Trading orate, because such a map contains generally frame- service is a more general mechanism that allows lookup by work markers and non-framework markers. The service type, and can be likened to a yellow pages lookup. former are connected to the map through a ‘PointPosi- Both are OMG standards; descriptions are available from the tion’, while the latter can be connected in two ways: OMG web site (http://www.omg.org/) and they are provided – through a ‘RangePosition’, with their neighboring with most ORB products. These services address a major framework markers as flanking ‘Mappable’ object and problem in biological databases: locating the objects. They no rank value. have no equivalent in other external data exchange systems, such as ASN.1 (Steedman, 1993), XML – through a ‘VaguePosition’ connected to an inter- (http://www.w3.org/XML) or even Java RMI. mediary ‘Bin’. A bin is created for each interval be- tween two adjacent framework markers, and connected Implementation to the map through an ‘IntervalPosition’. Then the non- framework markers are connected to their correspon- In this section we report experiences from implementing the ding bins through a ‘VaguePosition’. previously described IDL. The general architecture is de- picted in Figure 2. Both sites have implemented CORBA Factories servers that access existing databases and clients that use these servers to query the underlying databases through the Factories are objects whose only function is to build other methods offered in the IDL. objects (for example from a given name). Clients invoke a method of the factory which creates an object and returns the Database access through CORBA Inter-operable Object Reference (IOR) of the object. IORs are universal references to a CORBA object and can be used CORBA offers several ways to represent database objects by any client ORB. IORs are the CORBA equivalent to the and queries in IDL (Leser et al., 1998b). Choosing a strategy Uniform Resource Locators (URLs) of the World Wide Web. has great impact on the performance and flexibility of the Usually, generic factories are specialized into more specific service. In particular, database objects can be represented ones, each implementing the creation of a more derived type either as CORBA objects, belonging to an IDL interface of object. class, or as structs: Factories are the access points into the database. Therefore • When a client requests a CORBA object, it will get only the main types of objects in the data model (those that back only a reference. The client can from then on use will be queried) need a factory. When browsing genome the object as if it were local, with all necessary transla- maps, the access points are maps, markers, clones or cytoge- tions being carried out transparently inside the ORB. netic elements. We therefore propose the following factories But accessing an attribute value always requires a call in our genome map IDL: ‘Factory’ is a generic and abstract to the server, which leads to increased network traffic. interface, which is specialized in ‘MapFactory’ and ‘Mappa- If many values are requested, the load may become un-

161 E.Barillot et al.

Fig. 2. Heterogeneous clients access heterogeneous databases through an homogeneous view based on the use of CORBA and the common genome map IDL.

acceptable. On the other hand if the data is often up- Our IDL does not contain a method to pose arbitrary ad- dated on the server, the client will always get back the hoc queries against the underlying map database because this most actual data. would be incompatible with the general objective, which is to serve as a standard interface to mapping data. There is no • IDL struct are passed by value. Accessing attribute va- query language for such queries that is understood by all lues is therefore a local and fast operation, once the database systems in use. Furthermore, different databases whole struct has been transmitted. But clients will al- will always have their own, internal data schema. We there- ways obtain the whole struct even if only single fields fore take the approach of defining methods that encapsulate are required. Besides, if the data is updated on the parameterized, fixed queries. server after a client request, the client will stay with All attributes in our IDL are read-only, since our purpose outdated values until he executes a new request. is only to distribute mapping data, but not to update a data- base. Our IDL models most object classes as interfaces, because interfaces are conceptually clearer, contain methods and can Server implementation be organized in hierarchies. However, for performance rea- sons we have implemented some classes as structs. This We have implemented the presented IDL to provide access applies to classes which are accessed frequently but contain to two existing databases: RHdb (Lijnzaad et al., 1998) at the rather stable data. For instance, in a radiation hybrid mapping EBI and HuGeMap (Barillot et al., 1998) at Infobiogen. Im- project, object positions are defined by bins which will not plementing a server for this IDL is rather simple once the change during the experimentations, though the number of general mechanism is understood. It proceeds in the follow- object being placed in these bins is continuously increasing. ing steps: We therefore implement for instance the classes ‘Mappable’ • Every ORB comes with a compiler which translates and ‘Map’ as interface, but ‘MapElement’ as struct. IDL code in a server program skeleton. Translations are

162 A standard CORBA interface for genome maps

defined for many programming languages. This skel- pages. Further CORBA object references need to be obtained eton contains all declarations and method stubs for the again in every session. necessary data conversion. We have implemented three client programs at EBI and at Infobiogen that fetch the maps in the two databases through • A programmer fills code into this skeleton. This usually CORBA interfaces: requires only the initialization of a database gateway in • the server startup code and the appropriate queries in Mapplet is a Java applet developed by Kim Jungfer for the method bodies. the display of multiple maps (Jungfer and Rodriguez- Tomé, 1998). • Finally, the server is registered with the local ORB, and • the IORs of initial objects, usually factory objects, are MappetShow is a Java map viewer, that compares sev- made public. eral maps and gives a clear view of very dense maps (several hundreds of markers); it uses visualization From the client’s perspective it is completely transparent concepts such as (multi-)focus+context techniques and which system actually holds the data. For instance, the RHdb controlled information density (Leung and Apperley, database uses the Oracle 7.3 Relational DBMS. The CORBA 1994; Rao and Card, 1994; Woodruff et al., 1998). server has been developed using OrbixWeb 3.0 from IONA Inc and is written in Java using Oracle’s JDBC driver. In con- • The GDB map viewer (Fasman et al., 1997), developed trast, the HuGeMap database is built on top of the EyeDB at the Genome DataBase (Baltimore, USA), was inter- Object-Oriented DBMS (http://www.infobiogen.fr/services/ faced to our CORBA servers. Note that this map viewer eyedb). The server has been developed using Orbix 2.2MT is file-driven and does not use a direct database access. and is written with the EyeDB C++ Application Program- We have therefore developed an adapter which first ming Interface. These differences are completely hidden be- queries a database and then generates a file in the hind the server interface defined in IDL. necessary format. We have also ported the code of the RHdb server to another Clients access server objects by using their IOR. Our IDL database, IXDB (Leser et al., 1998c). IXDB also uses does not support a way to access data in multiple servers with ORACLE 7.3, but has a significantly different schema. Port- a single method. Although clients can follow links to objects ing the code was very simple, as mostly only the actual SQL on other servers, they can not ask a query such as: ‘Give me queries had to be changed. Only few methods required more all maps on any server that contain the gene ‘DMD’’. Imple- work. Finishing a prototype took less than three days. menting such a method is the task of a trader (Thissen and The IORs of both servers are publically available at Linnhoff-Popien, 1996). http://www.infobiogen.fr/services/Hugemap/IOR and Since all our clients are implemented as applets, they can http://corba.ebi.ac.uk/RHdb/EUCORBA/lOR respectively. be used with any WWW browser.

Client implementation Performance We have adopted a number of existing clients to use our IDL. Our system demonstrates that and how interoperability can Developing a client comprises the following steps (this de- be achieved through CORBA. However, response times are scription only covers those parts of a client that are concerned not negligible and range from milliseconds to a few seconds, with data access, but not visualization, user interface, cal- depending on the amount of data that need to be transferred culations, etc.): and the network delay. They thereby are of the same order of • First, the client needs to compile the IDL into a client magnitude than the currently predominant WWW access, program skeleton. but are considerably slower than direct TCP/IP connections. There is currently ample discussions about the perform- • Then a programmer adds code for establishing a con- ance of CORBA-based systems in genome research. We nection to the server. This usually implies loading a think that this discussion misses an important point, because IOR, for instance from a WWW page or from a local we do not believe that performance is the only pressing prob- file, and then calling methods of the referenced object lem for genomic databases today. The more general problem to obtain references to further objects. is the lack of flexible, comfortable and comprehensive access methods to the available data. It is still extremely difficult for • These references are continuously used to pose queries an average biologist to get a complete overview about avail- by calling object methods. able information in his area of interest (Gelbart, 1998; Rob- Note that IORs are in general not stable and change over bins, 1995), although this problem was addressed since the time, for example when the CORBA server is shutdown and early days of the Project (Robbins, 1992). re-started. We therefore publish our initial IORs as WWW We therefore face a situation where we have to trade per-

163 E.Barillot et al. formance for flexibility and expressiveness of interfaces to taining databases for genome mapping (Infobiogen and a certain degree. Our IDL allows a client application to ac- EBI). Further groups have expressed their interest in standar- cess data from many, completely heterogeneous and world- dizing the access to genome maps through the LSR (see wide distributed systems in a homogeneous fashion. http://www.omg.org/lsr/ and ftp://ftp.omg.org/pub/docs/ Another major issue concerns the scalability of a CORBA- corbamed/98–03–16.ps). We strongly encourage mappers to based system. CORBA itself was designed to be scalable actively contribute to the definition of this standard and to (Henning, 1998), but it is highly dependent on: (i) the granu- propose improvements to the existing version. larity of the data exchanged and (ii) the bandwidth of the Our IDL has been implemented and tested on two distant networks. The development of high speed connections is and different genome map databases: the relational database now recognized as a high priority for science and economy RHdb and the object oriented database HuGeMap. Several and its advent will address the problem of bandwidth (ACM, clients have been developed or adapted to show the sound- 1997; Bell and Gray, 1997). Regarding the granularity, the ness of our approach. A CORBA server for a third database, use of struct instead of interfaces as explained above keeps IXDB, is currently under development. the network traffic in a reasonable limit. The Object-by- We expect that our genome map IDL will evolve further in Value specification (Vinoski, 1998) has recently been re- the near future: leased and will offer another solution: it allows the server to • as a result of the feedback from the community; pass the value of all the attributes of an interface in a single • call instead of passing them individually when requested. to model some parts more precisely, such as the repre- sentation of coordinates; Related work • to include further modules, such as comparative map- ping; and The OMG addresses the access of databases with the ‘Object Query Service’. However, the OQS has several deficiencies • to make use of the capabilities of CORBA that are not (Wells and Thompson, 1994), and in particular requires SQL yet implemented or that will be specified in the 3.0 or OQL as query language. Many systems in genomic re- norm, such as the Naming Service, the Portable Object search do not support these languages. We therefore chose a Adapter (the Portable Object Adapter allows an impli- number of predefined queries and encapsulate them into cit registration of large amounts of CORBA objects in methods. Thereby, clients can access the data without know- an ORB independent way. This is especially important ing about the underlying query language or schema. To ad- in the context of CORBA wrappers for databases dress the problem of accessing multiple servers from a single where there can be millions of objects in the database interface we are currently investigating OMG’s ‘Object and it is impossible to register each of them individ- Trading Service’. ually) or the Object-by-Value specification (Vinoski, Our object-oriented representation of maps is probably not 1998). new. This is however difficult to judge as such detailed Defining common access methods for biological data and schemas are rarely published. The GDB map representation interconnecting biological databases is essential to future ge- (Fasman et al., 1996) also uses the separation of location and nomic research. Although the suggestion of standards to objects; however, they are strongly concentrating on marker reach these goals are not new, we believe that our approach, maps and do not mention recursive maps. The schema of based on the industry-proven middleware technology IXDB (Leser et al., 1998a) is also similar, but supports only CORBA, has convincing benefits. Clients and servers can be physical and genetic maps with absolute coordinates. written in many different programming languages and run on The OMG has recently formed the ‘Life Science Research different hardware platforms. Additionally, CORBA allows Domain Task Force’ to foster the definition and usage of structured access in contrast to the WWW, where data has to standard IDL interfaces for life science. Until now, this task be parsed out of HTML pages. We however strengthen the force has issued two requests for proposals: one for sequence fact that CORBA defines an application programming inter- analysis, and one for map representation. The authors are ac- face, and is not suitable for direct human access, as the tively involved in the latter and have already expressed their WWW is. intention to submit a proposal which will be based on the IDL We encourage the other genome map database managers presented in this work. to use our IOR to access our data, and also to offer them- selves such services. Conclusions Acknowledgements We have presented a proposal for a standard interface to ge- nomic maps through CORBA. This interface has been de- We would like to thank all our colleagues at the G.I.S. Info- fined as a consensus between laboratories involved in main- biogen and at EBI for their support and help in our work. At

164 A standard CORBA interface for genome maps

Infobiogen, we would like to thank Philippe Gesnouin our Leser,U., Tai,S. and Busse,S. (1998b) Design issues of database access system engineer and Frédéric Achard for discussions and in a CORBA environment. In Conrad,S. (ed.), Workshop on comments about the article. At EBI, we would like to thank Integration of Heterogeneous Software Systems, pp. 74–87, Magde- Jeremy Parsons and Anastasia Spiridou for their constructive burg, Germany. participation in the discussions. This paper has benefited Leser,U., Wagner,R., Grigoriev,A., Lehrach,H. and Crollius,H.R. (1998c) IXDB, an X chromosome integrated database. Nucleic from a thorough review whose authors are sincerely ac- Acids Res., 26, 108–111. knowledged. This work was supported in part by the Euro- Letovsky,S.I., Cottingham,R.W., Porter,C.J. and Li,P.W.D. (1998) pean Union contracts BIO4-CT95-0037 and GDB: the Human Genome Database. Nucleic Acids Res., 26, 94–99. BIO4-CT96-0346. Ulf Leser was supported by a short-term Leung,Y.K. and Apperley,M.D. (1994) A review and taxonomy of EMBO fellowship (ALTF ASTF). distortion-oriented presentation techniques. ACM Trans. Com- puter–Human Interaction, 1, 126–160. Lijnzaad,P., Helgesen,C. and Rodriguez-Tomé,P. (1998) The radiation References hybrid database. Nucleic Acids Res., 26, 102–105. Achard,F. and Barillot,E. (1997) Ubiquitous distributed objects with Object Management Group (1996) CORBA Architecture and Specifi- CORBA. In Altman,R., Dunker,K., Hunter,L. and Klein,T. (eds), cations. OMG publications. http://www.omg.org/store/ Pacific Symposium on Biocomputing ‘97, pp. 39–50. World publications.html Scientific, Singapore. Rao,R. and Card,S.K. (1994) The table lens: merging graphical and Achard,F. and Barillot,E. (1998) Virgil: a database of rich links symbolic representations in an interactive focus+context visualiz- between GDB and GenBank. Nucleic Acids Res., 26, 100–101. ation for tabular information. In CHI ‘94. Conference Proceedings Achard,F., Cussat-Blanc,C., Viara,B. and Barillot,E. (1998) The new on Human Factors in Computing Systems: ‘Celebrating Interdepen- Virgil database: a service of rich links. Bioinformatics, 14, 342–348. dence‘, pp. 318–322 and 481–482. ACM Press, Boston. ACM (1997) The next 50 years. Commun ACM, 40. Robbins,R.J. (1992) Challenges in the human genome project: ACM (1998) The CORBA connection. Commun.ACM, 41, 34–79. progress hinges on resolving database and computational factors. Barillot,E., Guyon,F., Cussat-Blanc,C., Viara,E. and Vaysseix,G. IEEE Eng. Medicine Biol., March 1992, 25–34. (1998) HuGeMap: a distributed and integrated Human Genome Robbins,R.J. (1995) Information infrastructure for the human genome Map database. Nucleic Acids Res., 26, 106–107. project. IEEE Eng. Medicine Biol., 14, 746–759. Bell,G. and Gray,J.N. (1997) The revolution yet to happen. In Denning, Rodriguez-Tomé,P., Helgesen,C., Lijnzaad,P. and Jungfer,K. (1997) A P.J. and Metcalfe,R.M. (eds), Beyond Calculation, pp. 5–32. CORBA server for the radiation hybrid database. Proceedings of the Copernicus, Springer, New York. Fifth ISMB, pp. 250–253. AAAI Press, Halkidiki, Greece. Biaudet,V., Samson,F. and Bessières,P. (1997) Micado — a network- Siegel,J. (1996) CORBA, Fundamentals and Programming. Wiley oriented database for microbial genomes. CABIOS, 13, 431–438. Computer Publishing Group, New York. Fasman,K.H., Letovsky,S.I., Robert,W.C. and Kingsbury,D.T. (1996) Steedman,D. (1993) ASN.1 The Tutorial and Reference. Technology Improvements to the GDB human genome data base. Nucleic Acids Appraisals Ltd, Twickenham, UK. Res., 24, 57–63. Stoesser,G., Moseley,M.A., Sleep,J., McGowran,M., Garcia-Pas- Fasman,K.H., Letovsky,S.I., Robert,W,C. and Kingsbury,D.T. (1997) tor,M. and Sterk,P. (1998) The EMBL nucleotide sequence database. The gdb human genome database anno 1997. Nucleic Acids Res., 25, Nucleic Acids Res., 26, 8–15. 72–81. Thissen,D. and Linnhoff-Popien,C. (1996) Finding optimal services Gelbart,W.M. (1998) Databases in genomic research. Science, Oct. within a corba trader. In Spaniol,O., Linnhoff-Popien,C. and 1998, 659–661. Meyer,B. (eds), 1st Conference on Trends in Distributed Systems, Henning,M. (1998) Binding, migration, and scalability in CORBA. pp. 200–213. Springer, Aachen, Germany. Commun.ACM, 41, 62–71. Vinoski,S. (1998) New Features for CORBA 3.0. Commun. ACM, 41, Hu,J., Mungall,C., Nicholson,D. and Archibald,A. (1998) Design and 44–52. implementation of a CORBA-based genome mapping system Wells,D.L. and Thompson,C.W. (1994) Evaluation of the object query prototype. Bioinformatics, 14, 112–120. service submissions to the object management group. IEEE Jungfer,K. and Rodriguez-Tomé,P. (1998) Mapplet: a CORBA-based Quarterly Bull. Data Eng., 17, 36–45. genome map viewer. Bioinformatics, 14, 734–738. Woodruff,A., Landay,J. and Stonebraker,M. (1998) Constant informa- Leser,U., Lehrach,H. and Crollius,H.R. (1998a) Issues in developing tion density in zoomable interfaces. In Proceedings of the Working integrated genomic databases and application to the human X Conference on Advanced Visual Interfaces AVI ‘98, pp. 57–65. chromosome. Bioinformatics, 14, 583–590. ACM Press, L’Aquila, Italy.

165 E.Barillot et al.

Annex: the common genome map IDL module GenomeMaps { typedef sequence < string > Strings;

interface MapObject; interface MapElement; interface Mappable; interface CytogeneticElement;

typedef sequence < MapElement > MapElementList; typedef sequence < Mappable > MappableList; typedef sequence < CytogeneticElement > CytogeneticElementList; typedef sequence < MapObject > MapObjectList;

interface MapObject { readonly attribute string database; readonly attribute string aid; readonly attribute string Name; MapObjectList getCrossReferences (); };

interface Mappable:MapObject { readonly attribute string species; readonly attribute string chromosome; readonly attribute string type; MapElementList getMaps (); };

interface Point:Mappable { };

interface Marker:Point { };

interface Segment:Mappable { readonly attribute float length; readonly attribute string unit; };

166 A standard CORBA interface for genome maps interface Clone:Segment { }; interface Map:Segment { exception positionOutOfRange { string reason; }; exception objectNotContained { string reason; }; long getNrOfElements (); MapElementList getElements (); MapElementList getRangeBetweenObjects(in string obj1,in string obj2) raises (objectNotContained); MapElementList getElementsInSegment (in Segment segment) raises (objectNotContained); }; typedef sequence < Map > MapList; interface LinearMap:Map { readonly attribute float maxCoordinate; readonly attribute float minCoordinate; MapElementList getMapRange (in float from Position,in float to Position) raises (positionOutOfRange); MapElementList getMapRangeAroundObject (in string OID,in float range) raises (objectNotContained); }; interface Bin:Map { }; interface CytogeneticElement Bin { readonly attribute short rank; exception noSuperBand { string reason; };

167 E.Barillot et al.

exception noSubBands { string reason; }; CytogeneticElement getSuperBand () raises (noSuperBand); CytogeneticElementList getSubBands () raises (noSubBands); CytogeneticElementList getSiblings (); };

struct Precision { float error; float lodscore; string description; };

interface MapElement { readonly attribute Precision positionPrecision; };

interface PointPosition:MapElement { readonly attribute Point mappedObj; readonly attribute LinearMap onMap; readonly attribute float position; readonly attribute boolean frameworkElement; };

interface OrderedPosition:MapElement { readonly attribute Mappable mappedObj; readonly attribute Bin onMap; readonly attribute short rank; };

interface RangePosition:MapElement { readonly attribute Mappable mappedObj; readonly attribute LinearMap onMap; readonly attribute Mappable leftFlankingObj; readonly attribute Mappable rightFlankingobj;

168 A standard CORBA interface for genome maps

};

interface VaguePosition:MapElement { readonly attribute Mappable mappedObj; readonly attribute Bin onMap; };

interface TntervalPosition:MapElement { readonly attribute Segment mappedObj; readonly attribute LinearMap onMap; readonly attribute float leftEnd; readonly attribute float rightEnd; };

interface Factory { exception objectNotFound { string reason; string objName; }; MapObject getByOid(in string oid) raises (objectNotFound); MapObjectList getByName (in string name) raises (objectNotFound); };

interface MappableFactory:Factory { MappableList getObjectByExternalID(in MapObject externalID) raises (objectNotFound); };

interface MapFactory:Factory { MapList getMapByExternalID(in MapObject externalID) raises (objectNotFound); MapList getMapList (in string mapType,in string species,in string chromosome,in Strings mappedObjects) raises (objectNotFound); }; };

169