Integration of Metabolic Pathway Linked Data: Lessons Learned

Maciej Rybinski, María Jesús García-Godoy, Ismael Navas-Delgado and José F. Aldana-Montes Universidad de Málaga, Spain

Abstract. In the last few years, the Life Science domain has experienced a rapid growth in the amount of available biological databases. The heterogeneity of these databases makes the data integration a challenging issue. Some inte- gration challenges refer to locating resources, relations, data formats, synonyms or ambiguity. Linked Data approach partly solves the heterogeneity problems (mainly the syntactic ones). Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. However, Linked Data approach is not a solution in itself. This paper illustrates a data integration and interlinking process for metabolic pathway databases such as Uniprot, Kegg, Brenda and Reactome. This process was an important part of the development of a tool for biological end-users that integrates metabolic pathways information by using Linked Data technology. The integration process and the tool de- velopment shows that heterogeneity problems have not been solved in this domain, and so this paper discusses the problems that arise from the use of data and links published as Linked Data and the possible approaches to solve these issues.

Keywords: Linked Data, Knowledge integration, Bioinformatic databases

(phosphoglucomutase, Homo sapiens) in order to 1. Introduction obtain information about this enzyme in glycoly- sis/gluconeogenesis pathways. The result is a list of Over the last two decades, the biological database pathways in which the target enzyme works. The user community has witnessed a rapid growth in the num- has to browse the map to reach the link to the enzyme. ber of available data sources. The growth is a result The information at this link includes genes, pathways, of an enormous increase in the availability and acces- classes of pathways, orthology and biochemical de- sibility to biological data. In the beginning, the in- tails. However, the user has to go to another database formation was used by a few specialized disciplines to discover kinetic and proteomic information about but today, these databases have become essential the target enzyme. For example, the user can use the resources used by biologists around the world. Brenda database for the information related to the In the metabolism area, there are some examples enzyme code including the components of chemical of databases, such as Kegg [1], whose major compo- reactions, the reaction type, pathways, synonyms, nents are pathways (of the most known biochemical kinetic information (kcat, Ki, PH, temperature) and and regulatory networks) shown in graphical dia- cofactors, tissue source, cellular location and infor- grams. Brenda [2] is one of the largest available in- mation about molecular structure. Additionally, the formation systems, which provides biochemical and user can access Uniprot database [4] to see infor- molecular information about classified enzymatic mation on corresponding protein attributes, sequenc- activity. Reactome [3] captures all the chemical reac- es annotations, alternative splicing of the enzymes, tions and pathways that occur in different organisms. literature and general information. For each database Many biologists use these databases on a daily basis the search has to be repeated independently. This in order to extract biological information needed in scenario reflects the main issue with these data their work. For example, consider a user who search- sources namely that each database contains a subset es the Kegg database for the enzyme code 5.4.2.2 of the biological knowledge that might be of rele-

vance. These knowledge islands cause a problem for cal background. During the implementation process, biologists who have to repeatedly browse different several problems with the real-world usage of biolog- databases to obtain the answer for a cross-domain ical integrated data were detected. We have catego- information need. rized these problems and a discussion on how to face To address the problem of information integration, obstacles belonging to each respective category is the semantic Web community led by W3C communi- presented in this paper. ty proposed a set of standards such as the RDF [5] The paper is divided as follows: Section 2 de- and OWL [6]. Since 2007, there has been a lot of scribes the process to provide access to heterogene- effort made to provide data sets from different areas ous biological information and the results obtained. using the Semantic Web technologies. In this context, Section 3 presents a detailed description of the prob- a set of best practices has been proposed for sharing, lems and discussion on: the quality of the data avail- publishing and connecting data, information and able within the Linked Data Cloud, the heterogeneity knowledge by using RDF and URIs. These practices of the data and reusability bottlenecks. Section 4 de- are known as Linked Data principles [7]. scribes some of the related work and approaches that The movement towards the publication and linking address the maintenance issues of the Linked Data of the data has been continuously growing and the Cloud. Section 5 includes the conclusions resulting number of triples stored in the Linked Data Cloud from this integration experiment and the consequent has grown from 2 billion in 2007 to 31 billion in improvements to be implemented in the Biochemical 2011 [8]. In Life Sciences the amount of RDF infor- Pathway application. mation published has increased enormously due to the efforts of several projects. Some of the relevant integration platforms providing data in this domain 2. Providing access to heterogeneous information are Bio2RDF [9] and Linked Life Data [10]. These two platforms have converted relational information The integration of biological databases in Life Sci- into RDF and linked the heterogeneous data between ences has become a challenge for researchers. One of biological databases with the objective of publishing the main problems of biological data is its syntactic billions of triples representing biological knowledge. and semantic heterogeneity. Linked Life Data is a data integration platform Additionally, over the last decade the number of achieved through using a massive RDF warehouse biological databases has increased. The annual NAR solution extended with reasoning service inference journal supplement listed 96 new online databases in and semantic annotations. It is supported by the 2001, 800 in 2007 and 92 in the year 2012 (Figure 1). OWLIM semantic repository [11] and stores 20 bil- The increase in the number of databases each year is lion statements. Whereas, Bio2RDF is a mash-up due to the ease of publishing them on the Web. There application that combines data from different rele- are many databases and each one has its own reposi- vant biological databases. tories and community based on it. Consequently, The Linked Data community seems to be switch- between the data sources a variety of resources are ing from the data publication stage to the data con- replicated, overlap or are presented from different sumption stage. In many domains the systems for points of view. Linked Data consumption are very few or none, however the paradigm itself has been proven useful in practice by efforts from BBC music [12] and the Linked Open Drug Data (LODD) [13]. In Life Sciences, one of the main integration prob- lems is related to the complexity of biological data. This feature makes it difficult to develop ‘wide-span’ applications for end-users. This paper presents an integration solution (demo version available at http://150.214.214.5/metabolicpathways/ ) in the con- text of the metabolic pathways that uses Linked Data as source of information. On top of the integration Fig.1. The growth rate of biological databases published in the part we have implemented functionalities such as NAR in the last decade. The X- axis represents the years and the navigation, search and visualization to make the inte- Y-axis represents the number of databases appeared per year in the grated data easily accessible to users with a biologi- NAR database supplement

To solve (to some extent) the integration problem nor is that desired in terms of providing a complete in bioinformatics, Linked Data is becoming pub- knowledge base for the application, as it is unrealistic lished, as it gets wider recognition throughout the to assume that it is possible to create a ‘database of biological community. However, some challenges everything’. In the most basic case, integration of have to be faced in order to reach useful solutions. two knowledge sources would simply mean interlink- For example, one of the common challenges in the ing our dataset with an existing relevant one (regard- Life Sciences integration is the synonymy. There are less of whether it follows Linked Data principles or it many synonyms for the same biological entity as a is a ‘traditional’ biological database). Integration in consequence of naming entities independently. In the this sense follows Linked Data principles and pro- area of metabolism, most of the redundant infor- vides the user with ‘navigational’ freedom of infor- mation results from using different identifiers to de- mation access. Nonetheless, in order to be able to note the same metabolite. The ambiguity is some- perform the interlinking process one still needs the times also caused by the conceptual heterogeneity, core data to start with. As automatic merging of the for example, ethanol is a drug but also a metabolite. pathway data is a very complex task to say the least We have addressed the challenges of the integra- (the topic is covered in more detail in Section 3.2), it tion in Life Sciences during the development of an seems accurate to expect that by choosing a specific end-user tool that shows data provided by different dataset for our core data we also approximately es- biological data sets that store metabolic information. tablish the domain coverage of the final database. This tool allows users to access the heterogeneous Comparison of the most important pathway data- and interlinked biological information. It takes ad- bases shows that using Kegg as a single source for vantage of the semantic links expressed in RDF for- core data ensures the widest possible coverage in mat. To implement this tool, we have selected a set terms of organisms and pathways. Additionally pub- of metabolic databases (described in the Section 2.1) lication of Kegg data with respect to Linked Data and the standardized BioPAX ontology (Section 2.2) principles was an important part of Bio2rdf initiative, for representing the metabolic pathway knowledge so we could reasonably assume that reusing this open [14]. A core for the database was created by using the data and interlinking it with other important datasets Kegg database (the reasons for selecting the Kegg from our application domain would ensure an im- database are presented in Section 2.1) and was inter- provement in the general condition of open biological linked to the other selected databases by using well data, without republishing a non-open dataset as established tools and link types such as owl:sameAs Linked Data. The coverage comparison for important and rdfs:seeAlso. Once that data set was created, we databases is presented in Table 1. Table 2 presents a designed a user-friendly interface for searching, more detailed comparison between Kegg (as it was browsing and visualizing the metabolic integrated published within the Bio2rdf project) and a potential data. In the process of implementation of an integrat- second-best choice in terms of open datasets cover- ed data set some obstacles were detected. These age, Reactome (also provided by Bio2rdf). problems refer to: the data quality (nonexistent refer- The detailed comparison between those two data- ences, missing objects, incorrect literal values etc.), bases provides more rationale to back up the decision incompatibilities due to the heterogeneity of the data of building the core of the dataset around the Kegg that characterizes the biological objects, the reusabil- data, as it shows that the Reactome database seems to ity of the data set and the possibility of a have more populated classes mainly due to the dif- semi(automated) data curation to improve the data ferent level of abstraction used. A more careful ex- quality and the correct selection of a standardized amination of the data explains why, for example, ontology as the vocabulary. Reactome may have an instance of the BioPAX class SmallMolecule that represents carbon dioxide in a 2.1. Datasets specific organism in a specific cellular location and another that represents the same compound in a dif- Our integration approach would involve integrat- ferent setting. As a result common molecules will ing several datasets in order to cater to the user’s have several instances in Reactome. In Kegg a real- need for easy access to cross database information. world compound is represented with one instance of As our system is to be set in the Linked Data picture, the Kegg class Compound, regardless of its location, integration does not necessarily involve incorporating hence the proximity of the numbers in respective all the relevant information into our local database, columns. This is an important problem when we con- sider its implications to the task of data integration

and interlinking. With regards to the comparison be- [17]. This data source offers metabolism and sig- tween Kegg and Reactome, we would also like to naling transduction pathways information in a da- point out that Kegg data relates more to metabolic ta scheme. This metabolic database includes over pathways and Reactome seems to be focused rather 6155 interactions and 3395 protein entities avail- more on the representation of signaling pathways. able in INOH and BioPAX. This fact also justifies the choice of Kegg as the The RDFs of Kegg, Uniprot, NCBI Taxonomy, Re- source for our core data, as biosignaling pathways are actome are provided by the Bio2RDF project. The beyond the scope of our tool. DrugBank RDF is provided by the service of the re- In order to perform the integration of metabolic in- search group of Anja Jentzsh et.al [18]. formation, we have narrowed our scope to several relevant data sources. These data sets cover a wide variety of possible user needs that might arise when 2.2. BioPAX Ontology dealing with pathway data (e.g. details of biochemi- cal reactions, metabolites, enzymes etc.). They also At this point it is also vital to choose the right vo- provide other potentially interesting information re- cabulary to represent the data. Most of the prominent lated with the objects of interest (e.g. targets of drugs, biological databases by design use their own data genomic information etc.): structures (examples include Kegg, Reactome,  The Kegg database presents a part of the metabo- INOH). This trend was often also propagated to their lism represented by graphical diagrams. This da- RDF publications. At some point however, the need tabase has information on enzymes and metabo- for standardization was recognized by the community, lites whose identifiers are the EC numbers and a which resulted in two standout initiatives: BioPAX Kegg identifier, respectively. and SBML [19], standards used for pathway repre-  The Uniprot database is a central database of pro- sentation and pathway bio-modeling respectively. tein sequences with annotations and functional in- The impact of these two is clearly visible as a lot of formation. This database provides proteomic and major bio-databases additionally publish their data in genomic information on the enzymes that cata- a standardized manner, at least through on-demand lyze biochemical reactions in the metabolism that generated reports. Figure 2 illustrates how our dataset will be used in the data set. can be viewed in a general manner, with our core  The DrugBank is an annotated resource that com- data expressed through the BioPAX ontology and bines detailed drug data with drug target and drug surrounded with links that point to other web re- action information [15]. This database was used sources. Additionally, the BioPAX standard has been in the Link Open Drug Data (LODD). Therefore, used in various Linked Data projects for Life Scienc- we proposed this drug database to cross reference es (including Bio2Rdf), therefore this choice was information on drugs, their targets and the metab- also justified in terms of confronting previously em- olites. ployed practices  NCBI taxonomy is a database that stores the BioPAX is a standardized language for represent- names and classifications for all organisms that ing biological pathway knowledge from molecular to are represented in genetic databases with at least organism level with the objective of facilitating the one nucleotide or protein sequence. This database exchange of information. This ontology permits rep- represented 10% of the existent species on the resenting molecular pathways and interactions. Bi- planet [16]. oPAX level 2 is focused on metabolic pathways as  The Reactome database is a resource for human networks that contain biochemical reactions and en- biological processes. The Reactome database in- zymes, which convert reactants to products. In a ho- terface permits representing processes in the hu- listic view, these networks are involved in many bio- man system, including the pathways of interme- logical and evolutive processes in the organisms. diary and regulatory metabolism. This database Given that, it seemed reasonable to express the data was selected to enrich the process of cross refer- with respect to some version of BioPAX standard, encing metabolic entities and complete the infor- which is meant for pathway representation and seems mation of pathways included in the Kegg data- Linked Data catered as it is RDF/XML OWL con- base. formant.  The Integrating Network Objects with Hierar- chies (INOH) is a structured and curated database

The latter were then generated through purely data- level transformations. This method of link generation was also considered for the Brenda database, as its data were not to be accessed programmatically. Fig- ure 4 also shows the tools that were originally to be used for the specific integration steps. Section 3 dis- cusses the results of the execution of the process, being the obtained dataset. It will also provide in- sights on the problems encountered and situations that made us reconsider our original aggregation pro- ject.

Fig.2. Core of the data follows the BioPAX Level 2 scheme and the links that point to biological databases included in the LOD cloud. Figure 3 presents the domain concepts as they are used in the BioPAX level 2 standard to provide the reader with the vocabulary involved and what is rep- resented by our data.

2.3. Dataset aggregation

Having chosen the prerequisites, the ontology and source for the core data, we moved on to design the dataset aggregation process itself. Figure 4 presents schematically the process step-by-step, the way it was originally projected. The first step involved ex- traction, cleaning and transformation (in order to reach BioPAX conformance) of the Kegg data. Con- Fig.3. Detailed view of the BioPAX classes and properties used for sequent steps were to involve owl:sameAs links gen- the data representation. eration for the RDF datasets. The obtained links were to be incorporated directly into the result dataset along with the links pointing to original databases.

Table 1. A comparative study of the content of each database involved Comparative study of the integrated databases Kegg Reactome INOH Pathways Kegg Reactome (Bio2RDF) (Bio2RDF) Commons Pathways 99062 10682 102 1668 158792 14090 Organisms 954 65 N/A 414 1779 43 Enzymes 4245 3854 N/A N/A 5708 N/A Reactions 7755 23716 less than 1000 N/A 8888 33487 Open Data yes yes licensed as in sources licensed licensed

Table 2. A comparative study between Kegg and Reactome data- relationships specified within the INOH data to Kegg bases. data, which in turn was originally linked to from our Distinct catalysts core data. As the owl:sameAs is transitive and reflex- Databases Distinct small molecules (identified by EC ive we used the information to create a new link set Names (compounds) Number) pointing form our core data to INOH data.

Reactome 396 15918 •Extraction, Cleaning and mapping of Kegg (Bio2rdf) LDIF 14071 compounds and Kegg 3618 10965 glycans •Linking Reactome Silk

As indicated in Figure 4, the first step involved the •Linking Uniprot extraction of metabolic pathway data from Kegg da- Silk taset published as RDF. The objective seemed •Linking INOH straightforward and well defined as initial research Silk showed that transition from the Bio2rdf Kegg vocab- ulary to the BioPAX Level 2 standard would not be •Linking Brenda complicated, as the most relevant classes and predi- LDIF cates have their 1:1 counterparts in the respective datasets. This fact is reflected in Figure 4. All the •Linking DrugBank Silk mappings are included in Table 3. The extraction was smooth enough until we realized that Bio2rdf’s Kegg, despite what is suggested by the data structure, does Fig.4. The project of the integration process. not relate to Kegg internal organism codes with any With the Brenda database the case was much sim- taxonomy database. Links of this nature are expected pler as both our core data and Brenda employ the to be included in a BioPAX conformant dataset. It same convention to identify enzymes. Knowing that, made the core data extraction impossible with the use direct links to Brenda were created from our enzyme of standard tools (LDIF) as we had to work around resources using their identifiers to generate a proper the data deficiency. The missing data was directly Brenda webpage URL. To finish the process, a utility extracted from the Kegg web page dedicated for pub- was implemented to verify the created links by lishing the mapping from Kegg codes to the NCBI checking the Brenda’s front-end response. All the Taxonomy identifiers1. The problem of this solution links that used a properly defined EC number (identi- is that it makes the entire process dependent on a fier) were correct. As an interesting off shoot, the single web resource that additionally may be subject utility helped us discover the ‘faulty’ enzymes with a to changes, which results in a fragile dependency malformed EC number within the data extracted from structure that enables the reusability of the process. Bio2rdf Kegg. With the core data extracted and enriched, the in- The proposed interlinking scenario did not work in terlinking of data with other relevant datasets started the case of pathway data (interlinking to Reactome with respect to the databases that were published as and Pathway Commons pathways), as it is not trivial Linked Data. We assumed a plan of interlinking that to find a generic solution for the interlinking of relat- consisted of two steps: linking out to the open dataset, ed pathways. This problem is due mainly to the am- producing links to the original dataset using links biguous use of ontologies and different granularity of obtained in the first step and links included in the the databases. open dataset (alternatively we would transform the In the Table 3, the results of the extraction and vo- obtained links in case both datasets use similar iden- cabulary mapping are shown. For several classes the tifier convention). Special cases of successful and number of instances in the derived set exceeds the reusable interlinking process included INOH and number of instances in the original dataset. This pe- Brenda, databases that do not have their Linked Data culiarity is caused by manual and semiautomatic data versions. In the case of INOH we relied entirely on modifications needed to maintain higher data con- sistency. These actions would for example include

1 Names and phylogenetic lineages of more than 160,000 organ- isms that have molecular data in the NCBI.

Table.3. Results of the extraction of the core data extraction: extraction of Kegg data, its cleaning and vocabulary mapping to BioPAX.

Kegg BioPAX class Class instance intended meaning No. of instances No. of instances in Class in Kegg the result set Pathway Pathway Represents a pathway 99062 99062 Reaction BiochemicalReaction Represents a reaction 7755 7779 Enzyme Protein Represents a protein that acts as enzyme 4245 4285 Glycan SmallMolecule Represents a metabolite 10965 25065 Compound SmallMolecule Represents a metabolite 14071 25065 - BioSource Represents an organism 954 896 - UnificationXref A cross reference to an external DB (Tax- - 896 onomy) - PathwayStep A logically separable set of events within a - 916350 pathway - Catalysis Represents catalytic activity of an enzyme - 6468

- PhysicalEntityParticipant Represents an interaction of a concrete - 39648 entity - DataSource Information about data provenance - 1 the creation of missing enzyme instances (original ly to the BioPAX level 2 standard. As shown in Fig- data contained ‘hanging’ links to non-existent in- ure 5 the user can navigate to a pathway and then stances). It is also worth noticing that the BioPAX access the integrated information related to any of its standard imposes a structure that uses utility classes components. The implementation and functionality on numerous occasions and their instances were gen- details are beyond the scope of this paper, nonethe- erated automatically during the extraction process. less the readers might want to try the demo version, What is alarming is the fact that despite adding very which is available at little new information (only about 1MB of organism http://www.150.214.214.5/metabolicpathways/. It related information), fitting the original data to the provides a convenient way to examine the properties BioPAX frame resulted in a growth of the repository of the dataset and the results of the integration effort. from the 100MB of the original dump to around 1GB. Table 4 illustrates the results of the executed inter- linking process. The results show that the rich link 3. Discussion set generated will allow the end user to browse the pathway data in an organized and intuitive manner. As stated previously, in our work we managed to Table 4 also shows the heuristics involved, as well as achieve an interlinked set of data intended for biolog- the basic statistics and tools used in the process exe- ical pathway representation. The core pathway data cution. was enriched with links to external data sources in The interlinking was achieved by using standard order to provide the end user with a possibility to tools and techniques. As a result, the process is high- reach relevant information in an intuitive manner. ly reusable, given that the respective datasets main- For example, if we show the data related with a bio- tain their vocabularies. logical entity of interest, we also display the links to the related resources. Although our inability to pro- 2.4. Tool for browsing metabolic data vide a pathway interlinking heuristic, we can de- scribe the interlinking as fairly successful, as we In this section we briefly introduce the tool built managed to include the most important datasets in a for accessing the assembled dataset. It provides navi- reusable process. gation, search and visualization facilities. The aim Nevertheless, we have encountered some non- was to enable intuitive access to complex data and trivial issues that are relative as valid discussion yet to retain multiple ways of access that benefit from points. We believe that finding solutions for these the multidimensionality of the data. Those goals were problems is an important step towards building better achieved through implementation of a multilayered Linked Data applications, systems and datasets in metabolism navigation interface. The resulting tool Life Sciences. consumes Linked Data directly, expressed according-

Table.4. Results of the interlinking process.

Class (% of Links per interlinked in- Total Used Target dataset Link type linked in- Interlinking heuristic stances of the Links tools stance (avg) class) Enzymes were matched by Brenda Enzyme (99%) 4245 rdfs:seeAlso 1 Script their EC Numbers

Enzymes were matched with Uniprot Enzyme (58,6%) 190832 rdfs:seeAlso 76,07 corresponding proteins by their Silk (Bio2rdf) EC Number aliases

Links generated automatically LDIF/Scr Uniprot Enzyme (58,6%) 190832 rdfs:seeAlso 76,07 with the link set pointing to ipt Bio2rdf Uniprot

Enzymes were matched with Drugbank Enzyme (25,9%) 2597 rdfs:seeAlso 2,33 the use of synonyms referring Silk to EC Numbers

Metabolites needed to have Small Molecule Drugbank 1280 owl:sameAs 1 identical formulas; name simi- Silk (5,1%) larity was taken into account Pathway, Reac- tion, Enzyme, Links generated automatically LDIF/Scr Kegg (Bio2rdf) 168736 owl:sameAs 1 Small Molecule during the core data extraction ipt (>99% each) Links generated automatically Kegg as above as above rdfs:seeAlso 1 for the links generated to Script Bio2rdf Kegg Links generated automatically INOH Pathway 451 BioPAX reference 9,2 through analysis of existing Script INOH to Kegg references

We have divided the obstacles into four categories: formed literal values, splitting single literal values  Data quality-related problems into several values, etc.  Issues related with semantic non- In an already mentioned example it was found that compatibility between datasets organism codes form Bio2rdf Kegg are not related  Maintainability of the created data set with any organism names, nor taxonomy codes. In  Adequacy of standard ontologies other words, the dataset contained pathways grouped The proposed division resulted in the subsections by organism, but did not provide the information on presented below. Each of the following subsection what were the organisms. Users would have to look treats a category of the problems in more detail, pre- up this basic information on the Web in an additional senting the background and outlining possible solu- effort. We managed to locate and integrate the miss- tions. ing information automatically but it affected our abil- ity to extract the core data using only standard tools. 3.1. Data quality Another example includes splitting literal values. We came across cases of entity synonyms split into Resolving data quality issues might be the key en- two within Bio2rdf Kegg. So, for example, instead of abling factor for the ‘Web of Data’. Unfortunately having a triple , we would encounter a is a major obstacle in building integrated datasets and pair of triples: and . nonexistent or missing objects, to publication of mal-

Fig.5. The metabolic pathways interface.

In the domain of Life Science, entities are referred ing and enabling technology. Various approaches to with various alternative names. This kind of publi- stressed the importance of publishing mappings be- cation error seriously affects our ability to consume tween frequently used vocabularies (on the schema the data as it complicates even further the processes level) and reusing existing, standardized ontolo- of instance matching, interlinking and identification gies/vocabularies. (disambiguation). It is clear that by following the reusability guide- When it comes to tackling the problem, it seems lines, by employing schema and ontology matching that linked open biological data on metabolic path- tools, one is able to reduce the overall heterogeneity ways is often erroneous or incomplete and to present that has to be dealt with. Additionally, schema level real value to the users, the data needs to be corrected matching usually can be done manually in a reasona- and completed. Certainly, there is an open field for ble time. It is the data level matching (instance research in automatic curation of the data. Our vision matching) that seems to be especially essential for however aims instead to use the community aspect the data consumption but also much more complicat- and to establish a peer data curation platform for ed. Linked Biological Data. This kind of initiative, with As mentioned before, in our case it was desirable the combination of additional integration effort, is a to match instances between datasets in order to gen- possible solution to turn currently available open data erate appropriate links. Conveniently, we could prac- on metabolic pathways into a high quality usable tically dismiss the problem of schema level mis- knowledge base. matches, as mapping the relevant structures was not a challenging issue. Figure 6 shows the simplified 3.2. Overcoming heterogeneity schema level dependencies between the datasets in- volved in the ‘big picture’. We can reasonably as- One of the main problems in the field of data inte- sume that the data in the ‘shared’ classes overlaps at gration throughout the body of Linked Data is deal- least partially between datasets. ing with highly heterogeneous data. Over the past The results of the matching process are summa- few years there has been a lot effort put into develop- rized in Table 5. The established links are related ing new methods for coping with this issue. As the with the cases of matching simple data objects. As data on the ‘ideal’ Web of Data is expressed via on- already mentioned, we have failed to interlink related tologies, the work in the integration domain has been pathways as the task proved to be especially difficult concentrated on ontology matching, instance match- in terms of reaching satisfying results. There are

three main reasons that make this interlinking job so way matching. For example, when Kegg and Reac- challenging: (1) complexity of pathway objects, as tome RDF distributions are considered, important they are collections of simpler objects (reactions), differences can be observed in the general approach which in turn relate other objects (reactants, control- towards the domain modeling. As mentioned before, lers, etc.); (2) complexity of cross-pathway relations in Kegg an instance of Compound class denotes a – pathways may overlap, one pathway may include certain chemical compound in general (regardless to or equal another pathway; additionally pathways are where it is and how it interacts) and in Reactome an equivalent only for the same organisms; (3) vocabu- instance of a respective SmallMolecule class denotes laries are used differently in different databases. a certain compound in a specific situation (for exam- ple as a product of a specific reaction in human body). By adopting the Kegg data to BioPAX standard the dataset is expressed through a compatible vo- cabulary with respect to the Reactome data. However, mapping to a compatible vocabulary did not resolve the problem; the intended use of entities remained different across the databases. The problem is pre- sented visually in Figure 7. It shows that in the re- sultant dataset distinct pathway steps may ‘share’ the same Reaction instance, whereas in the case of Reac- tome Reaction instance practically denotes an occur- rence of a reaction within a pathway step. In order to be able to compare pathway structures automatically in this situation we would need to con- sider a scenario, in which firstly we map correspond- ing Reactome instances to the more general Kegg- based instances and consider the matched instances equal during the rest of the pathway matching pro- cess. Later we would have to execute the comparison of the ‘flattened’ pathway structures. It is more than likely that the first part of the process is not achieva- ble using fully automated techniques, so anyway it would need a great deal of human input. Additionally this is not a general issue and it requires special Fig.6. Schema level dependencies between databases. treatment for each specific interlinking scenario. Matching pathways therefore is not possible with the Reasons (1) and (2) are closely related as (2) can use of standard tools, nor with simple scripting. be seen as the consequence of (1). As there are no The answer to this issue could be found in using a strict rules regarding pathway modeling and classifi- social data curation platform as mentioned in the cation, we can easily imagine a situation, in which a previous subsection. So experts and other specialists larger pathway is represented as a network of smaller could match the related objects in a community- interconnected pathways (nesting pathways is also a based effort with respect to the heterogeneity issue possibility with respect to established representation presented in this paper. Such a platform would cer- norms). Considering just conditions (1) and (2) it can tainly need some non-standard interlinking operators be imagined that pathway matching is in fact a diffi- that could denote object inclusion, specification, etc. cult process, as a matching algorithm would consist Apart from that, the pathway matching itself also of representing pathways in a flattened form and seems to be an interesting problem. To start with, we comparing their structure and elements in order to could assume the ‘ideal’ conditions scenario that produce a similarity score. Additionally such an ap- holds the assumption that both datasets express their proach would have to consider a series of rule-based data in a unified (compatible) manner. constraints (for example to ensure that the organisms match in the pathways). Presence of the third condi- tion, (3), additionally complicates the task of path-

It is enough that one of the databases involved changes the presentation of its data and the infor- mation that supports the links will disappear from the Web. Additionally the Kegg organism codes some- times differ from their Bio2rdf counterparts, which makes it impossible to fill in (semi)automatically all the information gaps. Yet again, the real problem revolves around quality and availability of biological data in the Open Linked Data domain. As in the case of the previously presented prob- lems, we believe that the solution lies in establishing a platform to improve the data quality. This would enable us to treat the generated dataset as a base for the future improvement. In this scenario, the dataset generation process would be used in coverage expan- sion (incorporating new pathways through periodic searches performed on the sources), while the previ- ously extracted data would ‘live on’ independently, without the need to review its structure with external

Fig.7. Different (contextually) usage of the same ontology con- data sources. The independent evolution of the cepts. Linked Data sources can be an important step for- ward, especially as the natural links between Linked 3.3. Process reusability and its weak links Data and traditional databases seem to go on with little support. The aim was to make the process of dataset gener- ation reusable and as maintainable as possible. To 3.4. Making use of the BioPAX ontology achieve this goal well-established tools were used: LDIF [20] and SILK. The upside of this choice was Development of a data source calls for a decision that these tools are configured declaratively and it is regarding its ontology. In the context of Linked Data easy to reflect vocabulary changes in those configu- the most important implication is that the choice de- rations. fines the way the data is represented. Additional data cleaning was implemented through As stated before, in terms of data integration simple Java programs that would read the whole da- choosing the right ontology might increase data in- taset, detect a predefined type of error (missing type teroperability when a reusable and standardized vo- expressions, malformed URIs, etc.) and rewrite the cabulary is chosen. corrected model. Our experience shows that using a standardized Unfortunately some of the data cleaning had to be vocabulary (in our case BioPAX level 2) is useful in done manually (erroneous synonyms detection, split- terms of choosing a basic set of concepts within the ting entangled instances, etc.), so the dataset genera- domain of interest. This is particularly important in tion was not fully automated. the context of cross-domain science, where a stand- Reaching conformance to BioPAX standard re- ard vocabulary can be seen as a conceptual reference quired us to retrieve the missing organism infor- map of the domain. It obviously is an important ad- mation from Kegg database webpage and then relate vantage, but the problem described in Section 3.2 this information with Taxonomy database codes shows that the value of standardizing the vocabulary through accessing the source of Taxonomy database should not be overestimated, as the standardization webpage. Establishing this kind of relation was not itself does not necessarily lead to solving the hetero- possible by means of using only available Open geneity issues. Especially as accurate vocabulary Linked Data. Thus, it was retrieved through a non- level mappings were relatively easy to obtain standard process, which cannot be highly regarded in throughout the project. terms of maintaining high standards in information Additionally, migrating to the BioPAX standard is engineering. costly when we consider the size of the output (ca. 1GB) data compared to the size of the input data (ca.

100MB). As mentioned earlier, the standard vocabu- lary uses a relatively complex model in order to be able to reflect more of the potentially relevant infor- mation. Nonetheless, this relevant information is po- tentially present when we consider the ‘real world’, but often it will not be possible to derive it from the original data. This means that in order to achieve the standard conformance, we agree to employ a more complex (perhaps overly complex) model to repre- sent the original information (which in terms of com- pleteness does not really need the rich model). Figure 3 presents the domain as they are used in BioPAX level 2 standard. Furthermore, the ontological con- nections present in the standard are harder to process automatically in terms of taking advantage of their underlying semantics. When we consider the ‘intui- Fig.8. Increasing number of RDF triples associated with the use of tive’ model, it can be seen that reactions are related the BioPAX vocabulary. with their reactants through a single property, where- as in the actual BioPAX model there is a utility in- stance ‘in-between’. The middle instance is interpret- 4. Related Work ed as an activity of certain metabolite in a reaction and permits the addition of associative properties (for The idea of data integration in Life Sciences has example, stoichiometric information). As convenient come about mainly due to the large and increasing as it is to use these constructions for domain model- number of resources. Compared to otherdisciplines ing, it has at least two undesirable characteristics. such as physics or astronomy, biological datasets are Firstly, the middle instance is hardly interpretable as not very large. The main characteristic of biological a real world object and it leads to hiding, in terms of data is its complexity. The data complexity is caused eventual automatic processing, the truly important by a number of features: the diversity and contextual- and meaningful relationship between a reaction and ization of the data, the ambiguity caused by presence its reactant. Secondly, it leads to representing infor- of multiple synonyms, the large number of ‘com- mation, which was originally represented in the Kegg plex’/non-atomic objects, and the different data gran- vocabulary with one RDF triple per reaction, with at ularity across different sets. least four triples per reaction participant. The original Since 1995, there have been numerous attempts to convention is not perfect either, as it needs a specific establish an integration system in Life Sciences. Da- parsing method in order to process the information it vison et. al proposed a procedure for data integration conveys. Nonetheless, for a moderately sized reac- and presented two sets to integrate YAC (Yeast arti- tion with four participants BioPAX will need 16 tri- ficial chromosome) and Alu-PCR databases [21]. In ples to express information originally expressed with 2003, Köhler et. al implemented SEMEDA (Seman- 1 triple. This tendency is presented in Figure 8. tic Meta Database) that provides semantically en- In conclusion, we need to stress that in our opinion hanced access to databases in a collaborative envi- reusing standard vocabularies is a good practice ronment for maintaining, editing and controlling on- when we want to increase the interoperability. It is tologies and controlled vocabularies [22]. In 2006, however worth noting, that reusing standard ontolo- Stein presented some of integration issues and differ- gies does not ensure the full interoperability as such, ent ways to integrate information: link integration although it does facilitate important tasks that are seems to be the most successful way of integrating carried out across different databases (such as, for heterogeneous data, despite its shortcomings [23]. example, entity disambiguation and resolution). Fur- Recent years brought projects that advocate using thermore, it seems right to evaluate the representation the OWL based representations. TAMBIS (Transpar- standards with regards to their suitability for the ent Access to Multiple Bioinformatics Information Linked Data domain. In our opinion, such an evalua- Sources) was the first attempt to use OWL for the tion may lead the community to reconsider introduc- conceptualization and contextualization of the ing substantial changes in the standards to make sure knowledge from five data sources in molecular biol- that they are well suited for the Web of Data. ogy domain [24]. BioPAX is another project that uses

an OWL-based format to represent biosignaling and data (for example limited only to yeasts metabolism). metabolic pathways. Uniprot consortium has used the We chose to experiment with data integration in a RDF standardization to represent the proteomic broad (over 900 organisms), yet clearly bounded (to knowledge included in its well-established relational metabolic pathways) field. Additionally, biological database. pathways present a special case in which the infor- Yeast Hub [25] and Fungal Web [26] are two other mation is very difficult to aggregate. To address this, projects which focus on the data integration issue. we attempted to develop a platform that integrates The Yeast Hub project has developed a domain dedi- biological information from seven databases. Our cated ontology for yeast genomics. Cheung KH et al. goal was to ensure the widest possible information used the Sesame system [27] to provide inference coverage for a researcher working with metabolic services for the integrated knowledge. Fungal Web is information. a data-centric infrastructure catering for the Web. It We pointed out several difficulties encountered was designed to represent fungal genomic knowledge during the integration process. The literature de- from five data sources. Like the Yeast Hub it also scribes some approaches for the maintenance of the provides the end-user with reasoning services [28]. In data and RDF links [30] by using tools (such as Silk) the process information integration, these research to recalculate the links at regular intervals through projects also point to the heterogeneity and granulari- data sources publishing update feeds. Another alter- ty related issues. native approach is to provide information on data Stephens et al. proposed a data integration model source changes via subscription models to central and applied it the field of drugs discovery for chemo- registries that keep track of the changes of data items therapy resistant patients. This integration platform [31]. We also tried to address the maintainability was based on the Oracle RDF Data Model and Sea- issue in terms of designing a sustainable integration Mark Navigation from Siderean Software [29]. This process, but yet again we came across the data quali- work is a demonstration of the usefulness of the inte- ty bump. grated biologic information obtained by using a com- In the context of metabolic pathways, we had ex- bination of Web technologies and semantic stand- perience developing the Systems Biology Metabolic ards. Modeling Assistant (SBMM Assistant), a tool built In the context of the Linked Data technology, there using an ontology-based mediator and designed to have been several important projects set in the do- facilitate the metabolic modeling through the integra- main of Life Sciences. Bio2RDF and Linked Life tion of data from metabolic repositories [32]. Data are projects whose objective was to share the RDFized information from biological data sources. However, these projects are not very attractive for 5. Conclusions & Future work users due the lack of user friendly views and inter- faces and the questionable quality of RDF data. The Linked Data technology was supposed to In the area of pharmacology, Chem2Bio2RDF is a solve most of the problems that integration processes drug repository that is crossed-referenced to present. Some of the solutions include: the conver- Bio2RDF and LODD (Linked Open Drug Data). This sion of the biological data to semantic web formats project integrates Pubchem Bioessay, DrugBank, (eg. RDF or OWL), the use of standards, RDF/OWL Kegg Ligand, CTD, BindingDB, PharmGKB, languages and the large-scalable reasoning of these MATADOR, and QSAR sets. data. With the objective of applying this emerging Using the experience provided by the initiatives technology to the Life Sciences, we have tried to mentioned as a starting point we carried out our pro- build an integration solution based on biochemical ject work. Despite some similarities our project dif- pathway information making the process of dataset fers substantially from the ones presented. Firstly, as generation reusable and maintainable. To guarantee in the case of Bio2Rdf Reactome and other pathway the interoperability in this domain, BioPAX level 2 related RDF publications, we chose one of the Bi- ontology has been used. oPAX ontologies in order to provide a standardized The application built on top of the integrated da- representation, however we provide a critical assess- taset is a data access module intended for a systemic ment of its applicability in the Linked Data context. biology researcher. It enables querying and browsing Secondly, the previous projects integrated biological different detail layers of the metabolism – from the information either without focusing on a specific global maps down to the information related to spe- domain (Bio2RDF), or presenting only very specific

cific low level entities. This information contains biology, biochemistry, and medicine. Gene Funct. metabolic, kinetic, genomic and proteomic integrated Dis. 3:109-118 (2000) information from Uniprot, INOH, DrugBank, Brenda, [3] Joshi-Tope, G., Vastrik, I., Gopinathrao, G., Mat- Reactome and Kegg databases. This pathway plat- thews, L., Schmidt, E., Gillespie, M., D'Eustachio form provides an application ready, consolidated P., Jassal, B., Lewis, S., Wu, G., Birney, E., and view of the domain. Stein L.: The Genome Knowledgebase: A Re- However, in the course of the project, some obsta- source for Biologists and Bioinformaticists. Cold cles were encountered. One of the main issues in the Spring Harb Symp Quant Biol 68:237-243 (2003) integration process was the lack of data quality. Most [4] Jain, E., Bairoch, A., Duvaud, S., Phan, I., of the issues were related to non-existent references, Redaschi, N., Suzek, B.E., Martin, M.J., McGar- missing objects, incorrect literal values etc. Another vey P., Gasteiger E. Infrastructure for the life observation was related to data heterogeneity and sciences: design and implementation of the Uni- complexity of the domain, the complexity of rela- Prot website BMC Bioinformatics, 10:136 tionships between entities, inter-pathway relation- (2009) ships (overlapping pathways or different pathway [5] Manola, F., Miller, E.: RDF Primer. W3C Rec- boundaries) and the use of different terms for a ommendation. World Wide Web Consortium unique biological concept (e.g., drug or metabolite). (2004). Furthermore, we detected and corrected some data [6] Dean, M., Schreiber, G.: OWL web ontology lan- errors through semi-automated scripting. All the guage reference. W3C recommendation, W3C main challenges encountered have been discussed in (2004). order to share our experience; we have also presented [7] Bizer, C., Heath, T., Berners-Lee, T.: Linked data outlines of the solutions to the problems we came - the story so far. International Journal on Seman- across. Those solutions and discussion points might tic Web and Information Systems (IJSWIS), 5:1- enable faster growth of the Linked Data community 22 (2009). in the Life Sciences domain, as the Linked Data [8] Linking Open Data cloud diagram: community awaits some key improvements, better http://richard.cyganiak.de/2007/10/lod/ reasoning services, etc. [9] Belleau, F., Nolin, M.A., Tourigny, N., Rigault, One of the considered future lines of research is to P., Morisette, J.: Bio2RDF: Towards a Mushup to improve the approach in order to enable data integra- Build Bioinformatics Knowledge Systems. Journal tion through a dynamic process, in which we would of Biomedical Informatics 41: 706-717 (2008). consider crawling the biological web of data and in- [10] Momtchev, V., Peychev, D., Primov, T., corporating updated and corrected information on Georgiev, G. 2009. Expanding the Pathway and pathways. This could result in improving the process’ Interaction Knowledge in Linked Life Data. In: reusability. 8th Proceedings of International Semantic Web Challenge Washington, EEUU (2008). Acknowledgments [11] Kiryakov, A., O., Damyan, O., Manov, D.: OWLIM - A Pragmatic Semantic Repository for The Project Grant TIN2011-25840 (Spanish Min- OWL. Lecture Notes of Computing Science 3807, istry of Education and Science) and P11-TIC-7529 182-192 (2005) (Innovation, Science and Enterprise Ministry of the [12] Kobilarov, G., Scott, T., Raimond, Y., Oli- regional government of the Junta de Andalucía) have ver, S., Sizemore, C., Smethurst, M.,Bizer, C., supported this work. Lee, R.: The Semantic Web: Research and Appli- cations Lecture Notes in Computer Science 5554: References 723-737 (2009) [13] Chen, B., Dong, X., Jiao, D., Wang, H., [1] Kanehisa, M., G, S., Kawashima, S., Okuno, Y., Zhu, Q., Ding, Y., Wild, D.: Chem2Bio2RDF: a Hattori, M.: The KEGG Resource for Deciphering semantic framework for linking and data mining the Genome. Nucleic Acid Research 32: 277-280 chemogenomic and systems chemical biology da- (2004) ta 11: 1471-2105 (2010) [2] Schomburg, I., Hofmann, O., Baensch, C., Chang, [14] BioPAX consortium.: Nature Biotechnology A., Schomburg, D.: Enzyme data and metabolic 28, 935-942 (2010) information: BRENDA, a resource for research in

[15] Wishart, D.S., Knox, C., Guo A.C, [29] Stephens, S., LaVigna, D., DiLascio, M., Shrivastava, S., Hassanali, M., Stothard P., Chang Luciano, J: Aggregation of bioinformatics data Z., Woolsey, J.: DrugBank: a comprehensive re- using Semantic Web technology. Web Semant. 4, source for in silico drug discovery and explora- 216-221 (2006). tion Nucleic Acids Res 1:668-672 (2006) [30] Volz, J., Bizer, C., Gaedke, M., Kobilarov. [16] Federhen, S.: The NCBI Taxonomy data- G.: Discovering and Maintaining Links on the base Nucleic Acids Res 40: 136-143 (2012) Web of Data. In: Proceedings of the 8th Interna- [17] Yamamoto, S., Sakai, N., Nakamura, H., tional Semantic Web Conference (ISWC '09), Fukagawa, H., Fukuda, K., Takagi, T.: INOH: on- Abraham Bernstein, David R. Karger, Tom Heath, tology-based highly structured database of signal Lee Feigenbaum, Diana Maynard, Enrico Motta, transduction pathways Database (2011) and Krishnaprasad Thirunarayan (Eds.). Springer- [18] DrugBank:http://www4.wiwiss.fu- Verlag, Berlin, Heidelberg, 650-665 (2009). berlin.de/drugbank/ [31] Soren, A., Dietzold, S., Lehmann, J., Hell- [19] Hucka, M. et. al: The systems biology mann, S., Aumueller, D.: Triplify: light-weight markup language (SBML): a medium for repre- linked data publication from relational databases. sentation and exchange of biochemical network In: Proceedings of the 18th international confer- models Bioinformatics 19: 524-531 (2003) ence on World wide web (WWW '09). ACM, [20] Schultz, A., Matteini, A., Isele, R., Mendes, New York, NY, USA, 621-630 (2009). P, Bizer, C., Becker, C.: LDIF - A Framework for [32] Reyes-Palomares, A., Montañez, R., Real- Large-Scale Linked Data Integration. In: 21st In- Chicharro, A., Chniber, O. Kerzazi, A., Delgado, ternational World Wide Web Conference Navas, I., Medina, M.M., Aldana Montes, J.F., (WWW2012), Developers Track. Lyon, France, Sánchez-Jiménez, F.: Systems biology metabolic April 2012. modeling assistant: an ontology-based tool for the [21] Davidson, S. B., Overton, C., Buneman, P.: integration of metabolic data in kinetic modeling Challenges in Integrating Biological Data Sources 6: 834-835 (2009) Journal of Computational Biology 2:557-572 (1995) [22] Köhler, J., Philippiand, S., Lange, M.: SEMEDA: ontology based semantic integration of biological databases Bioinformatics 19: 2420- 2427 (2003) [23] Stein, L.D.: Integrating biological databases. Nat Rev Genet 4:337-345 (2003). [24] Stevens, R., Baker, P. G., Bechhofer, S., Ng, G., Jacoby, A., Paton, N. W., Goble, C. A., et al. (2000). TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources Bioinformat- ics 16:184-185 (2000) [25] Cheung, K.H., Yip, K.Y., Smith, A., Deknikker, R., Masiar, A., Gerstein, M.: YeastHub: a semantic web use case for integrat- ing data in the life sciences domain Bioinformat- ics 21:85-96 (2005) [26] Christopher, J.O., Baker, Arash, S.N., Su, X., Haarslev, V., Butler, G.: 2006. Semantic web in- frastructure for fungal enzyme biotechnolo- gists. Web Semant. 4: 168-180 (2006) [27] Cheung, K.H., Yip, K.Y., Smith, A., Deknikker, R., Masiar, A., Gerstein, M.: YeastHub: a semantic web use case for integrat- ing data in the life sciences domain Bioinformat- ics 21(1):85-96 (2005) [28] Sesame: http://www.openrdf.org/news.jsp