Integration of Metabolic Pathway Linked Data: Lessons Learned
Total Page:16
File Type:pdf, Size:1020Kb
Integration of Metabolic Pathway Linked Data: Lessons Learned Maciej Rybinski, María Jesús García-Godoy, Ismael Navas-Delgado and José F. Aldana-Montes Universidad de Málaga, Spain Abstract. In the last few years, the Life Science domain has experienced a rapid growth in the amount of available biological databases. The heterogeneity of these databases makes the data integration a challenging issue. Some inte- gration challenges refer to locating resources, relations, data formats, synonyms or ambiguity. Linked Data approach partly solves the heterogeneity problems (mainly the syntactic ones). Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. However, Linked Data approach is not a solution in itself. This paper illustrates a data integration and interlinking process for metabolic pathway databases such as Uniprot, Kegg, Brenda and Reactome. This process was an important part of the development of a tool for biological end-users that integrates metabolic pathways information by using Linked Data technology. The integration process and the tool de- velopment shows that heterogeneity problems have not been solved in this domain, and so this paper discusses the problems that arise from the use of data and links published as Linked Data and the possible approaches to solve these issues. Keywords: Linked Data, Knowledge integration, Bioinformatic databases (phosphoglucomutase, Homo sapiens) in order to 1. Introduction obtain information about this enzyme in glycoly- sis/gluconeogenesis pathways. The result is a list of Over the last two decades, the biological database pathways in which the target enzyme works. The user community has witnessed a rapid growth in the num- has to browse the map to reach the link to the enzyme. ber of available data sources. The growth is a result The information at this link includes genes, pathways, of an enormous increase in the availability and acces- classes of pathways, orthology and biochemical de- sibility to biological data. In the beginning, the in- tails. However, the user has to go to another database formation was used by a few specialized disciplines to discover kinetic and proteomic information about but today, these databases have become essential the target enzyme. For example, the user can use the resources used by biologists around the world. Brenda database for the information related to the In the metabolism area, there are some examples enzyme code including the components of chemical of databases, such as Kegg [1], whose major compo- reactions, the reaction type, pathways, synonyms, nents are pathways (of the most known biochemical kinetic information (kcat, Ki, PH, temperature) and and regulatory networks) shown in graphical dia- cofactors, tissue source, cellular location and infor- grams. Brenda [2] is one of the largest available in- mation about molecular structure. Additionally, the formation systems, which provides biochemical and user can access Uniprot database [4] to see infor- molecular information about classified enzymatic mation on corresponding protein attributes, sequenc- activity. Reactome [3] captures all the chemical reac- es annotations, alternative splicing of the enzymes, tions and pathways that occur in different organisms. literature and general information. For each database Many biologists use these databases on a daily basis the search has to be repeated independently. This in order to extract biological information needed in scenario reflects the main issue with these data their work. For example, consider a user who search- sources namely that each database contains a subset es the Kegg database for the enzyme code 5.4.2.2 of the biological knowledge that might be of rele- vance. These knowledge islands cause a problem for cal background. During the implementation process, biologists who have to repeatedly browse different several problems with the real-world usage of biolog- databases to obtain the answer for a cross-domain ical integrated data were detected. We have catego- information need. rized these problems and a discussion on how to face To address the problem of information integration, obstacles belonging to each respective category is the semantic Web community led by W3C communi- presented in this paper. ty proposed a set of standards such as the RDF [5] The paper is divided as follows: Section 2 de- and OWL [6]. Since 2007, there has been a lot of scribes the process to provide access to heterogene- effort made to provide data sets from different areas ous biological information and the results obtained. using the Semantic Web technologies. In this context, Section 3 presents a detailed description of the prob- a set of best practices has been proposed for sharing, lems and discussion on: the quality of the data avail- publishing and connecting data, information and able within the Linked Data Cloud, the heterogeneity knowledge by using RDF and URIs. These practices of the data and reusability bottlenecks. Section 4 de- are known as Linked Data principles [7]. scribes some of the related work and approaches that The movement towards the publication and linking address the maintenance issues of the Linked Data of the data has been continuously growing and the Cloud. Section 5 includes the conclusions resulting number of triples stored in the Linked Data Cloud from this integration experiment and the consequent has grown from 2 billion in 2007 to 31 billion in improvements to be implemented in the Biochemical 2011 [8]. In Life Sciences the amount of RDF infor- Pathway application. mation published has increased enormously due to the efforts of several projects. Some of the relevant integration platforms providing data in this domain 2. Providing access to heterogeneous information are Bio2RDF [9] and Linked Life Data [10]. These two platforms have converted relational information The integration of biological databases in Life Sci- into RDF and linked the heterogeneous data between ences has become a challenge for researchers. One of biological databases with the objective of publishing the main problems of biological data is its syntactic billions of triples representing biological knowledge. and semantic heterogeneity. Linked Life Data is a data integration platform Additionally, over the last decade the number of achieved through using a massive RDF warehouse biological databases has increased. The annual NAR solution extended with reasoning service inference journal supplement listed 96 new online databases in and semantic annotations. It is supported by the 2001, 800 in 2007 and 92 in the year 2012 (Figure 1). OWLIM semantic repository [11] and stores 20 bil- The increase in the number of databases each year is lion statements. Whereas, Bio2RDF is a mash-up due to the ease of publishing them on the Web. There application that combines data from different rele- are many databases and each one has its own reposi- vant biological databases. tories and community based on it. Consequently, The Linked Data community seems to be switch- between the data sources a variety of resources are ing from the data publication stage to the data con- replicated, overlap or are presented from different sumption stage. In many domains the systems for points of view. Linked Data consumption are very few or none, however the paradigm itself has been proven useful in practice by efforts from BBC music [12] and the Linked Open Drug Data (LODD) [13]. In Life Sciences, one of the main integration prob- lems is related to the complexity of biological data. This feature makes it difficult to develop ‘wide-span’ applications for end-users. This paper presents an integration solution (demo version available at http://150.214.214.5/metabolicpathways/ ) in the con- text of the metabolic pathways that uses Linked Data as source of information. On top of the integration Fig.1. The growth rate of biological databases published in the part we have implemented functionalities such as NAR in the last decade. The X- axis represents the years and the navigation, search and visualization to make the inte- Y-axis represents the number of databases appeared per year in the grated data easily accessible to users with a biologi- NAR database supplement To solve (to some extent) the integration problem nor is that desired in terms of providing a complete in bioinformatics, Linked Data is becoming pub- knowledge base for the application, as it is unrealistic lished, as it gets wider recognition throughout the to assume that it is possible to create a ‘database of biological community. However, some challenges everything’. In the most basic case, integration of have to be faced in order to reach useful solutions. two knowledge sources would simply mean interlink- For example, one of the common challenges in the ing our dataset with an existing relevant one (regard- Life Sciences integration is the synonymy. There are less of whether it follows Linked Data principles or it many synonyms for the same biological entity as a is a ‘traditional’ biological database). Integration in consequence of naming entities independently. In the this sense follows Linked Data principles and pro- area of metabolism, most of the redundant infor- vides the user with ‘navigational’ freedom of infor- mation results from using different identifiers to de- mation access. Nonetheless, in order to be able to note the same metabolite. The ambiguity is some- perform the interlinking process one still needs the times also caused by the conceptual heterogeneity, core data to start with. As automatic merging of the for example, ethanol is a drug but also a metabolite. pathway data is a very complex task to say the least We have addressed the challenges of the integra- (the topic is covered in more detail in Section 3.2), it tion in Life Sciences during the development of an seems accurate to expect that by choosing a specific end-user tool that shows data provided by different dataset for our core data we also approximately es- biological data sets that store metabolic information.