Open PHACTS

Deliverable 6.24

Integrating OPS data with Human Disease maps.

Prepared by J. Pinero, N. Queralt, L. Furlong (PSMAR), C. Chichester (SIB) Approved by AZ, UNIVIE, Novartis

August 2013 Version 1.0

Project title: An open, integrated and sustainable chemistry, biology and pharmacology knowledge resource for drug discovery Instrument: IMI JU Contract no: 115191

Start date: 01 March 2011 Duration: 3 years

Nature of the Deliverable Report x Prototype Other Dissemination level Public dissemination level For internal use only x

______© Copyright 2011 Open PHACTS Consortium

Open Deliverable: Integrating OPS data with Deliverable: 6.24 PHACTS Human Disease maps. Author: J. Pinero, N. Queralt, L. Furlong IMI - 115191 Version: 1.0 2 / 7 (PSMAR), C. Chichester (SIB)

Definitions  Partners of the Open PHACTS Consortium are referred to herein according to the following codes:

Pfizer – Pfizer limited – Coordinator UNIVIE – Universität Wien – Managing entity of IMI JU funding DTU – Technical University of Denmark – DTU UHAM – University of Hamburg, Center for BIT – BioSolveIT GmbH PSMAR – Consorci Mar Parc de Salut de Barcelona LUMC – Leiden University Medical Centre RSC – Royal Society of Chemistry VUA – Vrije Universiteit Amsterdam CNIO – Spanish National Cancer Research Centre UNIMAN – University of Manchester UM – University of Maastricht ACK – ACKnowledge USC – University of Santiago de Compostela UBO – Rheinische Friedrich-Wilhelms-Universität Bonn AZ – AstraZeneca GSK – GlaxoSmithKline Esteve – Laboratorios del Dr. Esteve, S.A. Novartis – Novartis ME – Merck Serono HLU – H. Lundbeck A/S E.Lilly – Eli Lilly NBIC – Stichting Netherlands Bioinformatics Centre SIB – Swiss Institute of Bioinformatics ConnDisc – Connected Discovery EBI – European Bioinformatics Institute Janssen – Janssen Pharmaceutica OGL – OpenLink Software

 Grant Agreement: The agreement signed between the beneficiaries and the IMI JU for the undertaking of the Open PHACTS project.  Project: The sum of all activities carried out in the framework of the Grant Agreement.  Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding to the work to be carried, out as specified in the Grant Agreement.  Consortium: The Open PHACTS Consortium composed of the above-mentioned legal entities.  Project Agreement: Agreement concluded amongst Open PHACTS participants for the implementation of the Grant Agreement. Such an agreement shall not affect the parties’ obligations to the Community and/or to one another arising from the Grant Agreement.

______© Copyright 2011 Open PHACTS Consortium Open Deliverable: Integrating OPS data with Deliverable: 6.24 PHACTS Human Disease maps. Author: J. Pinero, N. Queralt, L. Furlong IMI - 115191 Version: 1.0 3 / 7 (PSMAR), C. Chichester (SIB)

1 The DisGeNET database

1.1 Description of the resource DisGeNET is a -disease database created to promote understanding about the underlying mechanisms of complex diseases and drug adverse reactions. Several databases have been developed collecting associations between and diseases but, each of these databases focuses on different aspects of the phenotype-genotype relationship. DisGeNET is a comprehensive gene-disease database designed to contain the current knowledge of human genetic diseases including mendelian, complex and environmental diseases.

DisGeNET integrates information on gene-disease associations from several publicly available, expert curated data sources and also associations derived from the literature by text-mining tools. The integration is performed by means of mapping gene and disease vocabularies, and by harmonizing description of gene-disease associations with the DisGeNET gene-disease association ontology [1]. DisGeNET also integrates pathway information related to disease genes extracted from Reactome and SNPs associated to gene-disease relationships are provided in order to have a more complete picture of the biological processes underlying a disorder and the correlation of specific genomic variants with disease susceptibility. For more information, please see the original publications [2] and [3]. The current version of DisGeNET (DisGeNET v2.0, July 2012), contains 100.729 associations between 9313 genes and 6029 diseases.

1.1.1 Primary data sources DisGeNET v2.0 integrates human and mouse gene-disease associations from open data sources, which are:

Human data UniProt: Uniprot/Swissprot contains several related-information about proteins. Only protein- disease entries and protein-gene links were retained. This database provided 2525 distinct gene-disease associations for 1754 genes and 2243 diseases. CTDTM: The Comparative Toxicogenomics DatabaseTM contains curated gene-disease associations focusing on the effects of environmental chemical effects on human health. It includes associations from the literature and from OMIM. This database provided for Homo sapiens 16382 distinct gene-disease associations for 6065 genes and 4403 diseases. GAD: The Genetic Association Database stores human genetic association studies of complex diseases (studies from published papers in peer reviewed journals and GWAS). This database provided 12798 distinct gene-disease associations for 2461 genes and 1395 diseases. LHGDN: The Literature-derived Human Gene-Disease Network is a text-mining derived database with focus on extracting and classifying gene-disease associations with respect to several biomolecular conditions. This database provided 59274 distinct gene-disease associations for 6140 genes and 1847 diseases.

______© Copyright 2011 Open PHACTS Consortium Open Deliverable: Integrating OPS data with Deliverable: 6.24 PHACTS Human Disease maps. Author: J. Pinero, N. Queralt, L. Furlong IMI - 115191 Version: 1.0 4 / 7 (PSMAR), C. Chichester (SIB)

Mouse data CTD_mouseTM: CTDTM database provided for Mus musculus 118 distinct gene-disease associations for 47 genes and 79 diseases. MGD: The Mouse Genome Database provides annotation of phenotypes and human disease associations for mouse models (genotypes). This database provided 1749 distinct gene- disease associations for 1253 genes and 1016 diseases.

DisGeNET is publicly available from a web interface [4]. Several useful tools to explore and analyze the data are available from this web interface. In addition, for network biology studies involving gene-disease associations, a Cytoscape plugin is also available [3]. The Database is available in SQLite and RDF.

1.2 The RDF conversion DisGeNET MySQL database has been converted to RDF format to extend the Linked Data space with pharmacologically relevant data about genes that are associated with diseases and their interactions through biological pathways. Importantly, the RDF version of DisGeNET is planned to be integrated in the release 1.5 of the OPS platform enabling to include disease-gene-pathway concepts and to integrate them with OPS compound-protein data.

Figure 1. Simplified RDF schema.

______© Copyright 2011 Open PHACTS Consortium Open Deliverable: Integrating OPS data with Deliverable: 6.24 PHACTS Human Disease maps. Author: J. Pinero, N. Queralt, L. Furlong IMI - 115191 Version: 1.0 5 / 7 (PSMAR), C. Chichester (SIB)

The RDF version of DisGeNET is a set of triples centered on the gene-disease association concept. Additional information, such as genomic variation or the pathways where disease genes are known to be involved, is linked to this main concept (see Figure 1). The conversion has been done using RDF and OWL languages, common ontologies and vocabularies, and according to the Linked Data principles, open access and inter-operability of the data. Besides, a SPARQL endpoint [5] has been implemented to openly access the data. The RDF schema, the data dump, the VoID description file and the SPARQL endpoint can be accessed via a web interface [6]. Remarkably, our RDF disease instances can be linked out to OMIM RDF disease data. As the Open PHACTS project is co-developing and exploiting the RDF-based nanopublication format, an adaptation of the RDF DisGeNET data to the OPS nanopublication format is currently been tackled.

1.3 Database update In the first release of DisGeNET, DisGeNET v1.0, the primary sources integrated were CTDTM, UniProt, OMIM, PharmGKB, and LHGDN. In order to enrich the database content, an update was launched on July 2012. DisGeNET v2.0 includes other sources: GDA, MGD, and information on mouse (Mus musculus) from the CTD that was not previously available. In the current release, the disease identifiers from OMIM and MeSH were changed to concept unique identifiers from the Unified Medical Language System (UMLS), a system that integrates many health and biomedical vocabularies and standards. For licensing reasons, OMIM and PharmGKB data sources had to be removed and they are not currently included in DisGeNET v2.0. Nevertheless, as these two datasets were adapted for its integration in DisGeNET v1.0, they would be easily reintegrated into DisGeNET if licensing issues are overcome.

1.4 Future objectives

1. DisGeNET team is currently working on a new update, that we plan to release by the end of this year, to update not only the information from human and mouse, but also to include information from Rattus norvegicus, using the Rat Genome database, and include the associations for Rattus norvergicus in CTD. In addition, we are planning to update the text mining associations set. In the next release, the updated gene- disease association ontology developed in our group and integrated in the Semantics Science Integrated Ontology (SIO) [7], will be incorporated to the data. The RDF version of DisGeNET will be properly updated. We also plan to improve the browser, and add new functionalities at users' request. 2. The DisGeNET RDF will be incorporated into the Linked Data Cache (LDC) in one of the Open PHACTS releases subsequent, to the 1.3 release to integrate and make available the disease-gene associations extracted from the many resources as described previously. The information supplied in DisGeNET RDF has database cross references to other data presently in the LDC, notably UniProt. As the neXtProt RDF, which also will be added to the LDC subsequent to the 1.3 release, also has cross-references to UniProt, and will supply tissue expression data, the link between

______© Copyright 2011 Open PHACTS Consortium Open Deliverable: Integrating OPS data with Deliverable: 6.24 PHACTS Human Disease maps. Author: J. Pinero, N. Queralt, L. Furlong IMI - 115191 Version: 1.0 6 / 7 (PSMAR), C. Chichester (SIB)

tissue expression and disease data will be exposed in the Open PHACTS API as part of future releases. 3. In a first step in the effort to produce a “visualization of transporter interactions applied in the prediction of tissue distribution, in the context of human disease map”, a demonstration of the target pharmacology-tissue expression relationships will be given at the IMI workshop for Management of Tissue Knowledge resources [See reference 8 for details]. This demonstration has required the generation and loading of the neXtProt tissue expression RDF data into a triple store, the development of queries over this data, along with the manual combination to data already present in the LDC.

1.5 References [1] Furlong, L.I. (2012, Oct 11). DisGeNET gene-disease association ontology. Retrieved from http://ibi.imim.es/DisGeNET-Dev/ontologies/GeneDiseaseAssociation_v4.owl [2] Bauer-Mehren et al., Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PLOS One 6, 6 (2011), e20284. [3] Bauer-Mehren et al., DisGeNET: a cytoscape plugin to visualize, integrate, search and analyze gene-disease networks. Bioinformatics 26, 22 (2010), 2924-2926. [4] Furlong, L.I. (2012). DisGeNET web interface. Available from http://ibi.imim.es/DisGeNET [5] Furlong, L.I. (2013). DisGeNET RDF SPARQL endpoint. Available from http://rdf.imim.es/sparql [6] Furlong, L.I. (2013). DisGeNET RDF web interface. Available from http://rdf.imim.es/DisGeNET.html [7] Dumontier et al., The Semanticscience Integrated Ontology (SIO) for Biomedical Research and Knowledge Discovery. 2013. (Submitted). [8] Tissue Knowledge Management Workshop: The workshop is jointly organised by partners on IMI projects DDMoRe, Open PHACTS and eTox, and hosted by the Innovative Medicines Initiative (IMI). This event will address the challenge of encoding, organizing, classifying and sharing knowledge about tissues in support of pharmaceutical R&D. In particular, this workshop will focus on the following five R&D resource types to enable the cross-linking, querying and visualisation of tissue knowledge embedded across IMI efforts:

1) development: clinical measurement data (e.g. radiological biomarkers, fluid biochemistry, physiology measurements flow, electrophysiology); 2) modelling: mechanism-based models of disease– or drug-related processes in pharmacokinetics (e.g. absorption, distribution, metabolism and excretion) and pharmacodynamics; 3) discovery: measurement of gene expression and gene product localisation, as well as tissue-specific information about drug receptors and transporters; 4) safety/tox: tissue-specific toxicology assessments, histopathology reports, and anatomy- specific pharmacovigilance signals; 5) terms: terminological resources for disease and clinical trials (e.g. CDISC).

______© Copyright 2011 Open PHACTS Consortium Open Deliverable: Integrating OPS data with Deliverable: 6.24 PHACTS Human Disease maps. Author: J. Pinero, N. Queralt, L. Furlong IMI - 115191 Version: 1.0 7 / 7 (PSMAR), C. Chichester (SIB)

To that end, the two primary activities of this workshop are to:

[Day 1] survey and report on the ‘state of the art’ in tissue knowledge representation (KR) and management (KM) across ongoing IMI projects. In particular, various groups will hold demonstrations of their methods and tools that enable the functional integration across any of the above resource types through standardized annotation, automated inferencing or multiscale visualisation.

[Day 2] identify stakeholders to roadmap the sustainable development of (i) communal KR interoperability standards, (ii) tools that leverage these standards, resulting in (iii) shared tissue KM practices for the IMI community. Three key topics for discussion and co-ordination are community development, publications and funding strategies.

______© Copyright 2011 Open PHACTS Consortium