TERA: the Toxicological Effect and Risk Assessment Knowledge Graph
Total Page:16
File Type:pdf, Size:1020Kb
TERA: the Toxicological Effect and Risk Assessment Knowledge Graph Erik B. Myklebust1;2, Ernesto Jim´enez-Ruiz2;3, Jiaoyan Chen4, Raoul Wolf1, and Knut Erik Tollefsen1;5 1 Norwegian Institute for Water Research, Oslo, Norway 2 Department of Informatics, University of Oslo, Norway 3 City, University of London, United Kingdom 4 Department of Computer Science, University of Oxford, United Kingdom 5 Faculty of Environmental Sciences and Natural Resource Management, Norwegian University of Life Sciences, As,˚ Norway Abstract. Ecological risk assessment requires large amounts of chemi- cal effect data from laboratory experiments. Due to experimental effort and animal welfare concerns it is desired to extrapolate data from ex- isting sources. To cover the required chemical effect data several data sources need to be integrated to enable their interoperability. In this pa- per we introduce the Toxicological Effect and Risk Assessment (TERA) knowledge graph, which aims at providing such integrated view, and the data preparation and steps followed to construct this knowledge graph. We also present the applications of TERA for chemical effect prediction and the potential applications within the Semantic Web community. Keywords: Ecotoxicology, Risk Assessment, Knowledge Graph 1 Introduction Expanding the scope of ecological risk assessment models is a key goal in compu- tational ecotoxicological research. However, the limiting factor in risk assessment is often the availability of toxicological effect data for a given compound and a given organism (species). The potential use of ten to hundreds of test organisms becomes ethically questionable. Moreover, collection of these data is labour- and cost-intensive and often requires extensive laboratory experiments. arXiv:1908.10128v5 [cs.AI] 12 Dec 2019 One major challenge in risk assessment processes is the interoperability of data. In this paper we present the Toxicological Effect and Risk Assessment (TERA) Knowledge Graph that aims at providing an integrated view of the rel- evant data sources.y The data sources that TERA integrates vary from tabular, to RDF files and SPARQL queries over public linked data. Certain sources are y This paper focuses and extends on the construction of the TERA knowledge graph as a resource for the ecotoxicological and Semantic Web domains, while our paper in [1] had a special focus on the use of knowledge graph embeddings and machine learning for chemical effect prediction. 2 E. B. Myklebust et. al. very large and frequently updated, therefore TERA is materialized upon request via a series of APIs that are created to interact with TERA. The main contributions of this paper are summarized as follows: (i) We have released the TERA knowledge graph. A partially materialized snapshot among other relevant resources are publicly available (see Zenodo repository [2]). (ii) TERA also includes the mappings between ECOTOX and NCBI for which there was not a complete and public alignment. (iii) We have created a series of APIs to access, update and extend TERA (see GitLab repository [3]). (iv) We describe several applications of the TERA knowledge graph both in the ecotoxicology domain (e.g., chemical effect prediction) and the Seman- tic Web domain (e.g., embedding of hierarchical biased knowledge graphs, large scale ontology alignment). This paper is organized as follows. Section 2 provides an introduction to ecotoxicological risk assessment, while Section 3 provides some background to facilitate the understanding of the subsequent sections. In Section 4 we first present the data sources used to construct TERA, before we show the integration of these sources, and finally show several ways of accessing TERA. Section 5 describes the potential applications of TERA, while Section 6 elaborates on the contribution of this work. 2 Background Ecotoxicology is a multidisciplinary field that studies the potentially adverse toxicological effects of chemicals on individuals, sub-populations, communities and ecosystems. In this context, risk is the result of the intrinsic hazards of a substance on species, populations or ecosystems, combined with an estimate of the environmental exposure, i.e., the product of exposure and effect (hazard). Figure 1 shows a simplified risk assessment pipeline. Exposure data is gath- ered from analysis of environmental concentrations of one or more chemicals, while effects (hazards) are characterized for a number of species in the labora- tory as a proxy for more ecologically relevant organisms. These two data sources are used to calculate risk, using so-called assessment factors to extrapolate a risk quotient (RQ; ratio between exposure and effects). The RQ for one chemical or the mixture of many chemicals is used to identify chemicals with the highest RQs (risk drivers), susceptible species (or taxa), identify relevant modes of action1 (MoA) and characterize detailed toxicity mechanisms for one or more species (or taxa). Results from these predictions can generate a number of new hypotheses that can be investigated in the laboratory or studied in the environment. The effect data is obtained from available data, or in the case of no avail- able data, during laboratory experiments, where the sub-population of a single 1 The functional or anatomical change in an organism due to exposure to a compound is called MoA. TERA: the Toxicological Effect and Risk Assessment Knowledge Graph 3 Risk Quotient Risk Hot spot Exposure identification Susceptible Effects Species Risk Driver Fig. 1: Simplified ecological risk assessment pipeline. species is exposed to a gradient of concentrations of a chemical. Most commonly, mortality rate, growth, development or reproductive output are measured over time. To give a good indication of the toxicity to a species, these experiments are conducted with a concentration range spanning from no effect (0%) to complete effect (100%) when this is pragmatically possible. Hence some compounds will be more toxic than others and variance in susceptibility between species may provide a distribution of the effective concentration for one specific compound. Ecological risk assessment requires large amounts of effect data to efficiently predict risk for the ecosystems and ecosystem components (e.g., species and taxa). The data must cover a minimum number of the chemicals found when analysing environmental samples, along with covering species and taxa present in the ecosystem. This leads to an immense search space that is close to impos- sible to encompass in its entirety and risk assessment is thus often limited by lack of sufficiently high quality effect data. It becomes essential to extrapolate from known to unknown combinations of chemical{species pairs, which in some degree can be overcome by predicting the effects themselves through the use of quantitative structure{activity relationship models (QSARs). These models have shown promising results for use in risk assessment, e.g., [4], but have lim- ited application domain (coverage), both in terms of compounds and species. Use of read-across and selection of proxy compounds that are chemically sim- ilar, display similar toxicity or have similar MoA and toxicity mechanisms are therefore becoming an attractive solution with increasing popularity (e.g., [5,6]). Development of computational approaches that identify data that can be used for identifying proxy compounds to be used for read-across and data gap filling, is key to facilitate rapid, cost-effective, reliable and transparent predictions of 4 E. B. Myklebust et. al. new effects. We contribute in this regard by creating a semantic layer, i.e., a knowledge graph, to enable extraction and integration of this high quality data. 3 Preliminaries Knowledge graphs. We follow [7] in the notion of a RDF-based knowledge graph which is represented as a set of RDF triples hs; p; oi, where s represents a subject (a class or an instance), p represents a predicate (a property) and o represents an object (a class, an instance or a data value e.g., text, date and number). RDF entities (i.e., classes, properties and instances) are repre- sented by URIs (Uniform Resource Identifier). A knowledge graph consits of a terminology and an assertions box (TBox and ABox). The TBox is com- posed by RDF Schema constructs like class subsumption (e.g., ncbi:taxon/6668 rdfs:subClassOf ncbi:taxon/6657) and property domains and ranges (e.g., et:concentration rdfs:domain et:Chemical).2 The ABox contains relation- ships among instances and type definitions (e.g., et:taxon/28868 rdf:type et:Taxon). SPARQL Queries. RDF-based knowledge graphs can be accessed by SPARQL query language.3 Next we summarise the SPARQL constructs used in this work: (i) Select queries are used when the desired output is tabular. (ii) Construct queries can be used if the purpose of the query is to create a new graph. We use this in Listing 3 to create equivalence triples. (iii) Property paths express multiple edges in a graph. e.g., alternate paths (e.g., rdfs:label | foaf:name), inverse relations (e.g., ^rdf:type), path se- quences (e.g., rdf:type / rdfs:subClassOf), and any combination of these. (iv) A blank node is a node where the identifier is not explicitly given. This al- lows the use of temporary nodes in queries. e.g., Listing 4 uses [rdfs:label "Oslofjorden"@no] to represent a node with label Oslofjorden. Moreover, the extended syntax of SPARQL enables the use of complex property paths (e.g., a path of minimum 1 to maximum n rdfs:subClassOf relations is represented as rdfs:subClassOff1,ng), concatenating variables (e.g., Listing 3), aggregations and more.4 Ontology alignment. Finding the corresponding mappings between a source and a target ontology or knowledge graph is called ontology alignment [9]. In this work, computed mappings are represented in the knowledge graph as triples among the entities of the source and target (e.g., ncbi:taxon/13402 owl:sameAs et:taxon/Carya). 2 The OWL 2 ontology language provides more expressive constructors. Note that the graph projection of an OWL 2 ontology can be seen as a knowledge graph (e.g., [8]).