Ontology Alignment in the Biomedical Domain Using Entity Definitions And
Total Page:16
File Type:pdf, Size:1020Kb
Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context Lucy Lu Wangy, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Chris Wilhelm, and Waleed Ammar Allen Institute for Artificial Intelligence yDepartment of Biomedical Informatics and Medical Education, University of Washington Seattle, Washington, USA [email protected] Abstract that includes “Dentatorubral-pallidoluysian atro- phy” from OMIM). Ontology alignment is the task of identi- This poses a challenge for bioNLP applica- fying semantically equivalent entities from tions where multiple ontologies are needed for two given ontologies. Different ontologies grounding, but each concept must be repre- have different representations of the same sented by only one entity. For instance, in entity, resulting in a need to de-duplicate www.semanticscholar.org, scientific pub- entities when merging ontologies. We pro- lications related to carpal tunnel syndrome are pose a method for enriching entities in an linked to one of multiple entities derived from ontology with external definition and con- UMLS terminologies representing the same con- text information, and use this additional cept,2 making it hard to find all relevant papers on information for ontology alignment. We this topic. To address this challenge, we need to develop a neural architecture capable of automatically map semantically equivalent entities encoding the additional information when from one ontology to another. This task is referred available, and show that the addition of to as ontology alignment or ontology matching. external data results in an F1-score of Several methods have been applied to ontol- 0.69 on the Ontology Alignment Evalua- ogy alignment, including rule-based and statisti- tion Initiative (OAEI) largebio SNOMED- cal matchers. Existing matchers rely on entity fea- NCI subtask, comparable with the entity- tures such as names, synonyms, as well as rela- level matchers in a SOTA system. tionships to other entities (Shvaiko and Euzenat, 2013; Otero-Cerdeira et al., 2015). However, it 1 Introduction is unclear how to leverage the natural language Ontologies are used to ground lexical items in var- text associated with entities to improve predic- ious NLP tasks including entity linking, question tions. We address this limitation by incorporating answering, semantic parsing and information re- two types of natural language information (defini- trieval.1 In biomedicine, an abundance of on- tions and textual contexts) in a supervised learning tologies (e.g., MeSH, Gene Ontology) has been framework for ontology alignment. Since the def- developed for different purposes. Each ontology inition and textual contexts of an entity often pro- describes a large number of concepts in health- vide complementary information about the entity’s care, public health or biology, enabling the use of meaning, we hypothesize that incorporating them ontology-based NLP methods in biomedical appli- will improve model predictions. We also discuss cations. However, since these ontologies are typ- how to automatically derive labeled data for train- ically curated independently by different groups, ing the model by leveraging existing resources. In many important concepts are represented inconsis- particular, we make the following contributions: tently across ontologies (e.g., “Myoclonic Epilep- • We propose a novel neural architecture for on- sies, Progressive” in MeSH is a broader concept tology alignment and show how to effectively 1Ontological resources include ontologies, knowledgebases, 2See https://www.semanticscholar.org/ terminologies, and controlled vocabularies. In the rest of this topic/Carpal-tunnel-syndrome/248228 and paper, we refer to all of these with the term ‘ontology’ for https://www.semanticscholar.org/topic/ consistency. Carpal-Tunnel-Syndrome/3076 47 Proceedings of the BioNLP 2018 workshop, pages 47–55 Melbourne, Australia, July 19, 2018. c 2018 Association for Computational Linguistics Figure 1: OntoEmma consists of three modules: a) candidate selection (see x2.2 for details), b) feature generation (see x2.2 for details), and c) prediction (see x2.3 for deatils). OntoEmma accepts two ontolo- gies (a source and a target) as inputs, and outputs a list of alignments between their entities. When using a neural network, the feature generation and prediction model are combined together in the network. 4 integrate natural language inputs such as defini- and a list of usage contexts (econtexts). tions and contexts in this architecture (see x2 for details).3 2.2 Candidate selection and feature generation • We use the UMLS Metathesaurus to extract Many ontologies are large, which makes it compu- large amounts of labeled data for supervised tationally expensive to consider all possible pairs training of ontology alignment models (see of source and target entities for alignment. For x3.1). We release our data set to help future re- 3 example, the number of all possible entity pairs search in ontology alignment. in our training ontologies is on the order of 1011. • We use external resources such as Wikipedia In order to reduce the number of candidates, we and scientific articles to find entity definitions use an inexpensive low-precision, high-recall can- and contexts (see x3.2 for details). didate selection method using the inverse docu- ment frequency (idf ) of word tokens appearing in 2 OntoEmma entity names and definitions. For each source en- tity, we first retrieve all target entities that share In this section, we describe OntoEmma, our pro- a token with the source entity. Given the set of posed method for ontology matching, which con- shared word tokens ws+t between a source and sists of three stages: candidate selection, fea- target entity, we sum the idf of each token over the set, yielding idf = P idf(i). To- ture generation and prediction (see Fig.1 for an total iws+t overview). kens with higher idf values appear less frequently overall in the ontology and presumably contribute 2.1 Problem definition and notation more to the meaning of a specific entity. We com- pute the idf sum for each target entity and output We start by defining the ontology matching prob- the K = 50 target entities with the highest value lem: Given a source ontology Os and a target for each source entity, yielding jOsj×K candidate ontology Ot, each consisting of a set of entities, pairs. find all semantically equivalent entity pairs, i.e., For each candidate pair (es; et), we precompute f(es; et) 2 Os×Ot : es ≡ etg, where ≡ indicates a set of 32 features commonly used in the ontology semantic equivalence. For consistency, we prepro- matching literature including the token Jaccard cess entities from different ontologies to have the distance, stemmed token Jaccard distance, char- same set of attributes: a canonical name (e ), name acter n-gram Jaccard distance, root word equiv- a list of aliases (e ), a textual definition (e ), aliases def alence, and other boolean and probability values 3Implementation and data available at https://www. 4Some of these attributes may be missing or have low cover- github.com/allenai/ontoemma/ age. See x3.2 for coverage details. 48 Figure 2: Siamese network architecture for computing entity embeddings for each source and target entity in a candidate entity pair. over the entity name, aliases, and definition.5 are concatenated and used as the name vector vname. 2.3 Prediction • Each alias in ealiases is independently embed- Given a candidate pair (es; et) and the precom- ded using the same encoder used for canonical puted features f(es; et), we train a model to pre- names (with shared parameters), yielding a set dict the probability that the two entities are seman- of alias vectors valias−i for i = 1;:::; jealiasesj. tically equivalent. Figure2 illustrates the architec- • An entity definition edef is a sequence of tokens, ture of our neural model for estimating this proba- each encoded using pretrained embeddings then bility which resembles a siamese network (Brom- fed into a bi-directional LSTM. The definition ley et al., 1993). At a high level, we first encode vector vdef is the concatenation of the final hid- each of the source and target entities, then concate- den states in the forward and backward LSTMs. nate their representations and feed it into a multi- • Each context in e is independently em- layer perceptron ending with a sigmoid function contexts bedded using the same encoder used for defi- for estimating the probability of a match. Next, nitions (with shared parameters), then averaged we describe this architecture in more detail. yielding the context vector vcontexts. Entity embedding. left As shown in Fig.2( ), The name, alias, definition, and context vec- we encode the attributes of each entity as follows: tors are appended together to create the entity embedding, e.g., the source entity embedding es • A canonical name ename is a sequence of tokens, s s s s s is: v = [v ; v ∗ ; v ; v ]: In or- each encoded using pretrained word2vec em- name alias−i def contexts der to find representative aliases for a given pair beddings concatenated with a character-level of entities, we pick the source and target aliases convolutional neural network (CNN). The to- with the smallest Euclidean distance, i.e., i∗; j∗ = ken vectors feed into a bi-directional long short- s t arg mini;j kv − v k2 term memory network (LSTM) and the hidden alias−i alias−j layers at both ends of the bi-directional LSTM Siamese network. After the source and target entity embeddings are computed, they are fed into 5Even though neural models may obviate the need for feature two subnetworks with shared parameters followed engineering, feeding highly discriminative features into the neural model improves the inductive bias of the model and by a parameterized function for estimating similar- reduces the amount of labeled data needed for training. ity. Each subnetwork is a two layer feedforward 49 network with ReLU non-linearities and dropout Online Mendelian Inheritance in Man (OMIM), (Srivastava et al., 2014).