Biomedical Entity Representations with Synonym Marginalization
Total Page:16
File Type:pdf, Size:1020Kb
Biomedical Entity Representations with Synonym Marginalization Mujeen Sung Hwisang Jeon Jinhyuk Leey Jaewoo Kangy Korea University fmujeensung,j hs,jinhyuk lee,[email protected] Abstract Unlike named entities from general domain text, typical biomedical entities have several different Biomedical named entities often play impor- surface forms, making the normalization of biomed- tant roles in many biomedical text mining ical entities very challenging. For instance, while tools. However, due to the incompleteness two chemical entities ‘motrin’ and ‘ibuprofen’ be- of provided synonyms and numerous varia- tions in their surface forms, normalization of long to the same concept ID (MeSH:D007052), biomedical entities is very challenging. In this they have completely different surface forms. paper, we focus on learning representations of On the other hand, mentions having similar sur- biomedical entities solely based on the syn- face forms could also have different meanings onyms of entities. To learn from the incom- (e.g. ‘dystrophinopathy’ (MeSH:D009136) and plete synonyms, we use a model-based candi- ‘bestrophinopathy’ (MeSH:C567518)). These ex- date selection and maximize the marginal like- amples show a strong need for building latent rep- lihood of the synonyms present in top candi- dates. Our model-based candidates are itera- resentations of biomedical entities that capture se- tively updated to contain more difficult neg- mantic information of the mentions. ative samples as our model evolves. In this In this paper, we propose a novel framework for way, we avoid the explicit pre-selection of learning biomedical entity representations based on negative samples from more than 400K can- the synonyms of entities. Previous works on entity didates. On four biomedical entity normal- normalization mostly train binary classifiers that ization datasets having three different entity decide whether the two input entities are the same types (disease, chemical, adverse reaction), (positive) or different (negative) (Leaman et al., our model BIOSYN consistently outperforms previous state-of-the-art models almost reach- 2013; Li et al., 2017b; Fakhraei et al., 2019; Phan ing the upper bound on each dataset. et al., 2019). Our framework called BIOSYN uses the synonym marginalization technique, which maximizes the probability of all synonym repre- sentations in top candidates. We represent each 1 Introduction biomedical entity using both sparse and dense rep- resentations to capture morphological and semantic Biomedical named entities are frequently used as information, respectively. The candidates are itera- arXiv:2005.00239v1 [cs.CL] 1 May 2020 key features in biomedical text mining. From tively updated based on our model’s representations biomedical relation extraction (Xu et al., 2016; Li removing the need for an explicit negative sam- et al., 2017a) to literature search engines (Lee et al., pling from a large number of candidates. Also, the 2016), many studies are utilizing biomedical named model-based candidates help our model learn from entities as a basic building block of their methodolo- more difficult negative samples. Through extensive gies. While the extraction of the biomedical named experiments on four biomedical entity normaliza- entities is studied extensively (Sahu and Anand, tion datasets, we show that BIOSYN achieves new 2016; Habibi et al., 2017), the normalization of ex- state-of-the-art performance on all datasets, out- tracted named entities is also crucial for improving performing previous models by 0.8%∼2.6% top1 the precision of downstream tasks (Leaman et al., accuracy. Further analysis shows that our model’s 2013; Wei et al., 2015). performance has almost reached the performance yCorresponding authors upper bound of each dataset. The contributions of our paper are as follows: to any type of entity and evaluated on four datasets First, we introduce BIOSYN for biomedical en- having three different biomedical entity types. tity representation learning, which uses synonym While traditional biomedical entity nor- marginalization dispensing with the explicit needs malization models are based on hand-crafted of negative training pairs. Second, we show that the rules (D’Souza and Ng, 2015; Leaman et al., iterative candidate selection based on our model’s 2015), recent approaches for the biomedical representations is crucial for improving the perfor- entity normalization have been significantly im- mance together with synonym marginalization. Fi- proved with various machine learning techniques. nally, our model outperforms strong state-of-the-art DNorm (Leaman et al., 2013) is one of the first models up to 2.6% on four biomedical normaliza- machine learning-based entity normalization tion datasets.1 models, which learns pair-wise similarity using tf-idf vectors. Another machine learning-based 2 Related Works study is CNN-based ranking method (Li et al., Biomedical entity representations have largely re- 2017b), which learns entity representations using a lied on biomedical word representations. Right convolutional neural network. The most similar after the introduction of Word2vec (Mikolov et al., works to ours are NSEEN (Fakhraei et al., 2019) 2013), Pyysalo et al.(2013) trained Word2Vec on and BNE (Phan et al., 2019), which map mentions biomedical corpora such as PubMed. Their biomed- and concept names in dictionaries to a latent space ical version of Word2Vec has been widely used using LSTM models and refines the embedding for various biomedical natural language process- using the negative sampling technique. However, ing tasks (Habibi et al., 2017; Wang et al., 2018; most previous works adopt a pair-wise training Giorgi and Bader, 2018; Li et al., 2017a) including procedure that explicitly requires making negative the biomedical normalization task (Mondal et al., pairs. Our work is based on marginalizing positive 2019). Most recently, BioBERT (Lee et al., 2019) samples (i.e., synonyms) from iteratively updated has been introduced for contextualized biomedi- candidates and avoids the problem of choosing a cal word representations. BioBERT is pre-trained single negative sample. on biomedical corpora using BERT (Devlin et al., In our framework, we represent each entity with 2019) and numerous studies are utilizing BioBERT sparse and dense vectors which is largely motivated for building state-of-the-art biomedical NLP mod- by techniques used in information retrieval. Models els (Lin et al., 2019; Jin et al., 2019; Alsentzer in information retrieval often utilize both sparse and et al., 2019; Sousa et al., 2019). Our model also dense representations (Ramos et al., 2003; Palangi uses pre-trained BioBERT for learning biomedical et al., 2016; Mitra et al., 2017) to retrieve relevant entity representations. documents given a query. Similarly, we can think The intrinsic evaluation of the quality of biomed- of the biomedical entity normalization task as re- ical entity representations is often verified by trieving relevant concepts given a mention (Li et al., the biomedical entity normalization task (Leaman 2017b; Mondal et al., 2019). In our work, we use et al., 2013; Phan et al., 2019). The goal of the maximum inner product search (MIPS) for retriev- biomedical entity normalization task is to map an ing the concepts represented as sparse and dense input mention from a biomedical text to its asso- vectors, whereas previous models could suffer from ciated CUI (Concept Unique ID) in a dictionary. error propagation of the pipeline approach. The task is also referred to as the entity linking or 3 Methodology the entity grounding (D’Souza and Ng, 2015; Lea- man and Lu, 2016). However, the normalization 3.1 Problem Definition of biomedical entities is more challenging than the We define an input mention m as an entity string normalization of general domain entities due to a in a biomedical corpus. Each input mention has large number of synonyms. Also, the variations its own CUI c and each CUI has one or more syn- of synonyms depend on their entity types, which onyms defined in the dictionary. The set of syn- makes building type-agnostic normalization model onyms for a CUI is also called as a synset. We difficult (Leaman et al., 2013; Li et al., 2017b; Mon- denote the union of all synonyms in a dictionary dal et al., 2019). Our work is generally applicable as N = [n1; n2;::: ] where n 2 N is a single syn- 1Code available at https://github.com/dmis-lab/BioSyn. onym string. Our goal is to predict the gold CUI c∗ Training Iterative Update with Synonym Marginalization Inference Update Embeddings Mention (i‐thiteration ) covid19 Encoder Similarity MML over Synonyms = Scores ( indicates synonyms) COVID‐19 Weight MERS Sharing Top k Circovirus Sort SARS Encoder Pneumonia SARS‐COV2 Influenza Top 1 Dictionary : Inner Product Figure 1: The overview of BIOSYN. An input mention and all synonyms in a dictionary are embedded by a shared encoder and the nearest synonym is retrieved by the inner-product. Top candidates used for training are iteratively updated as we train our encoders. of the input mention m as follows: Dense Entity Representation While the sparse representation encodes the morphological infor- ∗ c = CUI(argmaxn2N P (njm; θ)) (1) mation of given strings, the dense representation encodes the semantic information. Learning ef- where CUI(·) returns the CUI of the synonym n fective dense representations is the key challenge and θ denotes a trainable parameter of our model. in the biomedical entity normalization task (Li et al., 2017b; Mondal et