BENNERD: a Neural Named Entity Linking System for COVID-19
Total Page:16
File Type:pdf, Size:1020Kb
BENNERD: A Neural Named Entity Linking System for COVID-19 Mohammad Golam Sohraby,∗, Khoa N. A. Duongy,∗, Makoto Miway, z , Goran Topic´y, Masami Ikeday, and Hiroya Takamuray yArtificial Intelligence Research Center (AIRC) National Institute of Advanced Industrial Science and Technology (AIST), Japan zToyota Technological Institute, Japan fsohrab.mohammad, khoa.duong, [email protected], fikeda-masami, [email protected], [email protected] Abstract in text mining system, Xuan et al.(2020b) created CORD-NER dataset with comprehensive NE an- We present a biomedical entity linking (EL) system BENNERD that detects named enti- notations. The annotations are based on distant ties in text and links them to the unified or weak supervision. The dataset includes 29,500 medical language system (UMLS) knowledge documents from the CORD-19 corpus. The CORD- base (KB) entries to facilitate the corona virus NER dataset gives a shed on NER, but they do not disease 2019 (COVID-19) research. BEN- address linking task which is important to address NERD mainly covers biomedical domain, es- COVID-19 research. For example, in the example pecially new entity types (e.g., coronavirus, vi- sentence in Figure1, the mention SARS-CoV-2 ral proteins, immune responses) by address- ing CORD-NER dataset. It includes several needs to be disambiguated. Since the term SARS- NLP tools to process biomedical texts includ- CoV-2 in this sentence refers to a virus, it should ing tokenization, flat and nested entity recog- be linked to an entry of a virus in the knowledge nition, and candidate generation and rank- base, not to an entry of ‘SARS-CoV-2 vaccination’, ing for EL that have been pre-trained using which corresponds to therapeutic or preventive pro- the CORD-NER corpus. To the best of our cedure to prevent a disease. knowledge, this is the first attempt that ad- We present a BERT-based Exhaustive Neural dresses NER and EL on COVID-19-related entities, such as COVID-19 virus, potential Named Entity Recognition and Disambiguation vaccines, and spreading mechanism, that may (BENNERD) system. The system is composed benefit research on COVID-19. We release of four models: NER model (Sohrab and Miwa, an online system to enable real-time entity 2018) that enumerates all possible spans as po- annotation with linking for end users. We tential entity mentions and classifies them into en- also release the manually annotated test set tity types, masked language model BERT (Devlin and CORD-NERD dataset for leveraging EL et al., 2019), candidate generation model to find task. The BENNERD system is available at https://aistairc.github.io/BENNERD/. a list of candidate entities in the unified medical language system (UMLS) knowledge base (KB) for 1 Introduction entity linking (EL) and candidate ranking model to disambiguate the entity for concept indexing. In response to the coronavirus disease 2019 The BENNERD system provides a web interface to (COVID-19) for global research community to ap- facilitate the process of text annotation and its dis- ply recent advances in natural language processing ambiguation without any training for end users. In (NLP), COVID-19 Open Research Dataset (CORD- addition, we introduce CORD-NERD (COVID-19 19)1 is an emerging research challenge with a re- Open Research Dataset for Named Entity Recog- source of over 181,000 scholarly articles that are nition and Disambiguation) dataset an extended related to the infectious disease COVID-19 caused version of CORD-NER as for leveraging EL task. by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To facilitate COVID-19 stud- ies, since NER is considered a fundamental step 2 System Description ∗ *Both authors contributed equally. The main objective of this work is to address re- 1https://www.kaggle.com/ allen-institute-for-ai/ cent pandemic of COVID-19 research. To facil- CORD-19-research-challenge itate COVID-19 studies, we introduce the BEN- 182 Proceedings of the 2020 EMNLP (Systems Demonstrations), pages 182–188 November 16-20, 2020. c 2020 Association for Computational Linguistics Cosine Similarity CANDIDATE GENERATION MODEL Mention Embedding Span Semantic Type Name Semantic Type Name Semantic Type Angiotensin-converting enzyme 2 ( ACE2 ) as a SARS-CoV-2 receptor CUI: C0960880 CUI: C1422064 NAME: Angiotensin-converting enzyme 2 NAME: ACE2 Gene T116 T028 T005 T192 SEMANTIC TYPE: T116 SEMANTIC TYPE: T028 Entity Linking System CORD-19 UMLS Ranking Corpus Knowledgebase Candidate e1 e2 .... ek Generation SimString Mention Encoder CANDIDATE GENERATION MODEL Entity Linking Model Mention Text Classifier Labelled NER MODEL CORD-19 Mention Corpus Text Embedding Semi-supervised Learning BERT BERT BERT BERT NER Model Angiotensin-converting enzyme 2 ( ACE2 ) as a SARS-CoV-2 receptor BENNERD WEB INTERFACE Articles (Unlabelled) Figure 1: Workflow of BENNERD System NERD system that finds nested named entities and above text, the corresponding concept unique iden- links them to a UMLS knowledge base (KB). BEN- tifier (CUI) on the UMLS is shown. Figure3 also NERD mainly comprises two platforms: a web in- shows an example where entity mention SARS- terface and a back-end server. The overall workflow CoV-2 links to its corresponding CUI. Users can of the BENNERD system is illustrated in Figure1. save the machine readable text in txt format and annotation files in the ann format where the ann 2.1 BENNERD Web Interface annotation file provides standoff annotation output in brat (Stenetorp et al., 2012)2 format. The user interface of our BENNERD system is a web application with input panel, load a sample tab, 2.1.1 Data Flow of Web Interface annotation tab, gear box tab, and .TXT and .ANN We provide a quick inside look of our BENNERD tabs. Figure2 shows an users’ input interface of web interface (BWI). The data flow of BWI is pre- BENNERD. For a given text from users or loading sented as follows: a sample text from a sample list, the annotation tab will show the annotations with the text based on Server-side initialization (a) The BWI config- best NER- and EL-based training model. Figure3 uration, concept embeddings, and NER and EL shows an example of text annotation based on the models are loaded (b) GENIA sentence splitter and BENNERD’s NER model. Different colors rep- BERT basic tokenizer instances are initialized (c) resent different entity types and, when the cursor floats over a coloured box representing an entity 2https://brat.nlplab.org 183 layer assigns a vector vi;j to the j-th subword of the i-th word. Then, we generate the vector embedding vi for each word xi by computing the unweighted average of its subword embeddings vi;j. We gen- erate mention candidates based on the same idea as the span-based model (Lee et al., 2017; Sohrab and Miwa, 2018; Sohrab et al., 2019a,b), in which all continuous word sequences are generated given a maximal size Lx (span width). The representa- tion x 2 Rdx for the span from the b-th word to Figure 2: BENNERD Users’ Input Interface b;e the e-th word in a sentence is calculated from the embeddings of the first word, the last word, and the weighted average of all words in the span as follows: " e # X xb;e = vb; αb;e;ivi; ve ; (1) i=b where αb;e;i denotes the attention value of the i-th word in a span from the b-th word to the e-th word, and [; ; ] denotes concatenation. Figure 3: Entity Annotation and Linking with BEN- 2.2.2 Entity Linking NERD In our EL component, for every mention span xb;e of a concept in a document, we are supposed to Concept embeddings are indexed by Faiss (Johnson identify its ID in the target KB.3 Let us call the ID et al., 2019) a concept unique identifier (CUI). The input is all predicted mention spans M = fm1; m2; : : : ; mng, When a text is submitted (a) The text is split where mi denotes the i-th mention and n denotes into sentences and tokens (b) Token and sentence the total number of predicted mentions. The list of standoffs are identified (c) NER model is run on to- entity mentions fmigi=1;:::;n needs to be mapped kenized sentences (d) EL model is run on the result to a list of corresponding CUIs fcigi=1;:::;n. We (e) The identified token spans are translated into decompose EL into two subtasks: candidate gener- text standoffs (f) The identified concepts’ names ation and candidate ranking. are looked up in the UMLS database (g) A brat document is created (h) The brat document is trans- Candidate Generation To find a list of candi- lated into JSON, and sent to the client side (i) The date entities in KB to link with a given mention, brat visualizer renders the document we build a candidate generation layer adapting a dual-encoders model (Gillick et al., 2019). Instead 2.2 BENNERD Back-end of normalizing entity definition to disambiguate The BENNERD back-end implements a pipeline entities, we simply normalize the semantic types in of tools (e.g., NER, EL), following the data flow both mention and entity sides from UMLS. described in Section 2.1.1. This section provides The representation of a mention m in a document implementation details of our back-end modules by the semantic type tm, can be denoted as: for NER and EL. vm = [wm; tm] ; (2) 2.2.1 Neural Named Entity Recognition dt where tm 2 R m is the mention type embedding. We build the mention detection, a.k.a NER, based For the entity (concept) side with semantic type on the BERT model (Devlin et al., 2019).