Headword Amplified Multi-Span Distantly Supervised Method For

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) HAMNER: Headword Amplified Multi-Span Distantly Supervised Method for Domain Specific Named Entity Recognition∗ Shifeng Liu,1 Yifang Sun,1 Bing Li,1 Wei Wang,1,2 Xiang Zhao3 1University of New South Wales, 2Dongguan University of Technology 3National University of Defence Technology {shifeng.liu, bing.li}@unsw.edu.au, {yifangs, weiw}@cse.unsw.edu.au, [email protected] Abstract methods firstly find entity mentions by exact string matching (Giannakopoulos et al. 2017; Shang et al. 2018b) or reg- To tackle Named Entity Recognition (NER) tasks, super- ular expression matching (Fries et al. 2017) with the dictio- vised methods need to obtain sufficient cleanly annotated data, which is labor and time consuming. On the contrary, nary, and then assign corresponding types to the mentions. distantly supervised methods acquire automatically annotated A model is trained on the training corpus with the pseudo data using dictionaries to alleviate this requirement. Unfortu- annotations. As the result, distant supervision significantly nately, dictionaries hinder the effectiveness of distantly su- reduces the annotation cost, while not surprisingly, the ac- pervised methods for NER due to its limited coverage, espe- curacy (e.g., precision and recall) reduces. cially in specific domains. In this paper, we aim at the limita- In this paper, we aim to reduce the gap between dis- tions of the dictionary usage and mention boundary detection. tantly supervised methods and supervised methods. We ob- We generalize the distant supervision by extending the dic- served two limitations in distantly supervised methods. The tionary with headword based non-exact matching. We apply first limitation is that the information in the dictionary has a function to better weight the matched entity mentions. We not been fully mined and used. For example, consider a propose a span-level model, which classifies all the possible spans then infers the selected spans with a proposed dynamic newly discovered disease namely ROSAH syndrome, which programming algorithm. Experiments on all three benchmark is unlikely to exist in the dictionaries, and hence cannot be datasets demonstrate that our method outperforms previous correctly extracted and annotated if we use simple surface state-of-the-art distantly supervised methods. matching. However, human beings can easily recognize it as a disease, since there are many disease entity mentions in the dictionaries that are ended with syndrome. This mo- 1 Introduction tivates us to use headwords of entity mentions as indicators Named entity recognition (NER) is a task that extracts entity of entity types, and thus improves the quality of the pseudo mentions from sentences and classifies them into pre-defined annotations. types, such as person, location, disease, chemical, etc. It is a The second limitation in distantly supervised methods is vital task in natural language processing (NLP), which ben- that most of the errors come from incorrect boundaries1. efits many downstream applications including relation ex- Most methods (including supervised methods) model the traction (Mintz et al. 2009), event extraction (Nguyen and NER problem as a sequence labeling task and use popular Grishman 2018), and co-reference resolution (Chang, Sam- architectures/classifiers like BiLSTM-CRF (Ma and Hovy dani, and Roth 2013). 2016). However, CRFs suffer from sparse boundary tags (Li, With a sufficient amount of cleanly annotated texts (i.e., Ye, and Shang 2019), and pseudo annotations can only be the training data), supervised methods (Lample et al. 2016; more sparse and noisy. In addition, CRFs focus more on Ma and Hovy 2016) have shown their ability to achieve word-level information and thus cannot make full use of high-quality performance in general domain NER tasks and span-level information (Zhuo et al. 2016). Some methods benchmarks. However, obtaining cleanly annotated texts is choose to fix the entity boundaries before predict the entity labor-intensive and time-consuming, especially in specific type. Apparently, any incorrect boundary will definitely lead domains, such as the biomedical domain and the technical to incorrect output, no matter how accurate the subsequent domain. This hinders the usage of supervised methods in classifier is. Therefore, we propose to decide entity bound- real-world applications. aries after predicting entity types. As such, there would be Distantly supervised methods circumvent the above is- more information, such as the types and confidence scores of sue by generating pseudo annotations according to do- entity mentions, which can help to determine more accurate main specific dictionaries. Dictionary is a collection entity boundaries. of entity mention, entity type pairs. Distantly supervised 1For example, even the state-of-the-art distantly supervised ∗S. Liu and W. Wang are the co-corresponding authors. method (Shang et al. 2018b) has at least 40% errors coming from Copyright c 2020, Association for the Advancement of Artificial incorrect boundaries, on all three benchmarks that are evaluated in Intelligence (www.aaai.org). All rights reserved. this paper. 8401 Based on the above two ideas, we propose a new distantly tasks (Fries et al. 2017; Giannakopoulos et al. 2017; Shang supervised method named HAMNER (Headword Amplified et al. 2018b). SwellShark (Fries et al. 2017) utilizes a col- Multi-span NER) for NER tasks in specific domains. We first lection of dictionaries, ontologies, and, optionally, heuris- introduce a novel dictionary extension method based on the tic rules to generate annotations and predict entity mentions semantic matching of headwords. To account for the possi- in the biomedical domain without human annotated data. ble noise introduced by the extended entity mentions, we Shang et al. 2018b use exact string matching to generate also use the similarity between the headwords of the ex- pseudo annotated data, and apply high-quality phrases (i.e., tended entity mentions and the original entity mentions to mined in the same domain without assigning any entity represent the quality of the extended entity mentions. The types) to reduce the number for false negative annota- extended dictionary will be used to generate pseudo annotations (Shang et al. 2018b). However, their annotation qual- tions. We then train a model to estimate the type of a given ity is limited by the coverage of the dictionary, which leads span from a sentence based on its contextual information. to a relatively low recall. In the technical domain, Distant- Given a sentence, HAMNER uses the trained model to pre- LSTM-CRF (Giannakopoulos et al. 2017) applies syntactic dict types of all the possible spans subject to the pre-defined rules and pruned high-quality phrases to generate pseudo an- maximum number of words, and uses a dynamic program- notated data for distant supervision. The major differences ming based inference algorithm to select the most proper between HAMNER and other distantly supervised methods boundaries and types of entity mentions while suppressing are 1. HAMNER extended the coverage of dictionaries with- overlapping and spurious entity mentions. out human efforts. 2. HAMNER makes predictions in the The main contributions of this paper are span level with both entity type and boundary information. • We generalize the distant supervision idea for NER by extending the dictionaries using semantic matching of head- 3 Problem Definition words. Our extension is grounded in linguistic and dis- We represent a sentence as a word sequence tributional semantics. We use the extended dictionary to (x1,x2,...,xN ). For span (xi,...,xj) from the sen- improve the quality of the pseudo annotations. tence, we use i, j to denote its boundaries, and use l ∈ L • We propose a span-level model with both span informa- to denote its type, where L represents the list of pre-defined tion and contextual information to predict the type for a types (e.g., Disease, Chemical) and none type (e.g., given span. We propose a dynamic programming infer- None). We let None be the last element in L (i.e., L|L|). ence algorithm to select the spans which are the most We tackle the problem with distant supervision. Unlike likely to be entity mentions. supervised and semi-supervised methods, we require no hu- • Experiments on three benchmark datasets have demon- man annotated training data. Instead, we only require a dic- strated that HAMNER achieves the best performance with tionary as the input in addition to the raw text. The dictionary dictionaries only and no human efforts. Detailed analysis is a collection of entity mention, entity type-tuples. We use has shown the effectiveness of our designed method. dictionary in the training phase to help generate pseudo annotations on the training corpus. 2 Related Work We argue that dictionaries are easy to obtain, either from publicly available resources, such as Freebase (Bollacker et Named entity recognition (NER) attracts researchers and has al. 2008) and SNOMED Clinical Terms2, or by crawling been tackled by both supervised and semi-supervised meth- terms from some domain-specific high-quality websites 3. ods. Supervised methods, including feature based methods (Ratinov and Roth 2009; Liu et al. 2018; Sun et 4 The Proposed Method al. 2019) and neural network based methods (Lample et al. 2016; Ma and Hovy 2016), require cleanly annotated Figure 1 illustrates the key steps in HAMNER. In training, texts. Semi-supervised methods either utilize more unla- our method firstly generates pseudo annotations according beled data (Peters et al. 2018) or generate annotated data to a headword-based extended dictionary (details in Sec- gradually (Tomanek and Hahn 2009; Han, Kwoh, and Kim tion 4.1). Then a neural model is trained using the pseudo 2016; Brooke, Hammond, and Baldwin 2016). annotated data. The model takes a span and its context in the sentence as input, and predicts the type of the span. The Distant supervision is proposed to alleviate human efforts structure of the neural model is introduced in Section 4.2.

Load more