Concept Extraction Using Pointer-Generator

Concept Extraction Using Pointer{Generator Networks and Distant Supervision for Data Augmentation Alexander Shvets1[0000−0002−8370−2109] and Leo Wanner1;2[0000−0002−9446−3748] 1 NLP Group, Pompeu Fabra University, Roc Boronat, 138, Barcelona, Spain falexander.shvets,[email protected] 2 Catalan Institute for Research and Advanced Studies (ICREA) Abstract. Concept extraction is crucial for a number of downstream applications. However, surprisingly enough, straightforward single token/nominal chunk{concept alignment or dictionary lookup techniques such as DBpedia Spotlight still prevail. We propose a generic open domain- oriented extractive model that is based on distant supervision of a pointer{ generator network leveraging bidirectional LSTMs and a copy mechanism and that is able to cope with the out-of-vocabulary phenomenon. The model has been trained on a large annotated corpus compiled specifically for this task from 250K Wikipedia pages, and tested on regular pages, where the pointers to other pages are considered as ground truth concepts. The outcome of the experiments shows that our model sig- nificantly outperforms standard techniques and, when used on top of DBpedia Spotlight, further improves its performance. The experiments furthermore show that the model can be readily ported to other datasets on which it equally achieves a state-of-the-art performance. Keywords: Open-domain discourse texts · Concept extraction · Pointer- generator neural network · Distant supervision 1 Introduction In knowledge discovery and representation, the notion of concept is most often used to refer to sense, i.e., àbstract entity' or àbstract object' in the Fregean dichotomy of sense vs. reference [10]. In Natural Language Processing (NLP), the task of detection of surface forms of concepts, namely Concept Extraction (CE), deals with the identification of the language side of the concept coin, i.e., Frege's reference. Halliday [16] offers a syntactic interpretation of reference. In his terminology, it is a \classifying nominal group". For instance, renewable energy or nuclear energy are classifying nominal groups: they denote a class (or type) of energy, while, e.g., cheap energy or affordable energy are not: they do not typify, but rather qualify energy (and are thus \qualifying nominal groups"). CE is crucial for a number of downstream applications, including, e.g., language understanding, ontology population, semantic search, and question an- swering; it is also the key to entity linking [22]. In generic open domain subject- neutral discourse across different (potentially unrelated) subjects, indexing the 2 A. Shvets and L. Wanner longest possible nominal chunks and their head words located in sequences of tokens between specified \break words" [34] and special dictionary lookups such as DBpedia Spotlight [6] and WAT [28] are very common techniques. They generally reach outstanding precision, but low recall due to constant evolvement of the language vocabulary. Advanced deep learning models that already dom- inate CE in specialized closed domain discourse on one or a limited range of related subjects, e.g., biomedical discourse [14, 33], and that are also standard in keyphrase extraction [2, 25] are an alternative. However, such models need a tremendous amount of labeled data for training. We present an operational CE model that utilizes pointer{generator networks with bidirectional long short-term memory (LSTM) units [12, 30] to re- trieve concepts from general discourse textual material.3 Furthermore, since for a generic, open domain concept extraction model we need a sufficiently large training corpus that covers a vast variety of topics and no such annotated corpora are available, we opt for distant supervision to create a sufficiently large and diverse dataset. Distant supervision consists in automatic labeling of potentially useful data by an easy-to-handle (not necessarily accurate) algorithm to obtain an annotation which is likely to be noisy but, at the same time, to contain enough information to train a robust model [26]. Two labeling schemes are considered. Experiments carried out on a dataset of 250K+ Wikipedia pages show that copies of our model trained differently and joined in an ensemble sig- nificantly outperform standard techniques and, when used on top of DBpedia Spotlight, further improve its performance by nearly 10%. 2 Related work In this section, we focus on the review of generic discourse CE; for a compre- hensive review of the large body of work on specialized discourse CE, and, in particular, on biomedical CE; see, e.g., [15]. We also do not discuss recent ad- vances in keyphrase extraction [2] because their applicability to generic concept extraction is limited due to the specificity of the task. The traditional CE techniques interpret any single and multiple token nominal chunk as a concept [34] or do a dictionary lookup, as, e.g., DBpedia Spot- light [6], which matches and links identified nominal chunks with DBpedia entries (6.6M entities, 13 billion RDF triples)4, based on the Apache OpenNLP5 models for phrase chunking and named entity recognition (NER). Given the large cov- erage of DBpedia, the performance of DBpedia Spotlight is rather competitive. However, obviously, the presence of an entry cannot always be ensured. Consider, e.g., a paper title \Natural language understanding with Bloom embeddings, con- volutional neural networks and incremental parsing", where DBpedia Spotlight does not detect \Bloom embeddings" or \incremental parsing", as there are no such entries in DBpedia. 3 We adopt Halliday's notion of classifying nominal group as definition of a concept. 4 https://wiki.dbpedia.org/develop/datasets/dbpedia-version-2016-10 5 https://opennlp.apache.org/ Concept Extraction Using Pointer{Generator Networks 3 As DBpedia Spotlight, AIDA [35] relies on an RDF repository, YAGO2. WAT and its predecessor TagMe [28] use a repository of possible spots made of wiki- anchors, titles, and redirect pages. Both TagMe and WAT rely on statistical attributes called link probability and commonness; WAT draws furthermore on a set of statistics to prune a set of mentions using an SVM classifier. Wikifier [4] focuses on relation extraction, relying on a NER, which uses gazetteers extracted from Wikipedia and simple regular expressions to combine several mentions into a single one. All of them are used for state-of-the-art entity linking and (potentially nested) entity mention detection and typing [17, 36]. FRED [11] also focuses on extraction of relations between entities, with frames [9] as the under- lying theoretical constructs. Unlike Wikifier and FRED, e.g., OLLIE [24] does not rely on any precompiled repository. It outperforms its strong predecessors REVERB [8] in relation extraction by expanding the set of possible relations and including contextual information from the sentence from which the relations are extracted. A number of works focus on the recognition of named entities, which are the most prominent concept type. NERs work at a sentence level and aim at labeling all occurred instances. Among them, Lample et al. [20] provide a state- of-the-art NER model that avoids traditional heavy use of hand-crafted features and domain-specific knowledge. The model is based on bidirectional LSTMs and Conditional Random Fields (CRFs) that rely on two sources of information on words: character-based word representations learned from an annotated corpus and unsupervised word representations learned from unannotated corpora. An- other promising approach to NER is fine-tuning of a language representation model such as, e.g., BERT [7]. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, including NER, without substantial task-specific architecture modifications. Pointer-generator networks, which are generally applied in summarization contexts, are also experimented with in information extraction; cf., e.g., [27], where they have been successfully applied to the detection of term definitions in sentences with a specific structure and their translation into Description Logics formulæ using syntactic transformation. As a matter of fact, this similar task partially motivated our choice to use pointer-generator networks for CE. 3 Description of the model We implement a deep learning model and a large-scale annotation scheme for distant supervision to cope autonomously with dictionary-independent generic CE and to complement state-of-the-art lookup-based approaches in order to increase their recall. In addition, we would like our model to perform reasonably well on pure NER tasks with a small gap to models specifically tuned for the NER datasets. The model follows the well-established tendency in information extraction adopted for NER and extractive summarization and envisages CE as an attention-based sequence-to-sequence learning problem. 4 A. Shvets and L. Wanner Fig. 1. The neural architecture for concept extraction 3.1 Overview of the model As basis of our model, we use the pointer{generator network proposed in [30], which aids the creation of summaries with accurate reproduction of information. In each generation step t, the pointer allows for copying words wi from the source sequence to the target sequence using the distribution of the attention layer at, while the generator samples tokens from the learned vocabulary distribution ∗ Pvocab, conditioned by a context vector ht produced by the same attention layer, which is built

Concept Extraction Using Pointer-Generator

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support