Syntactically Aware Neural Architectures for Definition Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Syntactically Aware Neural Architectures for Definition Extraction Luis Espinosa-Anke and Steven Schockaert School of Computer Science and Informatics Cardiff University fespinosa-ankel,[email protected] Abstract lexicographic information such as definitions con- stitutes the cornerstone of important language re- Automatically identifying definitional knowl- sources for NLP, such as WordNet (Miller et al., edge in text corpora (Definition Extraction or 1990), BabelNet (Navigli and Ponzetto, 2012), DE) is an important task with direct appli- Wikidata (Vrandeciˇ c´ and Krotzsch¨ , 2014) and ba- cations in, among others, Automatic Glos- sary Generation, Taxonomy Learning, Ques- sically any Wikipedia-derived resource. tion Answering and Semantic Search. It is In this context, systems able to address the prob- generally cast as a binary classification prob- lem of Definition Extraction (DE), i.e., identifying lem between definitional and non-definitional definitional information spanning in free text, are sentences. In this paper we present a set of of great value both for computational lexicogra- neural architectures combining Convolutional phy and for NLP. In the early days of DE, rule- and Recurrent Neural Networks, which are based approaches leveraged linguistic cues ob- further enriched by incorporating linguistic in- formation via syntactic dependencies. Our ex- served in definitional data (Rebeyrolle and Tan- perimental results in the task of sentence clas- guy, 2000; Klavans and Muresan, 2001; Malaise´ sification, on two benchmarking DE datasets et al., 2004; Saggion and Gaizauskas, 2004; Stor- (one generic, one domain-specific), show that rer and Wellinghoff, 2006). However, in order these models obtain consistent state of the to deal with problems like language dependence art results. Furthermore, we demonstrate that and domain specificity, machine learning was in- models trained on clean Wikipedia-like defini- corporated in more recent contributions (Del Gau- tions can successfully be applied to more noisy dio et al., 2013), which focused on encoding infor- domain-specific corpora. mative lexico-syntactic patterns in feature vectors 1 Introduction (Cui et al., 2005; Fahmi and Bouma, 2006; West- erhout and Monachesi, 2007; Borg et al., 2009), Dictionaries and glossaries are among the most both in supervised and semi-supervised settings important sources of meaning for humankind. (Reiplinger et al., 2012; Faralli and Navigli, 2013). Compiling, updating and translating them has tra- On the other hand, while encoding definitional ditionally been left mostly to domain experts and information using deep learning techniques has professional lexicographers. However, the last been addressed in the past (Hill et al., 2015; No- two decades have witnessed a growing interest in raset et al., 2016), to the best of our knowledge automating the construction of lexicographic re- no previous work has tackled the problem of DE sources. by reconciling both the linguistic lessons learned Analogously, in Natural Language Processing in the past decades (e.g., the importance of lex- (NLP), lexicographic resources have proven use- ico syntactic patterns or long-distance relations ful for a myriad of tasks, for example Word Sense between definiendum and definiens)1 and the pro- Disambiguation (Banerjee and Pedersen, 2002; cessing potential of neural networks. Navigli and Velardi, 2005; Agirre and Soroa, Thus, we propose to bridge this gap by learn- 2009; Camacho-Collados et al., 2015), Taxonomy ing high level features over candidate definitions Learning (Velardi et al., 2013; Espinosa-Anke 1Traditionally, a definienidum is a term being defined, et al., 2016b) or Information Extraction (Richard- whereas the definiens refers to its differentiable characteris- son et al., 1998; Delli Bovi et al., 2015). Moreover, tics. via convolutional filters, and then apply recurrent trained embeddings the word2vec (Mikolov et al., neural networks to learn long term dependencies 2013) vectors trained with negative sampling on over these feature maps. Without preprocessing the Google News corpus3. Each sentence is repre- and only taking pretrained embeddings as input, sented as an n × k matrix S, where n is the size of it is already possible to consistently obtain state the longest sentence in the corpus (using padding of the art results in two benchmarking datasets where necessary). The convolution layer applies a (h+1)k for DE (one generic, one domain-specific). Fur- filter wj 2 R to each ngram window of h+1 ther improvements over this simple model are ob- tokens. Specifically, writing xi:i+h for the con- tained by incorporating syntactic information by catenation of the word vectors xi; xi+1; :::; xi+h, composing and embedding head-modifier syntac- we have: tic dependencies and dependency labels. One in- i teresting side result of our experiments is the ob- cj = f (wj · xi:i+h + bj) servation that a model trained only on canonical wikipedia-like definitions performs significantly where bj 2 R is a bias term and f is the ReLu ac- better in a domain-specific academic setting than a tivation function (Nair and Hinton, 2010). In total, model that has been trained on that domain, which we use 100 such convolutional features, i.e. we use i i i i th somewhat contradicts previously assumed notions the vector c = c1; c2; ··· ; c100 to encode the i about the creativity of academic authors when pre- ngram. We empirically set the length h+1 of each senting and describing novel terminology.2 ngram to 3. To reduce the size of the representa- tion, we then use a max pooling layer with a pool i i i i 2 Method size of 4. Let us write d = d1; d2; ··· ; d97 , where di = max(di ; di+1; di+2; di+3). The in- The impact of deep learning methods in NLP is j j j j j put sentence S is then represented as the sequence today indisputable. The utilization of neural net- d1; d5; d9; :::; dn−3, which is used as the input works has improved the state of the art almost sys- to a bidirectional LSTM (BLSTM) layer. Finally, tematically in a wide number of tasks, from lan- the output vectors of the final states for both di- guage modeling (Bengio et al., 2003; Yih et al., rections of this BLSTM are connected to a single 2011; Mikolov et al., 2013) to text classification neuron with a sigmoid activation function. In all (Kim, 2014) or machine translation (Bahdanau the experiments reported in this paper, we classify et al., 2014), among many others. a sentence as definitional when the output of this In this paper we leverage two of the most popu- neuron yields a value which is at least 0.5. lar architectures in deep learning for NLP with the goal to predict, given an input sentence, its proba- 2.1 Incorporating Syntactic Information bility of including definitional knowledge. In our best performing model we take advantage of Con- The role of syntax has been extensively stud- volutional Neural Networks (CNNs) to learn lo- ied for improving semantic modeling of domain cal features via convolved filters (LeCun et al., terminologies. Examples where syntactic cues 1998), and then apply to the learned feature maps are leveraged include medical acronym expansion a Bidirectional Long Short Term Memory (blstm) (Pustejovsky et al., 2001), hyponym-hypernym network (Hochreiter and Schmidhuber, 1997). In extraction and detection (Hearst, 1992; Shwartz this way, we aim at capturing ngram-wise features et al., 2016), and definition extraction either from (Zhou et al., 2015), which may be strong indica- the web (Saggion and Gaizauskas, 2004), schol- tors of definitional patterns (e.g., the classic X is arly articles (Reiplinger et al., 2012), and more a Y pattern), combined with the learning of long- recently from Wikipedia-like definitions (Boella term sequential dependencies over these learned et al., 2014). feature maps. However, the interplay between syntactic infor- Following standard notation for sentence mod- mation and the generalization potential of neural k networks remains unexplored in definition mod- eling via CNNs (Kim, 2014), we let xi 2 R be the k-dimensional word vector associated to the i- eling, although intuitively it seems reasonable to th word in an input sentence S. We use as pre- assume that a syntax-informed architecture should have more tools at its disposal for discriminating 2Code available at bitbucket.org/ luisespinosa/neural_de 3code.google.com/archive/p/word2vec/ 300 between definitional and non-definitional knowl- 100 word 1 edge. As an example of the importance of syntax in encyclopedic definitions, among the definitions ... 100 ... ... contained in the WCL definition corpus (see Sec- word n tion 3.1), 71% of them include the lexico-syntactic ... subj dobj nsubj 346 si gmoi d pattern noun −−is−−! noun. To explore the word 1 word j potential of syntactic information, we represent ... ... dr opout dr opout dependency-based phrases by embedding them in conj ... the same vector space as the pretrained word em- word n word i i nput embeddi ng CNN Max Pool i ng bl st m beddings introduced above. This approach draws from previous work on modeling phrases by com- Figure 1: Architecture of our proposed definition ex- posing their parts and the relations that link them traction model. Input may be either simple pretrained (Socher et al., 2011, 2013, 2014). embeddings or syntactically enriched representations (separated by the dotted line). Specifically, let Sd be the list of head-modifier relations obtained by parsing4 sentence S. Each relation r in S is a head-modifier tuple hh; m; li. d 3 Evaluation Here l denotes the dependency label of the rela- tion (e.g., nsubj), which we represent as the vector 3.1 Evaluation data 1 r = 2 (h + m), with h and m the vector represen- tations of words h and m respectively. This setting WCL: The WCL (World-Class Lattices) dataset for composing first-order head-modifier relations (Navigli et al., 2010) consists of manually anno- is similar to the one proposed in Dyer et al.