Capturing Long-Range Contextual Dependencies with Memory-Enhanced Conditional Random Fields

Capturing Long-range Contextual Dependencies with Memory-enhanced Conditional Random Fields Fei Liu Timothy Baldwin Trevor Cohn School of Computing and Information Systems The University of Melbourne Victoria, Australia [email protected] [email protected] [email protected] Abstract linear-chain Conditional Random Fields (CRFs) (Wang et al., 2011; Zhang et al., 2017), framing Despite successful applications across a the task as one of sequence tagging. Although broad range of NLP tasks, conditional ran- CRFs are adept at capturing local structure, the dom fields (“CRFs”), in particular the problem does not naturally suit a linear sequen- linear-chain variant, are only able to model tial structure, i.e. , a post may be a reply to ei- local features. While this has important ther a neighbouring post or one posted far earlier benefits in terms of inference tractabil- within the same thread. In both cases, contextual ity, it limits the ability of the model to dependencies can be long-range, necessitating the capture long-range dependencies between ability to capture dependencies between arbitrarily items. Attempts to extend CRFs to cap- distant items. Identifying this key limitation, Sut- ture long-range dependencies have largely ton and McCallum(2004) and Finkel et al.(2005) come at the cost of computational com- proposed the use of CRFs with skip connections plexity and approximate inference. In this to incorporate long-range dependencies. In both work, we propose an extension to CRFs cases the graph structure must be supplied a pri- by integrating external memory, taking in- ori, rather than learned, and both techniques incur spiration from memory networks, thereby the need for costly approximate inference. allowing CRFs to incorporate informa- Recurrent neural networks (RNNs) have been tion far beyond neighbouring steps. Ex- proposed as an alternative technique for encoding periments across two tasks show substan- sequential inputs, however plain RNNs are unable tial improvements over strong CRF and to capture long-range dependencies (Bengio et al., LSTM baselines. 1994; Hochreiter et al., 2001) and variants such as LSTMs (Hochreiter and Schmidhuber, 1997), 1 Introduction although more capabable of capturing non-local While long-range contextual dependencies are patterns, still exhibit a significant locality bias in prevalent in natural language, for tractability rea- practice (Lai et al., 2015; Linzen et al., 2016). sons, most statistical models capture only local In this paper, taking inspiration from the work features (Finkel et al., 2005). Take the sentence of Weston et al.(2015) on memory networks in Figure1, for example. Here, while it is easy to (MEMNETs), we propose to extend CRFs by in- determine that Interfax in the second sentence is a tegrating external memory mechanisms, thereby named entity, it is hard to determine its semantic enabling the model to look beyond localised fea- class, as there is little context information. The us- tures and have access to the entire sequence. age in the first sentence, on the other hand, can be This is achieved with attention over every en- reliably disambiguated due to the post-modifying try in the memory. Experiments on named en- phase news agency. Ideally we would like to be tity recognition and forum thread parsing, both able to share such contexts across all usages (and of which involve long-range contextual dependen- variants) of a given named entity for reliable and cies, demonstrate the effectiveness of the proposed consistent identification and disambiguation. model, achieving state-of-the-art performance on A related example is forum thread discourse the former, and outperforming a number of strong analysis. Previous work has largely focused on baselines in the case of the latter. A full imple- 555 Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 555–565, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP B-ORG O O O B-ORG O B-MISC O ··· ··· ··· Interfax news agency said Interfax quoted Russian military Figure 1: A NER example with long-range contextual dependencies. The vertical dash line indicates a sentence boundary. mentation of the model is available at: https: incompatibility with exact inference. //github.com/liufly/mecrf. Similar ideas have also been explored by Kr- The paper is organised as follows: after review- ishnan and Manning(2006) for NER, where they ing previous studies on capturing long range con- apply two CRFs, the first of which makes pre- textual dependencies and related models in Sec- dictions based on local information, and the sec- tion2, we detail the elements of the proposed ond combines named entities identified by the first model in Section3. Section4 and5 present the CRF in a single cluster, thereby enforcing label experimental results on two different datasets: one consistency and enabling the use of a richer set of for thread discourse structure prediction and the features to capture non-local dependencies. Liao other named entity recognition (NER), with anal- and Grishman(2010) make a strong case for go- yses and visualisation in their respective sections. ing beyond sentence boundaries and leveraging Lastly, Section6 concludes the paper. document-level information for event extraction. While we take inspiration from these earlier 2 Related Work studies, we do not enforce label consistency as a hard constraint, and additionally do not sacrifice In this section, we review the different families of inference tractability: our model is capable of in- models that are relevant to this work, in captur- corporating non-local features, and is compatible ing long-range contextual dependencies in differ- with exact inference methods. ent ways. Recurrent Neural Networks (RNNs). Re- Conditional Random Fields (CRFs). CRFs cently, the broad adoption of deep learning meth- (Lafferty et al., 2001), in particular linear-chain ods in NLP has given rise to the prevalent use of CRFs, have been widely adopted and applied to RNNs. Long short-term memories (“LSTMs”: sequence labelling tasks in NLP, but have the crit- Hochreiter and Schmidhuber(1997)), a particular ical limitation that they only capture local struc- variant of RNN, have become particularly popu- ture (Sutton and McCallum, 2004; Finkel et al., lar, and been successfully applied to a large num- 2005), despite non-local structure being common ber of tasks: speech recognition (Graves et al., in structured language classification tasks. In the 2013), sequence tagging (Huang et al., 2015), context of named entity recognition (“NER”), Sut- document categorisation (Yang et al., 2016), and ton and McCallum(2004) proposed skip-chain machine translation (Cho et al., 2014; Bahdanau CRFs as a means of alleviating this shortcom- et al., 2014). However, as pointed out by Lai ing, wherein distant items are connected in a se- et al.(2015) and Linzen et al.(2016), RNNs — quence based on a heuristic such as string identity including LSTMs — are biased towards immedi- (to achieve label consistency across all instances ately preceding (or neighbouring, in the case of bi- of the same string). The idea of label consistency directional RNNs) items, and perform poorly in and exploiting non-local features has also been ex- contexts which involve long-range contextual de- plored in the work of Finkel et al.(2005), who take pendencies, despite the inclusion of memory cells. long-range structure into account while maintain- This is further evidenced by the work of Cho et al. ing tractable inference with Gibbs sampling (Ge- (2014), who show that the performance of a basic man and Geman, 1984), by performing approxi- encoder–decoder deteriorates as the length of the mate inference over factored probabilistic models. input sentence increases. While both of these lines of work report impres- sive results on information extraction tasks, they Memory networks (MEMNETs). More re- come at the price of high computational cost and cently, Weston et al.(2015) proposed memory 556 networks and showed that the augmentation of output memory representations are connected via memory is crucial to performing inference re- an attention mechanism whose weights are deter- quiring long-range dependencies, especially when mined by measuring the similarity between the in- document-level reasoning between multiple sup- put memory and the current input. The CRF layer, porting facts is required. Of particular interest to on the other hand, takes the output of the memory our work are so-called “memory hops” in memory layer as input. In the remainder of this section, we networks, which are guided by an attention mech- detail the elements of ME-CRF. anism based on the relevance between a question and each supporting context sentence in the mem- 3.1 Memory Layer ory hop. Governed by the attention mechanism, 3.1.1 Input memory the ability to access the entire sequence is similar to the soft alignment idea proposed by Bahdanau Every element (word/post) in a sequence x is en- coded with x = Φ(x ), where Φ( ) can be any et al.(2014) for neural machine translation. In this t t · work, we borrow the concept of memory hops and encoding function mapping the input xt into a vector x Rd. This results in the sequence integrate it into CRFs, thereby enabling the model t ∈ x ,..., x . While this new sequence can be to look beyond localised features and have access { 1 T } to the whole sequence via an attention mechanism. seen as the memory in the context of MEMNETs, one major drawback of this approach, as pointed 3 Methodology out by Seo et al.(2017), is the insensitivity to temporal information between memory cells. We In the context of sequential tagging, we assume therefore follow Xiong et al.(2016) in inject- the input is in the form of sequence pairs: = D ing temporal signal into the memory using a bi- x(n), y(n) N where x(n) is the input of the { }n=1 directional GRU encoding (Cho et al., 2014): n-th example in dataset and consists of a se- (n) (n) D(n) (n) quence: x1 , x2 , .

Load more