Capturing Long-range Contextual Dependencies with Memory-enhanced Conditional Random Fields

Fei Liu Timothy Baldwin Trevor Cohn School of Computing and Information Systems The University of Melbourne Victoria, Australia [email protected] [email protected] [email protected]

Abstract linear-chain Conditional Random Fields (CRFs) (Wang et al., 2011; Zhang et al., 2017), framing Despite successful applications across a the task as one of sequence tagging. Although broad range of NLP tasks, conditional ran- CRFs are adept at capturing local structure, the dom fields (“CRFs”), in particular the problem does not naturally suit a linear sequen- linear-chain variant, are only able to model tial structure, i.e. , a post may be a reply to ei- local features. While this has important ther a neighbouring post or one posted far earlier benefits in terms of inference tractabil- within the same thread. In both cases, contextual ity, it limits the ability of the model to dependencies can be long-range, necessitating the capture long-range dependencies between ability to capture dependencies between arbitrarily items. Attempts to extend CRFs to cap- distant items. Identifying this key limitation, Sut- ture long-range dependencies have largely ton and McCallum(2004) and Finkel et al.(2005) come at the cost of computational com- proposed the use of CRFs with skip connections plexity and approximate inference. In this to incorporate long-range dependencies. In both work, we propose an extension to CRFs cases the graph structure must be supplied a pri- by integrating external memory, taking in- ori, rather than learned, and both techniques incur spiration from memory networks, thereby the need for costly approximate inference. allowing CRFs to incorporate informa- Recurrent neural networks (RNNs) have been tion far beyond neighbouring steps. Ex- proposed as an alternative technique for encoding periments across two tasks show substan- sequential inputs, however plain RNNs are unable tial improvements over strong CRF and to capture long-range dependencies (Bengio et al., LSTM baselines. 1994; Hochreiter et al., 2001) and variants such as LSTMs (Hochreiter and Schmidhuber, 1997), 1 Introduction although more capabable of capturing non-local While long-range contextual dependencies are patterns, still exhibit a significant locality bias in prevalent in natural language, for tractability rea- practice (Lai et al., 2015; Linzen et al., 2016). sons, most statistical models capture only local In this paper, taking inspiration from the work features (Finkel et al., 2005). Take the sentence of Weston et al.(2015) on memory networks in Figure1, for example. Here, while it is easy to (MEMNETs), we propose to extend CRFs by in- determine that Interfax in the second sentence is a tegrating external memory mechanisms, thereby named entity, it is hard to determine its semantic enabling the model to look beyond localised fea- class, as there is little context information. The us- tures and have access to the entire sequence. age in the first sentence, on the other hand, can be This is achieved with attention over every en- reliably disambiguated due to the post-modifying try in the memory. Experiments on named en- phase news agency. Ideally we would like to be tity recognition and forum thread , both able to share such contexts across all usages (and of which involve long-range contextual dependen- variants) of a given named entity for reliable and cies, demonstrate the effectiveness of the proposed consistent identification and disambiguation. model, achieving state-of-the-art performance on A related example is forum thread discourse the former, and outperforming a number of strong analysis. Previous work has largely focused on baselines in the case of the latter. A full imple-

555 Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 555–565, Taipei, Taiwan, November 27 – December 1, 2017 2017 AFNLP B-ORG O O O B-ORG O B-MISC O ··· ··· ···

Interfax news agency said Interfax quoted Russian military

Figure 1: A NER example with long-range contextual dependencies. The vertical dash line indicates a sentence boundary. mentation of the model is available at: https: incompatibility with exact inference. //github.com/liufly/mecrf. Similar ideas have also been explored by Kr- The paper is organised as follows: after review- ishnan and Manning(2006) for NER, where they ing previous studies on capturing long range con- apply two CRFs, the first of which makes pre- textual dependencies and related models in Sec- dictions based on local information, and the sec- tion2, we detail the elements of the proposed ond combines named entities identified by the first model in Section3. Section4 and5 present the CRF in a single cluster, thereby enforcing label experimental results on two different datasets: one consistency and enabling the use of a richer set of for thread discourse structure prediction and the features to capture non-local dependencies. Liao other named entity recognition (NER), with anal- and Grishman(2010) make a strong case for go- yses and visualisation in their respective sections. ing beyond sentence boundaries and leveraging Lastly, Section6 concludes the paper. document-level information for event extraction. While we take inspiration from these earlier 2 Related Work studies, we do not enforce label consistency as a hard constraint, and additionally do not sacrifice In this section, we review the different families of inference tractability: our model is capable of in- models that are relevant to this work, in captur- corporating non-local features, and is compatible ing long-range contextual dependencies in differ- with exact inference methods. ent ways. Recurrent Neural Networks (RNNs). Re- Conditional Random Fields (CRFs). CRFs cently, the broad adoption of meth- (Lafferty et al., 2001), in particular linear-chain ods in NLP has given rise to the prevalent use of CRFs, have been widely adopted and applied to RNNs. Long short-term memories (“LSTMs”: sequence labelling tasks in NLP, but have the crit- Hochreiter and Schmidhuber(1997)), a particular ical limitation that they only capture local struc- variant of RNN, have become particularly popu- ture (Sutton and McCallum, 2004; Finkel et al., lar, and been successfully applied to a large num- 2005), despite non-local structure being common ber of tasks: speech recognition (Graves et al., in structured language classification tasks. In the 2013), sequence tagging (Huang et al., 2015), context of named entity recognition (“NER”), Sut- document categorisation (Yang et al., 2016), and ton and McCallum(2004) proposed skip-chain machine translation (Cho et al., 2014; Bahdanau CRFs as a means of alleviating this shortcom- et al., 2014). However, as pointed out by Lai ing, wherein distant items are connected in a se- et al.(2015) and Linzen et al.(2016), RNNs — quence based on a heuristic such as string identity including LSTMs — are biased towards immedi- (to achieve label consistency across all instances ately preceding (or neighbouring, in the case of bi- of the same string). The idea of label consistency directional RNNs) items, and perform poorly in and exploiting non-local features has also been ex- contexts which involve long-range contextual de- plored in the work of Finkel et al.(2005), who take pendencies, despite the inclusion of memory cells. long-range structure into account while maintain- This is further evidenced by the work of Cho et al. ing tractable inference with Gibbs sampling (Ge- (2014), who show that the performance of a basic man and Geman, 1984), by performing approxi- encoder–decoder deteriorates as the length of the mate inference over factored probabilistic models. input sentence increases. While both of these lines of work report impres- sive results on tasks, they Memory networks (MEMNETs). More re- come at the price of high computational cost and cently, Weston et al.(2015) proposed memory

556 networks and showed that the augmentation of output memory representations are connected via memory is crucial to performing inference re- an attention mechanism whose weights are deter- quiring long-range dependencies, especially when mined by measuring the similarity between the in- document-level reasoning between multiple sup- put memory and the current input. The CRF layer, porting facts is required. Of particular interest to on the other hand, takes the output of the memory our work are so-called “memory hops” in memory layer as input. In the remainder of this section, we networks, which are guided by an attention mech- detail the elements of ME-CRF. anism based on the relevance between a question and each supporting context sentence in the mem- 3.1 Memory Layer ory hop. Governed by the attention mechanism, 3.1.1 Input memory the ability to access the entire sequence is similar to the soft alignment idea proposed by Bahdanau Every element (word/post) in a sequence x is en- coded with x = Φ(x ), where Φ( ) can be any et al.(2014) for neural machine translation. In this t t · work, we borrow the concept of memory hops and encoding function mapping the input xt into a vector x Rd. This results in the sequence integrate it into CRFs, thereby enabling the model t ∈ x ,..., x . While this new sequence can be to look beyond localised features and have access { 1 T } to the whole sequence via an attention mechanism. seen as the memory in the context of MEMNETs, one major drawback of this approach, as pointed 3 Methodology out by Seo et al.(2017), is the insensitivity to temporal information between memory cells. We In the context of sequential tagging, we assume therefore follow Xiong et al.(2016) in inject- the input is in the form of sequence pairs: = D ing temporal signal into the memory using a bi- x(n), y(n) N where x(n) is the input of the { }n=1 directional GRU encoding (Cho et al., 2014): n-th example in dataset and consists of a se- (n) (n) D(n) (n) quence: x1 , x2 , . . . , xT . Similarly, y is { (n) } −→mt = GRU−−→(xt, −→mt 1) (1) of the same length as x and consists of the cor- − (n) (n) (n) responding labels y , y , . . . , y . For nota- ←−mt = GRU←−−(xt, ←−mt+1) (2) { 1 2 T } tional convenience, hereinafter we omit the super- mt = tanh(−→W m−→mt + ←−W m←−mt + bm) (3) script denoting the n-th example. In the case of NER, each x is a word in a sen- t where −→W , ←−W and b are learnable parame- tence with y being the corresponding NER label. m m m t ters. For forum thread discourse analysis, xt represents y the text of an entire post whereas t is the dialogue 3.1.2 Current input act label for the t-th post. The proposed model extends CRFs by integrat- This is used to represent the current step xt, be it ing external memory and is therefore named a a word or a post. As in MEMNETs, we want to Memory-Enhanced Conditional enforce the current input to be in the same space (“ME-CRF”). We take inspiration from Memory as the input memory so that we can determine the Networks (“MEMNETs”: Weston et al.(2015)) attention weight of each element in the memory and incorporate so-called memory hops into by measuring the relevance between the two. We CRFs, thereby allowing the model to have unre- denote the current input by ut = mt. stricted access to the whole sequence rather than localised features as in RNNs (Lai et al., 2015; 3.1.3 Attention Linzen et al., 2016). To determine the attention value of each element As illustrated in Figure2, ME-CRF can be di- in the memory, we measure the relevance between vided into two major parts: (1) the memory layer; the current step u and m for i [1, t] with a t i ∈ and (2) the CRF layer. The memory layer can be softmax function: further broken down into three main components: > (a) the input memory m1:t; (b) the output mem- pt,i = softmax(ut mi) (4) ory c1:t; and (c) the current input ut, which rep- resents the current step (also known as the “ques- eai where softmax(ai) = . aj tion” in the context of MEMNET). The input and j e P 557 CRF yt yt+1 ··· ··· layer

ot Σ ot+1 Σ Weighted sum Weighted sum Output Output c1:t c1:t+1 Attention Attention p1:t p1:t+1 Memory layer softmax softmax Input Input m1:t m1:t+1

Inner product Inner product urn input Current urn input Current

ut ut+1

Figure 2: Illustration of ME-CRFs with a single memory hop, showing the network architecture at time step t and t + 1.

3.1.4 Output memory which case the output of the k-th hop is taken as input to the (k + 1)-th hop: Similar to mt, ct is the output memory, and is calculated analogously but with a different set of k+1 k k parameters in the GRUs and tanh layers of Equa- ut = ot + ut (6) tions (1), (2) and (3). The output memory is used k+1 where ut encodes not only information at the to generate the final output of the memory layer k current step (ut ) but also relevant knowledge from and fed as input to the CRF layer. k the memory (ot ). In the scope of this work, we 3.1.5 Memory layer output limit the number of hops to 1. Once the attention weights have been computed, 3.2 CRF Layer the memory access controller receives the re- K+1 sponse o in the form of a weighted sum over the Once the representation of the current step ut output memory representations: is computed — incorporating relevant information from the memory (assuming the total number of

ot = pt,ici (5) memory hops is K) — it is then fed into a CRF layer: Xi This allows the model to have unrestricted access T T to elements in previous steps as opposed to a sin- s(x, y) = Ayt,yt+1 + Pt,yt (7) t=0 t=1 gle vector ht in RNNs, thereby enabling ME- X X CRFs to detect and effectively incorporate long- range dependencies. where A |Y|×|Y| is the CRF transition matrix, ∈ T is the size of the label set, and P R ×|Y| is a |Y| K∈+1 3.1.6 Extension linearly transformed matrix from ut such that K+1 h For more challenging tasks requiring complex rea- P > = Wsu where Ws R|Y|× with h be- t,: t ∈ soning capabilities with multiple supporting facts ing the size of mt. Here, Ai,j represents the score from the memory, the model can be further ex- of the transition from the i-th tag to the j-th tag tended by stacking multiple memory hops, in whereas Pi,j is the score of the j-th tag at time i.

558 Using the scoring function in Equation (7), we cal- of the posts link to posts other than their immedi- culate the score of the sequence y normalised by ately preceding post. the sum of scores of all possible sequences y˜, and The task is defined as follows: given a list of this becomes the probability of the true sequence: preceding posts x1, . . . , xt 1 and the current post − xt, to classify which posts it links to (lt) and the exp(s(x, y)) p(y x) = (8) dialogue act (yt) of each such link. In the scope | exp(s(x, y˜)) y˜ YX of this work, ME-CRFs are capable of modelling ∈ both tasks natively, and therefore a natural fit for We train the modelP to maximise the probability this problem. of the gold label sequence with the following loss function: 4.2 Experimental Setup N In this dataset, in addition to the body of text, = log p(y(n) x(n)) (9) each post is also associated with a title. We L | n=1 therefore use two encoders, Φ ( ) and Φ ( ), X title · text · to process them separately and then concatenate where p(y(n) x(n)) is calculated using the x = [Φ (x ); Φ (x )]>. Here, Φ ( ) and | t title t text t title · forward–backward algorithm. Note that the model Φ ( ) take word embeddings as input, process- text · is fully end-to-end differentiable. ing each post at the word level, as opposed to the At test time, the model predicts the output se- post-level bi-directional GRU in Equations (1) and quence with maximum a posteriori probability: (2), and the representation of a post xt (either title or text) is obtained by transforming the last and y∗ = arg max p(y˜ x) (10) y˜ Yx | first hidden states of the forward and backward ∈ word-level GRU, similar to Equation (3). Note Since we are only modelling bigram interactions, that Φ ( ) and Φ ( ) do not share parame- title · text · we adopt the for decoding. ters. As in Tang et al.(2016), we further restrict mk = ck to curb overfitting. 4 Thread Discourse Structure Prediction i i In keeping with Wang(2014), we comple- In this section, we describe how ME-CRFs can ment the textual representations with hand-crafted 2 be applied to the task of thread discourse structure structural features Φs(xt) R : ∈ prediction, wherein we attempt to predict which initiator: a binary feature indicating whether • post(s) a given post directly responds to, and in the author of the current post is the same as what way(s) (as captured by dialogue acts). This the initiator of the thread, is a novel approach to this problem and capable of position: the relative position of the current • natively handling both tasks within the same net- post, as a ratio over the total number of posts work architecture. in the thread; Also drawing on Wang(2014), we incorporate 4.1 Dataset and Task 3 punctuation-based features Φp(xt) R : the ∈ In this work, we adopt the dataset of Kim number of question marks, exclamation marks and et al.(2010), 1 which consists of 315 threads and URLs in the current post. The resultant feature 1,332 posts, collected from the Operating System, vectors are projected into an embedding space by Software, Hardware and Web Development sub- Ws and Wp and concatenated with xi, resulting 2 forums of CNET. Every post has been manually in the new xi0 . Subsequently, xi0 is fed into the bi- linked to preceding post(s) in the thread that it is a directional GRUs to obtain mi. direct response to (in the form of “links”), and the For link prediction, we generate a supervision nature of the response for each link (in the form signal from the annotated links, guiding the atten- of “dialogue acts”, or “DAs”). In this dataset, it is tion to focus on the correct memory position: not uncommon to see messages respond to posts N T which occur much earlier in the thread (based on = CrossEntropy(l(n), p(n)) (11) LLNK t t the chronological ordering of posts). In fact, 18% n=1 t=1 X X 1 http://people.eng.unimelb.edu.au/ (n) tbaldwin/resources/conll2010-thread/ where lt is a one-hot vector indicating where the 2http://forums.cnet.com/ link points to for the t-th post in the n-th thread,

559 (n) and p = pt,1, . . . , pt,t is the predicted distri- Model Link DA Joint t { } bution of attention over the t posts in the memory. CRF 86.3 75.1 — To accommodate the first post in a thread, as it KIM CRF 82.3 73.4 66.5 points to a virtual “head” post, we set a dummy WANG DEPPARSER 85.0 75.7 70.6 post, m0 = 0, of the same size as mi. While the dataset contains multi-headed posts (posts with MEMNET 85.8 76.0 69.5 more than one outgoing link), following Wang ME-CRF 86.4 77.5 70.9 (2014), we only include the most recent linked ME-CRF+ 86.3 77.4 71.2 post during training, but evaluate over the full set of labelled links. Table 1: Post-level Link and DA F-scores. Per- For this task, ME-CRF is jointly trained to pre- formance for ME-CRF and ME-CRF+ is marco- dict both the link and dialogue act with the follow- averaged over 5 runs. ing loss function:

0 = α + (1 α) (12) parser-based approach of Wang(2014), where the L LDA − LLNK author treated the discourse structure prediction where DA is the CRF likelihood defined in Equa- task as a constrained dependency parsing problem, L tion (9), and α is a hyper-parameter for balancing with posts as nodes in the dependency graph, and the emphasis between the two tasks. the constraint that links must connect to preced- Training is carried out with Adam (Kingma and ing posts in the thread (“DEPPARSER”).3 In ad- Ba, 2015) over 50 epochs with a batch size of dition to the CRF/parser-based systems, we also 32. We use the following hyper-parameter set- build a MEMNET-based baseline (named MEM- 100 3 tings: size of 20, Wp R × , NET) where MEMNET shares the architecture of 50 2 ∈ Ws R × , α = 0.5, hidden size of Φtitle and the memory layer in ME-CRF but excludes the ∈ Φtext is 20, hidden size of GRU−−→ and GRU←−− is 50. use of the CRF layer. Instead, MEMNET, follow- Dropout is applied to all GRU recurrent units on ing the work of Sukhbaatar et al.(2015), predicts the input and output connections with a keep rate the final answer by: of 0.7. K+1 Lastly, we also explore the idea of curriculum yˆ = softmax(WDA(u )) (13) learning (Bengio et al., 2009), by fixing the CRF transition matrix A = 0 for the first e = 20 where yˆ is the predicted DA distribution, WDA d ∈ epochs, after which we train the parameters for the R|Y|× is a parameter matrix for the model to remainder of the run. This allows the ME-CRF to learn, and K = 1 is the total number of hops. This learn a good strategy for DA and link prediction, is equivalent to classifying link and DA indepen- as independent “maxent” type classifier, before at- dently at each time step t without taking transi- tempting to learn sequence dynamics. We refer to tions between DA labels into account. this variant as “ME-CRF+”. 4.4 Results 4.3 Evaluation The experimental results are presented in Table1, Following Wang(2014), we evaluate based on wherein the first three rows are the three baseline post-level micro-averaged F-score. All experi- systems. ments were carried out with 10-fold cross valida- tion, stratifying at the thread level. State-of-the-art post-level results. ME-CRFs We benchmark against the following previous achieve state of the art results in terms of joint work: the feature-rich CRF-based approach of post-level F-score, substantially better than the Kim et al.(2010), where the authors trained inde- baselines. While ME-CRF slightly outperforms pendent models for each of link and DA classifica- the current state-of-the-art (DEPPARSER), ME- tion (“CRFKIM”); the feature-rich CRF-based ap- CRF+ improves the performance and achieves a proach of Wang(2014), where the author further further 0.3% absolute gain. extended the feature set of Kim et al.(2010) and 3Note that a mistake was found in the results in the origi- jointly trained a CRF over the link and DA pre- nal paper (Wang et al., 2011), and we use the corrected results diction tasks (“CRFWANG”); and the dependency from Wang(2014).

560 Link&DA Joint CRFSGD MALTPARSER ME-CRF+ 100 Link CRFWANG DEPPARSER 100 80 ME-CRF+ 80 60 60

40 Link F-score 40 20 [1,2] [3,4] [5,6] [7,8] [9,)

Link&DA Joint F-score Post depth 0 DA [1,2] [3,4] [5,6] [7,8] [9,) Post depth 100

Figure 3: Breakdown of post-level Joint F-scores 80 by post depth, where e.g. “[1, 2]” is the joint F- 60 score over posts of depth 1–2, i.e. the first or DA F-score second post in the thread. Note that we take 40 [1,2] [3,4] [5,6] [7,8] [9,) the reported performance of CRF and DEP- WANG Post depth PARSER from Wang(2014). Figure 4: Breakdown of post-level Link and DA F-scores by post depth. Curriculum learning improves joint prediction. Despite the slight performance drop on the DA and link prediction tasks, ME-CRF+, with the CRF 4.5 Analysis transition matrix frozen for the first 20 epochs, We break the performance down by the depth of achieves a 0.3% absolute gain in joint F-score ∼ each post in a given thread, and present the re- over ME-CRF. This suggests that the sequence sults in Figure3. Although below the baselines dynamics between posts, while difficult to capture, for the interval [1, 2], ME-CRF+ consistently out- are beneficial to the overall task (resulting in more performs CRF from depth 3 onwards, and is coherent DA and link predictions) if trained with WANG superior to DEPPARSER for depths [7, ). Break- proper initialisation. ing down the performance further to the individual tasks of Link and DA prediction, as displayed in Figure4, we observe a similar trend. In line with MEMNET vs. ME-CRFs. We see consistent gains across all three tasks when the CRF layer is the findings in the work of Wang(2014), this con- added. Although not presented in Table1, the dif- firms that prediction becomes progressively more ference is most notable at the thread-level (i.e. a difficult as threads grow longer, which is largely thread is correct iff all posts are tagged correctly), due to the increased variability in discourse struc- highlighting the importance of sequential transi- ture. Despite the escalated difficulty, ME-CRF+ tional information between posts. is substantially superior to the baselines when classifying deeper posts. Between the CRF-based models, it is worth

CRF vs. ME-CRFs. Note that CRFKIM is not noting that despite the lower performance for trained jointly on the two component tasks, but in- [1, 2], ME-CRF+ benefits from having global ac- dividually on each task. Without additional data, cess to the entire sequence, and consistently out- jointly training on the two tasks generates results performs CRFWANG for depths [3, ), highlight- that are comparable or substantially better over the ing the effectiveness of the memory mechanism. individual tasks. This highlights the effectiveness Overall, these results validate our hypothesis that of ME-CRF, especially with the link prediction having unrestricted access to the whole sequence performance comparable to that of a single-task is beneficial, especially for long-range dependen- model CRF, and surpassing it in the case of DA. cies, offering further evidence of the power of

561 ME-CRFs. Training is carried out with Adam, over 100 epochs with a batch size of 32. We use the fol- 5 Named Entity Recognition lowing hyper-parameter settings: word embed- ding size = 50; hidden size of GRU−−→ and GRU←−− = 50; In this section, we present experiments in a sec- 50 50 50 −→W m and ←−W m R × ; and b R . Dropout ond setting: named entity recognition, over the ∈ m ∈ is applied to all GRU recurrent units on the in- CoNLL 2003 English NER shared task dataset put and output connections, with a keep rate of (Tjong Kim Sang and De Meulder, 2003). Our in- 0.8. We initialise ME-CRF with pre-trained word terest here is in evaluating the ability of ME-CRF embeddings and keep them fixed during training. to capture document context, to aid in the identifi- While we report results only on the test set, we use cation and disambiguation of NEs. early stopping based on the development set.

5.1 Dataset and Task 5.3 Evaluation The CoNLL 2003 NER shared task dataset Evaluation is based on span-level NE F-score, consists of 14, 041/3, 250/3, 453 sentences in based on the official CoNLL evaluation script.4 the training/development/test set, resp., extracted We compare against the following baselines: from 946/216/231 Reuters news articles from the 1. a CRF over hand-tuned lexical features period 1996–97. The goal is to identify individ- (“CRF”: Huang et al.(2015)) ual token occurrences of NEs, and tag each with 2. an LSTM and bi-directional LSTM LOCATION ORGANISATION its class (e.g. or ). (“LSTM” and “BI-LSTM”, resp.: Huang Here, we use the IOB tagging scheme. In terms et al.(2015)) of tagging schemes, while some have shown 3. a CRF taking features from a convolutional improvements with a more expressive IOBES neural network as input (“CONV-CRF”: Col- marginally (Ratinov and Roth, 2009; Dai et al., lobert et al.(2011)) 2015), we stick to the BIO scheme for simplicity 4. a CRF over the output of either a sim- and the observation of little improvement between ple LSTM or bidirectional LSTM (“LSTM- these schemes by Lample et al.(2016). CRF” and “BI-LSTM-CRF”, resp.: Huang et al.(2015)) 5.2 Experimental Setup Note that for our word embeddings, while we ob- We choose Φ(xt) to be a lookup function, return- serve better performance with GLOVE (Penning- ing the corresponding embedding xt of the word ton et al., 2014), for fair comparison purposes, xt. In addition to the word features, we employ a we adopt the same SENNA embeddings (Collobert subset of the lexical features described in Huang et al., 2011) as are used in the baseline methods.5 et al.(2015), based on whether the word: starts with a capital letter; 5.4 Results • is composed of all capital letters; The experimental results are presented in Table2. • is composed of all lower case letters; Results for the baseline methods are based on the • contains non initial capital letters; published results of Huang et al.(2015) and Col- • contains both letters and digits; lobert et al.(2011). Note that none of the sys- • contains punctuation. tems in Table2 use external gazetteers, to make • These features are all binary and refered to as the comparison fair. As can be observed, ME- Φl(xt). Similar to the thread structure predic- CRF achieves the best performance, beating all tion experiments, we concatenate Φl(xt) with xt the baselines. to generate the new input x0 to the bi-directional To gain a better understanding of what the GRUs in Equations (1) and (2). model has learned, Table3 presents two examples In order to incorporate information in the docu- 4http://www.cnts.ua.ac.be/conll2000/ ment beyond sentence boundaries, we encode ev- chunking/conlleval.txt ery word sequentially in a document with Φ and 5Lample et al.(2016) report a higher result of 90.9 using GRU−−→ and GRU,←−− and store them in the memory m a BI-LSTM-CRF architecture, but augmented with skip n- i grams (Ling et al., 2015) and character embeddings. Due to and c for i [1, t ], where t is the index of the i ∈ 0 0 the differing underlying representation, we exclude it from current word t in the document. the comparison.

562 Model F-score Memory pt,i Memory pt,i CRF 86.1 ...... LSTM 83.7 Manchester 0.23 , 0.00 Interfax 0.32 BI-LSTM 85.2 United 0.00 face 0.00 news 0.00 CONV-CRF 88.7 Juventus 0.65 agency 0.00 LSTM-CRF 88.4 in 0.00 said 0.00 BI-LSTM-CRF 88.8 Europe 0.00 . 0.00 ... Interfax 0.68 ME-CRF 89.5 European champions Interfax quoted Table 2: NER performance on the CoNLL 2003 Juventus ... Russian ... English NER shared task dataset. Table 3: An NER example showing learned atten- tion to long-range contextual dependencies. The where ME-CRF focuses on words beyond the cur- last row is the current sentence. The underlined rent sentence boundaries. In the example on the words indicate the target word xt, and the dashed left, where the target word is Juventus (an Ital- line indicates a sentence boundary. ian soccer team), ME-CRF directs the attention mainly to the occurrence of the same word in a previous sentence and a small fraction to Manch- Acknowledgments ester (a UK soccer team, in this context). Note We thank the anonymous reviewers for their that it does not attend to the other NE (Europe) valuable feedback, and gratefully acknowledge in that sentence, which is of a different NE class. the support of Australian Government Research In the example on the right, on the other hand, Training Program Scholarship and National Com- ME-CRF allocates attention to the same words putational Infrastructure (NCI Australia). This as the target word in the current sentence. Note work was also supported in part by the Australian that the second occurrence of Interfax in the mem- Research Council. ory is the same occurrence as the first word in the current sentence. While more weight is placed on the second Interfax, close to one third of the References attention is also asigned to the first occurrence. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Given that the memory, mi and ci, is encoded gio. 2014. Neural machine translation by jointly with bi-directional GRUs, the first Interfax should, learning to align and translate. In Proceedings of to some degree, capture the succeeding neighbour- the 3rd International Conference on Learning Rep- resentations (ICLR 2015). San Diego, USA. ing elements: news agency. This is reminiscent of label consistency in the Yoshua Bengio, Jer´ omeˆ Louradour, Ronan Collobert, works of Sutton and McCallum(2004) and Finkel and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Con- et al.(2005), but differs in that the consistency ference on (ICML 2009). Mon- constraint is soft as opposed to hard in previous treal, Canada, pages 41–48. studies, and automatically learned without the use of heuristics. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradi- ent descent is difficult. IEEE Transactions on Neu- 6 Conclusion ral Networks 5(2):157–166.

In this paper, we have presented ME-CRF, a Kyunghyun Cho, Bart Van Merrienboer,¨ Dzmitry Bah- model extending linear-chain CRFs by including danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- external memory. This allows the model to look proaches. In Proceedings of SSST-8, Eighth Work- beyond neighbouring items and access long-range shop on Syntax, Semantics and Structure in Statisti- context. Experimental results demonstrate the ef- cal Translation. Doha, Qatar, pages 103–111. fectiveness of the proposed method over two tasks: Ronan Collobert, Jason Weston, Leon´ Bottou, Michael forum thread discourse analysis, and named entity Karlen, Koray Kavukcuoglu, and Pavel Kuksa. recognition. 2011. Natural language processing (almost) from

563 scratch. Journal of Machine Learning Research Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. 12:2493–2537. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Hong-Jie Dai, Po-Ting Lai, Yung-Chun Chang, and Conference on Artificial Intelligence (AAAI 2015). Richard Tzong-Han Tsai. 2015. Enhancing of Austin, USA, volume 333, pages 2267–2273. chemical compound and drug name recognition us- ing representative tag scheme and fine-grained tok- Guillaume Lample, Miguel Ballesteros, Sandeep Sub- enization. Journal of Cheminformatics 7(1):S14. ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Jenny Rose Finkel, Trond Grenager, and Christopher Neural architectures for named entity recognition. Proceedings of the 2016 Conference of the North Manning. 2005. Incorporating non-local informa- In American Chapter of the Association for Computa- tion into information extraction systems by gibbs tional Linguistics: Human Language Technologies sampling. In Proceedings of the 43rd Annual Meet- ing of the Association for Computational Linguistics (NAACL 2016). San Diego, USA, pages 260–270. (ACL 2005). Ann Arbor, USA, pages 363–370. Shasha Liao and Ralph Grishman. 2010. Using doc- Stuart Geman and Donald Geman. 1984. Stochas- ument level cross-event inference to improve event tic relaxation, Gibbs distributions, and the Bayesian extraction. In Proceedings of the 48th Annual Meet- restoration of images. IEEE Transactions on Pattern ing of the Association for Computational Linguistics Analysis and Machine Intelligence 6:721–741. (ACL 2010). Uppsala, Sweden, pages 789–797. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Wang Ling, Yulia Tsvetkov, Silvio Amir, Ramon Fer- Hinton. 2013. Speech recognition with deep recur- mandez, Chris Dyer, Alan W Black, Isabel Tran- rent neural networks. In 2013 IEEE International coso, and Chu-Cheng Lin. 2015. Not all contexts Conference on Acoustics, Speech and Signal Pro- are created equal: Better word representations with cessing (ICASSP 2013). Vancouver, Canada, pages variable attention. In Proceedings of the 2015 Con- 6645–6649. ference on Empirical Methods in Natural Language Processing (EMNLP 2015). Lisbon, Portugal, pages Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and 1367–1372. Jurgen¨ Schmidhuber. 2001. Gradient flow in recur- rent nets: the difficulty of learning long-term depen- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. dencies. 2016. Assessing the ability of lstms to learn syntax- Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. sensitive dependencies. Transactions of the Asso- Long short-term memory. Neural computation ciation for Computational Linguistics (TACL 2016) 9(8):1735–1780. 4:521–535. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- Jeffrey Pennington, Richard Socher, and Christopher tional lstm-crf models for sequence tagging. arXiv Manning. 2014. GloVe: Global vectors for word preprint arXiv:1508.01991 . representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Su Nam Kim, Li Wang, and Timothy Baldwin. 2010. Processing (EMNLP 2014). Doha, Qatar, pages Tagging and linking web forum posts. In Proceed- 1532–1543. ings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL 2010). Associ- Lev Ratinov and Dan Roth. 2009. Design challenges ation for Computational Linguistics, Uppsala, Swe- and misconceptions in named entity recognition. den, pages 192–202. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL Diederik Kingma and Jimmy Ba. 2015. Adam: A 2009). Boulder, USA, pages 147–155. method for stochastic optimization. In Proceed- ings of the 3th International Conference on Learn- Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh ing Representations (ICLR 2015). San Diego, USA. Hajishirzi. 2017. Query-reduction networks for Vijay Krishnan and Christopher D. Manning. 2006. An question answering. In Proceedings of the 5th Inter- effective two-stage model for exploiting non-local national Conference on Learning Representations dependencies in named entity recognition. In Pro- (ICLR 2017). Toulon, France. ceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meet- Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, ing of the Association for Computational Linguistics and Rob Fergus. 2015. End-to-end memory net- (ACL 2006). Sydney, Australia, pages 1121–1128. works. In Proceedings of Advances in Neural Infor- mation Processing Systems (NIPS 2015). Montreal,´ John D. Lafferty, Andrew McCallum, and Fernando Canada, pages 2440–2448. C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling se- Charles Sutton and Andrew McCallum. 2004. Col- quence data. In Proceedings of the Eighteenth Inter- lective segmentation and labeling of distant entities national Conference on Machine Learning (ICML in information extraction. Technical report, Mas- 2001). San Francisco, USA, pages 282–289. sachusetts Univ Amherst Dept of Computer Science.

564 Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect level sentiment classification with deep memory net- work. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016). Austin, USA, pages 214–224. Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on North American Chapter of the Association for Com- putational Linguistics (NAACL 2003). Edmonton, Canada, pages 142–147. Li Wang. 2014. Knowledge discovery and extraction of domain-specific web data. Ph.D. thesis, The Uni- versity of Melbourne. Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. 2011. Predicting thread discourse structure over technical web forums. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011). Ed- inburgh, UK, pages 13–25. Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory networks. In Proceedings of the 3rd International Conference on Learning Representa- tions (ICLR 2015). San Diego, USA. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016). New York, USA, pages 2397–2406.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchi- cal attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics — Human Language Technologies (NAACL HLT 2016). San Diego, USA, pages 1480– 1489. Amy Zhang, Bryan Culbertson, and Praveen Paritosh. 2017. Characterizing online discussion using coarse discourse sequences. In Proceedings of the 11th AAAI International Conference on Web and Social Media (ICWSM 2017). Palo Alto, USA, pages 357– 366.

565