Locally-Contextual Nonlinear Crfs for Sequence Labeling
Total Page:16
File Type:pdf, Size:1020Kb
Locally-Contextual Nonlinear CRFs for Sequence Labeling Harshil Shah 1 Tim Z. Xiao 1 David Barber 1 2 Abstract not directly incorporate information from the surrounding words in the sentence (known as ‘context’). As a result, Linear chain conditional random fields (CRFs) linear chain CRFs combined with deeper models which do combined with contextual word embeddings have use such contextual information (e.g. convolutional and re- achieved state of the art performance on sequence current networks) gained popularity (Collobert et al., 2011; labeling tasks. In many of these tasks, the iden- Graves, 2012; Huang et al., 2015; Ma & Hovy, 2016). tity of the neighboring words is often the most useful contextual information when predicting the Recently, contextual word embeddings such as those pro- label of a given word. However, contextual em- vided by pre-trained language models have become more beddings are usually trained in a task-agnostic prominent (Peters et al., 2018; Radford et al., 2018; Akbik manner. This means that although they may en- et al., 2018; Devlin et al., 2019). Contextual embeddings code information about the neighboring words, incorporate sentence-level information, therefore they can it is not guaranteed. It can therefore be benefi- be used directly with linear chain CRFs to achieve state of cial to design the sequence labeling architecture the art performance on sequence labeling tasks (Yamada to directly extract this information from the em- et al., 2020; Li et al., 2020b). beddings. We propose locally-contextual nonlin- Contextual word embeddings are typically trained using a ear CRFs for sequence labeling. Our approach generic language modeling objective. This means that the directly incorporates information from the neigh- embeddings encode contextual information which can gen- boring embeddings when predicting the label for erally be useful for a variety of tasks. However, because a given word, and parametrizes the potential func- these embeddings are not trained for any specific down- tions using deep neural networks. Our model stream task, there is no guarantee that they will encode the serves as a drop-in replacement for the linear most useful information for that task. Certain tasks such as chain CRF, consistently outperforming it in our sentiment analysis and textual entailment typically require ablation study. On a variety of tasks, our results global, semantic information about a sentence. In contrast, are competitive with those of the best published for sequence labeling tasks it is often the neighboring words methods. In particular, we outperform the previ- in a sentence which are most informative when predicting ous state of the art on chunking on CoNLL 2000 the label for a given word. Although the contextual em- and named entity recognition on OntoNotes 5.0 bedding of a given word may encode information about its English. neighboring words, this is not guaranteed. It can therefore be beneficial to design the sequence labeling architecture to directly extract this information from the embeddings 1. Introduction (Bhattacharjee et al., 2020). arXiv:2103.16210v1 [cs.CL] 30 Mar 2021 In natural language processing, sequence labeling tasks We therefore propose locally-contextual nonlinear CRFs involve labeling every word in a sequence of text with a for sequence labeling. Our approach extends the linear linguistic tag. These tasks were traditionally performed us- chain CRF in two straightforward ways. Firstly, we directly ing shallow linear models such as hidden Markov models incorporate information from the neighboring embeddings (HMMs) (Kupiec, 1992; Bikel et al., 1999) and linear chain when predicting the label for a given word. This means that conditional random fields (CRFs) (Lafferty et al., 2001; Mc- we no longer rely on each contextual embedding to have Callum & Li, 2003; Sha & Pereira, 2003). These approaches encoded information about the neighboring words in the model the dependencies between adjacent word-level labels. sentence. Secondly, we replace the linear potential functions However when predicting the label for a given word, they do with deep neural networks, resulting in greater modeling flexibility. Locally-contextual nonlinear CRFs can serve 1Department of Computer Science, University College London 2Alan Turing Institute. as a drop-in replacement for linear chain CRFs, and they have the same computational complexity for training and Locally-Contextual Nonlinear CRFs for Sequence Labeling log-linear potential functions: h1 h2 h3 ··· hT T log t = i(yt−1) A · i(yt) (4) T log ηt = i(yt) B · ht (5) y1 y2 y3 ··· yT where i(y) denotes the one-ht encoding of y, and A and B are parameters of the model. Note that ‘log-linear’ here Figure 1. The graphical model of the linear chain CRF. refers to linearity in the parameters. Henceforth, we refer to linear chain CRFs simply as CRFs. inference. 2.1. Incorporating contextual information We evaluate our approach on chunking, part-of-speech tag- ging and named entity recognition. On all tasks, our results When predicting the label at a given time step yt, it is often are competitive with those of the best published methods. necessary to use information from words in the sentence In particular, we outperform the previous state of the art other than only xt; we refer to this as contextual information. on chunking on CoNLL 2000 and named entity recognition For example, below are two sentences from the CoNLL on OntoNotes 5.0 English. We also perform an ablation 2003 named entity recognition training set (Tjong Kim Sang study which shows that both the local context and nonlinear & De Meulder, 2003): potentials consistently provide improvements compared to linear chain CRFs. One had threatened to blow it up unless it was refu- elled and they were taken to [LOC London] where 2. Linear chain CRFs they intended to surrender and seek political asylum. Linear chain CRFs (Lafferty et al., 2001) are popular for [ORG St Helens] have now set their sights on taking various sequence labeling tasks. These include chunking, the treble by winning the end-of-season premiership part-of-speech tagging and named entity recognition, all of which begins with next Sunday’s semifinal against which involve labeling every word in a sequence according [ORG London]. to a predefined set of labels. We denote the sequence of words x = x1; : : : ; xT and the In each example, the word “London” (xt) has a different corresponding sequence of labels y = y1; : : : ; yT . We label (yt) and so it is necessary to use contextual information assume that the words have been embedded using either to make the correct prediction. a non-contextual or contextual embedding model; we de- In Figure1, we see that there is no direct path from any hj note the sequence of embeddings h = h1;:::; hT . If non-contextual embeddings are used, each embedding is with j 6= t to yt. This means that in order to use contextual information, CRFs rely on either/both of the following: a function only of the word at that time step, i.e. ht = emb−NC f (xt). If contextual embeddings are used, each (y ; y ; θ) embedding is a function of the entire sentence, i.e. ht = • The transition potentials t−1 t and emb−C (yt; yt+1; θ) carrying this information from ft (x). the labels at positions t − 1 and t + 1. The linear chain CRF is shown graphically in Figure1; ‘linear’ here refers to the graphical structure. The conditional • The (contextual) embedding ht encoding information distribution of the sequence of labels y given the sequence about words xj with j 6= t. of words x is parametrized as: Relying on the transition potentials may not always be ef- Q η p(yjx; θ) = t t t (1) fective because knowing the previous/next label is often not P Q η y t t t sufficient when labeling a given word. In the above exam- where θ denotes the set of model parameters. The potentials ples, the previous/next labels indicate that they are not part of a named entity (i.e. yt−1 = O, yt+1 = O). Knowing t and ηt are defined as: this does not help to identify whether “London” refers to a t = (yt−1; yt; θ) (2) location, an organization, or even a person’s name. ηt = η(yt; ht; θ) (3) In the first example, knowing that the previous word (xt−1) is “to” and the next word (xt+1) is “where” indicates that The potentials are constrained to be positive and so are “London” is a location. In the second example, knowing parametrized in log-space. Linear chain CRFs typically use that the previous word (xt−1) is “against” indicates that Locally-Contextual Nonlinear CRFs for Sequence Labeling h1 h2 h3 ··· hT h1 h2 h3 ··· hT y1 y2 y3 ··· yT g1 g2 g3 ··· gT Figure 2. The graphical model of the CRF⊗. y1 y2 y3 ··· yT “London” is a sports organization. Therefore, to label these sentences correctly, a CRF relies on the embedding h to t Figure 3. The graphical model of the CRF⊗ combined with a bidi- encode the identities of the neighboring words. rectional LSTM. However, contextual embedding models are usually trained in a manner that is agnostic to the downstream tasks they We also evaluate a wider contextual window in AppendixC will be used for. Tasks such as sentiment analysis and tex- but find that it does not provide improved results. tual entailment typically require global sentence-level con- text whereas sequence labeling tasks usually require local Since the labels yt are discrete, the parametrization of log t contextual information from the neighboring words in the remains the same as in Equation (4). However unlike the sentence, as shown in the example above. Because different log-linear parametrization in Equation (5), log ηt along with downstream tasks require different types of contextual in- log φt and log ξt are each parametrized by feedforward net- formation, it is not guaranteed that task-agnostic contextual works which take as input the embeddings ht−1, ht and embeddings will encode the most useful contextual infor- ht+1 respectively (Peng et al., 2009): T φ mation for any specific task.