MWU-Aware Part-Of-Speech Tagging with a CRF Model and Lexical Resources

MWU-aware Part-of-Speech Tagging with a CRF model and lexical resources Matthieu Constant Anthony Sigogne UniversitéParis-Est, LIGM UniversitéParis-Est, LIGM 5, bd Descartes - Champs/Marne 5, bd Descartes - Champs/Marne 77454 Marne-la-Vallée cedex 2, France 77454 Marne-la-Vallée cedex 2, France [email protected] [email protected] Abstract ing semantic processing. Indeed, taggers are gen- erally evaluated on perfectly tokenized texts where This paper describes a new part-of-speech tag- multiword units (MWU) have already been identi- ger including multiword unit (MWU) identifi- fied. cation. It is based on a Conditional Random Our paper presents a MWU-aware POS tagger Field model integrating language-independent (i.e. a POS tagger including MWU recognition1). features, as well as features computed from external lexical resources. It was imple- It is based on a Conditional Random Field (CRF) mented in a finite-state framework composed model that integrates features computed from large- of a preliminary finite-state lexical analysis coverage morphosyntactic lexicons and fine-grained and a CRF decoding using weighted finite- MWU resources. We implemented it in a finite-state state transducer composition. We showed that framework composed of a finite-state lexical ana- our tagger reaches state-of-the-art results for lyzer and a CRF-decoder using weighted transducer French in the standard evaluation conditions composition. (i.e. each multiword unit is already merged in a single token). The evaluation of the tagger In section 2, we will first describe statistical tag- integrating MWU recognition clearly shows ging based on CRF. Then, in section 3, we will the interest of incorporating features based on show how to adapt the tagging models in order to MWU resources. also identify multiword unit. Next, section 4 will present the finite-state framework used to implement the tagger. Section 5 will focus on the description of 1 Introduction our working corpus and the set of lexical resources Part-of-speech (POS) tagging reaches excellent used. In section 6, we then evaluate our tagger on results thanks to powerful discriminative multi- French. feature models such as Conditional Random Fields 2 Statistical POS tagging with Linear (Lafferty et al., 2001), Support Vector Machine Chain Conditional Random Fields (Giménez and Márquez, 2004), Maximum Entropy (Ratnaparkhi, 1996). Some studies like (Denis and Linear chain Conditional Ramdom Fields (CRF) are Sagot, 2009) have shown that featuring these models discriminative probabilistic models introduced by by means of external morphosyntactic resources still (Lafferty et al., 2001) for sequential labelling. Given improves accuracy. Nevertheless, current taggers an input sequence x = (x1,x2, ..., xN ) and an out- rarely take multiword units such as compound words 1 into account, whereas they form very frequent lexi- This strategy somewhat resembles the popular approach of cal units with strong syntactic and semantic particu- joint word segmentation and part-of-speech tagging for Chi- nese, e.g. (Zhang and Clark, 2008). Moreover, other similar larities (Sag et al., 2001; Copestake et al., 2002) and experiments on the same task for French are reported in (Con- their identification is crucial for applications requir- stant et al., 2011). 49 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011), pages 49–56, Portland, Oregon, USA, 23 June 2011. c 2011 Association for Computational Linguistics put sequence of labels y = (y1,y2, ..., yN ), the In practice, we can divide features fk in two model is defined as follows: families: while unigram features (uk) do not de- N K pend on the preceding tag, i.e. fk(t,yt,yt−1,x) = 1 uk(t,yt,x), bigram features (bk) depend on both P (y|x)= . λ .f (t,yt,yt ,x) λ Z(x) X X k k −1 t k current and preceding tags, i.e. fk(t,yt,yt−1,x) = bk(t,yt,yt−1,x). In our practical case, bigrams where Z(x) is a normalization factor depending exlusively depends on the two tags, i.e. they are in- on x. It is based on K features each of them be- dependent from the input sequence and the current ing defined by a binary function fk depending on position like in the Hidden Markov Model (HMM)3. the current position t in x, the current label yt, Unigram features can be sub-divided into internal the preceding one yt−1 and the whole input se- and contextual ones. Internal features provide solely quence x. The feature is activated if a given con- characteristics of the current token w0: lexical form figuration between t, yt, yt−1 and x is satisfied (i.e. (i.e. its character sequence), lowercase form, suf- fk(t,yt,yt−1,x) = 1). Each feature fk is associated fice, prefix, ambiguity classes in the external lexi- with a weight λk. The weights are the parameters cons, whether it contains a hyphen, a digit, whether of the model. They are estimated during the train- it is capitalized, all capitalized, multiword. Contex- ing process by maximizing the conditional loglikeli- tual features indicate characteristics of the surround- hood on a set of examples already labeled (training ings of the current token: token unigrams at relative data). The decoding procedure consists in labelling positions -2,-1,+1 and +2 (w−2, w−1, w+1,w+2); to- a new input sequence with respect to the model, by ken bigrams w−1w0, w0w+1 and w−1w+1; ambi- maximizing P (y|x) (or minimizing −logP (y|x)). guity classes at relative positions -2,-1,+1 and +2 There exist dynamic programming procedures such (AC−2, AC−1, AC+1,AC+2). The different feature as Viterbi algorithm in order to efficiently explore all templates used in our tagger are given in table 2. labelling possibilities. Features are defined by combining different prop- Internal unigram features w0 = X &t0 = T erties of the tokens in the input sequence and the la- Lowercase form of w0 = L &t0 = T bels at the current position and the preceding one. Prefix of w0 = P with |P | < 5 &t0 = T Properties of tokens can be either binary or tex- Suffix of w0 = S with |S| < 5 &t0 = T w0 contains a hyphen &t0 = T tual: e.g. token contains a digit, token is capital- w0 contains a digit &t0 = T ized (binary property), form of the token, suffix of w0 is capitalized &t0 = T w0 is all capital &t0 = T 4 size 2 of the token (textual property). Most tag- w0 is capitalized and BOS &t0 = T gers exclusively use language-independent proper- w0 is multiword &t0 = T Lexicon tags AC0 of w0 = A & w0 is multiword &t0 = T ties – e.g. (Ratnaparkhi, 1996; Toutanova et al., Contextual unigram features 2003; Giménez and Márquez, 2004; Tsuruoka et wi = X, i ∈ {−2, −1, 1, 2} &t0 = T al., 2009). It is also possible to integrate language- wiwj = XY , (j,k) ∈ {(−1, 0), (0, 1), (−1, 1)} &t0 = T ACi = A & wi is multiword, i ∈ {−2, −1, 1, 2} &t0 = T dependant properties computed from an external Bigram features broad-coverage morphosyntactic lexicon, that are t 1 = T ′ &t0 = T − POS tags found in the lexicon for the given token Table 1: Feature templates (e.g. (Denis and Sagot, 2009)). It is of great interest to deal with unknown words2 as most of them are covered by the lexicon, and to somewhat filter the list of candidate tags for each token. We therefore 3 MWU-aware POS tagging added to our system a language-dependent property: MWU-aware POS tagging consists in identifying a token is associated with the concatenation of its and labelling lexical units including multiword ones. possible tags in an external lexicon, i.e. the am- bibuity class of the token (AC). 3Hidden Markov Models of order n use strong indepen- dance assumptions: a word only depends on its corresponding 2Unknown words are words that did not occur in the training tag, and a tag only depends on its n previous tags. In our case, data. n=1. 50 It is somewhat similar to segmentation tasks like unit: the part-of-speech of the lexical multiword unit chunking or Named Entity Recognition, that iden- (POS), its internal structure (STRUCT), its semantic tify the limits of chunk or Named Entity segments feature (SEM) and its relative position in the IOB and classify these segments. By using an IOB5 scheme (POSITION). Table 2 shows the encoding scheme (Ramshaw and Marcus, 1995), this task is of these properties in an example. The property ex- then equivalent to labelling simple tokens. Each to- traction is performed by a longest-match context- ken is labeled by a tag in the form X+B or X+I, free lookup in the resources. From these properties, where X is the POS labelling the lexical unit the to- we use 3 new unigram feature templates shown in ken belongs to. Suffix B indicates that the token is at table 3: (1) one combining the MWU part-of-speech the beginning of the lexical unit. Suffix I indicates with the relative position; (2) another one depending an internal position. Suffix O is useless as the end on the internal structure and the relative position and of a lexical unit corresponds to the beginning of an- (3) a last one composed of the semantic feature. other one (suffix B) or the end of a sentence. Such procedure therefore determines lexical unit limits, as FORM POS STRUCT POSITION SEM Translation well as their POS. un - - O - a gain - - O - gain A simple approach is to relabel the training data de - - O - of in the IOB scheme and to train a new model with the pouvoir NC NPN B - purchasing d’ NC NPN I - same feature templates.

Load more