Using Constraint Grammar for Retokenization

Eckhard Bick University of Southern [email protected]

Abstract multiple tokens, in this case allowing the second part (the article) to become part of a separate np. This paper presents a Constraint Grammar-based method for changing the Tokenization is often regarded as a necessary tokenization of existing annotated data, evil best treated by a preprocessor with an establishing standard space-based abbreviation list, but has also been subject to ("atomic") tokenization for corpora methodological research, e.g. related to finite- otherwise using MWE fusion and state transducers (Kaplan 2005). However, there contraction splitting for the sake of is little research into changing the tokenization of syntactic transparency or for semantic a corpus once it has been annotated, limiting the reasons. Our method preserves ingoing comparability and alignment of corpora, or the and outgoing dependency arcs and allows evaluation of parsers. The simplest solution to the addition of internal tags and structure this problem is making conflicting systems for MWEs. We discuss rule examples and compatible by changing them into "atomic evaluate the method against both a tokenization", where all spaces are treated as Portuguese treebank and live news text token boundaries, independently of syntactic or annotation. semantic concerns. This approach is widely used in the machine-learning (ML) community, e.g. 1 Introduction for the Universal Dependencies initiative (McDonald et al. 2013). The method described in In an NLP framework, tokenization can be this paper can achieve such atomic tokenization defined as the identification of the smallest of annotated treebank data without information meaningful lexical units in running text. Tokens loss, but it can also be used for grammar-based can be both words, symbols or numerical tokenization changes in ordinary annotation expressions, but there is no general consensus on tasks, such as NER. what constitutes a token boundary. For instance, are "instead of" or "Peter Madsen" 1 or 2 tokens? 2 Retokenization challenges Should German "z. B." (for example) be 2 tokens What exactly atomic (space-based) and English "e.g." 1 token, just because the retokenization implies, is language-dependent, former contains a space? What about a word that and may involve both splitting and fusion of allows optional space (insofar as vs. in so far tokens, for fused multi-word expressions as)? Far from being a merely theoretical issue, (MWEs) and split contractions, respectively. tokenization conventions strongly influence While the former, not least for NER, is a schemes and results (e.g. Grefenstette & universal issue, the latter is rare in Germanic Tapanainen 1994). Thus, contextual rules languages (e.g. aren't, won't), but common in become simple (and therefore safer) when faced Romance languages. In both cases, the with single-token names, conjunctions and retokenization method should conserve existing prepostions rather than complex ones. information, i.e. MWE boundaries in one case, Conversely, contractions such as Portuguese "na" and morphosyntactic tags of contraction parts in (= em [in] a [the]) can only be assigned a the other. Linguistically, token-splitting is the meaningful syntactic analysis when split into bigger problem, because it needs added

Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications 6 information: (a) partial POS tags, (b) additional need to be split (1c), in order to be treated like internal dependency links and (c) new internal other, "free" contractions in the corpus. (1c) hook-up points for existing outgoing and starts with a default male singular reading which incoming dependency links. Unlike simple tag is "corrected" by (1d) into female and/or plural conversion for, say, morphological features, this where necessary. cannot be achieved with a conversion table. Rule (1b) targets a complex adverb, dali para 3 CG retokenization diante [from here onward, from now on], performing not only a 3-way split on space, but Our solution is based on two unique features of also splitting the contraction dali (de+ali the CG3 compiler (Bick & Didriksen 2015). The PRP+ADV). The '*' on the first part means that it first allows context-based insertion, deletion and will inherit form and function tags from the substitution of cohorts (token + 1 or more MWE as a whole, and "c->p" means it will also readings), and was originally intended for spell- inherit both incoming (child) and outgoing and grammar-checking. Thus, we implemented (parent) dependencies. token fusion by either inserting a (new) fused token and then removing all original tokens, or (1a) SPLITCOHORT:multipart-prop ( by substituting a token with a larger, fused one "<$1>"v "$1"v v * c->p containing the subsequent token (rules 1), then "<$2>"v "$2"v PRP @N< 2->3 removing the latter (rule 2). The other feature "<$3>"v "$3"v PROP @P< 3->1) introduces cohort splitting rules and was added TARGET ("<(.+?)=(aos?|às?|com|contra|d[eao]s?|em| specifically for retokenization. Such a rule can n[ao]s?|para)=(.*)>"r PROP /\(@.*\)/r) ; specify how to split a target token and manipulate its parts using regular expression (1b) SPLITCOHORT:three->fourpart-adv( matching (rule 3). In a separate rule field, a "<$1e>"v "de"v v PRP VSTR:$5 1->p token. "<$2>"v "$2"v <-sam> ADV @P< 2->1 3.1 Multi-word expressions "<$3>"v "$3"v PRP VSTR:$5 @P< 3->1 "<$4>"v "$4"v ADV @P< c->3) How an MWE is to be split, obviously depends TARGET ("<([dD])(ali)=(para)=([^=]+?)>"r ADV \ on its POS and composition. A simple case are (@.*\)/r) ; name chains entirely made up of proper nouns. Here, (part) lemmas equal (part) tokens, and (1c) SPLITCOHORT ( internal structure is simply a left- (or right-) "" "por" PRP @N< c->p leaning dependency chain. With other word classes, however, there may be inflection and "<$1>"v "o" <-sam> DET M S complex internal structure. The Portuguese @>N 2->p) proper noun-splitting rule (1a), for instance, TARGET ("pel([ao]s?)"r) breaks up TARGET named entities (NE) of the (0 PRP OR N/PROP) ; type PROP+PRP+PROP (e.g. "(Presidente do) Conselho de Administração" [Administrative (1d) SUBSTITUTE (M) (F) Council President]) - if necessary, iteratively. TARGET ("<.*[aà]s?>"r ) ; The asterisk for part 1 means that the first part 3.2 Contractions inherits all tags (pos, edge label, features) from the NE as a whole, while c->p means that it also Fusion of tokens does not add linguistic inherits incoming child (c) and outgoing parent information, and a function tag can simply be (p) dependencies. For parts 2 and 3, independent inherited from the head token of the to-be-fused new POS tags (PRP, PROP) and syntactic words. Still, CG rules like (2-3) are an effective function labels (@N<, @P<) are provided. All option for this purpose, too, because the parts receive a numbered MWE id (, formalism will automatically handle the resulting etc.), and the original MWE token is dependency number adjustments for the rest of retained in a separate tag (. Note that the tree, and morphophonetic changes can be the new parts may themselves be MWEs, addressed where necessary. Here, we use fusion needing further splits. Contractions contained in rules to reassemble Portuguese contractions that a NE (do [of the_sg_m], pelas .. [by the_pl_f]) were split into lemma parts in the original

Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications 7 treebank (marked for first part and <- tokens, the method handled 44,826 MWEs of sam> for second part). Thus, (2a) creates a similar complexity (109,320 parts, 2.44 per compound POS for the contraction, substituting MWE), missing out on only 273 (0.6%) MWEs. it for the preposition POS of the contraction's The failure rate for contractions was a negligible first part. (2b-c) then fuse the tokens "por" and "as" into "pelas", and (2d) creates a compound 0.01% (with 76,610 successful fusions). lemma for the contraction. (3), finally, removes the now-superfluous second part token. 5 Conclusions and outlook We have shown that (cg3-level) Constraint (2a) SUBSTITUTE (PRP) (PRP_DET) Grammar can be used for retokenization, and that TARGET () our method can establish space-based (1 (<-sam> DET)) ; tokenization for or parser output that for syntactic or semantic reasons use different (2b) SUBSTITUTE tokenization strategies. Thus, both splitting of ("<$1>"v) (VSTR:"<$1$2>") fused multi-word-expressions and fusion of split TARGET ("<(.*)>"r PRP_DET) contractions can be handled with a high degree (1 ("<(.*)>"r <-sam>)) ; of accuracy. In addition to retokenization itself, the method specifically supports tree strucctures (2c) SUBSTITUTE in dependency treebanks, preserving ingoing and (""v) (VSTR:"") outgoing dependency arcs and allowing the TARGET (""r PRP_DET) ; addition of internal tags and dependency structure for MWEs. (2d) SUBSTITUTE ("$1"v) (VSTR:"$1+$2") TARGET ("([^<]+)"r PRP_DET) Apart from the treebank conversion discussed (1 ("([^<]+)"r <-sam>)) ; here, we hope that the technique will prove useful at various stages of NLP pipelines, (3) REMCOHORT REPEAT (<-sam>) supporting grammar- and context-driven (-1 (/^PRP_.*$/r) OR (PERS_PERS)) ; tokenization afterthe preprocessing stage, or as post-processing for named entities, numerical expressions or compound-related spelling errors. 4 Evaluation and statistics As a specialized application, we are currently We evaluated the CG-based retokenization experimenting with target-language method on the Portuguese Floresta Sintá(c)tica retokenization in machine . treebank (Afonso et al. 2002), a 239,899 token treebank covering the European as well as the References Brazilian varieties of Portuguese. The treebank is Afonso, Susana & Eckhard Bick & Renato Haber & available in both constituent and dependency Diana Santos. 2002. Floresta sintá(c)tica: A formats, both adhering to the cross-language treebank for Portuguese. In Proceedings of 1 VISL annotation standard . Since the treebank's LREC'2002, Las Palmas. pp. 1698-1703, Paris: native format does not adhere to atomic ELRA tokenization, as advocated by the Universal Dependency initiative, retokenization has Bick, Eckhard & Tino Didriksen. 2015. CG-3 - become an issue for ML-users of the Floresta Beyond Classical Constraint Grammar. In: Beáta Sintá(c)tica. Our retokenizer proved capable of Megyesi: Proceedings of NODALIDA 2015, May addressing this problem, resolving all 8779 11-13, 2015, Vilnius, Lithuania. pp. 31-39. MWEs in the treebank into their 21,954 parts Linköping: LiU Electronic Press. ISBN 978-91- (2.50 per MWE), and reestablishing all 15,912 7519-098-3 contractions. The process took 33.6 seconds on a Grefenstette, Gregory & Pasi Tapanainen. 1994. What 2-core laptop, amounting to a processing speed is a word, what is a sentence? Problems of of 7,140 words/sec. tokenization. Proceedings of the 3rd Conference on Computational and Text Research In combination with a live parser run, on a (COMPLEX'94), Budapest. pp. 79-87 Portuguese newstext corpus with ~ 1.1 million Kaplan, Ronald M. 2005. A method for tokenizing 1 http://visl.sdu.dk text. In: Festschrift in Honor of Kimmo

Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications 8 Koskenniemi’s 60th anniversary. CSLI Publications, Proceedings of ACL 2013, Sofia. pp. 92-98 Stanford, CA. pp. 55-64 McDonald, Ryan et al. 2013. Universal dependency annotation for multilingual parsing. In:

Proceedings of the NoDaLiDa 2017 Workshop on Constraint Grammar - Methods, Tools and Applications 9