Diacritic Error Detection and Restoration Via Part-Of-Speech Tags

Diacritic error detection and restoration via part-of-speech tags Jerid Francom, Mans Hulden Wake Forest University 323 Greene Hall, 27103 Winston-Salem, NC, USA [email protected] University of Helsinki P.O.Box 24 (Unioninkatu 40) FI-00014 Helsinki, Finland [email protected] Abstract In this paper we address the problem of diacritic error detection and restoration—the task of identifying and correcting missing accents in text. In particular, we evaluate the performance of a simple part-of-speech tagger-based technique comparing it to other well-established methods for error detection/restoration: unigram frequency, decision lists and grapheme-based approaches. In languages such as Spanish, the current focus, diacritics play a key role in disambiguation and results show that a straightforward modification to an n-gram tagger can be used to achieve good performance in diacritic error identification without resorting to any specialized machinery. Our method should be applicable to any language where diacritics distribute comparably and perform similar roles of disambiguation. Keywords: Diacritic restoration, Part-of-speech tagging, Romance languages, Spanish 1. Introduction lexicon and/or morphological analysis resources. On the Lexical disambiguation is key to developing robust other hand, stripping diacritics from other words produce natural language processing applications in a variety of semantic or syntactic ambiguity, such as pairs for the domains such as grammar and spell checking (Tufis¸ verb ‘hablar’ (to speak) hablo[3p/past]/hablo[1p/pres]´ , and Ceaus¸u, 2008), text-to-speech applications (Ungurean hablara[3p/fut]/hablara[3p/past/subj]´ , noun pairs sec- et al., 2008), among other direct applications. Effective retar´ıa/secretaria (secretariat vs. secretary) or demon- diacritic restoration, usually a pre-processing task, is also strative/noun estas/estas´ (these) or demonstrative/verb es- essential to the accuracy and reliability of any subsequent tas/estas´ (these/are[2p/pres]) pairs; real-world, context- text processing and ever more important as NLP investiga- dependent errors which are much more difficult to reliably tions are applied to ‘real-world’ contexts in which normal- identify. ized text cannot be assumed. In this paper we focus on a novel, potentially informa- The primary areas of investigation on lexical disam- tive angle to the problem of diacritic error detection which biguation have focused on syntactic, or part-of-speech leverages the observation that the disambiguation a modern (house/noun vs. house/verb) and semantic, or word HMM part-of-speech tagger performs can be extended to sense (bug/insect vs. bug/small microphone) ambigui- give a judgment on whether a particular word is correctly ties. Less attention has focused on ambiguities that arise diacriticized. For our main target language, Spanish, the from orthographic errors. An important component of correction itself is a relatively straightforward task as al- the orthography of many of the world’s languages, di- most all words have maximally one diacritic mark and be- acritic markings are often stripped or appear inconsis- cause we rarely find more than two-way ambiguous words tent due to technical errors—OCR errors, 8-bit conver- in the lexicon. Hence, in the following, we shall focus on sion/stripping, ill-equipped keyboards, etc.—and human the identification of incorrectly diacriticized words and as- error—native speaker errors related to the level of formal- sume the availability of auxiliary techniques to perform the ity and/or orthographic knowledge. For example, email correction. communication in Spanish more often than not lack sys- 2. Prior work tematic diacritic markings, most probably for convenience. There are three key studies that are relevant to the cur- Also, of all lexical errors for high school and early college rent investigation spanning grapheme, word and tag-based students in Spain orthographic accentuation is the most approaches to diacritic error detection/restoration. First, common (Paredes, 1999). Yarowsky (Yarowsky, 1994; Yarowsky, 1999) advocates a Whereas identified diacritic errors can be restored decision-list approach combining simple word form fre- quite trivially in Spanish, a language in which all words quency, morphological regularity and collocational infor- have maximally one diacritic mark and because we rarely mation to restore diacritics.1 This decision-list implemen- find more than two-way ambiguous words in the lexicon, reliable detection of diacritic errors is less straigh- 1The suffix-based approach described by Yarowsky forward. One the one hand, stripping diacritics from (Yarowsky, 1994; Yarowsky, 1999) is not considered here some words produce non-ambiguous non-words such as as morphological form merely serves as a proxy for part-of- segun/*segun´ (according to), quizas/*quizas´ (perhaps), speech category, the primary variable of the current POS-tag etc. which can be resolved easily with access to a large approach. tation is motivated in that each strategy in the decision (1) (2) list show complementary strengths and weaknesses. Re- Segun´ SPS00-OK Segun SPS00-BAD sults from Yarowsky’s study, looking at Spanish in partic- ellos PP3MP000-OK ellos PP3MP000-OK ular, appear to show high levels of accuracy for individual no RN-OK no RN-OK strategies and even higher levels when combining methods. compro´ VMIS3S0-OK compro VMIS3S0-BAD Of note, the impressive results reported in these results nada RG-OK nada RG-OK may, in part, be enhanced due to the fact that the training all´ı RG-OK alli RG-BAD and evaluation data are drawn from the same genre/register . Fp-OK . Fp-OK and most likely show homogeneity that cannot be expected in all training/evaluation tests and the strategies employed Table 1: Example of generation of diacritic-stripped vari- require fairly large training sets of orthographically cor- ant corpus for training, from original sentence (1). Tags are rect(ed) text—a characteristic that potentially limits the ap- simply augmented with information about whether words plication of these methods to diacritic restoration of text for are correctly diacriticized. less-resourced languages in which availability of correctly marked text is lacking. A second approach to diacritic restoration explored in some words in the stripped corpus will be marked BAD, the literature, partly motivated by the resource-scarcity while the rest will all carry the OK-tag. See Table 1 for problem (Scannell, 2007), hypothesizes that the local an illustration of a hypothetical sentence ‘According to graphemic distribution can provide sufficient information them (he/she/you[formal]) (did) not buy anything there’ to cue effective diacritic restoration, even based on small from both corpora.2 From this new corpus, which is twice training sets (Mihalcea, 2002). An implementation of a the size of the original corpus, we train a Hidden Markov memory-based learning (MBL) algorithm, De Pauw et al. Model tagger in the usual way. (De Pauw et al., 2007) reports mixed restoration accuracy After the tagger is trained on the corpus, we in effect rates for a variety of African, Asian and European lan- produce a POS tagger that not only is able to tag words guages, including the Romance Language French, using where the diacritics are missing, but that also marks each a grapheme-based learning approach. In the authors’ as- word as having either correct or incorrect accent marks. sessment, memory-based restoration effectiveness hinges That is, the output of the tagger would look like the two on LexDiff; a metric of orthographic ambiguity calculated columns in Table 1. by taking the ratio of the number of unique latinized forms The motivation behind such an approach is threefold. over the total number correctly marked forms. However, it First, there is the obvious direct connection behind part-of- is not clear how robust the technique is as reported perfor- speech and diacritic-induced ambiguity: some words are mance varies for languages matched for LexDiff for vari- ambiguous entirely along these lines (completo [N] (com- ous training corpus sizes. plete) / completo´ [V] (completed), esta [PRON] (this) / Most closely related to the current work, Simard and esta´ [V] (is), etc). Secondly, as has been noted in the lit- Deslauriers (Simard and Deslauriers, 2001) present a POS- erature on decision lists, POS sequences themselves are tagger based approach for the restoration of French diacrit- very good indicators of a given correct accent placement, ics. In essence, the proposed solution is to try different although the ambiguity may not distinguishable by local variant ‘candidate’ diacritizations of an input sentence and POS class alone: for example, the sequence (1) PREP (2) use a Hidden Markov Model (HMM) tagger to produce que (3) ... -ara, is a very strong clue to word (3) being a probability score for each ‘candidate’ restoration. The in the subjunctive mood and thus diacriticized -ara, vs. combination of diacritics with the highest likelihood, as the future -ara´ (para que terminara / comprara / em- judged by the POS tagger, is finally chosen as the correct pezara/etc.). Contrariwise, the sequence (1) NOUN (2) restoration. As in the current approach POS-tag informa- que (3) ... -ara, is usually indicative of the opposite choice: tion is harnessed to provide cues for diacritic error detec- the future tense (cosa que terminara´ / acabara´ / etc.).3 tion. However this POS-candidate approach requires more Thirdly, assuming the tagset is

Load more