normalization in Twitter using finite-state transducers

Jordi Porta and Jos´eLuis Sancho Centro de Estudios de la Real Academia Espa˜nola c/ Serrano 197-198, Madrid 28002 {porta,sancho}@rae.es

Resumen: Palabras clave: Abstract: This paper presents a linguistic approach based on weighted-finite state transducers for the lexical normalisation of Spanish Twitter messages. The sys- tem developed consists of transducers that are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation that generate sets of candi- dates according to a lexicon. A statistical language model is used to obtain the most probable sequence of . The article includes a description of the compo- nents and an evaluation of the system and some of its parameters. Keywords: Tweet Messages. Lexical Normalisation. Finite-state Transducers. Statistical Language Models.

1 Introduction 2 Architecture and components of the system Text messaging (or texting) exhibits a con- The system has three main components that siderable degree of departure from the writ- are applied sequentially: An analyser per- ing norm, including spelling. There are many forming tokenisation and on reasons for this deviation: the informality standard word forms and on other expres- of the communication style, the character- sions like numbers, dates, etc.; a compo- istics of the input devices, etc. Although nent generating word candidates for out- many people consider that these communi- of-vocabulary (OOV) tokens; a statistical cation channels are “deteriorating” or even language model used to obtain the most “destroying” languages, many scholars claim likely sequence of words; and finally, a true- that even in this kind of channels communi- caser giving proper capitalisation to common cation obeys maxims and that spelling is also words assigned to OOV tokens. principled. Even more, it seems that, in gen- Freeling (Atserias et al., 2006) with a spe- eral, the processes underlying variation are cial configuration designed for this task is not new to languages. It is under these con- used to tokenise the message and identify, siderations that the modelling of the spelling among other tokens, standard words forms. variation, and also its normalisation, can be The generation of candidates, i.e., the con- addressed. Normalisation of text messaging fusion set of an OOV token, is performed is seen as a necessary preprocessing task be- by components inspired in other modules fore applying other natural language process- used to analyse words found in historical ing tools designed for standard language va- texts, where other kind of spelling variation rieties. can be found (Porta, Sancho, and G´omez, 2013). The approach to historical variation Few works dealing with Spanish text mes- was based on weighted finite-state transduc- saging can be found in the literature. To ers over the tropical semiring implementing the best of our knowledge, the most rele- linguistically motivated models. Some ex- vant and recent published works are Mos- periments were conducted in order to assess quera and Moreda (2012), Pinto et al. the task of assigning to old word forms their (2012), Gomez Hidalgo, Caurcel D´ıaz, and corresponding modern lemmas. For each old I˜niguez del Rio (2013) and Oliva et al. word, lemmas were assigned via the possible (2013). modern forms predicted by the model. Re- sults were comparable to the results obtained xDDDDDD are recognised by regular expres- with the (Levenshtein, sions and mapped to their canonical form by 1966) in terms of recall, but were better in means of simple transducers. terms of accuracy, precision and F . As for 3.1.2 Initialisms, shortenings, and old words, the confusion set of a OOV token letter omissions is generated by applying the shortest-paths The string operations for initialisms (or algorithm to the following expression: acronymisation) and shortenings are difficult W ◦ E ◦ L to model without incurring in an overgenera- tion of candidates. For this reason, only com- where W is the automata representing the mon initialisms, e.g., sq (es que), tk (te quie- OOV token, E is an edit transducer gener- ro) or sa (se ha), and common shortenings, ating possible variations on tokens, and L is e.g., exam (examen) or nas (buenas), are lis- the set of target words. The composition of ted. these three modules is performed using an For the omission of letters several trans- on-line implementation of the efficient three- ducers are implemented. The simplest and way composition algorithm of Allauzen and more conservative one is a transducer intro- Mohri (2008). ducing just one letter in any position of the token string. Consonantal writing is a spe- 3 Resources employed cial case of letter omission. This kind of wri- ting relies on the assumption that consonants In this section, the resources employed by the carry much more information than vowels do, components of the system are described: the which in fact is the norm in same languages edit transducers, the lexical resources and the like Semitic languages. Some rewrite rules are language model. applied to OOV tokens in order to restore vo- 3.1 Edit transducers wels: We follow the classification of Crystal (2008) InsertVowels = invert(RemoveVowels) for texting features present also in Twitter RemoveVowels = Vowels (→)  messages. In order to deal with these features several transducers were developed. Trans- 3.1.3 Standard non-standard ducers are expressed as regular expressions spellings and context-dependent rewrite rules of the form α → β / γ δ (Chomsky and Halle, We consider non-standard spellings standard 1968) that are compiled into weighted finite- when they are widely used. These include state transducers using the OpenGrm Thrax spellings for representing regional or informal tools (Tai, Skut, and Sproat, 2011). speech, or choices sometimes conditioned by input devices, as non-accented writing. For 3.1.1 Logograms and Pictograms the case of accents and tildes, they are resto- Some letters are found used as logograms, red using a cascade of optional rewrite rules with a phonetic value. They are dealt with like the following: by optional rewrites altering the orthographic form of tokens: RestoreAccents = (n|ni|ny|nh (→) ˜n) ◦ (a (→) ´a) ◦ (e (→) ´e) ◦ ... ReplaceLogograms = (x (→) por) ◦ (2 (→) dos) ◦ (@ (→) a|o) ◦ ... Also words containing k instead of c or qu, which appears frequently in protest writings, Also laughs, which are very frequent, are are standardised with simple transducers. So- considered logograms, since they represent me other changes are done to some endings to sounds associated with actions. The multi- recover the standard ending. There are com- ple ways they are realised, including plurals, plete paradigms like the following, which re- are easily described with regular expressions. lates non-standard to standard endings: Pictograms like emoticons entered by means of ready-to-use icons in input devices -a -ada are not treated by our system since they are -as -adas not textual representations. However textual -ao -ado representations of emoticons like :DDD or -aos -ados We also consider phonetic writing as a varying number of consecutive occurrences of kind of non-standard writing in which a the same letter. An example of a rule dealing phonetic form of a word is alphabetically with letter a repetitions is a (→)  / a . and syllabically approximated. The transdu- A transducer is generated for the alphabet. cers used for generating standard words from Because messages are keyboarded, some their phonetic and graphical variants are: errors found in words are due to letter trans- positions and confusions between adjacent DephonetiseWriting = letters in the same row of the keyboard. The- invert(PhonographemicVariation) se changes are also implemented with a trans- ducer. PhonographemicVariation = Finally, a Levenshtein transducer with a GraphemeToPhoneme ◦ maximum distance of one has been also im- PhoneConflation ◦ plemented. PhonemeToGrapheme ◦ GraphemeVariation 3.2 The lexicon The lexicon for OOV token normalisation In the previous definitions, the PhoneCon- contains mainly Spanish standard words, flation makes phonemes equivalent, as for proper names and some frequent English example the IPA phonemes /L/ and /J/. Lin- words. These constitute the set of target guistic phenomena as seseo and ceceo, in words. We used the DRAE (RAE, 2001) as which several phonemes were conflated by the source for Spanish standard words in the 16th century, still remain in spoken variants lexicon. Besides inflected forms, we have ad- and are also reflected in texting. The Grap- ded verbal forms with clitics attached and hemeVariation transducer models, among ot- derivative forms not found as entries in the hers, the writing of ch as x, which could be DRAE: -mente adverbs, appreciatives, etc. due to the influence of other languages. The list of proper names was compiled from 3.1.4 Juxtapositions many sources and contains first names, sur- Spacing in texting is also non-standard. In names, aliases, cities, country names, brands, the normalisation task, some OOV tokens organisations, etc. Special attention was pa- are in fact juxtaposed words. The possi- yed to hypocorisms, i.e., shorter or diminuti- ble decompositions of a word into a sequen- ve forms of a given name, as well as nickna- ce of possible words is: shortest-paths(W ◦ mes or calling names, since communication in SplitConjoinedWords ◦ L( L)+), where W is channels as Twitter tends to be interpersonal the word to be analysed, L( L)+ represents (or between members of a group) and affecti- the valid sequences of words and SplitConjoi- ve. A list of common hypocorisms is provided nedWords is a transducer introducing blanks to the system. For English words, we have se- ( ) between letters and undoing optionally lected the 100,000 more frequent words of the possible fused vowels: BNC (BNC, 2001). 3.3 Language model SplitConjoinedWords = invert(JoinWords) We use a language model to decode the word graph and thus obtain the most probable JoinWords = word sequence. The model is estimated from (a a (→) a<1>) ◦ ... ◦ (u u (→) u<1>) ◦ a corpus of webpages compiled with Wacky ( (→) ) (Baroni et al., 2009). The corpus contains Note that in the previous definition, some ru- about 11,200,000 tokens coming from about les are weighted with a unit cost <1>. The- 21,000 URLs. We used as seeds the types se costs are used by the shortest-paths al- found in the development set (about 2,500). gorithm as a preference mechanism to select Backed-off n-gram models, used as language non-fused over fused sequences when both ca- models, are implemented with the OpenGrm ses are possible. NGram toolkit (Roark et al., 2012). 3.1.5 Other transducers 3.4 Truecasing Expressive lengthening, which consist in re- The restoring of case information in badly- peating a letter in order to convey emphasis, cased text has been addressed in (Lita et are dealt with by means of rules removing a al., 2003) and has been included as part of the normalisation task. Part of this process, tokens are analysed with the system. Recall for proper names, is performed by the ap- on OOV tokens is of 89.40 %. Confusion sets plication of the language model to the word size follows a power-law distribution with an graph. Words at message initial position are average size of 5.48 for OOV tokens that goes not always uppercased, since doing so yiel- down to 1.38 if we average over the rest of ded contradictory results after some experi- the tokens. Precision for 2- and 4-gram lan- mentation. A simple heuristic is implemented guage models is 78.10 %, but the best result to uppercase a normalisation candidate when is obtained with 3-grams, with an precision the OOV token is also uppercased. of 78.25 %. There is a number of non-standard 4 Settings and evaluation forms that were wrongly recognised as in- In order to generate the confusion sets we vocabulary words because they clash with used two edit transducers applied in a casca- other standard words. In the second series de. If neither of the two is able to relate a of experiments a confusion set is generated token with a word, the token is assigned to for each word in order to correct potentially itself. wrong assignments. Average size of confu- 1 The first transducer generates candida- sion sets increases to 5.56. Precision results tes according to the expansion of abbrevia- for the 2-gram language model is of 78.25 % tions, the identification of acronyms, picto- but 3- and 4-gram reach both an precision of grams and words which result from the follo- 78.55 %. wing composition of edit transducers combi- From a quantitative point of view, it seems ning some of the features of texting: that slighty better results are obtained using a 3-gram language model and generating con- RemoveSymbols ◦ fusion sets not only for OOV tokens but for LowerCase ◦ all the tokens in the message. In a qualitati- Deaccent ◦ ve evaluation of errors several categories show RemoveReduplicates ◦ up. The most populated categories are those ReplaceLogograms ◦ having to do with case restoration and wrong StandardiseEndings ◦ decoding by the language model. Some errors DephonetiseWriting ◦ are related to particularities of DRAE, from Reaccent ◦ which the lexicon was derived (dispertar or MixCase malaleche). Non standard morphology is ob- served in tweets, as in derivatives (tranquileo The second edit transducer analyses to- or loquendera). Lack of abbreviation expan- kens that did not receive analyses with the sion is also observed (Hum). Faulty applica- first editor. This second editor implements tion of segmentation accounts for a few errors consonantal writing, typing error recovery, an (mencantaba). Finally, some errors are not on approximate matching using a Levenshtein our output but on the reference (Hojo). distance of one and the splitting of juxtapo- sed words. In all cases, case, accents and re- 5 Conclusions and future work duplications are also considered. This second No attention has been payed to multilingua- transducer makes use of an extended lexicon lism since the task explicitly excluded tweets containing sequences of simple words. from bilingual areas of Spain. However, gi- Several experiments were conducted in or- ven that not few Spanish speakers (both in der to evaluate some parameters of the sys- Europe and America) are bilingual or live in tem. In particular, the effect of the order of bilingual areas, mechanisms should be provi- the n-grams in the language model and the ded to deal with other languages than English effect of generating confusion sets for OOV to make the system more robust. tokens only versus the generation of confu- We plan to build a corpus of lexically stan- sion sets for all tokens. For all the experi- dard tweets via the Twitter streaming API ments we used the test set provided with the to determine whether n-grams observed in a tokenization delivered by Freeling. 1We removed from the calculation the token For the first series of experiments, tokens mes de abril, which receives 308,017 different analy- identified as standard words by Freeling re- ses due to the combination of multiple editions and ceive the same token as analysis and OOV segmentations. Twitter-only corpus improve decoding or not Levenshtein, Vladimir I. 1966. Binary co- as a side effect of syntax being also non stan- des capable of correcting deletions, inser- dard. tions, and reversals. Soviet Physics Do- Qualitative analysis of results showed that klady, 10(8):707–710. there is room for improvement experimenting Lita, Lucian Vlad, Abe Ittycheriah, Salim with selective deactivation of items in the le- Roukos, and Nanda Kambhatla. 2003. xicon and further development of the segmen- tRuEcasIng. In Proc. of the 41st Annual ting module. Meeting on ACL - Volume 1, ACL ’03, pa- However, initialisms and shortenings are ges 152–159, Stroudsburg, PA, USA. features of texting difficult to model without causing overgeneration. Acronyms like FYQ, Mosquera, Alejandro and Paloma Moreda. which correspond to the school subject of 2012. TENOR: A lexical normalisation F´ısica y Qu´ımica, are domain specific and dif- tool for Spanish Web 2.0 texts. In Petr ficult to foresee and therefore to have them Sojka, AleˇsHor´ak,Ivan Kopeˇcek,and Ka- listed in the resources. rel Pala, editors, Text, Speech and Dialo- gue, volume 7499 of LNCS. Springer, pa- References ges 535–542. Allauzen, Cyril and Mehryar Mohri. 2008. 3- Oliva, J., J. I. Serrano, M. D. Del Castillo, way composition of weighted finite-state and A.´ Iglesias. 2013. A SMS normali- transducers. In Proc. of the 13th Int. zation system integrating multiple gram- Conf. on Implementation and Application matical resources. Natural Language En- of Automata (CIAA–2008), pages 262– gineering, 19:121–141, 1. 273, San Francisco, California, USA. Pinto, David, Darnes Vilari˜noAyala, Yuri- Atserias, Jordi, Bernardino Casas, Elisa- diana Alem´an,Helena G´omez,Nahun Lo- bet Comelles, Meritxell Gonz´alez, Llu´ıs ya, and H´ector Jim´enez-Salazar. 2012. Padr´o,and Muntsa Padr´o. 2006. Free- The Soundex phonetic algorithm revisited Ling 1.3: Syntactic and semantic services for SMS text representation. In Petr Soj- in an open-source NLP library. In Proc. of ka, AleˇsHor´ak,Ivan Kopeˇcek, and Karel the 5th Int. Conf. on Language Resources Pala, editors, Text, Speech and Dialogue, and Evaluation (LREC-2006), pages 48– volume 7499 of LNCS, pages 47–55. Sprin- 55, Genoa, Italy, May. ger. Baroni, Marco, Silvia Bernardini, Adriano Porta, Jordi, Jos´e-LuisSancho, and Javier Ferraresi, and Eros Zanchetta. 2009. The G´omez. 2013. Edit transducers for spe- wacky wide web: A collection of very large lling variation in Old Spanish. In Proc. of linguistically processed web-crawled cor- the workshop on computational historical pora. Language Resources and Evalua- linguistics at NODALIDA 2013. NEALT tion, 43(3):209–226. Proc. Series 18, pages 70–79, Oslo, Nor- way, May 22-24. BNC. 2001. The British National Cor- pus, version 2 (BNC World). Distribu- RAE. 2001. Diccionario de la lengua es- ted by Oxford University Computing Ser- pa˜nola. Espasa, Madrid, 22th edition. vices on behalf of the BNC Consortium. Roark, Brian, Richard Sproat, Cyril Allau- http://www.natcorp.ox.ac.uk. zen, Michael Riley, Jeffrey Sorensen, and Chomsky, Noam and Morris Halle. 1968. Terry Tai. 2012. The OpenGrm open- The sound pattern of English. Harper & source finite-state grammar software libra- Row, New York. ries. In Proc. of the ACL 2012 System Demonstrations, pages 61–66, Jeju Island, Crystal, David. 2008. Txtng: The Gr8 Db8. Korea, July. Oxford University Press. Tai, Terry, Wojciech Skut, and Richard Gomez Hidalgo, Jos´eMar´ıa,Andr´esAlfonso Sproat. 2011. Thrax: An Open Source Caurcel D´ıaz, and Yovan I˜niguezdel Rio. Grammar Compiler Built on OpenFst. In 2013. Un m´etodo de an´alisis de lenguaje ASRU 2011, Waikoloa Resort, Hawaii, De- tipo SMS para el castellano. Linguam´ati- cember. ca, 5(1):31—39, July.