Word Normalization in Twitter Using Finite-State Transducers

Word normalization in Twitter using finite-state transducers Jordi Porta and JoséLuis Sancho Centro de Estudios de la Real Academia Espa~nola c/ Serrano 197-198, Madrid 28002 fporta,[email protected] Resumen: Palabras clave: Abstract: This paper presents a linguistic approach based on weighted-finite state transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers that are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation that generate sets of candidates according to a lexicon. A statistical language model is used to obtain the most probable sequence of words. The article includes a description of the components and an evaluation of the system and some of its parameters. Keywords: Tweet Messages. Lexical Normalisation. Finite-state Transducers. Statistical Language Models. 1 Introduction 2 Architecture and components of the system Text messaging (or texting) exhibits a con- The system has three main components that siderable degree of departure from the writ- are applied sequentially: An analyser per- ing norm, including spelling. There are many forming tokenisation and lexical analysis on reasons for this deviation: the informality standard word forms and on other expres- of the communication style, the character- sions like numbers, dates, etc.; a compo- istics of the input devices, etc. Although nent generating word candidates for out- many people consider that these communi- of-vocabulary (OOV) tokens; a statistical cation channels are \deteriorating" or even language model used to obtain the most \destroying" languages, many scholars claim likely sequence of words; and finally, a true- that even in this kind of channels communi- caser giving proper capitalisation to common cation obeys maxims and that spelling is also words assigned to OOV tokens. principled. Even more, it seems that, in gen- Freeling (Atserias et al., 2006) with a spe- eral, the processes underlying variation are cial configuration designed for this task is not new to languages. It is under these con- used to tokenise the message and identify, siderations that the modelling of the spelling among other tokens, standard words forms. variation, and also its normalisation, can be The generation of candidates, i.e., the con- addressed. Normalisation of text messaging fusion set of an OOV token, is performed is seen as a necessary preprocessing task be- by components inspired in other modules fore applying other natural language process- used to analyse words found in historical ing tools designed for standard language va- texts, where other kind of spelling variation rieties. can be found (Porta, Sancho, and Gómez, 2013). The approach to historical variation Few works dealing with Spanish text mes- was based on weighted finite-state transduc- saging can be found in the literature. To ers over the tropical semiring implementing the best of our knowledge, the most rele- linguistically motivated models. Some ex- vant and recent published works are Mos- periments were conducted in order to assess quera and Moreda (2012), Pinto et al. the task of assigning to old word forms their (2012), Gomez Hidalgo, Caurcel D´ıaz, and corresponding modern lemmas. For each old I~niguez del Rio (2013) and Oliva et al. word, lemmas were assigned via the possible (2013). modern forms predicted by the model. Re- sults were comparable to the results obtained xDDDDDD are recognised by regular expres- with the Levenshtein distance (Levenshtein, sions and mapped to their canonical form by 1966) in terms of recall, but were better in means of simple transducers. terms of accuracy, precision and F . As for 3.1.2 Initialisms, shortenings, and old words, the confusion set of a OOV token letter omissions is generated by applying the shortest-paths The string operations for initialisms (or algorithm to the following expression: acronymisation) and shortenings are difficult W ◦ E ◦ L to model without incurring in an overgenera- tion of candidates. For this reason, only com- where W is the automata representing the mon initialisms, e.g., sq (es que), tk (te quie- OOV token, E is an edit transducer gener- ro) or sa (se ha), and common shortenings, ating possible variations on tokens, and L is e.g., exam (examen) or nas (buenas), are lis- the set of target words. The composition of ted. these three modules is performed using an For the omission of letters several trans- on-line implementation of the efficient three- ducers are implemented. The simplest and way composition algorithm of Allauzen and more conservative one is a transducer intro- Mohri (2008). ducing just one letter in any position of the token string. Consonantal writing is a spe- 3 Resources employed cial case of letter omission. This kind of writing relies on the assumption that consonants In this section, the resources employed by the carry much more information than vowels do, components of the system are described: the which in fact is the norm in same languages edit transducers, the lexical resources and the like Semitic languages. Some rewrite rules are language model. applied to OOV tokens in order to restore vo- 3.1 Edit transducers wels: We follow the classification of Crystal (2008) InsertVowels = invert(RemoveVowels) for texting features present also in Twitter RemoveVowels = Vowels (!) messages. In order to deal with these features several transducers were developed. Trans- 3.1.3 Standard non-standard ducers are expressed as regular expressions spellings and context-dependent rewrite rules of the form α ! β / γ δ (Chomsky and Halle, We consider non-standard spellings standard 1968) that are compiled into weighted finite- when they are widely used. These include state transducers using the OpenGrm Thrax spellings for representing regional or informal tools (Tai, Skut, and Sproat, 2011). speech, or choices sometimes conditioned by input devices, as non-accented writing. For 3.1.1 Logograms and Pictograms the case of accents and tildes, they are resto- Some letters are found used as logograms, red using a cascade of optional rewrite rules with a phonetic value. They are dealt with like the following: by optional rewrites altering the orthographic form of tokens: RestoreAccents = (njnijnyjnh (!) ~n) ◦ (a (!) á) ◦ (e (!) é) ◦ ... ReplaceLogograms = (x (!) por) ◦ (2 (!) dos) ◦ (@ (!) ajo) ◦ ... Also words containing k instead of c or qu, which appears frequently in protest writings, Also laughs, which are very frequent, are are standardised with simple transducers. So- considered logograms, since they represent me other changes are done to some endings to sounds associated with actions. The multi- recover the standard ending. There are com- ple ways they are realised, including plurals, plete paradigms like the following, which re- are easily described with regular expressions. lates non-standard to standard endings: Pictograms like emoticons entered by means of ready-to-use icons in input devices -a -ada are not treated by our system since they are -as -adas not textual representations. However textual -ao -ado representations of emoticons like :DDD or -aos -ados We also consider phonetic writing as a varying number of consecutive occurrences of kind of non-standard writing in which a the same letter. An example of a rule dealing phonetic form of a word is alphabetically with letter a repetitions is a (!) / a . and syllabically approximated. The transdu- A transducer is generated for the alphabet. cers used for generating standard words from Because messages are keyboarded, some their phonetic and graphical variants are: errors found in words are due to letter trans- positions and confusions between adjacent DephonetiseWriting = letters in the same row of the keyboard. The- invert(PhonographemicVariation) se changes are also implemented with a transducer. PhonographemicVariation = Finally, a Levenshtein transducer with a GraphemeToPhoneme ◦ maximum distance of one has been also im- PhoneConflation ◦ plemented. PhonemeToGrapheme ◦ GraphemeVariation 3.2 The lexicon The lexicon for OOV token normalisation In the previous definitions, the PhoneCon- contains mainly Spanish standard words, flation makes phonemes equivalent, as for proper names and some frequent English example the IPA phonemes /L/ and /J/. Lin- words. These constitute the set of target guistic phenomena as seseo and ceceo, in words. We used the DRAE (RAE, 2001) as which several phonemes were conflated by the source for Spanish standard words in the 16th century, still remain in spoken variants lexicon. Besides inflected forms, we have ad- and are also reflected in texting. The Grap- ded verbal forms with clitics attached and hemeVariation transducer models, among ot- derivative forms not found as entries in the hers, the writing of ch as x, which could be DRAE: -mente adverbs, appreciatives, etc. due to the influence of other languages. The list of proper names was compiled from 3.1.4 Juxtapositions many sources and contains first names, sur- Spacing in texting is also non-standard. In names, aliases, cities, country names, brands, the normalisation task, some OOV tokens organisations, etc. Special attention was pa- are in fact juxtaposed words. The possi- yed to hypocorisms, i.e., shorter or diminuti- ble decompositions of a word into a sequen- ve forms of a given name, as well as nickna- ce of possible words is: shortest-paths(W ◦ mes or calling names, since communication in SplitConjoinedWords ◦ L( L)+), where W is channels as Twitter tends to be interpersonal the word to be analysed, L( L)+ represents (or between members of a group) and affecti- the valid sequences of words and SplitConjoi- ve. A list of common hypocorisms is provided nedWords is a transducer introducing blanks to the system. For English words, we have se- ( ) between letters and undoing optionally lected the 100,000 more frequent words of the possible fused vowels: BNC (BNC, 2001). 3.3 Language model SplitConjoinedWords = invert(JoinWords) We use a language model to decode the word graph and thus obtain the most probable JoinWords = word sequence.

Load more