<<

A Rule-Based Normalization System for Greek Noisy User-Generated Text

Marsida Toska

Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’ Thesis in Language Technology, 30 ECTS credits November 9, 2020

Supervisor: Eva Pettersson Abstract

The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. There- fore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance ( approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization. Contents

Acknowledgements5

1 Introduction6 1.1 Purpose...... 6 1.2 Outline...... 6

2 Background8 2.1 Text Normalization...... 8 2.1.1 Normalization of Historical Texts...... 8 2.1.2 Normalization of Texts for Text-to-Speech Systems...... 9 2.2 Characteristics of Noisy User-Generated Texts...... 10 2.3 Methods for the Normalization of Noisy User-Generated Texts.... 11 2.3.1 The Rule-Based Approach...... 11 2.3.2 Levenshtein Distance...... 12 2.3.3 Comparison of Phonetic Similarity...... 13 2.3.4 Statistical Machine and Neural Methods...... 14 2.4 and Greek Tweets...... 14 2.4.1 Overview of Greek Spelling...... 15 2.4.2 Linguistic Phenomena in Greek Tweets...... 15 2.5 Part-of-Speech Tagging...... 16

3 Data & Resources 17 3.1 The Dataset...... 17 3.2 Resources...... 17 3.2.1 ...... 17 3.2.2 PANACEA n-gram Corpus...... 18 3.2.3 UD Pipe...... 18

4 Preprocessing 19 4.1 Cleaning the Dataset...... 19 4.2 Test Set...... 20 4.3 Systematic Analysis of Greek Tweets...... 20 4.3.1 Sentence Structure...... 20 4.3.2 Lower/Upper Case...... 20 4.3.3 Neologisms...... 20 4.3.4 and Engreek...... 21 4.3.5 Stress, Contractions, Elongations, ...... 21 4.3.6 Non-Standard Abbreviations...... 22 4.3.7 Misspellings...... 23 4.4 Scope Definition...... 23

5 System Architecture 25 5.1 Module 1: Regular Expressions...... 26 5.1.1 Non-Standard Abbreviations...... 26 5.1.2 Elongations...... 26 5.1.3 Contractions...... 26

3 5.1.4 Misjoined ...... 27 5.1.5 Truecasing...... 27 5.2 Module 2: Rule about Stress Restoration...... 27 5.2.1 Rule Overview...... 27 5.2.2 Rule Analysis...... 28 5.2.3 Handling of Special Cases...... 29 5.3 Module 3: Edit distance...... 29 5.3.1 Extraction of IV subset...... 29 5.3.2 Extraction of Candidates...... 30 5.3.3 Final Selection of the IV Counterpart...... 31

6 Evaluation and Results 33 6.1 Performance of the Rule-Based System...... 33 6.2 Error Analysis...... 34 6.3 Effect of Normalization on Tagging...... 35

7 Discussion 36

8 Conclusion 38

4 Acknowledgements

For her guidance, constructive feedback, continuous support and encouragement, I would like to warmly thank my supervisor Eva Pettersson. I also wish to thank my absolutely unique family, boyfriend and over the world spread out friends for always being there for me, even when they were not.

5 1 Introduction

User-generated texts, such as social media texts (e.g. tweets), constitute a vast source of information for opinion and event extraction. However, most of this information is composed in a language that is notorious for its high variability in sentence structure, the extensive usage of non-standard word forms and the presence of ungrammatical linguistic units (e.g. misspelled words) resulting in noise (Sikdar and Gambäck, 2016). Since most Natural Language Processing (NLP) tools are trained on formal texts, such as news text, it has been observed that the performance of such tools declines when run over informal text (Gimpel et al., 2011; O’Connor et al., 2010). Good tagging and results are essential for applications such as opinion mining, information retrieval etc (Kong et al., 2014). Therefore, there is seen a need to preprocess noisy user-generated texts (NUGT) through normalization, so that the performance of preprocessing NLP tasks, such as Part-of-Speech (POS) tagging, parsing and other subsequent ones, is not compromised significantly, if at all (Clark and Araki, 2011).

1.1 Purpose

Normalization could concisely be defined as the task of converting word tokens into their standard form (Han et al., 2013). Depending on the text genre (formal vs. informal) and its purpose (preprocessing text for Text-to-Speech (TTS) systems, information retrieval or other NLP tasks such as tagging and parsing) it poses different challenges and subsequently requires a different approach as well. The purpose of the present work is to explore the extent to which the rule-based normalization of Greek social media texts, specifically Greek tweets (tweets written in the language), can lead to more accurate tagging results. We opted for the rule- based approach because it has proven to deliver optimal results in other languages, at least when straightforward mappings suffice (Ruiz et al., 2014; Sidarenka et al., 2013). Additionally, including the levenshtein-distance algorithm, allows us deal with any type of spelling deviations, a phenomenon which we expect to be abundant in user-generated texts, such as tweets. Therefore, the research questions which we will attempt to answer in this project, could be formulated as follows: 1) What are the core categories of Greek tweets that could potentially be in need for normalization? 2) How well does a rule-based system combined with Levenshtein distance perform in normalizing Greek tweets? 3) What is the effect, if any, of normalization in the tagging results of Greek tweets?

1.2 Outline

The chapters are organized as follows: Chapter 2 provides information on the background of the task of normalization. As it is not a new field, a few of the most common areas of application and methods are described. There is also provided an introduction into the Greek NUGT as well as a brief overview of POS-tagging.

6 Chapter 3 introduces the dataset along with the resources that were used for the purposes of this project. Part of the data was used for the analysis of the phenomena occurring in Greek Tweets and the creation of an annotated test set, that was used for the evaluation of the system in the end. Chapter 4 contains information about the preprocessing steps, such as cleaning the data, performing an analysis and categorization of non-standard word forms and annotating the test set. Chapter 5 describes the actual implementation of the system by giving an overview of the architecture and illustrating approaches through examples. Chapter 6 gives an overview of the evaluation results, both of the system itself and with regard to POS-tagging, at which phase the answers to the research questions, as resulted from the work, are summarized. Chapters 7 and 8 discuss the results and provide a brief conclusion respectively.

7 2 Background

2.1 Text Normalization

Text normalization (or canonicalization or standardization) is a prevalent task in the NLP pipeline. It includes normalization subtasks such as tokenization, lemmatization, or sentence segmentation, but it can also be encountered in a more complex form in situations where, for instance, out-of-vocabulary (OOV) words must be addressed by converting them into a standard (lexicon approved) form. Lexical normalization, where the focus of this work lies, consists in preprocessing a text on a word level and transform it into a form that can be easily analyzed and processed by other tasks in the NLP pipeline (e.g. taggers) or downstream applications (e.g. systems) for these to produce consistent results (Han et al., 2013; Jurafsky and Martin, 2009) . As one may already suspect, there is no single correct way to normalize a text. On the contrary, text normalization is a domain-dependent and purpose-specific task. In the subsections below, we enlist some of the areas of application of text nor- malization, such as that of Historical Texts and Text-to-Speech, illustrating how the difference in domain and purpose pose different challenges for and even give a different meaning to normalization.

2.1.1 Normalization of Historical Texts Historical documents are a significant source of information regarding the past and as such a significant resource for researchers in digital humanities. With time, many of them have been digitized making historical texts easily accessible and suitable for instant searching (Pettersson, 2016). Nevertheless, they still cannot be studied, analyzed or exploited fully computationally and remain challenging within NLP due to the high degree of variability they exhibit in terms of spelling, syntax, morphology, vocabulary and semantics (Bollmann et al., 2011; Pettersson, 2016). In an attempt to provide a solution to this problem and taking into consideration that most NLP tools are trained on contemporary texts, researchers have proposed the normalization of historical texts to modern linguistic standards, so that they can be processed by the same tools and rendered suitable for e.g. (Piotrowski, 2012). In her study about historical spelling normalization, Pettersson (2016) supports this approach as well, as opposed to adapting NLP tools to the historical domain instead. An example of a sentence found in a historical text before and after normalization can be seen in Example 1 below from the Innsbruck Letter Corpus1:

(1) þe quene was ryght gretly displisyd with us both the queen was right greatly displeased with us both

In this sentence, the normalization consists mainly, if not entirely, in substituting certain non-standard word forms with their equivalent modern ones. For instance, “ryght” with “right”, “displisyd” with “displeased” etc. without performing changes

1https://www.uibk.ac.at/anglistik/research/projects/icamet/availability/

8 in the ordering of the words or other kind of modifications. This is anticipated, considering that spelling stands out as the most systematic and striking difference between old and modern texts (Pettersson, 2016; Bollmann, 2018; Piotrowski, 2012). However, normalizing historical texts regarding spelling is more challenging than one would expect due to its variability between different genres, periods or different authors within the same language. Inconsistent spelling has also been observed within the same document suggesting that in the past not all words were standardized in a single form (Piotrowski, 2012). Consequently, creating a correspondence between different lexical forms that appear in historical texts to present-day spelling is not as straightforward and adds to the normalization challenge. In her study on normalization of historical texts, Pettersson (2016) suggests the following methods for spelling normalization of historical texts.

a) Rule-based Normalization ) Levenshtein-based Normalization c) Memory-based Normalization d) SMT-based Normalization

Experiments suggest that the Statistical machine translation (SMT) model, which sees normalization as a translation task, achieves better results compared to other approaches. The proposed pipeline is applicable to several European languages and serves as a good preprocessing step for historical texts, before feeding them to tag- gers or parsers or using them to extract information (information retrieval). Other researchers (Bollmann and Søgaard, 2016; Korchagina, 2017a; Pettersson, 2016; Tang et al., 2018), have shown that NMT can surpass the SMT in the normalization of historical texts provided that large data sizes are available.

2.1.2 Normalization of Texts for Text-to-Speech Systems Contrary to the normalization of historical texts where the focus mainly lies in map- ping or converting word forms to present-day standard word forms, the normalization of texts destined for Text-to-Speech (TTS) systems has a quite different configuration. In this case, the aim of the normalization process is to prepare the text in such a way that it can be passed on to speech synthesizers and be read aloud (Zhang et al., 2019). The normalization component is usually among the first steps in the TTS pipeline and indispensable since its absence can compromise the meaning and the comprehension of the spoken message and easily lead to a poorly performing TTS system.

Category Before Normalization After Normalization acronyms e.g. WHO World Health Organization abbreviations e.g. Mr Mister numbers (ordinals) e.g. May 7 May seventh dates e.g. in 1994 in nineteen ninety-four money e.g. £12,000 twelve thousand pounds

Table 2.1: Example of TTS phenomena where normalization is needed.

In practice, this means expanding acronyms, abbreviated words and numerical similar instances in their spell out form, as in the examples listed in Table 2.1. On the other hand, simply expanding certain tokens into their spoken form is not always sufficient as in certain cases the expansions may require prior contextual disambiguation (Flint et al., 2017). For example, it is appropriate to verbalize numbers

9 as digits in a certain context, i.e. 5 as five, 2 as two and 3 as three when in “the pin is 523” but this does not suffice in other cases where numerical values have to be read out as cardinals as in the expression “523 people”, where the appropriate expansion would be “five-hundred twenty-three people” instead of the spell out of single digits “five two three people”. The degree of normalization complexity is language-dependent, considering that in highly inflected languages such as Greek, certain cardinals are subject to case, number and gender inflectional endings as they act as adjectival modifiers when preceding a noun. Consequently, while the cardinal 523 would be pronounced as “πεντακόσια είκοσι τρία” in Greek, when context available, some of its endings would have to get adjusted accordingly as illustrated in Example 2.

(2) πεντακόσιοι είκοσι τρείς άνθρωποι Five hundred.M.PL. twenty three.M.PL. people.M.PL. ’Five hundred twenty-three people’

Initially, most text normalization components were implemented through a rule- based approach where hand-written rules would specify the right expansion based on linguistic knowledge, while finite state transducers (Ebden and Sproat, 2014) have also been deployed as a method. Recently, there have been several attempts to approach text normalization for TTS systems from a data-driven perspective by using machine learning methods, e.g. deep neural networks. The main challenge faced by the latter is the lack of directly available parallel corpora, as there are no texts where instances of the semiotic class and abbreviations occur naturally in a verbalized form (Zhang et al., 2019).

2.2 Characteristics of Noisy User-Generated Texts

User-generated texts posted in social media platforms provide users with the possibility to share information, to informally express an opinion on current affairs, exchange ideas, comment, review or recommend products and services (Steinskog et al., 2017). Twitter posts, or otherwise tweets, are known for their brevity as each post is subject to a character limit, currently amounting to 240 characters per tweet (Alegria et al., 2015). In view of this restriction and given the informal context of text generation in social media (since posts are usually generated by and intended for individuals), they are characterized by certain idiosyncratic features, such as free or elliptical sentence structure, use of non-standard words and abbreviations, ungrammaticalities (intended or not), unreliable capitalization and code-switching, which often account for the noise found in such texts (Clark and Araki, 2011; Han et al., 2013).

(3) sry, dont know but will DM u 2mr about it sorry, don’t know but will direct message you tomorrow about it ’I am sorry, I don’t know, but I will let you know through a private message tomorrow about it’

In the above example there appear non-standard words, such as “sry”, “u” and “2mr” that are not included in any English lexicon and qualify as such as OOV words. Except for the lexical level idiosyncrasy, marks such as periods and are also among the features that are irregularly used in social media texts and can also impact text processing negatively. Except for these characteristics, which informal writing in social media is generally notorious for, the Twitter posts also exhibit the following domain specific features:

10 urls, hashtags and mentions (Wikström, 2017). These can appear either as syntactical part of a sentence or be independent of the sentence structure.

(4) a) its getting outta hand . Students MUST be given masks for #exams. b) staying home and chilling./ #luvmylife. In the example of tweet a, the hashtag “#exams” is an integral part of the sentence. Removing it would make the sentence ungrammatical. On the contrary, the hashtag “#luvmylife”, summarizing the mood of the statement, and following the period, could easily be omitted from a grammatical point a view. The deviation of such texts from newswire or similar texts in terms of governing rules accounts for the fact that social media posts are often referred to as noisy user-generated texts.

2.3 Methods for the Normalization of Noisy User-Generated Texts

The normalization of user generated texts resembles to some extent that of historical texts since the main focus in both cases lies in the conversion of certain words into canonical forms, so that these do not count towards OOV words and lead in turn to a drop in the performance of NLP tools run over such texts. However, historical texts fall into the category of formal texts which are written using non-modern language, while social media texts are classified as modern texts which incorporate numerous features that are associated with informal writing. Hence, tweets need to be normalized with regard to informal language use in order to approach to some extent the formal writing of the texts which NLP tools are mainly trained on (Chrupała, 2014; Han et al., 2013; Wang et al., 2017). A significant part of normalization consists in restoring the canonical form of words which in application resembles techniques used in traditional spelling corrections. Initially, the correction of spelling was viewed as a noisy channel model where it was assumed that the user knew what they wanted to type but accidentally introduced an error when typing (Church and Gale, 1991), an approach dating back to Damerau (1964). Consequently, the aim of the spelling task was to recover the intended character/s. This formula was formalized as follows by Church and Gale (1991):

푃푟 (푡 |푐) The above formula expresses the probability (Pr) of typo t occurring provided that c was the intended character. Along the same line, Brill and Moore (2000) proposed an improved error model where positional information, that is information about the position of the error in a word, was also deployed. Later, Li et al. (2006) utilizes distributional similarity to perform spelling corrections. Focusing on the normalization of text messages, Choudhury et al. (2007) uses a hidden Markov model to model the transformations and emissions of characters. The use of finite-state transducers (FST) has also been a popular method (Hulden and Francom, 2013; Porta and Sancho-Gómez, 2013), while the rule-based approach, comparison of phonetic similarity and machine translation methods have been imple- mented as well and will be presented in more detail in the subsequent chapters.

2.3.1 The Rule-Based Approach Rule-based systems consist primarily in the creation of rules with conditions in the form of “if X then do Y” in order to tackle a problem (Bollmann, 2018). In text normalization, Bollmann et al. (2011) deployed a rule-based approach in the form of context aware rewrite rules in order to transform non-modern word forms in

11 historical texts to their present-day counterparts. In the domain of user-generated texts, Sidarenka et al. (2013) proposed a normalization method for German that constitutes in the analysis of the unknown tokens and the development of rules that target specific cases. For Twitter specific phenomena (e.g. mentions, urls etc.), they introduced the use of artificial tokens, replacing the unknown word with “%” and its category, for instance replacing “https://www.example.com” with “%Link”, while under certain conditions, these were removed in initial and final sentence positions. For the major class of unknown tokens, those with spelling deviations, they created transformation rules, while also incorporating statistical information for those generating errors. Specifically, they set as a prerequisite that the sum of log probabilities of the current token (i.e. 푤푖 ) and its immediate context (푤푖 -1 and 푤푖 +1) 푤 ∗ be lower than that of the proposed in-vocabulary word form (i.e. 푖 ) for a replacement to take place. This condition is formally illustrated by the authors (Sidarenka et al., 2013) through the below inequality:

log(푃 (푤푖 − 1,푤푖 )) + log(푃 (푤푖 )) + 푙표푔(푃 (푤푖,푤푖 + 1)) < (2.1) 푃 푤 ,푤 ∗ 푃 푤 ∗ 푙표푔 푃 푤 ∗,푤 log( ( 푖 − 1 푖 )) + log( ( 푖 )) + ( ( 푖 푖 + 1))

Ruiz et al. (2014) rely on rule-based methods to normalize Spanish microtext, focusing on Twitter data. By making use of regular expressions, they create three different sets of rules for rule-based preprocessing that respectively handle a) ab- breviations, b) tokens lacking whitespace as a delimiter and c) emoticons and OOV words with character elongations. For unresolved cases, the authors deploy minimum edit-distance (Damerau, 1964), where frequently occurring “errors” are assigned a lower weight than rare ones. Beckley (2015) introduced a normalization approach where among others (sub- stitution list and sentence-level ranker) a rule-based component was also included. More specifically, they created the “ing” and the “coool” rules to handle frequent categories of OOV tokens that require normalization in terms of non-standard ending and elongated characters respectively. Rule-based implementations were deployed by further researchers (Pathak and Joshi, 2019; Samsudin et al., 2013) for normalization purposes.

2.3.2 Levenshtein Distance The Levenshtein distance (Levenshtein, 1966), also simply referred to as edit distance (Navarro, 1999), is an algorithm that measures the similarity between two strings a and b based on the number of character edits required (i.e. insertions, deletions, substitutions) to transform a into b. Although it was initially coined by . Levenshtein (1966) where all the edit operations were equally assigned the weight 1, today it is encountered in multiple variations and adjustments and is widely used in spell checking programs for the correction of typographical errors. A formal representation of it can be seen in Figure 2.1. The levenshtein distance has also found application within the context of the normalization task where the conversion of a word into a canonical form is seen as the correction of its spelling. However, normalization differs from spell checking since here the “misspellings” might be intended and not typographical (Han and Baldwin, 2011; Peterson, 1980).

12 Figure 2.1: The Levenshtein distance algorithm summary by Pettersson (2016).

2.3.3 Comparison of Phonetic Similarity Given the increasing and irregular variance in spelling, especially in user-generated texts, phonetic algorithms are deemed advantageous due to the possibility of directly mapping multiple spelling variations of the same word to their in-vocabulary (IV) counterpart through their similar or common pronunciation as shown in Figure 2.2.

Figure 2.2: OOV spelling variances of the word “together” from the Edinburgh Twitter Corpus by Liu et al. (2011).

For instance, taking “2getha”, one of the most frequent non-standard spelling vari- ants of the word “together” as shown in Figure 2.2, its edit distance from “together” would be 4 (2 substitutions and 2 insertions) when using edit distance algorithms. On the contrary, deriving the pronunciation [tu:"geD@(r)] for “2getha”, its similarity to [t@"geD@(r)], the IV word form, is more apparent and can easily lead to the normaliza- tion “together”. In social media text normalization, Han et al. (2013) make use of a phonetic algorithm (Philips, 2000) to get the phonetic encoding of an OOV word and search for its IV candidate based on it. Ahmed (2014) on the other hand proposes the combination of Soundex, refined Soundex or other phonetic algorithms along with edit distance ones for the maximization of results for the lexical normalization of tweets. Most phonetic algorithms that exist today are based on Soundex, which was designed for the normalization of names by producing a phonetic representation consisting of the initial letter of an OOV word and three digits. Assuming the aim would be to normalize Erique to Eric, the following Soundex steps would be followed (Knuth, 1998): a) Identify and retain the initial letter of the token, here “E” b) Drop all vowels and the semivowels “”, “y” and “w”, therefore “Erq” c) Substitute the consonants, except for the initial letter, with numbers based on the defined correspondence (see Table 2.2), therefore “E62” which is identical to the code for “Eric”. Relying on Soundex, similar algorithms were introduced such as Metaphone (Philips, 1990), DoubleMetaphone (Philips, 2000), NYSIIS (Rajkovic and Jankovic, 2007). However, phonetic models like Soundex cannot deal with number occurrences as they are designed to only recognise letters from the . Consequently, Soundex

13 Letters to be substituted Substitution number b, f, p, v 1 c, g, , k, q, s, x, z 2 d, t 3 4 m, n 5 r 6

Table 2.2: Soundex substitution of consonant classes with numbers by Knuth (1998). would not be useful in the normalization of the example word “2getha” provided above.

2.3.4 Statistical Machine Translation and Neural Methods Statistical machine translation (SMT) systems rely on statistical information derived from bilingual corpora to render a source text into the target language. In text normalization, the SMT approach sees normalization as a translation task where the non-standard word form is the source and the normalized version the target text. Aw et al. (2006) were among the first to deploy this method for the normalization of noisy user-generated texts, such as SMS messages, achieving good results. Later, focusing on the Twitter domain, Kaufmann (2010) introduced a model consisting in the preprocessing of tweets and the training of an SMT system. The main challenge faced in the implementation of SMT systems is the lack of available parallel corpora, i.e. user-generated texts along with their normalized version, to train the systems (Han and Baldwin, 2011). A partial alleviation to this has been provided by the proposal of character-based SMT which requires considerably fewer annotated resources for training as the statistical model is based on characters and the character-based context, which automatically results in more data and fewer possible combinations to learn – since letters are limited, while words are not – as opposed to word-based and phrase-based SMT. This approach has found application mainly in the normalization of historical texts as in Scherrer and Erjavec (2013), Pettersson (2016) and Korchagina (2017b). Neural approaches have also been deployed for lexical normalization delivering promising results. Bollmann and Søgaard (2016) introduce a bi-LSTM neural network model based on character level combined with multi-task learning for the lexical normalization of Early New German texts. On Twitter data, there have been attempts to exploit the neural approaches as well, while combining them with existing ones (Goker and Can, 2018; Leeman-Munk, 2016).

2.4 Greek Language and Greek Tweets

Although the Twitter domain has been studied and approached from a normalization point of view for several languages, e.g. English (Han and Baldwin, 2011), Ger- man (Sidarenka et al., 2013), Spanish (Ruiz et al., 2014) and Turkish (Eryigit˘ and Torunoglu-Selamet,˘ 2017), no work has been done concerning the Greek language to the best of our knowledge. Most research regarding Greek tweets has been focusing on main NLP tasks, such as (Kalamatianos et al., 2015), detection of irony (Charalampakis et al., 2015) etc. The preprocessing step of Twitter texts

14 in terms of normalization has not attracted much attention, despite its cruciality for consequent NLP tasks.

2.4.1 Overview of Greek Spelling The Modern consists of 24 letters, out of which 5 are vowels. Three of these vowels have more than one graphic representation (Daniels and Bright, 2010; Protopapas et al., 2012), e.g. /I/ can be spelled as ι, η, ει, οι or υ with no phonological differences, the choice of which is prescribed by grammatical rules or is predetermined by etymology and historical orthography (Pittas and Nunes, 2014). All vowels can also appear with , either the accent mark (΄v) e.g. ά or diairesis (¨v) that can only be used with Ì i.e. ϊ, the omission or misuse of which can result in unorthographical and consequently non-standard word forms or lead to changes in meaning e.g. /'nomos/ (νόμος, law) vs. /no'mos/ (νομός, county). Moreover, the accent mark is obligatory – as well as pronunciation-motivated and therefore predictable – for all lowercased polysyllabic word forms (with few exceptions), e.g. /kan/ (καν, even) but /'kano/ (κάνω, to do) (Ktori et al., 2008).

2.4.2 Linguistic Phenomena in Greek Tweets Similar to tweets in English and most languages, tweets in Greek also exhibit a frequent use of non-standard word forms (e.g. ευχαριστώωω instead of ευχαριστώ, thanks) and abbreviations (τλκ instead of τελικά, finally), irregular use of punctuation (e.g. Κρίμα μόλις έφυγε!!! instead of Κρίμα, μόλις έφυγε., pitty, he/she just left) and capitalization (Δεν ΞΕΡΩ instead of Δεν ξέρω, I don’t know), loose sentence structure and so forth. However, in view of the language-specific characteristics, there are also phenomena that are specific to Greek. A notorious example is Greeklish, a term that describes Greek transliterated with characters through a correspondence based on visual similarity (e.g. substituting χ with x, δ with d, etc). This is also referred to as (Chalamandaris et al., 2006).

(5) Greek: Δεν νομίζω να έρθω. Greeklish: Den nomizw na erthw. ’I don’t think I will come.’

Except for Greeklish, code-switching is also a common phenomenon in computer mediated communication (Georgakopoulou, 2007), where Greek Twitter users mainly interchange between English and Greek . Often, it is interchanged between the two languages, either intentionally or due to certain words not being as widespread in Greek (Balamoti, 2010). Such sentences may get even more complex to approach computationally when neologisms are also included.

(6) Σκρολάρω και κάνω share διάφορα posts scrolling.neologism and doing share.EN various posts.EN ’I am scrolling and sharing various posts.’

There has also been observed the opposite phenomenon, where English is transliter- ated in Greek letters, spanning either part of or the whole sentence.

(7) Θενκ γιου για το σχόλιο thank.EN you.EN for.EL the.EL comment.EL. ’Thank you for the comment.’

15 Another commonly observed feature of Greek tweets (or social media texts in Greek generally) is also the frequent omission of the accent mark, resulting in OOV forms. For instance, /sImfo'no/ (συμφωνώ, to agree) could appear as /sImfono/ (συμφωνω) where the final vowel ω misses the stress ώ. As such, the word counts towards non-standard word-forms requiring normalization.

2.5 Part-of-Speech Tagging

Part-of-Speech (POS) Tagging is the process of assigning to each word in a sentence the corresponding part of speech label (Jurafsky and Martin, 2009). Labeling words with their POS tags is not a straightforward process, as the same word may fall in a different word category, depending on the context it appears in. Therefore, efficient taggers use context information to resolve disambiguation cases. Example: “Answer to me” vs “I gave an answer”, where the word “answer” fulfills a different syntactic role in each case and requires the tag of VERB in the first and that of NOUN in the latter expression. POS-tagging is a key step in the NLP pipeline as it facilitates other NLP tasks and applications, such as word sense disambiguation, lemmatization, statistical machine translation models, language modelling (Jurafsky and Martin, 2009) etc. There are various methods that are employed for tagging. Some of them are:

• rule-based methods that assign a POS tag based on rules that exploit linguistic knowledge (Brill, 1992)

• probabilistic methods using CRF (Conditional Random Field) or HMM (Hidden Markov Model) that are based on the probability of a sequence of tags occurring in order (Kupiec, 1992)

• lexical-based methods that rely on the frequency of a word and its tag in the training corpus (Hall, 2003)

There are also approaches that make use of neural networks for the tag prediction (Schmid, 1994). When it comes to user-generated texts, there has been noted a drop in the tagging performance, which can in turn affect negatively the performance of subsequent NLP tasks relying on the tagging results. As such, normalization is deemed a crucial preprocessing step for non-standard texts, such as tweets, prior to POS-tagging (Van der Goot et al., 2017). Therefore, we perform an extrinsic evaluation of the system with regard to POS-tagging since it is among the first tasks in the NLP pipeline, and any positive effect on it could conceivably lead to better performance of subsequent tasks.

16 3 Data & Resources

3.1 The Dataset

The dataset1 used in this project is a Collection of Greek Tweets that was retrieved – originally for a project concerning Sentiment Analysis of Greek Tweets (Kalamatianos et al., 2015) – and made publicly available for research2 purposes by a team of researchers at the Democritus University of Thrace in and was presented in Kalamatianos et al. (2015). It was extracted using Twitter API by using the Python programming language. An overview of the dataset statistics is provided in Table 3.1.

dataset Size 832.1 MB Number of Tweets 4,373,197 Number of Users 30,778 Number of Hashtags 54,354 Hashtags (>1000 tweets) 41 Time Span 2008 - 2014

Table 3.1: Statistics of the Collection of Tweets by Kalamatianos et al. (2015) used as dataset in the present work.

Although the tweets span from 2008 to 2014, we did not consider it necessary to work with a more recent dataset as the phenomena we were expecting to encounter partly pertain to the nature of the Greek language itself (e.g. omitting the stress mark) and partly to practices associated with online text generation (e.g. abbreviating words due to character brevity that Twitter imposes, faster typing etc.).

3.2 Resources

3.2.1 Hunspell Hunspell is a spell checking library that supports several languages that have a lexi- cal writing system. It is suitable even for languages with rich morphology, complex compounding and characters. Except for spell checking through dictionary lookups, it also provides the possibility to perform tokenization, stemming, mor- phological analysis and more. It is freely available for download and licenced under LGPL/GPL/MPL tri-license (Németh, 2008). In the present project, it was only used as a in order to exclude from normalization words that were IV. The Hunspell Dictionary for Greek has 828.805 entries. Except for common words, it also includes a wide array of proper names (mainly Greek ones) and acronyms, rendering it suitable for classifying as in-vocabulary even words from these categories, which along with its size justifies our choice for it.

1Link to the dataset used in this project: http://hashtag.nonrelevant.net/downloads.html 2The tweet collection was work of the Undergraduate students Kalamatianos Georgios and Mallis Dimitrios, Ph.D candidate Symeon Symeonidis and Assistant Professor Avgerinos Arampatzis at the Democritus university of Thrace

17 3.2.2 PANACEA n-gram Corpus The PANACEA n-gram corpus for Greek consists of Greek n-grams and word/tag/lemma n-grams along with their corpus frequency. Specifically, the n-grams range from un- igrams to 5-grams. The data is from the domain “environment” and was collected through a web crawl within the framework of the EU project PANACEA.3 The corpus is available under the CC-BY-SA licence. Despite the different domain, the unigram and frequencies were utilized in the project for the calculation of the context probabilities at the stage of candidate selection, which will be described thoroughly in chapter 5. An overview of the specifics of this corpus if shown in Table 3.2.

Tokens 31,71 million Sentences 1,185,312 Unigrams 435,189 3,860,716 9,767,383 4-grams 13,683,940 5-grams 14,954,020

Table 3.2: Statistics of the PANACEA n-gram Corpus.

In this project, we made use only of the unigram and bigram frequencies, as required in our calculations. Despite the difference in domain (i.e. environment vs. tweets), we decided to rely on this n-gram corpus due to the high number of the frequencies provided and the assumption that these would remedy the difference in terms of domain.

3.2.3 UD Pipe Today, there are many algorithms and tools that are readily available for tagging texts in several languages, such as UD Pipe. It is a web-based tool which can be used to tag, lemmatize or parse but also to train taggers, but also lemmatizers and dependency parsers with new corpora, while it also allows choosing the exact model, all of which are based on the Universal Dependency . It comprises 17 core tags as those have been defined within the framework (Straka and Straková, 2017). This tool will be used in this project for the extrinsic evaluation of our rule-based normalization system. Our choice was motivated by the diversity of texts used for training UD Pipe for Greek as opposed to a single Twitter-irrelevant genre (the called UD Greek GDT includes news text, wikipedia entries and spoken language) and the fact that the majority of the POS-tags occur (16 out of 17 tags). In the project, a test text will be tagged before and after normalization in order to evaluate the effect of normalization of social media texts on the tagging accuracy.

3Link to PANACEA project: http://panacea-lr.eu/

18 4 Preprocessing

4.1 Cleaning the Dataset

The dataset originally consisted of a high number of retweets, which were all removed so as to deal exclusively with unique content. As a result, the size of the dataset shrank from 4,373,197 tweets to 1,918,105. Following this, we removed all instances of links (https://example.com), hashtags (#example) and mentions (@example) when occurring at the beginning or the very end of a tweet. Except for links which were removed invariably, hashtags and mentions were left untouched when appearing amid a sentence, so as to eliminate the probability of them being part of the sentence structure. As for emojies and emoticons, these were removed regardless of their position in the tweet due to their presence being irrelevant to lexical normalization. Finally, repeated punctuation marks (e.g. !!!!) were reduced to one, even in the case of ellipsis. Finally, tweets missing a final were added one, in order to prevent potential errors in their segmentation. Figure 4.1 provides an overview of these steps.

Figure 4.1: Preprocessing Steps.

Our motivation behind these cleaning techniques is explained by the idea that Twitter specific phenomena (e.g. emojies) or noise in form of extra whitespace and repeated punctuation, not only would not enhance POS-tagging performance, but would also obstruct measuring the actual normalization performance of the system. In view of this, it was assumed that the cleaning of the data in a similar way (straight- forward removal of what could lead to noise) is a common process within NLP, especially when working with noisy texts. Therefore, it is with a clean dataset that the normalization and tagging performance were tested.

19 4.2 Test Set

A random sample of 500 tweets was extracted from the dataset and reserved for testing purposes. Its annotation was conducted in the very end, after the completion of the implementation. Each tweet was manually normalized with regard to the normalization scope of the present work. Consequently, out of scope OOV words and potentially problematic features were not optimized in order to be able to measure the efficiency of the system by the same standard. Our annotated test set is therefore not a universally gold one, as its normalization degree is constrained by the scope definition of the project.

4.3 Systematic Analysis of Greek Tweets

In order to gain an overview of the specifics of Greek Tweets and create an outline of the OOV word groups that would be targeted by the normalization system, we performed a manual analysis of 1,000 randomly selected tweets from the dataset (disjoint from the preselected and set aside test set).

4.3.1 Sentence Structure The vast majority of tweets exhibits a relatively loose sentence structure, such that the verb is often elided, interjections are frequently used and interposed, while sentence boundaries are some times not detectable due to the lack of periods. Example 1 below illustrates this.

(1) Original tweet: χαχαχ δε ξερω γιατι παει το χασα τωρα την επομενη παλι! Potential boundaries: χαχαχ |δε ξερω γιατι |παει το χασα τωρα |την επομενη παλι! English: hahaha |don’t know why |now it’s gone though |next time again!

4.3.2 Lower/Upper Case As with punctuation, inconsistency of capitalization was another common phe- nomenon. For instance, words were not always initialized with an uppercase when at the beginning of a sentence or following a full stop. More rarely, this practice was also observed with proper names. Furthermore, there were lowercase instances of acronyms, while on the contrary there were posts consisting of common words in capitalized letters. Table 4.1 provides examples with unconventional character cases.

4.3.3 Neologisms Neologisms, as words recently coined in to the vocabulary of a language, may also count towards OOV words in a tweet. Some of them have already made it to spelling dictionaries (e.g. the word ίντερνετ for internet – although there already exists the Greek variant διαδίκτυο). However, there are others that often constitute part of slang words and expressions, such as σετάρω (to match) and have not yet been included in official lexical resources.

20 Categories Examples Standard Forms - χθες έδειξαν την ταινια Initial letter Χθες... (yesterday they showed the movie) - δεν έχω δει την μαρια Proper names ...Μαρία (I haven’t seen mary) - περασα στο εκπα ευτυχως Acronyms ...ΕΚΠΑ... (I got admitted to ekpa thankfully) ΠΟΛΛΑ ΜΠΡΑΒΟ Capitalization πολλά μπράβο (MANY CONGRATS)

Table 4.1: Instances of non-standard case (upper vs. lower).

4.3.4 Greeklish and Engreek As a general phenomenon of Greek online texts, Greeklish – words – is present in Twitter texts as well. Usually, as revealed from the analysis of example tweets, users opt to encode the entire Greek tweet with Latin characters instead of just part of it. Some of such posts may also contain English words or exhibit further features of non-standard writing, such as character elongations e.g. klaiwwww (“cryinggg”, usually standing for “I am laughing out loud”). Not as common, but nevertheless sporadically present, are English word or expres- sions transcribed with Greek characters. Most often, this is the case with single words which are syntactically fully integrated in the otherwise Greek sentence.

(2) με έκανε φόλοου me do.3rd.SG.PST follow ’He followed me (on social media)’

In the above example, the word φόλοου is automatically an OOV word as although being seemingly Greek, it is a foreign word (follow) and will therefore be treated as an unknown token by the system.

4.3.5 Stress, Contractions, Elongations, Space No stress mark: The vast majority of the detected non-standard tokens in Greek tweets are by far those lacking the stress mark (e.g. ειμαι – I am – as opposed to είμαι), which is an integral part of the correct spelling of a word. The high frequency of such stress-lacking forms suggests that these forms are intended (probably due to the typing brevity when omitting the extra key for stress). As a result, the number of non-standard word forms increases as there is no direct match with dictionary entries. Bringing the stress back is not always straightforward as depending on its placement – above one of the vowels – it can result in more than one valid words as explained in chapter 2.4.1. Such instances appear in a more challenging form when spelling errors (usually unintended ones) are also present. e.g. διαλειμα (break) where the conversion into a standard form would mean converting α to ά and adding an extra μ, therefore: διάλειμμα. Elongations: Vowel or even consonant elongation, at any position within the word, is another feature that renders tokens non-standard. In Greek, up to two repetitions of certain vowels and consonants can be valid, meaning that any repetitions above that boundary have to be truncated. There are also instances of words that incorporate more than one feature that is responsible for their irregular form, such as those that both lack stress mark and contain elongation of a character e.g. επιτεεεελους (finally)

21 in which case the 3 repetitions of ε would need to be reduced, while the remaining one would need to bear a stress mark and consequently be έ. Table 4.2 provides some more examples of this and other categories.

Categories Examples Standard Forms No stress mark - χρονια (a. years/ b. year) - χρόνια (a) or χρονιά (b) (disambiguation - φορα (a. time/ b. momentum) - φορά (a) or φόρα (b) may be needed) - ειναι (it is) - είναι - οοοοολα (everything) - όλα Elongations - ναιιιιιιι (yes) - ναι (any position) - βεεεεεβαια (of course) - βέβαια - θα χε (must have had) - θα είχε Contractions - το ξερα (I knew it) - το ήξερα (lacking ) - τα γραψα (I wrote them) - το έγραψα No space - δενξέρω (don’t know) - δεν ξέρω (accidentally joined - νακαναμε (to do) - να κάναμε words) - πεςτης (tell her) - πες της

Table 4.2: Categories of non-standard word forms.

Contractions: Another common phenomenon are word contractions appearing without the presence of the obligatory apostrophe to denote their form. This concerns mainly verbs in the past tense when the past form starts with a vowel. This vowel is then elided due to the preceding particle or weak personal pronoun ending in a vowel as well (e.g. το χα instead of το είχα, θα φυγε instead of θα έφυγε etc.) No space delimiter: Probably owing to the hastiness that characterizes the compo- sition of most online posts, there often appear two or more words that lack space as a delimiter. As a result, they count as a single word with a non-standard word form. For instance, πεςμου (tellme) instead of πες μου (tell me). As with other phenomena, here as well, the new token gets more challenging when one of the two forms, if not both, lack the obligatory stress as well e.g. δενειπα (instead of δεν είπα) in which case it is not straightforward for the system to potentially identify the individual lexical units.

4.3.6 Non-Standard Abbreviations Except for the official abbreviations that exist in a language, Twitter users seem to like using non-standard word abbreviations as well. Just like several of the characteristics that account for the non-standard nature of OOV tokens, informally established abbreviations are quicker to type which may explain their popularity. Based on the sample tweets, non-standard abbreviations could be distinguished in two groups: a) initialisms and b) disemvoweled cases. Examples of initialisms in Greek could be οτκ (ό,τι καλύτερο, simply the best), where the original expression has been replaced with the initial letters of each of its words. Similar examples in English are lmk (let me know), lol (laughing out loud). Contrary to English users being prolific with initialisms, not that many seem to exist in Greek. However, some widespread English initialisms, such as lol, are used by the Greek users as λολ or with varying elongation λοοολ, where it has been only graphemically localized. Disemvoweled words, i.e. words that have been stripped off their vowels and left only with consonants, are more common in Greek Tweets. An indicative short list with disemvowelments (and semi-disemvowelments) is provided in Table 4.3

22 No vowels Transc. Variants Expansion Translat. κ /k/ και and σμρ /smr/ σμρα σήμερα today τλκ /tlk/ τλκα τελικά eventually/ finally τπτ /tpt/ τπτα τίποτα nothing κλμ κλμρ /klmr/ καλημέρα good morning κλμρα the: def. article τν /tn/ την/τον accusative, Fem/Masc π /p/ που where

Table 4.3: Non-standard abbreviations: (semi)disemvowelment examples.

4.3.7 Misspellings Misspelled words are another example of non-standard tokens where misspellings could be either intended or – most of the times – unintended, in which case they could be classified as a) valid dictionary entries, b) random misspellings. For instance, in the tweet ποιο συχνά (which often) although both words are IV entries, the first one ποιο (which) is, as the context suggests, not intended and counts rather as a misspelling of the assumingly correct word πιο (more). An equivalent example in English would be writing their is instead of there is. Random misspellings are the ones occurring either due to a typo or due to lack of knowledge of spelling rules or of the unique and specific spelling of a word. In Greek, most random misspellings concern the spelling of the vowels /i/, /e/ and /o/ which despite identical pronunciation have more than one possibility of being transcribed e.g. επιρεάζω instead of επηρεάζω /epireazo/ where /i/ has been misspelled as (ι) instead of as (η).

4.4 Scope Definition

In view of the identified OOV word categories, only certain of these have been in- cluded in the normalization scope. The main criterion of selection was their relevance to the normalization task, as well as their frequency of occurrence in the sample test, along with the assumption that their non-standard form could lead to incorrect assignment of the POS-tag. For instance, while code switching was a quite common phenomenon, foreign words were not selected for normalization as rendering them into a “standard” form would most likely mean translating them, which is then by itself a different task. Similarly, instances of Greeklish and Engreek were left untouched, while neologisms were tackled only partially, by manually enriching the dictionary with few more entries, due to the lack of machine readable resources on neologisms. As for misspellings, those that accidentally result in IV forms are not handled by the system either (as opposed to those resulting in OOV word forms). Last, loose sen- tence structure is not dealt with since the focus of the present work is normalization on a lexical level. Therefore, the performance of the system will not be tested with regard to the aforementioned categories, which consequently have been left as is in the annotated test set as well. In summary, the below categories are in scope for normalization: a) Words lacking the stress mark b) Elongations c) Contractions

23 d) Non standard abbreviations e) True-Casing (to some extent) f) Misjoined Words (to some extent) g) Misspellings (only when OOV)

The order in which these categories were handled by the system was crucial to the correct normalization, more details about which follow in the chapter 6.

24 5 System Architecture

Our system consists mainly of a series of rule-based components which hierarchically attempt to convert an OOV word into its standard form. Our choice is motivated by the lack of annotated (i.e. normalized) Greek user-generated data which could make us consider other approaches, such as an NMT, but also by our assumption that linguistically motivated rules and methods could efficiently tackle OOV tokens when bearing commonly used non-standard features. This is why we make use of regular expressions, self created dictionary mapping and rules which among others exploit linguistic information. The very last rule which calculates the edit distance of an OOV token from certain – depending on the defined conditions – dictionary entries, is only activated, when none of the preceding components has produced a normalized variant. As it can be seen in the Figure 5.1, which provides an overview of the system’s main components, the OOV token is handled step by step and either returned normalized, if converted into an IV word, or passed on further when no match could be produced.

Figure 5.1: Overview of the System Architecture.

The order of each component is relevant to the normalization process and perfor- mance as it determines how likely it is for a token to be handled properly. For instance, checking for a match with non-standard abbreviations is only meaningful when no changes have been performed on the OOV token, such as truncation of elongated characters, which could then result in no match with the dictionary entries. Similarly, restoring missing stress on an OOV token as an attempt to convert it into an IV entry, has no functionality if set after the last component calculating the edit distance, as this rule handles stress as well. However, if only stress is missing, the stress restoration rule is a better choice as it is more time efficient and also makes it more likely to get the intended IV entry as the changes performed are straightforward.

25 The system could be distinguished in three main modules.

5.1 Module 1: Regular Expressions

In this module, regular expressions, the main technique, scan the tweets to determine patterns and perform substitutions, expansions and other modifications. At this stage, the tweets are not tokenized, but rather handled as plain text.

5.1.1 Non-Standard Abbreviations The first step for the system is to examine if the OOV words in the tweet are listed in the dictionary with non-standard abbreviations. This dictionary contains 60 entries which we collected through the analysis of 1,000 random tweets. For instance, if the OOV word is τλκ and found in the aforementioned dictionary, it will be replaced with the corresponding standard form τελικά (finally).

5.1.2 Elongations Next in the process is the handling of character elongations. Generally, it is legitimate to encounter up to two repetitions of the same character in Greek. However, different conditions apply to each letter and this is why we distinguish between four different categories: a) A small group of characters which can never appear more than once in sequence, regardless of the position of their occurrence in a word, i.e. [δζθξφχψΔΖΘΞΦΧΨ] These are always truncated to 1, if elongated. For instance, the OOV όχχι would be normalized as όχι (no) but the IV entry γράμμα (letter) would, as intended, be left as is, as μ is excluded from this group. b) A group of all Greek characters when at initial position in a word. In this case, always truncated to 1, if elongated. For instance, δδδεν would be captured by this and rendered as δεν (not). c) A group of all Greek characters when at final position in a word. In this case, sequentially repeated characters are truncated to 1, except for few exceptions (e.g. when the word ends with α, ε or ο, for instance the word αέναα (continuous) will remain as is, while ναιιι will be normalized to ναι (yes). d) A group of all Greek characters that are only truncated to 2 instances when in a middle position, if more than 2 occurrences in sequence, based on the idea that usually only up two two repetitions could be legitimate in Greek. This will not always produce a normalized variant, but in most cases will reduce the number of superfluous characters to a minimum, which at a later stage may facilitate the match to an IV entry. For instance, the OOV word καλλλλό will be modified to καλλό which is only one edit away from the standard counterpart καλό (good).

5.1.3 Contractions Contractions are next to be searched for. However, the focus lies only on verbal contractions. The contracted forms can be either in the present or in the past tense, in which case the initial vowel is omitted. This is in most cases grammatically legal as long as the contracted forms are preceded by an apostrophe. Nevertheless, we capture

26 all forms regardless of the presence of the apostrophe, under the assumption that the full verbal form would later on enhance POS-tagging. In order to capture only the forms in scope, we define the context to the left by listing the relevant pronominal clitics and particles as in: ([σΣ]?[τΤ][οα]|θα|[μΜσΣτΤ]ου). For instance, το χω or το’χω (I’ve it* in the sense of I got it) gets expanded into το έχω (I have it).

5.1.4 Misjoined words Incorrectly joined words, words lacking space as a delimiter, are tackled by the system only to some extent. At this stage the tweet is tokenized and a rule checks if the OOV word in question is made up by two valid (IV) sequential substrings and if yes, it inserts space between them. For instance, accidentally joined words διαφορετικέςκαι are returned as διαφορετικές και (different and). However, should the string bear additional irregularities, e.g. lack of stress or a misspelling, as in διαφορετικεςκαι, the rule will not deal with it as διαφορετικες is not an IV entry, although και is.

5.1.5 Truecasing Trucasing checks if an OOV word is in the dictionary in a different case than the one it appears in. For instance, acronyms that incorrectly appear lowercased, are successfully matched against the IV entry when tested in their uppercase variant and returned as such (ικα changed into ΙΚΑ). Furthermore, main names that do not occur with an uppercase initial letter are similarly matched to their standard case form (e.g. άννα into ΄Αννα). Finally, common words appearing in uppercase are converted into their lowercase form (e.g. ΝΑΙ into ναι). However, simply lowercasing uppercase occurrences of common words and rendering them that way as IV words is possible only to a limited extent due to the missing stress mark (obligatory as it is for almost all polysyllabic lowercase words in Greek, while superfluous in uppercase variants) e.g. ΒΙΒΛΙΟ is not automatically IV through βιβλιο, but rather through stress mark bearing βιβλίο, a step included in the next module.

5.2 Module 2: Rule about Stress Restoration

This module deals exclusively with OOV words that lack the stress mark. Such words could either be uppercase, where the stress mark is by default not part of the correct spelling (but it will have to be restored since uppercase words will be returned lowercase based on our assumption about the POS-tagging assignment), or lowercase instances (for reasons as explained in the background chapter).

5.2.1 Rule Overview Here as well, the input is a tokenized tweet, while examined get only tokens consisting of Greek characters. The rule first identifies and “remembers” the original case of the OOV token (i.e. if fully lowercase e.g. xxx, or uppercase initial letter and lowercase all the rest e.g. Xxx), before lowercasing it, so that the token can be returned in the end in the original form. Similarly, even in cases where a main name both lacks stress and the capitalization of the initial letter, the stress restoration rule will still match it to its IV counterpart. In summary, the following cases are generally covered: a) Uppercase common words are converted into their lowercase stress-bearing variant e.g. ΣΠΙΤΙ as σπίτι (house) b) Main names with an uppercase initial letter but no stress mark, are returned with

27 the stress restored e.g. Αννα as ΄Αννα (Anna) c) Main names fully lowercase and no stress are returned with the stress and case restored, e.g. αννα as ΄Αννα (Anna) d) Common words with an uppercase initial letter (i.e. due to its position in the sentence) are returned with the stress restored and the case untouched e.g. Σημερα as Σήμερα (today)

5.2.2 Rule Analysis In practice, a word (or at least a seemingly Greek word – consisting of Greek charac- ters) that is not in vocabulary and lacks the stress mark and provided that it has at least one vowel, will be handled by the rule. It is split in its individual letters, and if the current letter is a vowel, it will be replaced with its stressed counterpart (e.g. ε with έ etc). Then the system checks if this modification makes the word a valid dictionary entry. If yes, the specific stressed variant of the word is stored in a list of possible replacements. At the same time, the vowel in question is reverted to the original state (unstressed) and the testing continues with the rest of the vowels, so that potentially further valid stressed variants are considered as well, as it is shown in Figure 5.2.

for 푡표푘푒푛 ←in tweet do if token == Greek and OOV and has at least 1 vowel and is unstressed then List_of_possible_replacements = empty list; for 푐ℎ푎푟푎푐푡푒푟 ←in token do if character == vowel then replace vowel with stressed counterpart; update token; if token IV then append token to List_of_possible_replacements; Revert changes on token and continue; end else Revert changes on token and continue; end end end end end

Figure 5.2: Snippet of pseudocode from the stress restoration rule.

When the List_of_possible_replacements contains more than two suggestions, for example the stress restoration of χωρια would have two valid candidates, χώρια (sepa- rately) and χωριά (villages), in order to determine which one should be selected for the substitution of the OOV token, we obtain the probabilities of the immediate con- text for each candidate and select the one with the highest probability. In detail, we compute for each the unigram (candidate) and bigram (previous token and candidate, candidate and next token) log probabilities (based on the n-gram frequencies of the PANACEA corpus) and compare their sums, inspired by the method of Sidarenka et al. (2013) as shown in example 2.1. To extract the bigram probabilities, e.g. of

28 the previous token and a stressed candidate, we base our calculation on the formula below: 푐(푝푟푒푣푖표푢푠, 푐푎푛푑푖푑푎푡푒) 푃 (푐푎푛푑푖푑푎푡푒|푝푟푒푣푖표푢푠) = 푐(푝푟푒푣푖표푢푠, ¬푐푎푛푑푖푑푎푡푒) For unigram probabilities, we follow the same principle by relying on the formula below: 푐(푐푎푛푑푖푑푎푡푒) 푃 (푐푎푛푑푖푑푎푡푒) = 푐(푣표푐푎푏푢푙푎푟푦_푒푛푡푟푖푒푠)

5.2.3 Handling of Special Cases Although not as common, but still possible are tokens that aside from the lack of the stress mark, also bear a misspelling. In this case, most of the times, the system will forward it to the next module, as it is expected to. For instance, the word μυνημα will not be restored by the stress restoration rule, as no stress-bearing alternative of the token in question is included in the vocabulary. In other words, the system will check to see if μύνημα or μυνήμα (trying out stress on each of the vowels in turn) exists and after getting no match, will pass it over to the edit distance rule, which ideally should return the IV counterpart μήνυμα. However, such tokens (lacking the stress mark and including a misspelling) can in certain cases incorrectly be converted into an IV entry, when there is one available. For example, in the expression δεν ξερο (I don’t know), the word ξερο will by default be handled by the stress restoration rule (because it lacks the stress mark) and a match with the IV entry ξερό (dry) will be identified, while the actual standardization – based on the left context δεν (here: don’t) – would be to correct the misspelling (by replacing omikron with ) and restore stress on , so that ξέρω (know) is returned. In order to limit the possibilities of this happening, we do not automatically accept the suggested candidate (when it is proposed as a single candidate by the stress-restoration rule), but rather we first check if it appears at all in the immediate context (with the token before or after) and if yes, we return it, otherwise we pass the OOV word to the third and final module, which in present case would be expected to return the correct normalization of ξέρω. When no context is available, we base the selection on the unigram probabilities of the proposed candidate and the OOV token itself and keep the one with the higher one.

5.3 Module 3: Edit distance

In this module, which is also the last, end up all the OOV words which were not normalized in the previous modules. For the normalization of an OOV word, we compute its edit distance from certain vocabulary entries. Based on it, a number of candidates is selected for which we get the logarithmic probabilities. The candidate with the best probability, is then chosen as the normalized version.

5.3.1 Extraction of IV subset We decided to compute the edit distance of an OOV word only from a subset of entries of the Hunspell dictionary. Computing the edit distance of an OOV word from all dictionary entries would be expensive and take up considerable amount of time. On top of this, such an attempt might not necessarily prove to be effective as it would only complicate the process by including candidates that would certainly not be the normalized counterpart the system would be looking for. An example of this

29 would be considering as a candidate for the normalization of a four-letter-long OOV token even dictionary entries that have double the length. As a result, computing the edit distance only from the entries of a subset of the Hunspell dictionary is both quicker and more effective. The size of this subset is determined each time by the length of the OOV word, as well as its starting and final characters. We distinguished between four conditions:

1. If len(OOV word)>8: the Hunspell subset should consist of IV entries that: a) either have the same length with OOV token AND (start with the same 4 OR end with the same 3 characters) b) have len(OOV)+1 AND (start with the same 4 OR end with the same 3 characters) c) have len(OOV)-1 AND (start with the same 4 OR end with the same 3 characters) 2. If len(OOV word)>=6 or len(OOV word) <=8: the Hunspell subset should consist of IV entries that: a) either have the same length with OOV token AND (start with the same 3 OR end with the same 2 characters) b) have len(OOV)+1 AND (start with the same 3 OR end with the same 2 characters) c) have len(OOV)-1 AND (start with the same 3 OR end with the same 2 characters) 3. If len(OOV word)=4 or len(OOV word) = 5: the Hunspell subset should consist of IV entries that: a) either have the same length with OOV token AND (start with the same 2 OR end with the same 2 characters) b) have len(OOV)+1 AND (start with the same 2 OR end with the same 2 characters) c) have len(OOV)-1 AND (start with the same 2 OR end with the same 2 characters) 4. If len(OOV)< 4: the Hunspell subset should consist of IV entries that: a) either have the same length with OOV token AND (start with the same 1 OR end with the same 1 character) b) have len(OOV)+1 AND (start with the same 1 OR end with the same 1 character) c) have len(OOV)-1 AND (start with the same 1 OR end with the same 1 character)

5.3.2 Extraction of Candidates After the Hunspell subset has been created, we compute the edit distance of the OOV word in question from all the entries of the subset. We use a partially weighted implementation1 of the levenshtein distance where certain character substitutions are assigned customized weights, while all the other operations, including substitutions of the majority of the characters, are assigned a common weight. The Tables 5.1 and 5.2 show all the customized weights for vowel substitutions.

1https://github.com/luozhouyang/python-string-similarity

30 Char Sub Cost ο ω 0.5 ό ώ 0.5 Char Sub Cost η ι 0.5 α ά 0.25 ή ί 0.5 ε έ 0.25 η υ 0.5 η ή 0.25 ή ύ 0.5 ι ί 0.25 ι η 0.5 ϊ ΐ 0.25 ί ή 0.5 υ ύ 0.25 ι υ 0.5 ϋ ΰ 0.25 ί ύ 0.5 ο ό 0.25 υ η 0.5 ω ώ 0.25 ύ ή 0.5 Table 5.1: Substitution weight for restoring υ ι 0.5 stress on same vowel ύ ί 0.5

Table 5.2: Substitution weight for vowels commonly mixed up

The default weight that is assigned for an edit operation is 1. However, when it comes to the substitutions of certain vowels, we have manually assigned an edit distance of 0.25. The customization of the substitution weight is based on the idea that specific characters are more likely to be misstyped, intentionally or due to uncertainty, as certain other characters. For instance, substituting Greek vowels that lack the stress mark with their stressed counterpart should be a low-cost substitution, since, as explained in the Background chapter, omitting the stress mark in online content creation, is a widespread phenomenon. Once the edit distance of all Hunspell subset entries has been computed, the system cuts out from selection all the potential entries with an edit distance above 3. This upper limit was defined heuristically, after experimenting with different weights and concluding that only extremely rarely was the gold normalized counterpart excluded from consideration. As a result, the selection of the IV entry is done among only a short list of normalization candidates.

5.3.3 Final Selection of the IV Counterpart The final selection of the IV entry that will replace the OOV token is performed by considering initially IV words with edit distance lower than 1.5. Only when none of the IV words has an edit distance below this threshold, will the other candidates be considered. In each case, should the options be more than one, the IV entry with the highest sum of log probabilities of the immediate context, if available, is calculated, otherwise, only the unigram probability is taken into consideration for the selection. For example, assuming that that the OOV token μύνημα is about to be processed, as appearing in the context έστειλα μύνημα χθες, it will be handled as follows:

a) The Hunspell subset of IV candidates will be created, from which the system will compute the edit distance. Based on the length, the starting and ending characters of the word μύνημα, a Hunspell subset of 871 entries is created.

b) Given this subset with IV entries and their edit distance form the OOV entry μύνημα, only those are filtered, whose edit distance is lower than the predetermined

31 threshold 3. In this example, these are the following 33 candidates:

Candidate extraction with edit distance below 3: {’εύσημα’: 2.2, ’μέλημα’: 2.2, ’εύχυμα’: 2.45, ’κούνημα’: 2.1, ’μύθευμα’: 2.35, ’μύρωμα’: 2.2, ’μύξωμα’: 2.2, ’μάθημα’: 2.2, ’μήνυμα’: 1.35, ’μάσημα’: 2.2, ’κύημα’: 2.1, ’φώνημα’: 2.2, ’μάδημα’: 2.2, ’πόνημα’: 2.2, ’κίνημα’: 2.2, ’ομώνυμα’: 2.35, ’γόνιμα’: 2.45, ’μύωμα’: 2.1, ’μάνιωμα’: 2.35, ’ρίνημα’: 2.2, ’μίλημα’: 2.2, ’μύρισμα’: 2.35, ’μνήμα’: 1.25, ’φύσημα’: 2.2, ’εύρημα’: 2.2, ’μόν- ιμα’: 1.35, ’ούρημα’: 2.2, ’μάχιμα’: 2.45, ’ξύπνημα’: 2.1, ’σύνθημα’: 2.1, ’εύθυμα’: 2.45, ’κύλημα’: 2.2, ’εύφημα’: 2.2 }

c) If IV entries with edit distance below or equal to 1.5, create a subgroup with these. The result would be:

Candidates with edit distance below or equal to 1.5: {’μήνυμα’: 1.35, ’μνήμα’: 1.25, ’μόνιμα’: 1.35}

d) The logarithmic sum of probabilities of each IV candidate of the above subgroup and the immediate context in the tweet is calculated. Finally, the one with the highest probability is returned.

Final candidates with log sums: {(’μήνυμα’, 1.931853990), (’μνήμα’, -0.20860722411955), (’μόνιμα’, 1.2660461067248)} Here, it would be the word μήνυμα based on the aforementioned results, where it has the highest probability of 1.93 compared to the other two candidates.

32 6 Evaluation and Results

6.1 Performance of the Rule-Based System

The performance of the rule-based system was tested in terms of accuracy. This was measured by comparing each token of the normalized test data with the equivalent gold one. As baseline, we considered the deviation of the unnormalized (i.e. clean) test data from the gold dataset. The results, as shown in Table 6.1 indicate that the system has an accuracy of 95.4%, which in turn means lower error rate, from 19% to 4.5%. Word Accuracy % Error % Baseline 80.9 19.0 Normalization System 95.4 4.5

Table 6.1: Performance of the normalization system.

Given that the normalization of certain OOV tokens could in some cases result in more than one words (increasing that way the total number of tokens of the normalized dataset), for the purpose of performance testing, we modified such cases by outputing them with an underscore, in order to ensure alignment and comparability between the different versions of the test set.

Figure 6.1: The distribution of OOV words in scope for Normalization based on test set.

In order to gain an overview of the different groups of OOV words, Figure 6.1 illustrates the distribution of each group in the dataset. It was observed, that words lacking the stress mark made up the largest category of OOV words found in the test

33 set, with capitalized tokens misspellings, abbreviations and other groups following in smaller percentages. Finally, we provide further insight in the normalization performance by distinguish- ing between the percentage of tokens that were handled correctly by the system, either by being normalized into the expected form, or by being left as is, and the other way around.

Normalized % Not Normalized % Correctly 83.44 98.20 Incorrectly 16.55 1.79

Table 6.2: Percentage of tokens that were handled correctly and incorrectly by the system, through modification or not.

The figures in Table 6.2 show that when modifying a token, the system did this cor- rectly 83% of the times and incorrectly 17%, while when leaving a token unmodified, it was successful in 98% of cases and unsuccessful in 2%.

6.2 Error Analysis

Misidentification as OOV: Despite the restrictions set to the system as to what to identify as OOV, there are also cases of words that were sent for normalization despite being in a perfectly valid form. This concerned mostly newly coined acronyms not included in any of the used or self-created resources and neologisms. Failure to split: Incorrectly joined words (i.e. lacking space between word bound- aries) were only restored when a) consisting of maximally two words1 , b) bearing no misspellings beside the missing space. When at least one of these cases did not apply, the OOV token was sent for normalization to the edit distance rule. However, since that rule only generates a single token, it was by default an incorrect normalization. Misrestoring stress: Although we tried to some extent to limit this already when developing the system, incorrect stress restoration is in some cases still inevitable. Words lacking stress can sometimes also bear a misspelling, in which case the leven- shtein distance from valid dictionary entries should be computed. However, since the first step is to check whether stress restoration can convert an OOV token into an IV, some OOV words get converted through stress restoration into an IV token which was unintended by the user, as in δεν ξερο where ξερο gets converted to ξερό (dry, adjective) instead of ξέρω (I know, verb) given δεν (don’t). Whether this can be tackled successfully, depends on how informative the context is each time (if there is some at all) and how good the n-gram corpus used for this purpose is. Wrong candidate selected: When more than one candidate were available to re- place an OOV token, in some cases, higher context probability was assigned to an incorrect one. This was mostly the case either when the n-gram corpus contained too genre-specific context frequencies or when the context, once again, was not particularly informative, either because it consisted of function words or of words that were as well in a non-standard form. Correct candidate excluded: Although not as common, but still an issue, the correct candidate was in some cases excluded from consideration when creating the Hunspell subset for which the levenshtein distance would be computed. This was the case for

1Splitting any number of potentially misjoined words requires a more complex solution which takes care of disambiguation as well, as seen in the example of μεγάλακαι which could be μεγάλα και (big and) or με γάλα και (with milk and).

34 IV entries that neither met the length restrictions with regard to the OOV word, nor had the same starting or ending characters.

6.3 Effect of Normalization on Tagging

We used accuracy as a measure of testing the effect that lexical normalization would have on tagging. We first tagged the clean test data and then, after running the data through the normalization system, we retagged it. Each of these versions (i.e. the clean tagged data and the normalized tagged data) was compared to the tagging of the gold test set. However, since the tagging of the gold test set was performed automatically using UD Pipe (and not manually), it rather constitutes an upper bound and not the ground truth.

Tagging accuracy % Error % Baseline 88.33 11.67 Normalized Text 97.43 2.56

Table 6.3: Results of tagging clean vs normalized data.

The results, as shown in Table 6.3, suggest that normalization impacts the tagging process positively as it leads to more accurate tagging results. Compared to the tagging results of the clean text, which we take as baseline, tagging after normalization reaches an accuracy of 97% leading that way to an increase of 9%. Consequently, the error percentage drops from around 12% to around 3%.

35 7 Discussion

As per the statistics in Figure 6.1 in Chapter 6, where the distribution of the OOV tokens in the test set is shown, the majority of OOV tokens were lacking stress (approx. 10.5 %), while the rest 7.5% pertained to categories of OOV tokens addressed in the first module. Only 1.12 % of OOV tokens were of miscellaneous nature. The third module, as a catch all component, was in place to either deal with misspellings, or tokens that failed to be normalized in the previous two modules or even OOV tokens of unidentified category. According to the results, the rule-based approach in combination with the edit distance rule confirmed our initial assumption about being sufficient to a great extent in tackling phenomena in Greek noisy user-generated texts. Therefore, the three modules, the two first dedicated to specific categories, the last one to any token deviating from IV entries, managed, in the chosen hierarchy, to successfully normalize the vast majority of tokens (baseline approx. 81% vs system approx. 95%). However, the statistics in Table 6.2, showing the precision- and recall-like perfor- mance of the system, reveal that out of all the normalization attempts of the system, almost 16.5% were unsuccessful. This figure indicates that there is room for improve- ment in terms of either the efficiency of the rules, or in terms of simply leaving an OOV token as is. The latter case could concern OOV tokens such as proper names, which are often OOV but not non-standard, acronyms or different categories of named entities which are usually only partly included in dictionaries. As per now, the system filters OOV tokens only based on the alphabet they appear written in (excluding that way foreign words or generally words composed in non-Greek letters). On the other hand, there is another step in the system which helps prevent normalizing OOV tokens which are not in a non-standard form. This is the restriction we have set in the edit distance rule where only OOV tokens with a number of edit operations below 3 are dealt with, so that, for instance, a proper name such as Ιντερλάκεν (Interlaken, place in Switzerland) is correctly left untouched and returned as is due to its edit distance being greater than 3. Nevertheless, the system performance could be boosted further by developing finer criteria when deciding which OOV tokens should be sent for normalization and which ones should be excluded from the process and left untouched. Additionally, as the error analysis in Chapter 6 revealed, there there were cases of OOV tokens being restored to the wrong IV entry. This observation pertains to the edit distance rule where the probability of this happening is not always trivial and suggests that higher probability was often assigned to the unintended IV word. In view of the fact that the log probabilities for each candidate were based on the n-gram frequencies of the PANACEA corpus which was from the domain environment, we can deduce that despite the high number of n-gram frequencies of the corpus, it might have often provided the wrong context. Consequently, n-gram frequencies from a more relevant domain could provide better statistics. Overall, our normalization results are in line with the ones of similar work in other languages. For instance, Ruiz et al. (2014) conclude about their work on the normalization of Spanish Tweets that rule-based preprocessing with edit distance for candidate generation lead to higher accuracy compared to the baseline. A similar conclusion is also reached by Sidarenka et al. (2013) regarding rule-based components

36 on the normalization of German Tweets. Moreover, their finding about normalization impacting part-of-speech-tagging positively was something we could confirm in our experiments for Greek as well.

37 8 Conclusion

Text normalization, and more specifically lexical normalization, describes the pro- cess of standardizing text on a word level by replacing OOV instances with their IV counterparts. In this work, we focused on the normalization of tweets, as user- generated texts, composed in the Modern Greek language. The proposed approach is a rule-based system combined with Levenshtein distance. It addresses lexical forms that are non-standard in terms of stress omission, elongations, contractions, arbitrary abbreviations, incorrect case, lack of space as a word boundary and finally misspellings. These categories were identified as the most representative ones of non-standard words forms in Greek based on a manual analysis of a sample of 1,000 tweets. We used Twitter data that was collected by other researches within a different project and randomly extracted from it another sample which served as the test set. This was normalized manually for system evaluation purposes as it was used as a gold dataset. The system features three main modules. The first one uses regular expressions and resources such as dictionaries, to perform straightforward and unambiguous modifications. The second one handles stress restoration on OOV words where stress is omitted, while the third and last one uses the Levenshtein distance for the generation of normalization candidates, the final selection of which is made by computing the logarithmic sum of bigram context probabilities based on the bigram frequecies of the PANACEA corpus. In the end, we performed both an intrinsic evaluation, testing the performance of the system itself, and an extrinsic evaluation, checking the effect of lexical normalization on POS-tagging, for which we used UD Pipe. The results suggest that our normalization system performs better than the baseline by around 15%. Therefore, this leads to the conclusion that a rule-based approach combined with Levenshtein distance is quite effective in handling the normalization of user-generated texts, at least when it comes to Greek. However, since the majority of OOV tokens in the test set turned out to be words lacking the stress mark, it may not be just as effective if dealing with a different OOV word distribution. As for the tagging results, they indicate an improvement of approximately 9% in the assignment of POS-tags when lexical normalization has been performed. Although very promising, we finally performed an error analysis in order to gain an overview of the system’s main weaknesses and propose that way ideas for future work. For instance, experimenting with different n-grams, using a more relevant n-gram corpus or even creating one from the Twitter dataset and working out concepts that could tackle the issues discussed in 7.1 could help make the system more robust and effective.

38 Bibliography

Ahmed, Bilal (2014). “Lexical Normalisation of Twitter Data” (Sept. 2014). Alegria, Iñaki, Nora Aranberri, PereR Comas, Víctor Fernández, Pablo Gamallo, Lluís Padró, Iñaki Vicente, Jordi Turmo, and Arkaitz Zubiaga (2015). “TweetNorm: a benchmark for lexical normalization of Spanish tweets”. Language Resources and Evaluation 49 (Aug. 2015), pp. 1–23. DOI: 10.1007/s10579-015-9315-6. Aw, AiTi, Min Zhang, Juan Xiao, and Jian Su (2006). “A Phrase-Based Statistical Model for SMS Text Normalization”. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Sydney, Australia: Association for Computational Linguistics, July 2006, pp. 33–40. URL: https://www.aclweb.org/anthology/P06- 2005. Balamoti, Alexandra (2010). “Code-switching as a conversational strategy: evidence from Greek students in Edinburgh”. PhD thesis. Beckley, Russell (2015). “Bekli: A Simple Approach to Twitter Text Normalization.” In: Proceedings of the Workshop on Noisy User-generated Text, pp. 82–86. Bollmann, Marcel (2018). “Normalization of historical texts with neural network models”. In: Bollmann, Marcel, Florian Petran, and Stefanie Dipper (2011). “Rule-Based Normal- ization of Historical Texts”. In: Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. Hissar, Bulgaria: Association for Com- putational Linguistics, Sept. 2011, pp. 34–42. URL: https://www.aclweb.org/ anthology/W11-4106. Bollmann, Marcel and Anders Søgaard (2016). “Improving historical spelling nor- malization with bi-directional LSTMs and multi-task learning”. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 131–139. URL: https://www.aclweb.org/anthology/C16-1013. Brill, Eric (1992). “A Simple Rule-Based Part of Speech Tagger”. In: Proceedings of the Third Conference on Applied Natural Language Processing. ANLC ’92. Trento, Italy: Association for Computational Linguistics, pp. 152–155. DOI: 10.3115/974499. 974526. URL: https://doi.org/10.3115/974499.974526. Brill, Eric and Robert C. Moore (2000). “An Improved Error Model for Noisy Channel Spelling Correction”. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong: Association for Computational Linguistics, Oct. 2000, pp. 286–293. DOI: 10.3115/1075218.1075255. URL: https://www. aclweb.org/anthology/P00-1037. Chalamandaris, Aimilios, Athanassios Protopapas, Pirros Tsiakoulis, and Spyros Raptis (2006). “All Greek to me! An automatic Greeklish to Greek system” (Jan. 2006). Charalampakis, Basilis, Dimitris Spathis, Elias Kouslis, and Katia Kermanidis (2015). “Detecting Irony on Greek Political Tweets: A Approach”. In: Pro- ceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS). EANN ’15. , Island, Greece: Association for Comput- ing Machinery. ISBN: 9781450335805. DOI: 10.1145/2797143.2797183. URL: https://doi.org/10.1145/2797143.2797183.

39 Choudhury, Monojit, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu (2007). “Investigation and modeling of the structure of texting language”. International Journal of Document Analysis and Recognition (IJDAR) 10, pp. 157–174. Chrupała, Grzegorz (2014). “Normalizing tweets with edit scripts and recurrent neural embeddings”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Baltimore, Maryland: Association for Computational Linguistics, June 2014, pp. 680–686. DOI: 10.3115/v1/P14- 2111. URL: https://www.aclweb.org/anthology/P14-2111. Church, Kenneth Ward and William A. Gale (1991). “Probability scoring for spelling correction”. Statistics and Computing 1, pp. 93–103. Clark, Eleanor and Kenji Araki (2011). “Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English”. Procedia -Social and Behavioral Sciences 27 (Dec. 2011), pp. 2–11. DOI: 10.1016/j.sbspro. 2011.10.577. Damerau, Fred J. (1964). “A Technique for Computer Detection and Correction of Spelling Errors”. Commun. ACM 7.3 (Mar. 1964), pp. 171–176. ISSN: 0001-0782. DOI: 10.1145/363958.363994. URL: https://doi.org/10.1145/363958.363994. Daniels, P. and W. Bright (2010). The World’s Writing Systems. Oxford University Press, Incorporated, p. 275. ISBN: 9780195386929. URL: https://books.google.se/ books?id=DTvNkQEACAAJ. Ebden, Peter and Richard Sproat (2014). “The Kestrel TTS text normalization system”. Natural Language Engineering 21 (May 2014), pp. 333–353. DOI: 10.1017/S13513 24914000175. Eryigit,˘ Gül¸senand Dilara Torunoglu-Selamet˘ (2017). “Social media text normaliza- tion for Turkish”. Natural Language Engineering 23.6, pp. 835–875. DOI: 10.1017/ S1351324917000134. Flint, Emma, Elliot Ford, Olivia Thomas, Andrew Caines, and Paula Buttery (2017). “A Text Normalisation System for Non-Standard English Words”. In: Jan. 2017, pp. 107–115. DOI: 10.18653/v1/W17-4414. Georgakopoulou, Alexandra (2007). “Self-presentation and interactional alliances in e-mail discourse: The style- and code-switches of Greek messages”. International Journal of Applied Linguistics 7 (Apr. 2007), pp. 141–164. DOI: 10.1111/j.1473- 4192.1997.tb00112.x. Gimpel, Kevin, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah Smith (2011). “Part- of-Speech Tagging for Twitter: Annotation, Features, and Experiments”. In: vol. 2. Jan. 2011, pp. 42–47. Goker, Sinan and Burcu Can (2018). “Neural Text Normalization for Turkish Social Media”. In: Sept. 2018, pp. 161–166. DOI: 10.1109/UBMK.2018.8566406. Hall, Johan (2003). “A probabilistic part-of-speech tagger with suffix probabilities”. MSI report 3015. Han, Bo and Timothy Baldwin (2011). “Lexical Normalisation of Short Text Messages: Makn Sens a #twitter.” In: Jan. 2011, pp. 368–378. Han, Bo, Paul Cook, and Timothy Baldwin (2013). “Lexical Normalization for Social Media Text”. ACM Trans. Intell. Syst. Technol. 4.1 (Feb. 2013). ISSN: 2157-6904. DOI: 10.1145/2414425.2414430. URL: https://doi.org/10.1145/2414425.2414430. Hulden, Mans and Jerid Francom (2013). “Weighted and unweighted transducers for tweet normalization”. In: Jan. 2013. Jurafsky, Daniel and James H. Martin (2009). Speech and Language Processing (2nd Edition). USA: Prentice-Hall, Inc. ISBN: 0131873210.

40 Kalamatianos, Georgios, Dimitrios Mallis, Symeon Symeonidis, and Avi Arampatzis (2015). “Sentiment Analysis of Greek Tweets and Hashtags Using a Sentiment Lexicon”. In: Proceedings of the 19th Panhellenic Conference on Informatics. PCI ’15. , Greece: Association for Computing Machinery, pp. 63–68. ISBN: 9781450335515. DOI: 10.1145/2801948.2802010. URL: https://doi.org/10.1145/ 2801948.2802010. Kaufmann, Max (2010). “Syntactic Normalization of Twitter Messages”. In: Knuth, Donald E. (1998). The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. USA: Addison Wesley Longman Publishing Co., Inc., pp. 395– 400. ISBN: 0201896850. Kong, Lingpeng, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith (2014). “A Dependency Parser for Tweets”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1001–1012. DOI: 10.3115/v1/D14-1108. URL: https://www.aclweb.org/anthology/D14-1108. Korchagina, Natalia (2017a). “Normalizing Medieval German Texts: from rules to deep learning”. In: May 2017. Korchagina, Natalia (2017b). “Normalizing Medieval German Texts: from rules to deep learning”. In: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, pp. 12–17. Ktori, Maria, Walter van Heuven, and Nicola Pitchford (2008). “GreekLex: A lexical database of Modern Greek”. Behavior research methods 40 (Sept. 2008), pp. 773–83. DOI: 10.3758/BRM.40.3.773. Kupiec, Julian (1992). “Robust part-of-speech tagging using a hidden Markov model”. Computer Speech & Language 6, pp. 225–242. Leeman-Munk, Samuel Paul (2016). “Morphosyntactic Neural Analysis for General- ized Lexical Normalization”. PhD thesis. North Carolina. Levenshtein, VI (1966). “Binary Codes Capable of Correcting Deletions, Insertions and Reversals”. Soviet Physics Doklady 10, p. 707. Li, , Muhua Zhu, Yang Zhang, and Ming Zhou (2006). “Exploring Distributional Similarity Based Models for Query Spelling Correction”. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for Computational Linguistics, July 2006, pp. 1025–1032. DOI: 10.3115/1220175. 1220304. URL: https://www.aclweb.org/anthology/P06-1129. Liu, Fei, Fuliang Weng, Bingqing Wang, and Yang Liu (2011). “Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervi- sion”. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, June 2011, pp. 71–76. URL: https://www.aclweb.org/ anthology/P11-2013. Navarro, Gonzalo (1999). “A Guided Tour to Approximate String Matching”. ACM COMPUTING SURVEYS 33, p. 2001. Németh, László (2008). Hunspell. https://github.com/hunspell/hunspell. O’Connor, Brendan, Ramnath Balasubramanyan, Bryan Routledge, and Noah Smith (2010). “From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series”. In: vol. 11. Jan. 2010. Pathak, Varsha and Manish Joshi (2019). Rule based Approach for Word Normalization by resolving Transcription Ambiguity in Transliterated Search Queries. arXiv: 1910. 07233 [cs.IR].

41 Peterson, James L. (1980). “Computer Programs for Detecting and Correcting Spelling Errors”. Commun. ACM 23.12 (Dec. 1980), pp. 676–687. ISSN: 0001-0782. DOI: 10.1145/359038.359041. URL: https://doi.org/10.1145/359038.359041. Pettersson, Eva (2016). “Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction.” PhD thesis. Uppsala. Philips, Lawrence (1990). “Hanging on the Metaphone”. Computer Language Magazine 7.12 (Dec. 1990). Accessible at http : / / www . cuj . com / documents / s = 8038 / cuj0006philips/, pp. 39–44. Philips, Lawrence (2000). “The Double Metaphone Search Algorithm”. C/C++ Users Journal 18 (June 2000), pp. 38–43. Piotrowski, Michael (2012). Natural Language Processing for Historical Texts. Morgan & Claypool Publishers. ISBN: 1608459462. Pittas, Evdokia and Terezinha Nunes (2014). “The relation between morphological awareness and reading and spelling in Greek: A longitudinal study”. Reading and Writing 27 (Sept. 2014), pp. 1507–1527. DOI: 10.1007/s11145-014-9503-6. Porta, Jordi and José-Luis Sancho-Gómez (2013). “Word Normalization in Twitter Using Finite-state Transducers”. In: Tweet-Norm@SEPLN. Protopapas, Athanassios, Aikaterini Fakou, Styliani Drakopoulou, Christos Skaloum- bakas, and Angeliki Mouzaki (2012). “What do spelling errors tell us? Classification and analysis of errors made by Greek schoolchildren with and without dyslexia”. Reading and Writing 26 (May 2012). DOI: 10.1007/s11145-012-9378-3. Rajkovic, P and D Jankovic (2007). “Adaptation and application of Daitch-Mokotoff Soundex algorithm on Serbian names”. In: Ruiz, Pablo, Montse Cuadros, and Thierry Etchegoyhen (2014). “Lexical Normal- ization of Spanish Tweets with Rule-Based Components and Language Models”. Procesamiento del Lenguaje Natural 52, pp. 45–52. Samsudin, N., Mazidah Puteh, Abdul Hamdan, and Mohd Zakree Ahmad Nazri (2013). “Normalization of noisy texts in Malaysian online reviews”. Journal of Information and Communication Technology 12 (Jan. 2013), pp. 147–159. Scherrer, Yves and Tomaž Erjavec (2013). “Modernizing historical Slovene words with character-based SMT”. In: BSNLP 2013 - 4th Biennial Workshop on Balto-Slavic Natural Language Processing. Sofia, Bulgaria, Aug. 2013. URL: https://hal.inria.fr/hal- 00838575. Schmid, Helmut (1994). “Part-of-Speech Tagging with Neural Networks”. In: Pro- ceedings of the 15th Conference on Computational Linguistics - Volume 1. COLING ’94. Kyoto, Japan: Association for Computational Linguistics, pp. 172–176. DOI: 10.3115/991886.991915. URL: https://doi.org/10.3115/991886.991915. Sidarenka, Uladzimir, Tatjana Scheffler, and Manfred Stede (2013). “Rule-Based Normalization of German Twitter Messages”. In: Proceedings of the Conference of the German Society for Computational Linguistics (GSCL 2013). Darmstadt, Germany: European Language Resources Association (ELRA), Sept. 2013. Sikdar, Utpal Kumar and Björn Gambäck (2016). “Feature-Rich Twitter Named Entity Recognition and Classification”. In: Proceedings of the 2nd Workshop on Noisy User- generated Text (WNUT). Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 164–170. URL: https://www.aclweb.org/anthology/W16-3922. Steinskog, Asbjørn, Jonas Therkelsen, and Björn Gambäck (2017). “Twitter Topic Modeling by Tweet Aggregation”. In: Proceedings of the 21st Nordic Conference on Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics, May 2017, pp. 77–86. URL: https://www.aclweb.org/anthology/W17- 0210. Straka, Milan and Jana Straková (2017). “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe”. In: Proceedings of the CoNLL 2017 Shared

42 Task: Multilingual Parsing from Raw Text to Universal Dependencies. Vancouver, Canada: Association for Computational Linguistics, Aug. 2017, pp. 88–99. DOI: 10.18653/v1/K17-3009. URL: https://www.aclweb.org/anthology/K17-3009. Tang, Gongbo, Fabienne Cap, Eva Pettersson, and Joakim Nivre (2018). “An Evalua- tion of Neural Machine Translation Models on Historical Spelling Normalization”. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug. 2018, pp. 1320–1331. URL: https://www.aclweb.org/anthology/C18-1112. Van der Goot, Rob, Barbara Plank, and Malvina Nissim (2017). To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging. arXiv: 1707.05116 [cs.CL]. Wang, Q., J. Bhandal, S. Huang, and B. Luo (2017). “Classification of Private Tweets Using Tweet Content”. In: 2017 IEEE 11th International Conference on Semantic Computing (ICSC), pp. 65–68. Wikström, Peter (2017). “I tweet like I talk. Aspects of speech and writing on Twitter”. PhD thesis. Dec. 2017. Zhang, Hao, Richard Sproat, Axel Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gor- man, and Brian Roark (2019). “Neural Models of Text Normalization for Speech Applications”. Computational Linguistics 45 (Mar. 2019), pp. 1–49. DOI: 10.1162/ COLI_a_00349.

43