A Rule-Based Normalization System for Greek Noisy User-Generated Text
Total Page:16
File Type:pdf, Size:1020Kb
A Rule-Based Normalization System for Greek Noisy User-Generated Text Marsida Toska Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ECTS credits November 9, 2020 Supervisor: Eva Pettersson Abstract The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. There- fore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization. Contents Acknowledgements5 1 Introduction6 1.1 Purpose..................................6 1.2 Outline..................................6 2 Background8 2.1 Text Normalization............................8 2.1.1 Normalization of Historical Texts................8 2.1.2 Normalization of Texts for Text-to-Speech Systems......9 2.2 Characteristics of Noisy User-Generated Texts............. 10 2.3 Methods for the Normalization of Noisy User-Generated Texts.... 11 2.3.1 The Rule-Based Approach.................... 11 2.3.2 Levenshtein Distance....................... 12 2.3.3 Comparison of Phonetic Similarity............... 13 2.3.4 Statistical Machine Translation and Neural Methods...... 14 2.4 Greek Language and Greek Tweets................... 14 2.4.1 Overview of Greek Spelling................... 15 2.4.2 Linguistic Phenomena in Greek Tweets............. 15 2.5 Part-of-Speech Tagging.......................... 16 3 Data & Resources 17 3.1 The Dataset................................ 17 3.2 Resources................................. 17 3.2.1 Hunspell............................. 17 3.2.2 PANACEA n-gram Corpus................... 18 3.2.3 UD Pipe.............................. 18 4 Preprocessing 19 4.1 Cleaning the Dataset........................... 19 4.2 Test Set.................................. 20 4.3 Systematic Analysis of Greek Tweets.................. 20 4.3.1 Sentence Structure........................ 20 4.3.2 Lower/Upper Case........................ 20 4.3.3 Neologisms............................ 20 4.3.4 Greeklish and Engreek...................... 21 4.3.5 Stress, Contractions, Elongations, Space............ 21 4.3.6 Non-Standard Abbreviations.................. 22 4.3.7 Misspellings............................ 23 4.4 Scope Definition............................. 23 5 System Architecture 25 5.1 Module 1: Regular Expressions..................... 26 5.1.1 Non-Standard Abbreviations.................. 26 5.1.2 Elongations............................ 26 5.1.3 Contractions........................... 26 3 5.1.4 Misjoined words......................... 27 5.1.5 Truecasing............................. 27 5.2 Module 2: Rule about Stress Restoration................ 27 5.2.1 Rule Overview.......................... 27 5.2.2 Rule Analysis........................... 28 5.2.3 Handling of Special Cases.................... 29 5.3 Module 3: Edit distance......................... 29 5.3.1 Extraction of IV subset...................... 29 5.3.2 Extraction of Candidates..................... 30 5.3.3 Final Selection of the IV Counterpart.............. 31 6 Evaluation and Results 33 6.1 Performance of the Rule-Based System................. 33 6.2 Error Analysis............................... 34 6.3 Effect of Normalization on Tagging................... 35 7 Discussion 36 8 Conclusion 38 4 Acknowledgements For her guidance, constructive feedback, continuous support and encouragement, I would like to warmly thank my supervisor Eva Pettersson. I also wish to thank my absolutely unique family, boyfriend and over the world spread out friends for always being there for me, even when they were not. 5 1 Introduction User-generated texts, such as social media texts (e.g. tweets), constitute a vast source of information for opinion and event extraction. However, most of this information is composed in a language that is notorious for its high variability in sentence structure, the extensive usage of non-standard word forms and the presence of ungrammatical linguistic units (e.g. misspelled words) resulting in noise (Sikdar and Gambäck, 2016). Since most Natural Language Processing (NLP) tools are trained on formal texts, such as news text, it has been observed that the performance of such tools declines when run over informal text (Gimpel et al., 2011; O’Connor et al., 2010). Good tagging and parsing results are essential for applications such as opinion mining, information retrieval etc (Kong et al., 2014). Therefore, there is seen a need to preprocess noisy user-generated texts (NUGT) through normalization, so that the performance of preprocessing NLP tasks, such as Part-of-Speech (POS) tagging, parsing and other subsequent ones, is not compromised significantly, if at all (Clark and Araki, 2011). 1.1 Purpose Normalization could concisely be defined as the task of converting word tokens into their standard form (Han et al., 2013). Depending on the text genre (formal vs. informal) and its purpose (preprocessing text for Text-to-Speech (TTS) systems, information retrieval or other NLP tasks such as tagging and parsing) it poses different challenges and subsequently requires a different approach as well. The purpose of the present work is to explore the extent to which the rule-based normalization of Greek social media texts, specifically Greek tweets (tweets written in the Modern Greek language), can lead to more accurate tagging results. We opted for the rule- based approach because it has proven to deliver optimal results in other languages, at least when straightforward mappings suffice (Ruiz et al., 2014; Sidarenka et al., 2013). Additionally, including the levenshtein-distance algorithm, allows us deal with any type of spelling deviations, a phenomenon which we expect to be abundant in user-generated texts, such as tweets. Therefore, the research questions which we will attempt to answer in this project, could be formulated as follows: 1) What are the core categories of Greek tweets that could potentially be in need for normalization? 2) How well does a rule-based system combined with Levenshtein distance perform in normalizing Greek tweets? 3) What is the effect, if any, of normalization in the tagging results of Greek tweets? 1.2 Outline The chapters are organized as follows: Chapter 2 provides information on the background of the task of normalization. As it is not a new field, a few of the most common areas of application and methods are described. There is also provided an introduction into the Greek NUGT as well as a brief overview of POS-tagging. 6 Chapter 3 introduces the dataset along with the resources that were used for the purposes of this project. Part of the data was used for the analysis of the phenomena occurring in Greek Tweets and the creation of an annotated test set, that was used for the evaluation of the system in the end. Chapter 4 contains information about the preprocessing steps, such as cleaning the data, performing an analysis and categorization of non-standard word forms and annotating the test set. Chapter 5 describes the actual implementation of the system by giving an overview of the architecture and illustrating approaches through examples. Chapter 6 gives an overview of the evaluation results, both of the system itself and with regard to POS-tagging, at which phase the answers to the research questions, as resulted from the work, are summarized. Chapters 7 and 8 discuss the results and provide a brief conclusion respectively. 7 2 Background 2.1 Text Normalization Text normalization (or canonicalization or standardization) is a prevalent task in the NLP pipeline. It includes normalization subtasks such as tokenization, lemmatization, stemming or sentence segmentation, but it can also be encountered in a more complex form in situations where, for instance, out-of-vocabulary (OOV) words must be addressed by converting them into a standard (lexicon approved) form. Lexical normalization, where the focus of this work lies, consists in preprocessing a text on a word level and transform it into a form that can be easily analyzed and processed by other tasks in the NLP pipeline (e.g. taggers) or downstream applications (e.g. machine translation systems) for these to produce consistent results (Han et al., 2013; Jurafsky and Martin, 2009) . As one may already