RESOURCED LANGUAGES Dialectal Arabic Short Texts

DEPARTMENT OF PHILOSOPHY, LINGUISTICS AND THEORY OF SCIENCE AUTOMATIC DETECTION OF UNDER- RESOURCED LANGUAGES Dialectal Arabic Short Texts Wafia Adouane Master’s Thesis: 30 credits Programme: Master’s Programme in Language Technology Level: Advanced level Semester and year: Spring, 2016 Supervisor: Richard Johansson, Nasredine Semmar and Alan Said Examiner: Staffan Larsson Report number: (number will be provided by the administrators) Keywords: Under-resourced languages, Discrimination between similar languages and language varieties, Arabic varieties, Linguistic resource building Abstract Automatic Language Identification (ALI) is the first necessary step to do any language-dependent natural language processing task. It is the identification of the natural language of the input content by a machine. Being a well-established task in computational linguistics since early 1960's, various methods have been successfully applied to a wide range of languages. The state-of-the-art automatic language identifiers are based on character n-gram models trained on huge corpora. However, there are many natural languages which are not yet automatically processed. For instance, minority languages or informal forms of standard languages (general purpose languages used only in media/administration and taught at schools). Some of these languages are only spoken and do not exist in a written format. The use of social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. These new written languages are under- resourced, hence the current ALI tools fail to properly recognize them. In this study, we revisit the problem of ALI with the focus on discriminating under-resourced similar languages. We deal with the case of dialectal Arabic (informal Arabic varieties) used in social media, and we consider each Arabic dialect/variety as a stand-alone language. Our main purpose is to investigate the performance of the ALI standard methods, namely machine learning and dictionary- based methods, on distinguishing Arabic varieties. Given the fact that discriminating between Arabic varieties is a nontrivial linguistic task because of the absence of any clear-cut borderlines between the variants, we can conclude that machine learning models are well suited for Arabic dialects identification. Support vector machines, namely the LinearSVC method combining the character- based 5-6-grams with dialectal vocabulary as features, outperforms all the other methods. The dictionary-based method suffers mainly from the shortage in the vocabulary coverage. Acknowledgements I would like to thank very much my supervisors, Richard Johansson, Nasredine Semmar and Alan Said, for accepting to work on this topic and for their invaluable guidance and useful feedback. I would like also to thank Victoria Bobicev for her help in explaining and implementing the Prediction by Partial Matching (PPM) method. Special thanks go to all my lovely friends who collected the dialectal data and annotated it for free. Well that's friendship! Sorry, for not naming you, one by one, but I'm sure you all know whom I'm talking about. My thanks go also to the students of the Arabic linguistics and literature department at Boumerdès University for accepting to do the annotation for us. This project would not exist without you 'Anjad, choukran ktir ilkon'. I would like also to thank all my teachers and classmates at the MLT program. Last but not least, I would like to express my deepest gratitude to my parents 'la raison de mon existence et ma source d'insperation', my brothers and sisters and my husband for their support despite the long distance. Contents 1 Introduction....................................................................................................................................... 1 1.1 Motivation................................................................................................................................. 1 1.2 Goals and contributions............................................................................................................. 2 1.3 Thesis organization.................................................................................................................... 3 2 Background........................................................................................................................................ 4 2.1 Automatic language identification............................................................................................. 4 2.2 Discriminating similar languages and language varieties.......................................................... 4 2.3 Arabic Natural Language Processing......................................................................................... 5 2.4 Applications of dialectal Arabic identification........................................................................... 6 3 Arabic variants.................................................................................................................................. 7 3.1 Modern Standard Arabic ........................................................................................................... 7 3.2 Arabic dialects/languages/varieties............................................................................................ 7 3.3 Arabic dialects classification......................................................................................................8 3.4 Characteristics of Arabic dialects.............................................................................................10 4 System implementation................................................................................................................... 13 4.1 Linguistic resources................................................................................................................. 13 4.1.1 Dataset............................................................................................................................... 13 4.1.1.1 Dataset building and annotation................................................................................... 13 4.1.1.2 evaluation of the dataset annotation............................................................................ 17 4.1.1.3 Data pre-processing..................................................................................................... 21 4.1.2 Dialectal lexicons...............................................................................................................21 4.2 Approaches............................................................................................................................. 23 4.2.1 Machine learning.............................................................................................................. 23 4.2.2 Dictionary-based method.................................................................................................. 24 5 Experiments and result analysis..................................................................................................... 26 5.1 Cavnar's Text Categorization Character-based n-grams........................................................... 26 5.2 Scikit-learn classifiers.............................................................................................................. 29 5.2.1 Character-based n-gram..................................................................................................... 29 5.2.2 Word-based n-gram............................................................................................................ 32 5.2.3 Dialectal vocabulary.......................................................................................................... 35 5.2.4 Feature combination.......................................................................................................... 36 5.2.4.1 Combining word-based unigram with dialectal vocabulary.........................................36 5.2.4.2 Combining character-based 5-6-grams with dialectal vocabulary................................ 38 5.2.5 Regional dialect grouping.................................................................................................. 39 5.2.6 Learning curves................................................................................................................. 40 5.2.7 Introducing the 'Unknown' category.................................................................................. 41 5.2.8 Using the full-length-document......................................................................................... 42 5.3 Prediction by Partial Matching method.....................................................................................43 5.4 Dictionary-based method.......................................................................................................... 44 5.5 Summary of the results............................................................................................................. 45 6 Conclusions...................................................................................................................................... 47 6.1 General findings......................................................................................................................... 47 6.2 Future directions......................................................................................................................... 49 References..........................................................................................................................................

Load more