Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification Bashar Talafha Wael Farhan Ahmed Altakrouri Hussein T. Al-Natsheh Mawdoo3 Ltd, Amman, Jordan fbashar.talafha, wael.farhan, ahmed.altakrouri,
[email protected] Abstract et al., 2018)(Bouamor et al., 2019). The task of Arabic dialect identification is an inherently user dialect identification can be seen as a text complex problem, as Arabic dialect taxonomy classification problem, where we predict the prob- is convoluted and aims to dissect a continu- ability of a dialect given a sequence of words and ous space rather than a discrete one. In this other features provided by the task organizers. Be- work, we present machine and deep learning sides reporting the results from different models, approaches to predict 21 fine-grained dialects we show how the provided dataset for the task is form a set of given tweets per user. We adopted not straightforward and requires additional analy- numerous feature extraction methods most of which showed improvement in the final model, sis, feature engineering, and post-processing tech- such as word embedding, Tf-idf, and other niques. tweet features. Our results show that a sim- In the next sections, we describe the methods ple LinearSVC can outperform any complex followed to achieve our best model. Section2 deep learning model given a set of curated fea- lists previous work done, Section3 analyses the tures. With a relatively complex user voting dataset, while Section4 describes models and dif- mechanism, we were able to achieve a Macro- Averaged F1-score of 71.84% on MADAR ferent approaches.