Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification

Total Page:16

File Type:pdf, Size:1020Kb

Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification Bashar Talafha Wael Farhan Ahmed Altakrouri Hussein T. Al-Natsheh Mawdoo3 Ltd, Amman, Jordan fbashar.talafha, wael.farhan, ahmed.altakrouri, [email protected] Abstract et al., 2018)(Bouamor et al., 2019). The task of Arabic dialect identification is an inherently user dialect identification can be seen as a text complex problem, as Arabic dialect taxonomy classification problem, where we predict the prob- is convoluted and aims to dissect a continu- ability of a dialect given a sequence of words and ous space rather than a discrete one. In this other features provided by the task organizers. Be- work, we present machine and deep learning sides reporting the results from different models, approaches to predict 21 fine-grained dialects we show how the provided dataset for the task is form a set of given tweets per user. We adopted not straightforward and requires additional analy- numerous feature extraction methods most of which showed improvement in the final model, sis, feature engineering, and post-processing tech- such as word embedding, Tf-idf, and other niques. tweet features. Our results show that a sim- In the next sections, we describe the methods ple LinearSVC can outperform any complex followed to achieve our best model. Section2 deep learning model given a set of curated fea- lists previous work done, Section3 analyses the tures. With a relatively complex user voting dataset, while Section4 describes models and dif- mechanism, we were able to achieve a Macro- Averaged F1-score of 71.84% on MADAR ferent approaches. Section5 compares and dis- shared subtask-2. Our best submitted model cusses empirical results and finally conclude in ranked second out of all participating teams. Section6. 1 Introduction 2 Related Work In recent years, an extensive increase in social media platforms usages, such as Facebook and Recent work in the Arabic language tackles the Twitter, led to an exponential growth in the user- task of dialect identification. Fine-grained dialect base generated content. The nature of this data identification models proposed by Salameh et al. is diverse. It comprises different expressions, lan- (2018) to classify 26 specific Arabic dialects with guages, and dialects which attracted researchers to an emphasis on feature extraction. They trained understand and harness language semantics such multiple models using Multinomial Naive Bayes as sentiment, emotion, dialect identification, and (MNB) to achieve a Macro-Averaged F1-score of many other Natural Language Processing (NLP) 67.9% for their best model. tasks. Arabic is one of the most spoken languages In addition to traditional models, deep learn- in the world, being used by many nations and ing methods tackle the same problem. The re- spread across multiple geographical locations led search proposed by Elaraby and Abdul-Mageed to the generation of language variations (i.e., di- (2018), shows an enhancement in accuracy when alects) (Zaidan and Callison-Burch, 2014). compared to machine learning methods. In Huang In this paper, we tackle the problem of predict- (2015), they used weakly supervised data or dis- ing the user dialect from a set of his given tweets. tance supervision techniques. They crawled data We describe our work on exploring different ma- from Facebook posts combined with a labeled chine and deep learning methods in our attempt to dataset to increase the accuracy of dialect identi- build a classifier for user dialect identification as fication by 0.5%. part of MADAR (Multi-Arabic Dialect Applica- In this paper, we build on top of methods from tions and Resources) shared subtask-2 (Bouamor Salameh et al.(2018) and Elaraby and Abdul- 239 Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 239–243 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics Train Dev Test Available 195227 26750 43918 Unavailable 22365 3119 6044 Total 217592 29869 49962 Retweet 135388 17612 29868 Table 1: Distribution of the Train, Dev and Test sets used in our experiments. Mageed(2018), by exploring machine and deep learning models to tackle the problem of fine- grained Arabic dialect identification. Figure 1: The distribution of 21 Arabic dialects in MADAR Twitter corpus 3 Dataset tweet ”RT @Bedoon120: @Y » PðA « h. Q jÖÏ@ The dataset used in this work represents infor- https://t.co/sIKqXCUSAn for the user mation about tweets posted from the Arabic re- @abushooooq8 is an Egyptian tweet but it has a gion, where each tweet is associated with its label of Kuwait because the user who retweeted is dialect label (Bouamor et al., 2018)(Bouamor Kuwaiti (i.e. @abushooooq8), where the original et al., 2019). This dataset is collected from 21 author is Egyptian (i.e. @Bedoon120). countries which are Algeria, Bahrain, Djibouti, Table1 shows the distribution of available and Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, unavailable data across different sets. It is also Mauritania, Morocco, Oman, Palestine, Qatar, worth mentioning that around 10% of the data is Saudi Arabia, Somalia, Sudan, Syria, Tunisia, missing; some tweets are not accessible because United Arab Emirates, Yemen. they were deleted by the author or owner account was deactivated. As shown in Figure1, the distribution of tweets among countries is unbalanced. Around 35% of 4 Models the tweets belong to Saudi Arabia, where only 0.08% belong to Djibouti. In this section, we explain our feature extraction methodology then we go over the various experi- The dataset contains 6 features for each user; mented approaches. username of the Twitter user, tweet ID, the lan- guage of the tweet as automatically detected by 4.1 Feature Extraction Twitter, a probability scores of 25 city dialects and As a preprocessing step, normalization of Ara- MSA (Modern Standard Arabic) for each tweet bic letters is common when it comes to deal with obtained by running the best model described in Arabic text. We adopted the same preprocessing (Salameh et al., 2018) and most importantly the methodology used in (Soliman et al., 2017). tweet text. Aravec: A pre-trained word embedding mod- Each user has at most 100 tweets, labeled with els proposed by (Soliman et al., 2017) for the Ara- the same dialect, and exists in one set. For ex- bic language using three different datasets: Twit- ample, if a user is listed in the training set then ter tweets, World Wide Web pages, and Wikipedia that user will not exist in development nor test set. Arabic articles. Those models were trained us- Moreover, the maximum length of each tweet is ing Word2Vec skip-gram and CBOW (Mikolov 280 characters including spaces, URLs, hashtags, et al., 2013). In our experiments, we used the 300- and mentions which makes it challenging to iden- dimensional Twitter Skip-gram AraVec model. tify the dialects automatically (Twitter). fastText: An extension to Word2Vec model Another challenge of the dataset is that around proposed by (Bojanowski et al., 2017). The 61% of the tweets are retweets, as shown in Table model feeds an input based on sub-words rather 1. This means that the majority of the tweets are than passing the entire words. In our experi- a re-post of other Twitter users. For example, the ments, a model with 300 dimension was trained 240 on a combination of 5 different datasets: Au- features and lm is the language model probabili- toTweet (Almerekhi and Elsayed, 2015), Ara- ties. pTweet (Zaghouani and Charfi, 2018), DART (Al- sarsour et al., 2018), PADIC (Parallel Arabic DI- 4.2.2 Deep Learning Approaches alectal Corpus) (Harrat et al., 2014), MADAR fastText Classification: The word embedding of shared tasks (Bouamor et al., 2018)(Bouamor the words in an input sentence that is fed into a et al., 2019). weighting average layer. Then, it is fed to a linear Tf-idf: It has been proven that Tf-idf is efficient classifier with softmax output layer (Joulin et al., to encode textual information into real-valued vec- 2017). tors that represent the importance of a word to SepCNN: Stands for Separable Convolutional a document (Salameh et al., 2018). One of the Neural Networks (Denk, 2017), and is composed drawbacks of Tf-idf vectorized representation of of two consecutive convolutional layers. The first the text is that it looses the information of the is operating on the spatial dimension and per- word order (i.e., syntactical information). Consid- formed on channels separately, while the second ering n-grams, for both levels word and charac- layer convolutes over the channel dimension only. ters, reduces the effect of that drawback. Accord- Word embedding of the sentences is looked up ingly, unigram and bigram word level Tf-idf vec- from AraVec(Soliman et al., 2017). Then, the tors were extracted in addition to a character level embedding of each word in the sentence (i.e., Tf-idf vectors with n-gram values ranging from 1 the tweet) are passed into a number of SepCNN to 5. blocks followed by a max pooling layer. Features specific to tweets: There are features Word-level LSTM: A traditional deep learn- that are unique to Twitter such as user mentions, ing classification method. The word sequence is (e.g., @abushooooq8) and emojis. It has been passed into an AraVec layer to look up word em- found that using username as a feature can help bedding and then fed into a number of LSTM lay- the model understand the user dialect, for instance, ers. The final word sequence is used as an input it can easily find that the users @7abeeb ksa, to a softmax layer to predict the dialect (Liu et al., @a ksa2030 @alanzisaudi have a Saudi Arabia 2016). dialect. Character level unigram Tf-idf has been extracted from each of the mentioned features.
Recommended publications
  • 0818 Combined Insights Online
    KSA ONLINE JULY 2016 54% MALE 44%ARE BETWEEN 63,373,935 UNIQUE BROWSERS 15 AND 29 YEARS OLD 18% AND ANOTHER ARE FULL-TIME 81.21% STUDENTS MOBILE 43% ARE BETWEEN 30 AND 50 YEARS OLD 5,235,401 DAILY AVERAGE UNIQUE BROWSERS 70% 40% LIVE IN ARE HOUSEHOLDS OF UNIVERSITY 4+ PEOPLE EDUCATED 1,056,853,795 PAGE VIEWS 64% 2:19 ARE KSA AVERAGE VISIT DURATION NATIONALS Effective Measure collected this data from a sample of 130,669 profiles in Saudi Arabia, July 2016. EFFECTIVE MEASURE KSA INSIGHT 44%ARE BETWEEN 15 AND 29 REPORT YEARS OLD AND ANOTHER In July 2016, Sabq.org had the largest KSA 43% audience among Effective Measure tagged sites, ARE BETWEEN with over 5.8 million unique browsers (UBs) visiting the site during this period. Hawaaworld.com 30 AND 50 came in second with 4.9 million UBs, followed JULY 2016 YEARS OLD by Mawdoo3 with just under 4.5 million UBs. TOP SITES KSA The site in the top ten that experienced the biggest annual audience growth in KSA, growing Website UBs Mobile PV % 71% from 2.6 million UBs in July 2015, was sabq.com 5,836,289 90.10% Mawdoo3. According to Google’s latest Connected Consumers report*, Saudi Arabia’s hawaaworld.com 4,955,825 95.79% population has an 86% smartphone ownership level, mawdoo3.com 4,478,453 94.35% with the average person now owning 2.3 connected devices. argaam.com 3,228,469 68.64% alriyadh.com 3,028,005 75,26% The top ten sites measured are reflective of Saudi shahid.mbc.net 2,230,826 61.71% Arabia’s high consumption of page views by mobile devices.
    [Show full text]
  • A Way Forward for Positive Migration Governance in Libya
    OCTOBER 2019 From abuse to cohabitation: A way forward for positive migration governance in Libya Floor El Kamouni- Janssen, Nancy Ezzeddine and Jalel Harchaoui DISCLAIMER This publication was produced with the financial support of the European Union. Its contents are the sole responsibility of the Clingendael Institute and do not necessarily reflect the views of the European Union. Contents ACRONYMS AND ABBREVIATIONS ......................................................................................................................................................................................... 1 ACKNOWLEDGEMENTS ..................................................................................................................................................................................................................................................... 2 EXECUTIVE SUMMARY .......................................................................................................................................................................................................... 3 INTRODUCTION: THE COMPLEXITY OF SUPPORTING LOCAL MIGRATION GOVERNANCE IN LIBYA ................................................................................................................................................................................................... 10 1. STATE AND NON-STATE MIGRATION GOVERNANCE .......................................................................................................................................................
    [Show full text]
  • The MADAR Shared Task on Arabic Fine-Grained Dialect Identification
    The MADAR Shared Task on Arabic Fine-Grained Dialect Identification Houda Bouamor, Sabit Hassan, Nizar Habashy Carnegie Mellon University in Qatar, Qatar yNew York University Abu Dhabi, UAE fhbouamor,[email protected] [email protected] Abstract Grained Dialect Identification. The shared task was organized as part of the Fourth Arabic Natural In this paper, we present the results and find- Language Processing Workshop (WANLP), collo- ings of the MADAR Shared Task on Ara- cated with ACL 2019.1 This shared task is the first bic Fine-Grained Dialect Identification. This shared task was organized as part of The to target a large set of dialect labels at the city and Fourth Arabic Natural Language Process- country levels. The data for the shared task was ing Workshop, collocated with ACL 2019. created under the Multi-Arabic Dialect Applica- The shared task includes two subtasks: the tions and Resources (MADAR) project.2 MADAR Travel Domain Dialect Identification The shared task featured two subtasks. First subtask (Subtask 1) and the MADAR Twit- is the MADAR Travel Domain Dialect Identifica- ter User Dialect Identification subtask (Sub- task 2). This shared task is the first to target a tion subtask (Subtask 1), which targeted 25 spe- large set of dialect labels at the city and coun- cific cities in the Arab World. And second is the try levels. The data for the shared task was cre- MADAR Twitter User Dialect Identification (Sub- ated or collected under the Multi-Arabic Di- task 2), which targeted 21 Arab countries. All of alect Applications and Resources (MADAR) the datasets created for this shared task will be project.
    [Show full text]
  • The Degree of Applying E-Learning Standards by Lecturers in Designing Electronic Courses in Jordanian Universities
    i 6666666666666666666666666666 0000000000000000000 The Degree of Applying E-learning Standards by Lecturers in Designing Electronic Courses in Jordanian Universities By Ragad Abdullah Ahamd Supervised By Dr. Hamzeh Alassaf A Thesis Submitted in partial fulfillment of the Requirement for the Master Degree in Information and communication Technology in education Department of Special Education and Educational Technology College of Educational Sciences Middle East University June, 2020 ii Authorization iii Committee Decision iv Acknowledgment I thank Allah, the most Gracious and Compassionate, who blessed me with health and a great family and people who guided me in writing this thesis. I wish to express my sincere appreciation to my supervisor, Dr. Hamzeh Alassaf, who has the substance of a genius: he convincingly guided and encouraged me to be professional and do the right thing even when the road got tough. Without his persistent help, the goal of this project would not have been realized.My deep thanks for are to all the lecturer of the department of information and communication technology. Finally, I express my very profound gratitude to my parents and to my sister for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. v Dedication Without the support of many people the completion of this project would never have occurred. To my supervisor Dr. HamzehAlassaf, who challenged me to do the best that I could do, who gave me continuous support, guidance, encouragement and suggestions. I could not have done this without him.
    [Show full text]
  • Multi-Dialect Arabic BERT for Country-Level Dialect Identification
    Multi-Dialect Arabic BERT for Country-Level Dialect Identification Bashar Talafha∗ Mohammad Ali∗ Muhy Eddin Za’ter Haitham Seelawi Ibraheem Tuffaha Mostafa Samir Wael Farhan Hussein T. Al-Natsheh Mawdoo3 Ltd, Amman, Jordan fbashar.talafha,mohammad.ali,muhy.zater,haitham.selawi, ibraheem.tuffaha,mostafa.samir,wael.farhan,h.natshehg @mawdoo3.com Abstract Arabic dialect identification is a complex problem for a number of inherent properties of the lan- guage itself. In this paper, we present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI, along the way to achieving our winning solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The dialect identification sub- task provides 21,000 country-level labeled tweets covering all 21 Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented by the competition organizers for optional use. Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26.78% on the subtask at hand. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model, for any interested re- searcher out there. 1 Introduction The term Arabic language is better thought of as an umbrella term, under which it is possible to list hundreds of varieties of the language, some of which are not even mutually comprehensible. Nonethe- less, such varieties can be grouped together with varying levels of granularity, all of which correspond to the various ways the geographical extent of the Arab world can be divided, albeit loosely.
    [Show full text]
  • Multi-Dialect Arabic Bert for Country-Level Dialect Identification
    MULTI-DIALECT ARABIC BERT FOR COUNTRY-LEVEL DIALECT IDENTIFICATION Bashar Talafha∗ Mohammad Ali ∗ Muhy Eddin Za’ter Haitham Seelawi Ibraheem Tuffaha Mostafa Samir Wael Farhan Hussein T. Al-Natsheh Mawdoo3 Ltd, Amman, Jordan {bashar.talafha,mohammad.ali,muhy.zater,haitham.selawi, ibraheem.tuffaha,mostafa.samir,wael.farhan,h.natsheh} @mawdoo3.com July 14, 2020 ABSTRACT Arabic dialect identification is a complex problem for a number of inherent properties of the language itself. In this paper, we present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI, along the way to achieving our winning solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented by the competition organizers for optional use. Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26.78% on the subtask at hand. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model, for any interested researcher out there. 1 Introduction The term Arabic language is better thought of as an umbrella term, under which it is possible to list hundreds of varieties of the language, some of which are not even mutually comprehensible. Nonetheless, such varieties can be grouped together with varying levels of granularity, all of which correspond to the various ways the geographical extent of the Arab world can be divided, albeit loosely.
    [Show full text]
  • A Region in Motion: Reflections from West Asia-North Africa
    The West Asia-North Africa region sits at a complex crossroads where the impacts of chronic conflict, cultural fragmentation, and geopolitical interests collide. It is easy to think of the region’s future in pessimistic terms, yet while conflict and disaster provide a myriad of challenges, they also contain lessons to be learned. A Region in Motion: Reflections from West Asia-North Africa, an anthology of research conducted by the WANA Institute, highlights the potential for experiential learning that accompanies motion. Chapters relating to pressing issues such as climate change, human displacement, the rise of violent extremism, and the faltering social contract encourage policy- Asia-North Africa from West makers and development actors to seize the opportunities contained in A Reflections motion when mapping out the pathways to a conflict-free, economically in Motion: Region stable, and sustainable region. A Region in Motion: Reflections from West Asia-North Africa 320.956 A Region in Motion: Reflections From West Asia–North Africa / HRH Prince ElHassan Bin Talal ...[etc.al}.- Amman: The Company, 2018 ( ) p. Deposit No.: 2018/3/1414 Descriptors: /Policy//Asia//NorthAfrica/ يتحمــل المؤلــف كامــل المســؤولية عــن محتــوي مصنفــه وﻻ يعتبــر هــذا المصنــف عــن رأي دائــرة المكتبــة الوطنيــة أو أي جهــة حكوميــة أخــرى. Published in 2018 by the WANA Institute and Friedrich-Ebert-Stiftung Jordan & Iraq West Asia-North Africa Institute FES Jordan & Iraq Royal Scientific Society P.O. Box 941876 70 Ahmad Al-Tarawneh St 11194 Amman Amman, Jordan Amman, Jordan Email: [email protected] Email: [email protected] Website: www.wanainstitute.org Website: www.fes-jordan.org © WANA Institute and FES Jordan & Iraq All rights reserved.
    [Show full text]
  • Faheem at NADI Shared Task: Identifying the Dialect of Arabic Tweet
    Faheem at NADI shared task: Identifying the dialect of Arabic tweet Nouf A. Al-Shenaifi Aqil M. Azmi Department of Computer Science, Department of Computer Science, College Comput. & Info. Sciences, College Comput. & Info. Sciences, King Saud University, King Saud University, Riyadh 11543, Saudi Arabia Riyadh 11543, Saudi Arabia [email protected] [email protected] Abstract This paper describes Faheem (adj. of understand), our submission to NADI (Nuanced Arabic Dialect Identification) shared task. With so many Arabic dialects being understudied due to the scarcity of the resources, the objective is to identify the Arabic dialect used in the tweet, at the country-level. We propose a machine learning approach where we utilize word-level n- gram (n = 1 to 3) and tf-idf features and feed them to six different classifiers. We train the system using a data set of 21,000 tweets—provided by the organizers—covering twenty-one Arab countries. Our top performing classifiers are: Logistic Regression, Support Vector Machines, and Multinomial Na¨ıve Bayes (MNB). We achieved our best result of macro-F1 = 0:151 using the MNB classifier. 1 Introduction Arabic language users constitute the fastest-growing language group when it comes to the number of Internet users. According to (www.internetworldstats.com/stats7.htm), Arabic language Internet users have grown by 9,348% followed by Russian with a growth of 3,653% during the last twenty years. There is a serious lack of Arabic contents in the Internet. As one author put it, searching for “How to make ice cream?” in Arabic yields 99,100 results, while the same search in English yields 37,600,000 hits.
    [Show full text]
  • Architecture of Emergencies in the Middle East: Proposed Shelter
    Architecture of Emergencies in the Middle East Proposed Shelter Design Criteria Lara Alshawawreh A thesis submitted in partial fulfilment of the requirements of Edinburgh Napier University for the award of Doctor of Philosophy May 2019 To my late father, Abdallah Alshawawreh, and my mother, Sahar Matarneh. You are the reason behind every moment of success. To you I owe it all and to you I dedicate this thesis. II Acknowledgements ompleting this PhD thesis has been a hard, yet an enjoyable experience. The rocks were rough along the way, and the summit seemed out of sight; but the C support and love of the people around me made this journey smooth. First, and foremost, I would like to thank the Syrian residents in both Zaatari and Azraq camps who took part in my field studies. Thanks to them for giving me their time and effort; for the trust they have put in me; and above all, for the valuable life lessons they have taught me. A huge appreciation goes to the organizations in Jordan who supported me in accessing the camps, Save the Children and Plan International. I would like to express my gratitude towards my director of studies, Professor Sean Smith, for his continuous support, guidance, and encouragement. I extend my appreciation to my supervisor, John B. Wood, for his guidance and insightful comments, and to Professor John Currie, for acting as my Independent Panel Chair. Many thanks also go to my viva examiners, Prof Peter Guthrie from University of Cambridge and Dr Suha Jaradat from Edinburgh Napier University, for their constructive feedback that improved the final version of this thesis.
    [Show full text]
  • A Snapshot of Jordan's Private Sector
    A SNAPSHOT OF JORDAN’S PRIVATE SECTOR Jordan: Growth and Opportunity London Conference 2019 page 1 page 2 page 2 page 3 PREFACE & ACKNOWLEDGEMENTS This draft publication was researched, written, and designed by Building Markets. Its purpose is to bring visibility to Jordan’s economic potential by showcasing the country’s diverse and growing businesses. The document contains a non-exhaustive list of companies who may be seeking investment, supply chain, and export development opportunities. This publication was produced in partnership with the World Bank Group with funding from UK Aid. Building Markets would also like to recognize Open Society Foundations, whose generous support made this work possible. Data reflected in this publication was collected by Building Markets through in-person and call center suveys that were conducted between September 2018 and February 2019. All data included in this report was self-reported by businesses and has not been independently verified. The companies included in this publication gave Building Markets consent to share their information. All figures in this document are provided in USD. Financial data were typically provided by the businesses in Jordanian Dinar (JOD). The conversation rate used is 1 JOD = 1.41 USD. Figures were also rounded for simplicity. Net assets, turnover, and gross profit margins are reported for the last fiscal year. Some companies chose not to share this information, in which case it has been noted on their respective profile pages as undisclosed. Investment amounts reflect the reported capital needs of the companies for the next six months. Employee numbers listed are full-time equivalents, meaning they have been estimated as the number of full-time employees and half of the number of part-time employees reported.
    [Show full text]
  • Arabic Dialect Identification Using a Stacking Classifier
    LTG-ST at NADI Shared Task 1: Arabic Dialect Identification using a Stacking Classifier Samia Touileb Language Technology Group University of Oslo Norway [email protected] Abstract This paper presents our results for the Nuanced Arabic Dialect Identification (NADI) shared task of the Fifth Workshop for Arabic Natural Language Processing (WANLP 2020). We participated in the first sub-task for country-level Arabic dialect identification covering 21 Arab countries. Our contribution is based on a stacking classifier using Multinomial Naive Bayes, Linear SVC, and Logistic Regression classifiers as estimators; followed by a Logistic Regression as final estimator. Despite the fact that the results on the test set were low, with a macro F1 of 17.71, we were able to show that a simple approach can achieve comparable results to more sophisticated solutions. Moreover, the insights of our error analysis, and of the corpus content in general, can be used to develop and improve future systems. 1 Introduction Most resources for Arabic have been developed for Modern Standard Arabic (MSA) since it is the official language in most Arabic speaking countries. MSA is used in media coverage, politics, books, and even online. However, in each individual Arabic country, the predominant language used for everyday conversation (in real life and online) is a specific dialect for that region or country (Versteegh, 2014). Arabic dialects are not standardized. There are no formal grammar rules nor formalism to guide the speakers (Zaidan and Callison-Burch, 2014). There are some efforts on creating standards for automat- ically processing such dialects (Habash et al., 2018), but the language use remains non-standardized.
    [Show full text]