A Survey on Sentiment Analysis in Urdu: a Resource-Poor Language ⇑ ⇑ Asad Khattak A, Muhammad Zubair Asghar B, , Anam Saeed B, Ibrahim A

Egyptian Informatics Journal 22 (2021) 53–74 Contents lists available at ScienceDirect Egyptian Informatics Journal journal homepage: www.sciencedirect.com Review A survey on sentiment analysis in Urdu: A resource-poor language ⇑ ⇑ Asad Khattak a, Muhammad Zubair Asghar b, , Anam Saeed b, Ibrahim A. Hameed c, , Syed Asif Hassan d, Shakeel Ahmad d a College of Technological Innovation, Zayed University, 144534, Abu Dhabi Campus, UAE b Institute of Computing and Information Technology, Gomal University, D.I.Khan (KP), Pakistan c Department of ICT and Natural Sciences, Faculty of Information Technology and Electrical Engineering, Hovedbygget, B316, Ålesund, Norway d Faculty of Computing and Information Technology in Rabigh (FCITR) King Abdulaziz University, Jeddah, Saudi Arabia article info abstract Article history: Background/introduction: The dawn of the internet opened the doors to the easy and widespread sharing Received 27 December 2019 of information on subject matters such as products, services, events and political opinions. While the vol- Revised 7 March 2020 ume of studies conducted on sentiment analysis is rapidly expanding, these studies mostly address Accepted 23 April 2020 English language concerns. The primary goal of this study is to present state-of-art survey for identifying Available online 15 May 2020 the progress and shortcomings saddling Urdu sentiment analysis and propose rectifications. Methods: We described the advancements made thus far in this area by categorising the studies along Keywords: three dimensions, namely: text pre-processing lexical resources and sentiment classification. These Urdu sentiment analysis pre-processing operations include word segmentation, text cleaning, spell checking and part-of-speech Pre-processing Sentiment lexicon tagging. An evaluation of sophisticated lexical resources including corpuses and lexicons was carried Datasets out, and investigations were conducted on sentiment analysis constructs such as opinion words, modi- Corpus fiers, negations. Urdu sentiment classification Results and conclusions: Performance is reported for each of the reviewed study. Based on experimental Semantic orientation results and proposals forwarded through this paper provides the groundwork for further studies on Urdu sentiment analysis. Ó 2020 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Computers and Artificial Intel- ligence, Cairo University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Contents 1. Introduction . ....................................................................................................... 54 1.1. Need of Urdu SA . .......................................................................................... 54 1.2. Research motivation . .......................................................................................... 54 1.3. Our contributions . .......................................................................................... 55 1.4. Relation to the previous work . .......................................................................... 55 2. Survey methodology . .................................................................................... 55 2.1. Survey protocol . .......................................................................................... 56 2.2. Research questions . .......................................................................................... 56 2.3. Search strategy and inclusion & exclusion criteria . ....................................................... 56 2.4. Study quality assessment .......................................................................................... 56 2.5. Conducting the survey . .......................................................................................... 56 3. Survey classification. .................................................................................... 57 ⇑ Corresponding authors. E-mail addresses: [email protected] (A. Khattak), [email protected] (M.Z. Asghar), [email protected] (A. Saeed), [email protected] (I.A. Hameed), shassan1@kau. edu.sa, [email protected] (S. Asif Hassan), [email protected] (S. Ahmad). https://doi.org/10.1016/j.eij.2020.04.003 1110-8665/Ó 2020 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Computers and Artificial Intelligence, Cairo University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 54 A. Khattak et al. / Egyptian Informatics Journal 22 (2021) 53–74 3.1. RQ1. What are the text pre-processing techniques used in Urdu SA and what are the techniques used by researchers as reported in the published articles? . ................................................................................................ 57 3.1.1. Urdu words segmentation .................................................................................. 58 3.1.2. Text cleaning. .................................................................................. 58 3.1.3. Urdu spell checking & correction, part of speech tagging and named entity recognition . ............................ 59 3.2. RQ2: What are the different lexical resources used for Urdu SA and which techniques are used for creating such resources? . .... 60 3.2.1. Urdu corpus . .................................................................................. 60 3.2.2. Sentiment lexicon construction . .................................................................. 60 3.3. RQ3: Which techniques have been used for the sentiment classification of Urdu text and what are the recommended methods for efficient classification of sentiments in Urdu reviews? . ............................................................. 61 3.3.1. Subjectivity analysis . .................................................................................. 61 3.3.2. Semantic orientation . .................................................................................. 62 3.3.3. Modifier management . .................................................................................. 64 3.3.4. Negation handling . .................................................................................. 65 3.3.5. Levels of sentiment classification . .................................................................. 68 4. Comparison between various approaches. ....................................................................... 68 4.1. Summary of several investigations. ............................................................................. 68 4.2. Open problems of Urdu SA . ............................................................................. 68 4.2.1. Scarcity of sentiment lexicons and lack of precision in opinion word rating . ............................ 68 4.2.2. Emoticon and slang stockpile . .................................................................. 68 4.2.3. Management of modifiers and negations . .................................................................. 69 4.2.4. Categorisation of domain-centric words . .................................................................. 69 4.2.5. Categorization of slang. .................................................................................. 69 4.2.6. Categorization of emoticons . .................................................................. 69 5. Results and discussion . .......................................................................................... 69 5.1. Answers to posed research questions . ............................................................................. 69 5.2. Qualitative and quantitative evaluation. ............................................................................. 70 5.3. Trends in Urdu sentiment analysis. ............................................................................. 71 6. Conclusions. .......................................................................................................... 72 7. Informed consent . .......................................................................................... 72 8. Human and animal rights . .......................................................................................... 72 Funding. .......................................................................................................... 72 Declaration of Competing Interest . ....................................................................... 72 References . .......................................................................................................... 72 1. Introduction stance gives rise to obstacles during efforts to structure a corpus that is machine readable. The fundamental component for the The surfacing of social media sites has allowed and encouraged crafting of a SA system in any language is the sentiment lexicon. the wide dissemination of knowledge and opinions on issues The resource-rich English language comes with a substantial num- related to merchandizes, guidelines, facilities, and dilemmas [18]. ber of sentiment lexicons (such as SentiWordNet) that are well- The sharing of information on social networks has led to the devel- established. Urdu, on the other hand, is a resource-deprived lan- opment of high-tech appliances to facilitate good decision-making guage sorely lacking in sentiment lexicons. by firms and individuals [42]. Issues related to word segmentation, dissimilarities in morphol- The English language is loaded with sentiment analysis (SA) ogy, inconsistencies

A Survey on Sentiment Analysis in Urdu: a Resource-Poor Language ⇑ ⇑ Asad Khattak A, Muhammad Zubair Asghar B, , Anam Saeed B, Ibrahim A

Roman to Urdu Transliteration Using Word List

Urdu Alphabet 1 Urdu Alphabet

Learning Trilingual Dictionaries for Urdu – Roman Urdu – English

Sequence to Sequence Networks for Roman-Urdu to Urdu Transliteration

Consent Wikipedia in Hindi

An Unsupervised Method for Discovering Lexical Variations in Roman Urdu Informal Text

Joint Approach to Deromanization of Code-Mixed Texts

Urdu Word Processor – Standards and Guidelines

Sentiment Analysis of Roman Urdu/Hindi Using Supervised Methods

Translat Regular English Writting in Roman Time

Invoice to Meaning in Urdu

A Clustering Framework for Lexical Normalization of Roman Urdu