<<

ACTSA: Annotated Corpus for Telugu Sentiment Analysis

Sandeep Sricharan Mukku and Radhika Mamidi Technologies Research Center KCIS, IIIT Hyderabad [email protected], [email protected]

Abstract The annotation of Telugu data has not re- ceived a lot of attention in sentiment analysis Sentiment analysis deals with the task community. While there is a wealth of raw of determining the polarity of a doc- corpora with opinionated information, no cor- ument or sentence and has received pora with annotated sentences in Telugu are a lot of attention in recent years for publicly available as far as we know. the . With the rapid Telugu has a special status as an official growth of social media these days, a standard language in the twin states of Andhra lot of data is available in regional lan- Pradesh and Telangana of . There are guages besides English. Telugu is one a large variety of that constitute the such regional language with abundant mother tongues of Telugu speakers. Major data available in social media, but it’s Telugu print media, journalism, and electronic hard to find a labelled data of sen- media follow the dialects of Krishna and Go- tences for Telugu Sentiment Analysis. davari since it has been conceived as arguably In this paper, we describe an effort to standard and easy to reach the rest of the Tel- build a gold-standard annotated cor- ugu speakers (Krishnamurthi, 1961). We built pus of Telugu sentences to support Tel- our corpus over this as this dialect is ugu Sentiment Analysis. The corpus, most prominent and has a strong online pres- named ACTSA (Annotated Corpus for ence today on news websites, blogs, forums, Telugu Sentiment Analysis) has a col- and user/reader commentaries. lection of Telugu sentences taken from In this work, we present a dedicated gold different sources which were then pre- standard corpus of polarity annotated Telugu processed and manually annotated by sentences. To our knowledge, our corpus is native Telugu speakers using our anno- the largest source of polarity annotated Telugu tation guidelines. In total, we have an- sentences to date. This data also motivates notated 5410 sentences, which makes the development of new techniques for Telugu our corpus the largest resource cur- sentiment analysis. The corpus and annota- rently available. The corpus and an- tion guidelines are publicly available here1. notation guidelines are made publicly available. 2 Related Work 1 Introduction There is a growing interest within the Natural Language Processing community to build cor- Now-a-days, people are commonly found writ- pora for Indian from the data avail- ing comments, reviews, blog posts in social me- able on the web. (Kaur and Gupta, 2013) sur- dia about trending activities in their regional veyed sentiment analysis for different Indian languages. Unlike English, many regional lan- languages including Telugu, but never men- guages lack NLP tools and resources to ana- tioned about the corpus used. (Mukku et al., lyze these activities. Moreover, English has 2016) did sentiment classification for Telugu many datasets available, however, it is not the same with Telugu. 1https://goo.gl/M9rkUX

54 Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 54–58 Copenhagen, , September 8, 2017. c 2017 Association for Computational Linguistics Figure 1: Process of building the resource text using various ML techniques, but no data sites like Twitter and Facebook. Although was publicly made available. the news genre has received much less atten- (Wiebe et al., 2005) describes a corpus an- tion within the Sentiment Analysis commu- notation project to study issues in the man- nity, news plays an important role in exhibit- ual annotation of opinions, emotions, senti- ing the reality and has a strong influence on ments, speculations, evaluations and other pri- social practices. Also, a lot of Telugu data vate states in language. This was the first at- is available mostly on news websites. These tempt to manually annotate the 10,000 sen- reasons motivated us to select news genre for tence corpus of articles from the news. (Alm building our corpus. We scraped and har- et al., 2005) have manually annotated 1580 vested our raw data from five different Telugu sentences extracted from 22 Grimms’ tales for news websites viz., Andhrabhoomi2, Andhra- the task of emotion annotation at the sentence jyothi3, Eenadu4, Kridajyothi5 and Sakshi6. level. In total we have collected over 453 news ar- (Arora, 2013) performed sentiment analy- ticles and filtered down to 321 which were rel- sis task for the Language with limited evant to our work. corpus made manually annotated by the na- The extracted data was cleaned in a pre- tive Hindi speakers. (Das and Bandyopad- processing step, e.g. by removing headings hyay, 2010b) aims to manually annotate the and sub-headings, eliminating sentences with sentences in a web-based Bengali blog corpus non-Telugu words and cleaning any extra dots, with the emotional components such as emo- extra spaces, URLs, and other garbage values. tional expression (word/phrase), intensity, as- Later Sentence Segmentation is done where sociated holder and topic(s). this data was split into individual sentences. (Das and Bandyopadhyay, 2010a) built a The sentences thus obtained were now lexicon of words to support the task of Tel- tested for objectivity manually. Objective sen- ugu sentiment analysis and is made available tences are sentences where no sentiment, opin- to the public. (Das and Bandyopadhay, 2010) ion, etc. is expressed. They state a fact confi- created an interactive gaming to technology dently and has an evidence to support it. For (Dr. Sentiment) to create and validate Senti- example, sentence (1) is an objective sentence WordNet for Telugu. as it is a verifiable fact with evidence.

3 Data Collection (1) అ కం రతశ అధ ప In this section, we will explore the different Transliteration: Abdul kalāṁ bhāratadēśa resources where raw data was obtained from adhyakṣuḍigā panicēśāru and how processing of that data was done, as English: Abdul Kalam served as the presi- shown in Figure 1. dent of India Currently, most of the corpora available for Sentiment Analysis are harvested from 2http://www.andhrabhoomi.net/ 3 sources like review data from e-commerce web- http://www.andhrajyothy.com/ 4 http://www.eenadu.net/ sites where customers express their opinion on 5 http://www.andhrajyothy.com/pages/sports products freely, posts from social networking 6 http://www.sakshi.com/

55 Table 1: Example annotations ID Original Sentence English Translation A1 A2 V F US President Donald అమె అధ Trump withdrew the 1 ట ం వరణ Neg Obj Obj Obj US from the Paris Cli- ఒపందం ం అమె mate Agreement వైలం There is no need for 2 ఇం ఎవ అభంతరం any objection to any- Neu Neu NA Neu ఉండనవసరం one in this India’s Prime Minister రత ప నమం నం- Narendra Modi has re- 3 Neg Neg NA Neg ద అల రపై acted severely to the సంం Kashmir riots The minister is happy 4 ఫలపై మం సంషం Pos Pos NA Pos on the results ఉ Pos = Positive, Neg = Negative, Neu = Neutral, Obj = Objective, NA = Not Applicable, A1 = Annotator 1, A2 = Annotator 2, V = Validation, F = Final Result

the people for electing him These sentences do not contain any senti- ment/polarity and are not useful for sentiment On the other hand sentence (3) should be analysis. The objective sentences thus sepa- tagged negative because it expresses negative rated with objectivity test are removed from sentiment with (concern). ఆంళన the data. రంతర తలపై ప జ ఆంళన వకం 4 Annotation (3) In this section, we describe the process fol- Transliteration: Nirantara vidyut kōtalapai lowed for annotating the sentences (refer Fig- prajalu āndōḷana vyaktaṁ cēśāru ure 1). First, we built a team of seven ed- English: People have expressed concern over ucated native Telugu speakers for the task continuous power cuts of polarity tagging of the extracted Telugu sentences. Then, we developed an annota- However, sentence (4) is a neutral sentence tion schema for this task and the annotators as it is a speculation about the future. Even were instructed to thoroughly understand the though it doesn’t contain any sentiment, it concepts mentioned in the schema for a pre- is not an objective sentence because it is not cise/perfect annotation. Each sentence is an- a verifiable fact or not something which hap- notated by two annotators. pened in the past. It is speculating something The annotators were required to tag the sen- to happen in the future. tences with three polarities: positive, negative, (4) neutral. For example, sentence (2) should be ప వ నెల చై సందంచ tagged positive as it expresses positive senti- Transliteration: Pradhāni vaccē nelalō ment by the use of (gratitude). cainānu sandarśin̄canunnāru కృతజ త English: The prime minister is expected to (2) visit next month మం , ఆయన ఎనం, If in any case annotators were unsure or felt ప జల కృతజ త వకం Transliteration: Mantri, āyananu ennukun- ambiguous about the polarity of a sentence nanduku, prajalaku kr￿tajñata vyaktaṁ cēśāru they can label it uncertain. If they feel the English: The minister expressed gratitude to sentence is objective but was not removed in

56 Table 2: Agreement for Sentences in ACTSA Annotator 2 Positive Negative Neutral Total Annotator 1 Positive 1463 31 103 1597 Negative 23 1421 116 1560 Neutral 112 127 2427 2666 Total 1598 1579 2646 5823 the pre-processing step, they can mark it ob- agreement and is an indication of the reliabil- jective. ity of the annotations. Annotators labelled all the sentences, with each sentence annotated by exactly two anno- Table 3: Statistics about the data tators. We call it annotation step. The sen- News articles 321 tences marked uncertain by at least one anno- Cleaned Sentences 11952 tator were discarded to avoid any ambiguous Objective Sentences (Removed) 4327 sentences in the corpus. Uncertain Sentences (Removed) 1802 The sentences which had a clash between Disagreement Sentences 512 the two annotators’ labels were sent for a third Classified 99 independent annotation which we call as a val- Removed 413 idation step. The most common label among Positive sentences 1489 the three annotators was considered as the fi- Negative sentences 1441 nal label for the sentence. If even after the third annotation the disagreement prevailed, Neutral sentences 2475 such sentences were discarded as we consid- Total sentences 5410 ered them too ambiguous for getting three dif- ferent labels by three different annotators. If there were any objective sentences after the 6 Corpus Statistics validation step, they were discarded. In this section, we present the statistics about Table 1 shows some example annotations our data from raw data collection to final sen- from the corpus. tences. We scraped several websites for the data. We collected 453 news articles and fil- 5 Agreement Study tered down to 321 which were relevant for our After annotation task, we measured how re- work. After pre-processing this raw data, we liable our annotation scheme was. To mea- have 11952 sentences. We tested the sentences sure the reliability of our polarity annota- for subjectivity (as explained in section 3) and tion scheme, we conducted an inter-annotator removed 4327 objective sentences after which agreement study on the annotated sentences. we were left with 7812 sentences. These sen- Table 2 shows the agreement for the two anno- tences were given to the annotators for the an- tators’ judgments for each sentence. We used notation as mentioned in section 4. 1802 sen- Cohens´ kappa, κ which is calculated using tences were removed where at least one anno- formula (5) tator marked it uncertain. In the remaining 5823 sentences, 512 were with disagreement po pe κ = − (5) and were sent for third independent annota- 1 p − e tion. After the third annotation, 413 sentences where po is the relative observed agreement were discarded if the disagreement prevailed or and pe is the agreement by chance. In general, if they are objective. The final 5410 sentences κ values between 0.6 and 0.8 are considered forms the required annotated corpus, ACTSA. a substantial agreement. To our surprise we Statistics about our complete corpus can be got the κ value to be 0.87, which is in perfect found in Table 3.

57 7 Experiments and Evaluation Proceedings of the conference on human lan- guage technology and empirical methods in nat- A strategy that can give very useful hints ural language processing. Association for Com- about the reliability of the annotated data is putational Linguistics, pages 579–586. the comparison between the results of auto- Piyush Arora. 2013. Sentiment analysis for hindi mated classification and human annotation. language. MS by Research in Computer Science, (Mukku et al., 2016) described a method IIIT Hyderabad . to perform automated classification of Telugu Amitava Das and S Bandyopadhay. 2010. Dr sen- sentences into polarity tags: positive, nega- timent creates sentiwordnet (s) for indian lan- tive and neutral. We followed this method to guages involving internet population. In Pro- evaluate our data. We used 2000 sentences ceedings of Indo-wordnet workshop. from our human automated corpus to train the Amitava Das and Sivaji Bandyopadhyay. 2010a. model for automated classification. Sentiwordnet for indian languages. Asian Fed- To test the reliability of our annotated data, eration for Natural Language Processing, China pages 56–63. we compared the classification expressed by humans and that of the automated classifier Dipankar Das and Sivaji Bandyopadhyay. 2010b. trained above. The testing was done on the Labeling emotion in bengali blog corpus–a fine remaining 3410 sentences and the error rate grained tagging at sentence level. In Proceed- ings of the 8th Workshop on Asian Language was observed to be 12.3% which hints the Resources. page 47. quality and reliability of the annotated corpus, ACTSA. Amandeep Kaur and Vishal Gupta. 2013. A sur- vey on sentiment analysis and opinion mining techniques. Journal of Emerging Technologies 8 Conclusion in Web Intelligence 5(4):367–371. In this work, we presented a gold standard cor- Bh Krishnamurthi. 1961. Telugu verbal bases: A pus of Telugu sentences taken from different comparative and descriptive study. resources, which were then cleaned and an- Sandeep Sricharan Mukku, Nurendra Choudhary, notated by native Telugu speakers. For each and Radhika Mamidi. 2016. Enhanced senti- sentence, we have a polarity label attached ment classification of telugu text using ml tech- with it. We described our annotation pro- niques. In 4th Workshop on Sentiment Anal- ysis where AI meets Psychology, 25th Inter- cess and gave an overview of our annotation national Joint Conference on Artificial Intelli- schema. The results from our evaluation study gence. page 29. show that our corpus has a reasonable inter- annotator agreement. The corpus and guide- Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and lines are publicly available. In future, we try emotions in language. Language resources and to automate the task of annotation for new evaluation 39(2):165–210. sentences with the help of ACTSA. We would also like to perform sentiment analysis task for Telugu, using this corpus.

Acknowledgements

We would like to thank all the members of annotation group (native Telugu speakers) for their tireless effort in this process. The work reported in this paper was supported by Kohli Center on Intelligent Systems.

References Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat. 2005. Emotions from text: machine learning for text-based emotion prediction. In

58