Annotated Corpus for Telugu Sentiment Analysis

ACTSA: Annotated Corpus for Telugu Sentiment Analysis Sandeep Sricharan Mukku and Radhika Mamidi Language Technologies Research Center KCIS, IIIT Hyderabad [email protected], [email protected] Abstract The annotation of Telugu data has not received a lot of attention in sentiment analysis Sentiment analysis deals with the task community. While there is a wealth of raw of determining the polarity of a doc- corpora with opinionated information, no cor- ument or sentence and has received pora with annotated sentences in Telugu are a lot of attention in recent years for publicly available as far as we know. the English language. With the rapid Telugu has a special status as an official growth of social media these days, a standard language in the twin states of Andhra lot of data is available in regional lan- Pradesh and Telangana of India. There are guages besides English. Telugu is one a large variety of dialects that constitute the such regional language with abundant mother tongues of Telugu speakers. Major data available in social media, but it’s Telugu print media, journalism, and electronic hard to find a labelled data of sen- media follow the dialects of Krishna and Go- tences for Telugu Sentiment Analysis. davari since it has been conceived as arguably In this paper, we describe an effort to standard and easy to reach the rest of the Tel- build a gold-standard annotated cor- ugu speakers (Krishnamurthi, 1961). We built pus of Telugu sentences to support Tel- our corpus over this dialect as this dialect is ugu Sentiment Analysis. The corpus, most prominent and has a strong online pres- named ACTSA (Annotated Corpus for ence today on news websites, blogs, forums, Telugu Sentiment Analysis) has a col- and user/reader commentaries. lection of Telugu sentences taken from In this work, we present a dedicated gold different sources which were then pre- standard corpus of polarity annotated Telugu processed and manually annotated by sentences. To our knowledge, our corpus is native Telugu speakers using our anno- the largest source of polarity annotated Telugu tation guidelines. In total, we have an- sentences to date. This data also motivates notated 5410 sentences, which makes the development of new techniques for Telugu our corpus the largest resource cur- sentiment analysis. The corpus and annota- rently available. The corpus and an- tion guidelines are publicly available here1. notation guidelines are made publicly available. 2 Related Work 1 Introduction There is a growing interest within the Natural Language Processing community to build cor- Now-a-days, people are commonly found writ- pora for Indian languages from the data avail- ing comments, reviews, blog posts in social me- able on the web. (Kaur and Gupta, 2013) sur- dia about trending activities in their regional veyed sentiment analysis for different Indian languages. Unlike English, many regional lan- languages including Telugu, but never men- guages lack NLP tools and resources to ana- tioned about the corpus used. (Mukku et al., lyze these activities. Moreover, English has 2016) did sentiment classification for Telugu many datasets available, however, it is not the same with Telugu. 1https://goo.gl/M9rkUX 54 Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 54–58 Copenhagen, Denmark, September 8, 2017. c 2017 Association for Computational Linguistics Figure 1: Process of building the resource text using various ML techniques, but no data sites like Twitter and Facebook. Although was publicly made available. the news genre has received much less atten- (Wiebe et al., 2005) describes a corpus an- tion within the Sentiment Analysis commu- notation project to study issues in the man- nity, news plays an important role in exhibit- ual annotation of opinions, emotions, senti- ing the reality and has a strong influence on ments, speculations, evaluations and other pri- social practices. Also, a lot of Telugu data vate states in language. This was the first at- is available mostly on news websites. These tempt to manually annotate the 10,000 sen- reasons motivated us to select news genre for tence corpus of articles from the news. (Alm building our corpus. We scraped and har- et al., 2005) have manually annotated 1580 vested our raw data from five different Telugu sentences extracted from 22 Grimms’ tales for news websites viz., Andhrabhoomi2, Andhra- the task of emotion annotation at the sentence jyothi3, Eenadu4, Kridajyothi5 and Sakshi6. level. In total we have collected over 453 news ar- (Arora, 2013) performed sentiment analy- ticles and filtered down to 321 which were rel- sis task for the Hindi Language with limited evant to our work. corpus made manually annotated by the na- The extracted data was cleaned in a pre- tive Hindi speakers. (Das and Bandyopad- processing step, e.g. by removing headings hyay, 2010b) aims to manually annotate the and sub-headings, eliminating sentences with sentences in a web-based Bengali blog corpus non-Telugu words and cleaning any extra dots, with the emotional components such as emo- extra spaces, URLs, and other garbage values. tional expression (word/phrase), intensity, as- Later Sentence Segmentation is done where sociated holder and topic(s). this data was split into individual sentences. (Das and Bandyopadhyay, 2010a) built a The sentences thus obtained were now lexicon of words to support the task of Tel- tested for objectivity manually. Objective sen- ugu sentiment analysis and is made available tences are sentences where no sentiment, opin- to the public. (Das and Bandyopadhay, 2010) ion, etc. is expressed. They state a fact confi- created an interactive gaming to technology dently and has an evidence to support it. For (Dr. Sentiment) to create and validate Senti- example, sentence (1) is an objective sentence WordNet for Telugu. as it is a verifiable fact with evidence. 3 Data Collection (1) అ కం రతశ అధ ప In this section, we will explore the different Transliteration: Abdul kalāṁ bhāratadēśa resources where raw data was obtained from adhyakṣuḍigā panicēśāru and how processing of that data was done, as English: Abdul Kalam served as the presi- shown in Figure 1. dent of India Currently, most of the corpora available for Sentiment Analysis are harvested from 2http://www.andhrabhoomi.net/ 3 sources like review data from e-commerce web- http://www.andhrajyothy.com/ 4 http://www.eenadu.net/ sites where customers express their opinion on 5 http://www.andhrajyothy.com/pages/sports products freely, posts from social networking 6 http://www.sakshi.com/ 55 Table 1: Example annotations ID Original Sentence English Translation A1 A2 V F US President Donald అమె అధ Trump withdrew the 1 ట ం వరణ Neg Obj Obj Obj US from the Paris Cli- ఒపందం ం అమె mate Agreement వైలం There is no need for 2 ఇం ఎవ అభంతరం any objection to any- Neu Neu NA Neu ఉండనవసరం one in this India’s Prime Minister రత ప నమం నం- Narendra Modi has re- 3 Neg Neg NA Neg ద అల రపై acted severely to the సంం Kashmir riots The minister is happy 4 ఫలపై మం సంషం Pos Pos NA Pos on the results ఉ Pos = Positive, Neg = Negative, Neu = Neutral, Obj = Objective, NA = Not Applicable, A1 = Annotator 1, A2 = Annotator 2, V = Validation, F = Final Result the people for electing him These sentences do not contain any sentiment/polarity and are not useful for sentiment On the other hand sentence (3) should be analysis. The objective sentences thus sepa- tagged negative because it expresses negative rated with objectivity test are removed from sentiment with (concern). ఆంళన the data. రంతర తలపై ప జ ఆంళన వకం 4 Annotation (3) In this section, we describe the process fol- Transliteration: Nirantara vidyut kōtalapai lowed for annotating the sentences (refer Fig- prajalu āndōḷana vyaktaṁ cēśāru ure 1). First, we built a team of seven ed- English: People have expressed concern over ucated native Telugu speakers for the task continuous power cuts of polarity tagging of the extracted Telugu sentences. Then, we developed an annota- However, sentence (4) is a neutral sentence tion schema for this task and the annotators as it is a speculation about the future. Even were instructed to thoroughly understand the though it doesn’t contain any sentiment, it concepts mentioned in the schema for a pre- is not an objective sentence because it is not cise/perfect annotation. Each sentence is an- a verifiable fact or not something which hap- notated by two annotators. pened in the past. It is speculating something The annotators were required to tag the sen- to happen in the future. tences with three polarities: positive, negative, (4) neutral. For example, sentence (2) should be ప వ నెల చై సందంచ tagged positive as it expresses positive senti- Transliteration: Pradhāni vaccē nelalō ment by the use of (gratitude). cainānu sandarśin̄canunnāru కృతజ త English: The prime minister is expected to (2) visit China next month మం , ఆయన ఎనం, If in any case annotators were unsure or felt ప జల కృతజ త వకం Transliteration: Mantri, āyananu ennukun- ambiguous about the polarity of a sentence nanduku, prajalaku krtajñata vyaktaṁ cēśāru they can label it uncertain. If they feel the English: The minister expressed gratitude to sentence is objective but was not removed in 56 Table 2: Agreement for Sentences in ACTSA Annotator 2 Positive Negative Neutral Total Annotator 1 Positive 1463 31 103 1597 Negative 23 1421 116 1560 Neutral 112 127 2427 2666 Total 1598 1579 2646 5823 the pre-processing step, they can mark it ob- agreement and is an indication of the reliabil- jective. ity of the annotations. Annotators labelled all the sentences, with each sentence annotated by exactly two anno- Table 3: Statistics about the data tators.

Load more