Automatic Sentiment and Viewpoint Analysis of Slovenian News Corpus

EMBEDDIA hackathon report: Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+ Matej Martinc Nina Perger Andrazˇ Pelicon Jozefˇ Stefan Institute Faculty of Social Sciences Jozefˇ Stefan Institute Jamova 39, Ljubljana Kardeljeva plosˇcadˇ 5, Ljubljana Jamova 39, Ljubljana [email protected] [email protected] andraz.pelicon@@ijs.si Matej Ulcarˇ Andreja Vezovnik Senja Pollak Faculty of Computer Science Faculty of Social Sciences Jozefˇ Stefan Institute Vecnaˇ pot 113, Ljubljana Kardeljeva plosˇcadˇ 5, Ljubljana Jamova 39, Ljubljana [email protected] [email protected] [email protected] Abstract public objection (Kania, 2020) and church – state opposition (Paterson and Coffey-Glover, 2018). We conduct automatic sentiment and view- The related work also shows that the differences point analysis of the newly created Slovenian between ”liberal” and ”conservative” arguments news corpus containing articles related to the are not emphasised, mostly because both sides refer topic of LGBTIQ+ by employing the state-of- the-art news sentiment classifier and a system to each other’s arguments, if only to negate them; for semantic change detection. The focus is yet, political orientation can be identified through on the differences in reporting between quality the tone of the article (Zheng and Chan, 2020). news media with long tradition and news me- When it comes to methods employed for auto- dia with financial and political connections to matic analysis of the LGBTIQ+ topic, most re- SDS, a Slovene right-wing political party. The cent approaches rely on embeddings. Hamilton results suggest that political affiliation of the et al.(2016) employed embeddings to research media can affect the sentiment distribution of how words (among them also word gay) change articles and the framing of specific LGBTIQ+ specific topics, such as same-sex marriage. meaning through time. They built static embed- ding models for each time slice of the corpus and 1 Introduction then make these representations comparable by employing vector space alignment by optimising a Quantitative content analysis of news related to geometric transformation. This research was re- LGBTIQ+ in general, and specifically, to mar- cently expanded by (Shi and Lei, 2020), who em- riage equality debates show that distinctions can ployed embeddings to explore semantic shifts of be drawn between those media articles that express six descriptive LGBTIQ+ words from the 1860s positive, neutral or negative stance towards same- to the 2000s: homosexual, lesbian, gay, bisexual, sex marriage. Those media articles that express transgender, and queer. positive stance are grounded in human rights/civil There are also several general news analysis tech- equality discourses and access to benefits (Zheng niques that can be employed for the task at hand. and Chan, 2020; Colistra and Johnson, 2019; Pater- Azarbonyad et al.(2017) developed a system for son and Coffey-Glover, 2018), and frame marriage semantic shift detection for viewpoint analysis of equality as an inevitable path towards equality, as political and media discourse. A recent study by a civil right issue that would reduce existing prej- Spinde et al.(2021) tried to identify biased terms udices and discrimination, and protect threatened in news articles by comparing news media outlet LGBTIQ+ minority (Zheng and Chan, 2020). specific word embeddings. On the other hand, Peli- For media articles that express negative stance con et al.(2020) developed a system for analysing towards marriage equality, distinctive discursive the sentiment of news media articles. elements are present, such as “equal, but sepa- While the above described analyses in a large rate” (marriage equality should be implemented, majority of cases covered news in English speaking but differentiating labels should be kept in the name countries, in this research, we expand the quantita- of protecting the institute of marriage) (Kania, tive analysis to Slovenian news, in order to deter- 2020; Zheng and Chan, 2020; Paterson and Coffey- mine whether attitudes towards LGBTIQ+ differs Glover, 2018), and reference procreation/welfare in different cultural environments. We created a of children (Kania, 2020; Zheng and Chan, 2020), corpus of LGBTIQ+ related news and conducted an 121 Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pages 121–126 April 19, 2021 © Association for Computational Linguistics automatic analysis of its content covering several English texts (Ulcarˇ and Robnik-Sikonjaˇ , 2020). aspects: These representations are clustered using k-means and the derived cluster distributions are compared • Sentiment of news reporting, where we fo- across slices by employing Wasserstein distance cused on the differences in reporting between (Solomon, 2018). It is assumed that the ranking well established media with long tradition of resembles a relative degree of usage change, there- news reporting and more recently established fore words are ranked according to the distance. media characterised by their financial and po- Once the most changed words are identified, the litical connections to the Slovene conservative next step is to understand how their usage differs political party SDS. in the distinct corpus slices. The hypothesis is that specific clusters of BERT embeddings resemble • Usage of words, where we tried to identify specific word usages of a specific word. The prob- the words that are used differently in different lem is that these clusters may consist of several news sources and would indicate the differ- hundreds or even thousands of word usages, i.e. ence in the prevailing discourse on the topic sentences, therefore manual inspection of these us- of LGBTIQ+ in the specific liberal and con- ages would be time-consuming. For this reason, we servative media. extract the most discriminating unigrams, bigrams, The research was performed in the scope of trigrams and fourgrams for each cluster using the the EMBEDDIA Hackashop (Hackaton track) at following procedure: we compute the term fre- EACL 2021 and employs several of the proposed quency - inverse document frequency (tf-idf) score resources and tools (Pollak et al., 2021). of each n-gram and the n-grams appearing in more than 80% of the clusters are excluded to ensure that 2 Methodology the selected keywords are the most discriminant. This gives us a ranked list of keywords for each For sentiment analysis we used a multilingual cluster and the top-ranked keywords (according to news sentiment analysis tool. The tool was trained tf-idf) are used for the interpretation of the cluster. using a two-step approach, described in Pelicon et al.(2020). For training, a corpus of sentiment- 3 Experiments labeled news articles in Slovenian was used (Bucar et al., 2018) with news covering predominantly 3.1 Dataset the financial and political domains. This model The corpus was collected from the Event registry was subsequently applied to the LGBTIQ+ corpus (Leban et al., 2014) dataset by searching for Slove- where each news article was labeled with one of nian articles from 2014 to (including) 2020, con- the sentiment labels, namely negative, neutral or taining any of the manually defined 125 keywords positive. This allowed us to generate a sentiment (83 unigrams and 42 bigrams) and their inflected distribution of articles for each media source in the forms connected to the subject of LGBTIQ+. The corpus. resulting corpus contains news articles on the LGB- For word usage viewpoints analysis, we ap- TIQ+ topic from 23 media sources. The corpus plied a system originally employed for diachronic statistics are described in Table1. Out of this cor- shift detection (Martinc et al., 2020b). Our word pus, we extracted a subcorpus appropriate for the usage detection pipeline follows the procedure pro- viewpoint analysis. The subcorpus we used in- posed in the previous work (Martinc et al., 2020a,b; cluded the following online news media: Delo, Giulianelli et al., 2020): the created LGBTIQ+ cor- Vecer,ˇ Dnevnik, Nova24TV, Tednik Demokracija pus is split into two slices containing news from and PortalPolitikis. The sources were divided different news source according to procedure de- into two groups. The first group, namely Delo, scribed in Section3. Next, the corpus is lemma- Vecerˇ and Dnevnik represent the category of daily tized, using the Stanza library (Qi et al., 2020), and quality news media that are published online and lowercased. For each lemma that appears more in print with a long tradition in the Slovene me- than 100 times in each slice and is not consid- dia landscape. These three media are relatively ered a stopword, we generate a slice specific set of highly trusted by readers and have the highest contextual embeddings using BERT (Devlin et al., readership amongst Slovene dailies. The second 2019) pretrained on the Slovenian, Croatian and group of news media - namely, Nova24TV, Ted- 122 Source Num. articles Num. words the regional media landscape. Nevertheless, not MMC RTV Slovenija 1790 1,555,977 Delo 1194 1,064,615 all conservative media are characterized by a more Nova24TV 844 683,336 negative reporting about the LGBTIQ+ topic. For Vecerˇ 667 552,195 example, the source with the second lowest share 24ur.com 661 313,794 Dnevnik 592 262,482 of negative news is Druzina.si, which is strongly Siol.net Novice 549 460,561 connected to Roman Catholic Church. Slovenske novice 501 236,516 Svet24 430 286,429 Mladina 394 275,506 3.3 Viewpoint Analysis Tednik Demokracija 361 350,742 The viewpoint analysis was conducted by finding Domovina 327 283,478 Primorske novice 255 183,624 words, whose usage varies the most in the two Druzina.si 253 149,761 groups of media sources selected for the analysis Vestnik 242 263,737 Casnik.siˇ - Spletni magazin z mero 239 280,339 (i.e. Delo, Dnevnik, Vecerˇ vs. Nova24TV, Ted- Zurnal24ˇ 172 79,953 nik Demokracija and PortalPolitiks). The 10 most PortalPolitikis 157 111,683 changed words are presented in Table2.

Automatic Sentiment and Viewpoint Analysis of Slovenian News Corpus

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support