Exploratory Analysis of News Sentiment Using Subgroup Discovery

Exploratory analysis of news sentiment using subgroup discovery Anita Valmarska Luis Adrian´ Cabrera-Diego Senja Pollak Jozefˇ Stefan Institute and Elvys Linhares Pontes Jozefˇ Stefan Institute Jamova cesta 39 L3i laboratory Jamova cesta 39 and University of Ljubljana, University of La Rochelle Ljubljana, Slovenia Faculty of Computer La Rochelle, France [email protected] and information science luis.cabrera diego, Vecnaˇ pot 113 elvys.linhares pontes Ljubljana, Slovenia [email protected] [email protected] Abstract fake news identification (Bhutani et al., 2019) and in media bias analysis (El Ali et al., 2018). In this study, we present an exploratory anal- In the current trend of natural language process- ysis of a Slovenian news corpus, in which we ing research (Rogers and Augenstein, 2020), the investigate the association between named en- main focus is on improving the predictive perfor- tities and sentiment in the news. We propose a methodology that combines Named Entity mance over state-of-the art especially using deep Recognition and Subgroup Discovery - a de- learning-based methods. The drawback of these scriptive rule learning technique for identify- models is in their very limited interpretability. In ing groups of examples that share the same contrast, several data and text mining techniques class label (sentiment) and pattern (features - have been developed to improve domain under- Named Entities). The approach is used to in- standing and support exploratory analysis of data, duce the positive and negative sentiment class with focus on explainable models, which is crucial rules that reveal interesting patterns related e.g. in medical applications, but also interesting for to different Slovenian and international politi- cians, organizations, and locations. interdisciplinary research in the field of digital hu- manities and digital social sciences. Our research 1 Introduction falls under this line of research. The aim of our study is to gain better under- Traditionally, sentiment analysis refers to the use standing into news sentiment by analysis of named of natural language processing to systematically entities in a manually annotated corpus of Slove- identify, extract, quantify, and study affective states nian news articles (Bucarˇ , 2017). More specifi- and subjective information. Most frequently, it is cally, our aim is to identify groups of topics with used as a predictive technique used to model social negative or positive sentiment in Slovenian news, media (Beigi et al., 2016), more specifically to pre- where topics are identified by named entities and dict or summarize opinions, attitudes and emotions their interaction forms the context of the reported in tweets, comments, online reviews etc., where stories. We propose the employment of subgroup the main focus is on predicting attitudes expressed discovery — a descriptive rule learning technique towards a specific entity (Mejova, 2009). Another for identification of groups of examples sharing line of research applies sentiment analysis on news the same class label (sentiment) and same pattern text, where the focus has shifted from analyzing (features). The task of subgroup discovery is the sentiment towards a specific target to analyzing combination of predictive and descriptive rule in- the intrinsic mood of the text itself (Pelicon et al., duction. The result of subgroup discovery is to 2020). Authors aimed to model feelings (positive, provide understandable descriptions of subgroups negative, or neutral) that readers feel while reading of individuals which share a common target prop- a certain piece of news (Bucarˇ et al., 2018; Liu, erty of interest. Subgroup discovery methods have 2012; Pelicon et al., 2020), also in relation to news traditionally be successfully applied to in different covering Covid-19 (Aslam et al., 2020), modelled medical applications (e.g. detecting of groups of news sentiment in relation to stock market and eco- patients at risk for atherosclerotic cardiovascular nomic conditions (Van de Kauter et al., 2015; Bow- disease (Gamberger and Lavracˇ, 2002), supporting den et al., 2019; Rambaccussing and Kwiatkowski, factors for brain ischemia (Gamberger and Lavracˇ, 2020). Sentiment analysis has been also used in 2007), and psychiatric emergency (Carmona et al., 66 Proceedings of the 8th BSNLP Workshop on Balto-Slavic Natural Language Processing, pages 66–72 April 20, 2021. ©2021 Association for Computational Linguistics 2011), but only rarely applied to model textual data. The analysis of the agreement between annotators The closest to our study is the work by (Vavpeticˇ is available in (Bucarˇ et al., 2018). The value of an- et al., 2013) using the subgroup discovery system notators agreement on document level as measured Hedwig for analyzing news articles about Portugal by the Cronbach’s alpha is 0.903. focusing on interesting vocabulary patterns that re- In this paper we are interested only in documents flect credit default swap. The authors focused on with either positive or negative sentiment, which financial entities, geographical entities and a spe- corresponds to 1665 positive and 3337 negative ar- cialized vocabulary of the European sovereign debt ticles, respectively. Note that the dataset is thus im- crisis. balanced towards the negative class (which is also The main contributions of this paper are two fold. matching the observations of media researchers that First, we propose a novel approach using named attention to negative news is disproportionate (e.g. entity recognition and linking in a subgroup dis- (Van der Meer et al., 2019; Soroka et al., 2019)). covery setting. Next, we apply the method on the Slovenian news dataset, getting new insights into 3 Methodology Slovenian news reporting in terms of news senti- The methodology for named entity-based sentiment ment and showcase the potential of our approach subgroup discovery consists of three steps. for digital social science research. The paper is structured as follows: in Section2, 3.1 Named entity recognition and linking we present the data used in the experimental work. For each document from the corpora, we perform Section3 presents a short outline of the employed named entity recognition (NER) and named en- methodology. In Section4 and Section5, we tity linking (NEL) using the approaches described present our results and offer our conclusions and in Boros et al.(2020) 2 and Linhares Pontes et al. ideas for further work. (2020)3, respectively. Specifically, for the NER system we fine-tuned CroSloEngual BERT (Ulcarˇ 2 Data and Robnik-Sikonjaˇ , 2020) with two staked Trans- In our experiments, we used the manually senti- former blocks on the top. For the NEL system we ment annotated Slovenian news corpus SentiNews used the architecture founded on the Multilingual 1.0 (Bucarˇ et al., 2018; Bucarˇ , 2017)1. The cor- End-to-End Entity Linking with match correction pus consists of Slovene web-crawled news contain- and candidate filtering. Both systems, NER and ing more than 250,000 documents with political, NEL, were trained using the Slovene WikiANN business, economic and financial content from five dataset (Pan et al., 2017). The dataset was split in Slovene media resources on the web. The data three partitions, train, development and test. The covers the period between 1 September 2007 to evaluation on the test partition, showed that the 31 December 2013. Data used in the experiments NER system has a micro F-score of 0.954, while is a manually sentiment annotated stratified ran- the NEL system has an F-score of 0.705. dom sample of 10,427 documents from news por- In SentiNews 1.0, we identified 914 person tals 24ur, Dnevnik, Finance, Rtvslo, and Zurnal24.ˇ names, 699 organizations and 476 locations with Data was independently annotated by 2-6 annota- assigned NEL identifiers. We used the NEL codes tors, using the five-level Lickert scale (1 – very to extract the nominative case of the named entities negative, 2 – negative, 3 – neutral, 4 – positive, from the the Slovenian Wikipedia. and 5 – very positive) on three levels of granularity, 3.2 Data transformation i.e. on document, paragraph, and sentence level. The sentiment of an instance is defined as the aver- As the state-of-the-art algorithms for subgroup dis- age of the sentiment scores given by the different covery work on structured data, the second step annotators, where an instance labeled as negative of the methodology is to transform the discovered has received an average score less than or equal (and linked) named entities from step 1 into a tab- to 2.4 and an instance labeled as positive has re- ular form suitable for subgroup discovery. The ceived an average score to 3.6. Instances with an resulting tables were constructed by representing average score in-between were labeled as neutral. 2Code available at: https://github.com/ EMBEDDIA/stacked-ner 1Data is available on https://www.clarin.si/ 3Code available at: https://github.com/ repository/xmlui/handle/11356/1110 EMBEDDIA/multilingual_entity_linking 67 each non-neutral sentiment document from the cor- Persons Organizations Locations Borut Pahor Ljubljanska borza Zdruzenoˇ pora as a row in a table. The documents are de- kraljestvo scribed by the values of the identified entities, yes Janez Jansaˇ Evropska komisija Luka Koper if the respective entity was identified in the doc- Dow Jones Evropska unija New York ument and no if the entity was not present in the Danilo Turk¨ Telekom Slovenije Nova Gorica Igor Bavcarˇ Kosarkarskiˇ klub Murska Sobota document. The document’s sentiment represent the Zlatorog class label. Karl Erjavec Radiotelvizija Ljubljanska The result of data transformation is a table with Slovenija borza Gregor Virant Newyorskaˇ borza Slovenj Gradec 2645 rows (i.e. documents with positive or neg- Alenka Bratusekˇ Luka Koper Mestna obcinaˇ ative class label, and identified linked named en- Ljubljana MOL tity) and 2089 columns (i.e. named entities cor- Katarina Kresal Adria Airways Slovenija ˇ responding to person, organisation and location Nogometni klub Dow Jones Crna gora Koper names) as attributes.

Load more