No NLP Task Should Be an Island: Multi-Disciplinarity for Diversity in News Recommender Systems

No NLP Task Should be an Island: Multi-disciplinarity for Diversity in News Recommender Systems

Myrthe Reuver♥ Antske Fokkens♥♣ Suzan Verberne♦ ♥ CLTL, Dept. of Language, Literature & Communication, Vrije Universiteit Amsterdam ♣ Dept. of Mathematics and Computer Science, Eindhoven University of Technology ♦ Leiden Institute of Advanced Computer Science, Leiden University {myrthe.reuver,antske.fokkens}@vu.nl, [email protected]

Abstract News recommender systems play an increasingly important role in online news consumption Natural Language Processing (NLP) is defined by specific, separate tasks, with each their own (Karimi et al., 2018). Such systems recommend literature, benchmark datasets, and definitions. several news articles from a large pool of possi- In this position paper, we argue that for a com- ble articles whenever the user wishes to read news. plex problem such as the threat to democracy Recommender systems usually attempt to make by non-diverse news recommender systems, it the recommended articles increase the user’s inter- is important to take into account a higher-order, action and engagement. In a news recommender normative goal and its implications. Experts in system, this typically means optimizing for the indi- ethics, political science and media studies have suggested that news recommendation systems vidual user’s “clicks” or “reading time” (Zhou et al., could be used to support a deliberative democ- 2010). These measures are considered a proxy for racy. We reflect on the role of NLP in recom- reader interest and engagement, but other metrics mendation systems with this specific goal in could also be used, including the time spent on a mind and show that this theory of democracy page or article ratings. helps to identify which NLP tasks and tech- Recommender systems are tailored to individual niques can support this goal, and what work user interests. For other types of recommender sys- still needs to be done. This leads to recommendations for NLP researchers working on this tems, e.g. entertainment systems (recommending specific problem as well as researchers work- music or movies), this is less of a problem. How- ing on other complex multidisciplinary prob- ever, news recommendation is connected to society lems. and democracy, because news plays an important role in keeping citizens informed on recent societal 1 Introduction issues and debates (Helberger, 2019). Personaliza- The field of Natural Language Processing (NLP) tion to user interest in the news recommendation uses specific, self-defined definitions for separate domain can lead to a situation where users are in- tasks – each with their own leaderboards, bench- creasingly unaware of different ideas or perspec- mark datasets, and performance metrics. When tives on current issues. The dangers of such news dealing with complex, societal problems, it may ‘filter bubbles’ (Pariser, 2011) and online ‘echo however be better to take into account a broader chambers’ (Jamieson and Cappella, 2008) due to view, starting from the actual needs to solve the online (over)personalization have been pointed out overall societal problem. In particular, this paper before (Bozdag, 2013; Sunstein, 2018). addresses the complex issue of non-diverse news Political theory provides several models of recommenders potentially threatening democracy democracy, which each also imply different roles (Helberger, 2019). We focus on a theory of democ- for news recommendation. We follow the delib- racy and its role in news recommendation, as de- erative model of democracy, which states citizens scribed in Helberger(2019), and reflect on which of a functioning democracy need to get access to NLP tasks may help address this issue. In doing so, different ideas and viewpoints, and engage with we consider work by experts on the problem and these and with each other (Manin, 1987; Helberger, domain, such as political scientists, recommender 2019) (a further explanation of this model is given system experts, philosophers and media and com- in Section2). A uniform news diet and personaliza- munication experts. tion to only personal interests can, in theory if not

Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pages 45–55 April 19, 2021 © Association for Computational Linguistics in practice, lead to a narrow view on current issues solving this problem in Section2. Section3 pro- and a lack of deliberation in democracy. When vides an overview of literature tackling diversity in considering this model, it becomes clear that news news recommendation as a solution to this prob- personalization on user interest alone is potentially lem, and points out remaining gaps in these efforts, harmful for democracy. The normative goal of a specifically connected to the idea of a deliberative recommender system then becomes: supporting a democracy. Section4 outlines several related NLP deliberative democracy by showing a diverse set tasks and their connection to this overarching nor- of views to users. NLP can play a role here, by mative goal. In Section5, we discuss what we think automatically identifying viewpoints, arguments, the NLP community should take away from this or claims in news texts. Output of such trained reflection, and in Section6 we will conclude our models can help recommend articles that show a di- paper. verse set of views and arguments, and thus support a deliberative democracy. 2 Personalization in the News, Theories The explicit goals and underlying values of of Democracy, and Interdisciplinarity democracy expressed in the model of deliberative The online news domain has increasingly moved democracy can help in defining what NLP tasks towards personalization (Karimi et al., 2018). In and analyses are relevant for tackling the potential the news domain, such personalization comes with harmful effects of news recommendation. This can specific issues and challenges. A combination of increase the societal impact of relevant NLP tasks. personalizaton and (political) news can lead to po- We believe considering such theories and norma- larization, Filter Bubbles (Pariser, 2011), and Echo tive models can also help work on other complex Chambers (Jamieson and Cappella, 2008). This concepts and societal problems where NLP plays a trend to personalize leads to shared internet spaces role. In this paper, we outline societal challenges becoming much more tailored to the individual and a theoretical model of the role of non-diverse user rather than being a shared, public space (Pa- news recommenders in democracy, as developed pacharissi, 2002). Such phenomena could nega- by experts such as political scientists and media tively impact a citizen’s rights to information and experts. We then argue that argument mining, view- right to not be discriminated (Eskens et al., 2017; point detection, and related NLP tasks can make a Wachter, 2020). Evidence for filter bubbles is under valuable contribution to the effort in diversifying discussion (Borgesius et al., 2016; Bruns, 2019), news recommendation and thereby supporting a but empirical work does indicate that especially deliberative democracy. fringe groups holding extreme political or ideolog- This position paper provides the following con- ical opinions may end up into such a conceptual tributions to the discussion: We argue that taking bubble (Boutyline and Willer, 2017). normative and/or societal goals into account can Helberger(2019) points out that a lack of diver- provide insights in the usefulness of specific NLP sity in news recommendation can also harm democ- tasks for complex societal problems. As such, we racy. This clearly holds for the deliberative model believe that approaching such problems from an of democracy. This model assumes that democ- interdisciplinary point of view can help define NLP racy functions on deliberation, and the exchange of tasks better and/or increase their impact. In particu- points of view. A fundamental assumption in this lar, we outline the normative and societal goals for model is that individuals need access to diverse and diversifying news recommendation systems and il- conflicting viewpoints and argumentation to par- lustrate how these goals relate to various NLP tasks. ticipate in these discussions (Manin, 1987). News This results in a discussion on how, on the one hand, recommendations supporting a deliberative democ- news recommendation can make better use of NLP racy should then play a role in providing access to and, on the other hand, how the goal of diversifying these different viewpoints, ideas, and issues in the news provides inspiration for improving existing news (Helberger, 2019). tasks or developing new ones. The threat to democracy of non-diverse news rec- This paper is structured as follows: We first de- ommenders is a complex problem. It requires input scribe the problem that personalized news recom- from different academic disciplines, from media mendation could pose for democracy, as well as studies and computer science to political science the importance of an interdisciplinary approach to and philosophy (Bernstein et al., 2020). Political

46 theory can provide a framework that helps define of these values is diversity, but their case-study what is needed from more empirical and technical concerns implementing and optimizing for “dy- researchers to address this problem. In the next namism” – a diversity-related metric the authors section, we will discuss recent work in diversity define as “how much a list changes between up- in news recommendation. We point out remaining dates”. The authors note the computational dif- gaps in these efforts, specifically connected to the ficulty of measuring and optimizing for diversity, idea of a deliberative democracy. and propose a proxy. They define “intra-list diversity” as the inverse of the similarity of a recom- 3 Diversity in News Recommendation mendation set. This similarity is calculated over pre-defined news categories of the articles, such 3.1 Recent Diversity Efforts as ‘sports’ and ‘finance’, as well as over different Previous work on diversity in news recommender authors. Viewpoints or perspectives are not men- systems has mainly focused on assessing the cur- tioned. Lu et al.(2020)’s “editorial values” seem rent state of diversity in news recommendation to correspond to the public values mentioned in (Moller¨ et al., 2018), or on assessing diversity es- Bernstein et al.(2020), and implicitly also relate pecially at the end of a computational pipeline, in to the democratic values described by Helberger the form of (evaluation) metrics (Vrijenhoek et al., (2019). Both mention diversity as a central im- 2021; Kaminskas and Bridge, 2016), or on com- portant aspect, but Lu et al.(2020) still centralize putational implementations of diversity (Lu et al., the user’s satisfaction, rather than public values or 2020). Less attention has been given to defining democracy. and identifying the viewpoints, entities, or perspectives that are being diversified, or to the underlying Vrijenhoek et al.(2021) connect several demo- values and goals of diversification. cratic models to computational evaluative metrics of news recommender diversity. The paper dis- Within the recommender systems field, there are cusses several metrics that could be used as op- several ideas and concepts related to diversity, espe- timization and evaluation functions for diversity cially where it concerns evaluation or optimization for news recommender systems supporting a de- metrics. Diversity, serendipity, and unexpectedness liberative democracy, such as one to measure and all are metrics used in the recommender systems lit- optimize for the “representation” of different soci- erature that go beyond mere click accuracy (Kamin- etal opinions and voices, and another to measure skas and Bridge, 2016). There are two gaps we the “fragmentation”: whether different users re- see in many of these earlier metrics. Firstly, these ceive different news story chains. These evaluation metrics rarely focus on linguistic or conceptual fea- metrics are, to our knowledge, the first to explicitly tures or representations of (aspects of) diversity in consider normative values and models of democ- the news articles. Or, when they do, the NLP ap- racy in news recommender system design. How- proaches are simplified (e.g. topic models in Draws ever, this work does not discuss how to represent et al.(2020b)) to centralize the recommendation or identify different voices in news articles. The algorithm and its optimization. Secondly, such “be- NLP-related components discussed are limited to yond user interest” optimization in recommender annotating different named entities. systems is usually not connected to normative goals and societal gains, but still geared towards user in- We argue that the inclusion of more fine-grained terest and the idea that users react positively to and state-of-the-art NLP methods allows more pre- unexpected or previously unseen items. However, cise identification of different “voices” and view- several fairly recent works (Lu et al., 2020; Vri- points in support of diverse news recommender jenhoek et al., 2021) have attempted to go beyond systems. The connection of these NLP tasks to “click accuracy” for user interest and tackle the di- diversifying news recommendation is as follows. versity in news recommendation problem while We compare the building of diverse news recom- also explicitly considering normative values. menders in support of a deliberative democracy Lu et al.(2020) discuss how to implement “edi- to building a tower, with the identification of the torial values” in a news recommender for a Dutch different voices or viewpoints as the base of that online newspaper. Editorial values were defined tower. When an approach can reliably and consis- as journalistic missions or ideals found important tently identify different viewpoints or arguments, by the newspaper’s editors and journalists. One we can also diversify these viewpoints in recom-

47 mendations. A solid definition of viewpoints and useful NLP tasks to focus on for our problem in the reliable methods to detect them thus form the foun- following section. dation of our diverse news recommendation tower, and builds it towards the goal of a functioning de- 4 Relevant NLP Tasks liberative democracy. Within the NLP, text mining, and recommender systems literature, there are several (related) tasks 3.2 Technical and Conceptual Challenges that deal with identifying viewpoints, perspectives, The news is a specific domain for recommender and arguments in written language. We define a systems, with much faster-changing content than task in NLP as a clearly defined problem such as for instance movie or e-commerce recommenda- “stance detection”, with each task having connected tion. This leads to a number of unique technical methods, benchmark datasets, leaderboards and challenges. literature. The literature is currently fragmented Two specific technical and conceptual challenges in different related tasks and also definitions of to a (diverse) news recommendation have been ad- viewpoint, argument or claim, and perspective. Re- dressed in previous work. The first is the cold start searchers also use different datasets and content- problem (Zhou et al., 2010), which occurs when types (tweets and microblogs, internet discussions a news recommender needs data on articles to de- on websites like debate.org, or news texts). cide whether to recommend the article to a (new) In this section we discuss NLP tasks that are user. Recommendation, in news as well as in other related to viewpoint and argumentation diversity domains, often uses the interaction data of similar as defined in relation to the normative goal of a users to recommend data to new users, such as in healthy deliberative democracy. Recall that a delib- the method “collaborative filtering”. Such data is erative model assumes that participants of a democ- missing on the large volumes of new articles added racy need access to a variety of (conflicting) view- in the news domain every day, which makes such points and lines of argumentation. As such, we approaches less useful in this domain. This leads focus on NLP tasks that help identify what claims, to other recommendation techniques being more stances, and argumentation are present in news common in the news recommendendation domain. articles, and how specific items in the news are The second challenge specific to our problem is presented or framed. the continuous addition of new and many different An important distinction that needs to be made is topics, issues, and entities in public discussion and the one between stance and sentiment: a negative in the news. This makes detecting viewpoints with sentiment does not necessarily mean a negative one automated, single model and one set of train- stance or viewpoint on an issue, and vice versa. An ing data difficult. Previous work often explores example would be someone who supports the use one well-known publicly debated topic, such as of mouth masks as COVID-19 regulation (positive abortion (Draws et al., 2020a) or misinformation stance), and expresses negative sentiment towards related to COVID-19 (Hossain et al., 2020). How- the topic by criticizing the shortage of mouth masks ever, in an ideal solution we would also be able to available for caregivers. In this paper, we concern continuously identify all kinds of new debates and ourselves with stance on issues (being in favor of related views. masks) rather than with sentiment expressed about We believe that a combination of state-of-the-art such issues (being negative about their shortage). NLP techniques such as neural language models The remainder of this section is structured as fol- can help address this problem without resorting to lows. We first describe work on recommender sys- manual or unsupervised techniques. A possible tems that explicitly refers to detecting viewpoints. interesting research direction is zero-shot or one- We then address three relatively established NLP shot learning as in Allaway and McKeown(2020), tasks: argumentation mining, stance detection and where a model with the help of large(-scale) lan- polarization, frames & propaganda. We then briefly guage models learns to identify new debates and address work that refers to ‘perspectives’. viewpoints not seen at training time. In our case, this would mean identifying new debates and new 4.1 Viewpoint Detection and Diversity viewpoints without explicit training on these when The recommender systems literature specifically training for our task. We elaborate on potentially uses the term ‘viewpoint’ in relation to diversifying

48 recommendation. In these viewpoint-based papers, Stab and Gurevych(2017) identify the differ- we notice a systems-focused tendency. Defining a ent sub-tasks in argumentation mining, and use viewpoint is less of a concern, nor is evaluating the essays as the argumented texts in question. For in- viewpoint detection. Instead, researchers centralize stance, one sub-task is separating argumentative viewpoint presentation to users, or how these re- from non-argumentative text units. Then, their spond to more diverse news, as in Lu et al.(2020) pipeline involves classifying argument components and Tintarev(2017). As a result, there is no stan- into claims and premises, and finally it involves dard definition of ‘viewpoint’ and the concept is identifying argument relations. This first sub-task operationalized differently by various authors. is also sometimes called claim detection, and is Draws et al.(2020a) use topic models to extract related to detecting stances and viewpoints when and find viewpoints in news texts with an unsu- connecting claims to issues. pervised method, with the explicit goal to diver- For a deliberative democracy, the work on dis- sify a news recommender. They explicitly connect tinguishing argumentative from non-argumentative different sentiments to different viewpoints or per- text in argument mining is useful, since our goal spectives. For this study, they use clearly argu- requires the highlighting of deliberations and argumentative text on abortion from a debating website. ments, and not statements on facts. Identifying this The words ‘viewpoint’ and ‘perspective’ are used distinction might enable us to identify viewpoints interchangeably in this study. in news texts. The precise identification of claims Carlebach et al.(2020) also address what they and premises may also prove valuable, because call “diverse viewpoint identification”. Here as supporting a deliberative democracy requires the well, we see a wide range of definitions and terms detection of different deliberations and arguments related to viewpoints and perspectives (e.g. ‘claim’, in news texts. ‘hypothesis’, ‘entailment’). The authors use state- of-the-art methods including large neural language 4.3 Stance Detection models, but the study does not seem to consider Stance detection is the computational task of de- carefully defining their task, term definitions, and tecting “whether the author of the text is in fa- the needs of the problem. As such, it is unclear vor of, against, or neutral towards a proposition what they detect exactly. This is mainly due to the or target” (Mohammad et al., 2017, p. 1). This detection itself not being the main focus of their task usually involves social media texts and, once paper. again, user-generated content. Commonly, these With the more NLP-based tasks and definitions are shorts texts such as tweets. For instance, Mo- in the following sections, we explore how NLP hammad et al.(2017) provide a frequently used tasks relate to this ‘viewpoints’ idea from the rec- Twitter dataset that strongly connects stances with ommender systems community, and see what ideas sentiment and/or emotional scores of the text. An- and techniques these other tasks can add to diver- other common trend in stance detection is to use sity in news recommendation. text explicitly written in the context of an (online) debate, such as the website debate.org and social 4.2 Argument Mining media discussions. A recent study on Dutch social media comments Argument Mining is the automatic extraction and highlights the difficulties in annotating stances on analysis of specific units of argumentative text. It vaccination (Bauwelinck and Lefever, 2020). The usually involves user-generated texts, such as com- authors identify the need to annotate topics, but ments, tweets, or blogposts. Such content is often also topic aspects and whether units are expressing highly argumentative by design, with high senti- an argument or not. Getting to good inter-annotator ment scores. In some studies, arguments are related agreement (IAA) is difficult, showing that these to stances, as in the Dagstuhl ArgQuality Corpus concepts related to debate and stance are not uni- (Wachsmuth et al., 2017), where 320 arguments form to all annotators even after extensive training. cover 16 (political or societal) topics, and are bal- The same is found by Morante et al.(2020): An- anced for different stances on the same topic. These notating Dutch social media text as well as other arguments are from websites specifically aimed at debate text on the vaccination debate, they find debating. obtaining a high IAA is no easy task.

49 Other work related to stance detection is more viewpoints must be addressed. One shot learn- related to the news domain. The Fake News Clas- ing may provide means to deal with new topics in sification Task (Hanselowski et al., 2018b) has a the every-changing news landscape. The focus on sub-task that concerns itself with predicting the longer, less explicitly argumentative text is helpful stance of a news article towards the news headline. for our goal, and exists in for instance the first sub- In their setup stances can be ‘Unrelated’, ‘Discuss’, tasks of fake news detection (Hanselowski et al., ‘Agree’ or ‘Disagree’. The Fake News Classifica- 2018a) and other recent news-focused datasets and tion tasks also introduces claim verification as a papers (Conforti et al., 2020; Allaway and McKe- sub-task. This task is also related to the claim de- own, 2020). tection task: in order to verify claims, one needs to detect them first. 4.4 Polarization, Frames, and Propaganda Several papers specifically aim at stance detec- Some work already explicitly takes into account tion in the news domain. Conforti et al.(2020) note the more complex political dimension of news texts that different types of news events, from wars to when defining an NLP task. This work is often economic issues, might lead to stance classes that interdisciplinary in nature, with NLP researchers are not uniform across events. As a response, they working with political scientists or media scholars. decide to annotate stance on one specific type of The idea of (political) perspectives is prominent in news event: company acquisitions. The authors these papers, though researchers in this subfield use explicitly note here that textual entailment and sen- different definitions and names for similar tasks. timent analysis are different tasks from stance de- ‘Frames’, ‘propaganda’, and ‘polarization’ are tection, but acknowledge that all these tasks are loaded terms, with less nuance than terms such as related. However, as stated before, in the news do- ‘stance’ and ‘argument’. Terms like ‘polarization’ main new topics or issues occur constantly. Data are (ironically) more polarizing due to their politi- on only one type of news event is less representa- cal connotations. An explicitly political aspect in tive of all texts in the news domain. Some recent the task definition can be useful for our societal work aims to address this through one-shot or zero- problem – as stated, the deliberative democracy shot learning for detecting issues and viewpoints goal is also inherently connected to political de- on issues (Allaway and McKeown, 2020). In such bates. However, it can also lead to a confusion an approach, unseen topics or viewpoints would of terminology or the use of (accidentally) loaded be detected even when they are very different from terminology, for instance terms that are controver- what is annotated or seen at training time. sial in related disciplines such as communication Based on the above, there are three challenges science or media studies. involved in applying previous approaches on stance An example is a recent shared task on Propa- detection for diversifying news: First, most work ganda techniques (Da San Martino et al., 2019). on stance detection aims at short, high-sentiment It distinguishes 18 classes of what the authors user-generated texts with one specific stance. News call ‘rhetorical strategies’ that are not synonymous articles are more complex. News texts might high- with, but related to, propaganda. These include light a debate with several viewpoints of different ‘whataboutism’, ‘bandwagon’, and ‘appeal to fear people, with the emphasis on one rather than the and prejudice’, as well as ‘Hitler-comparisons’. other. Secondly, the authors of news articles gen- These terms are, incidentally, also known as cog- erally do not express opinions explicitly, unlike nitive biases (the bandwagon effect) or framing authors of tweets or blogs. News articles can ex- (appeal to fear) and argumentation flaws (Hitler- press viewpoints in more subtle ways, in the way comparisons, on the internet known as Godwin’s a story is told or framed. Additionally, training Law). Such confusion of terminology, especially in data that does come from the news domain may not a politically sensitive context, makes it less straight- generalize well to new topics. forward to see how this task can be used for view- We conclude that stance detection is, in princi- point diversification in support of a deliberative ple, a relevant task when aiming to ensure news democracy. recommendation supports a deliberative democ- Sometimes, the task of identifying different racy, but the challenges generalizing to new topics viewpoints on an issue or event in the news is and dealing with more subtle ways of expressing translated to ‘political bias’. In such work, the

50 viewpoints are related to a certain ideology or po- 4.5 Perspectives litical party (Roy and Goldwasser, 2020) or ‘media In NLP, definitions of ‘perspective’ range from frames’. However, we would argue that a view- ‘a relation between the source of a statement (i.e. point in the public debate does not have to be a the author or another entity introduced in the text) political standpoint related to a specific political and a target in that statement (i.e. an entity, event, ideology. Limiting ourselves only to detecting de- or (micro-)proposition)’ (Van Son et al., 2016) to bates and viewpoints explicitly related to political stances to specific (political) claims in text (Roy parties would also limit the view on public debate and Goldwasser, 2020). These definitions are simi- and deliberative democracy, and thus would not lar to those seen in the Stance Detection literature. support our normative goal to its full extent. Sometimes, it is unclear what the difference is between a stance and a perspective. Other NLP work that addresses the political na- Common debate content used for analysis and ture of news texts and perspectives is Fokkens task definition of perspectives is political elections et al.(2018). In this work, stereotypes on Muslims (Van Son et al., 2016), vaccination (Morante et al., are detected with a self-defined method known as 2020), and also societally debated topics like abor- ‘micro-portrait extraction’. This paper is an exam- tion. Perspectives are especially useful for our goal, ple of work where other disciplines (communica- since they assume different groups in society are tion and media experts) are heavily involved in task seeing one issue from different angles. This allows definition and execution, aiding clear and careful us to identify an active debate in society, which definitions and aiding to the problem and the so- explicitly supports a deliberative democracy. cietal complex issue (stereotypes in the news) at hand. 5 Discussion

‘Fake news’ related tasks are also connected to In the previous section, we have outlined a number the political content of news. The Fake News Clas- of relevant NLP tasks, and made their possible con- sification Task (Hanselowski et al., 2018b) has the tribution to the support of a deliberative democracy explicit goal to identify fake news. It consists of through diverse news recommendation explicit. In several sub-tasks related to argument mining and the following section, we discuss the implications stance detection. The debate on (fake) news has and considerations following from these separate recently shifted away from the simple label ‘fake tasks for diversity in news recommendations, and news’, since it is not only the simple distinction provide some advice for NLP researchers. between fake and true that is interesting. This again 5.1 Evaluation shows the importance of multi-disciplinary work: computational tasks are often aimed at a simple There has been a general push in NLP evaluation classification such as ‘true’ versus ‘false’, while to go “beyond accuracy” (Ribeiro et al., 2020) and social scientists and media experts call for different in recommender systems to go “beyond click accu- labels not directly related to the truth of an entire racy” (Lu et al., 2020; Zhou et al., 2010) in eval- article or claim, such as ‘false news’, ‘misleading uation and optimization. We believe that going news’, ‘junk news’ (Burger et al., 2019), or ‘click- beyond these evaluations might also mean looking bait’. All these are terms for a media diet with at normative, societal goals and values, and the im- lower quality (or with less ‘editorial values’ to use plications for the task and its effect on these goals the term from Lu et al.(2020)). and values. A possible advantage of a higher-level evaluation with a normative goal is that it allows the It can be useful for a deliberative democracy- measurement of real-world impact. One explicit supporting diverse news recommender when tasks problem however is how to evaluate whether sup- already incorporate the political dimension of news port of a deliberative democracy has been achieved. texts. However, it can also be harmful when the po- Recent work by Vrijenhoek et al.(2021) has litical or social science definitions are not clear and identified evaluation metrics to evaluate whether a uniform, or when the political dimension actually recommender system supports specific models of narrows what a deliberative democracy is by only democracy, one of which is the deliberative model. considering explicitly political viewpoints, or only They propose a number of evaluation metrics for views tied to political parties or ideologies. recommender system diversity that are explicitly

51 connected to different models of democracy. These goals and tasks related to such problems, especially metrics could be used to evaluate different aspects when working on real-world impact. of diversity related to a (deliberative) democracy. As discussed in Section4, the NLP field has The aspects discussed are the representation of many related tasks that seem to be relevant to the different groups in the news, whether alternative problem of news recommender diversity and es- voices from minority groups are represented in the pecially the support of a deliberative democracy. recommendations, whether the recommendations However, we note that NLP tends to use their own activate users to take action, and the degree of frag- definitions, and not consider other fields or even mentation between different users. sub-fields, when designing these tasks. This means However, Vrijenhoek et al.(2021) does not ad- the field covers a wide array of different implemen- dress the evaluation of the NLP tasks involved. tations and definitions related to perspectives and Where specific, clearly defined NLP tasks can gen- viewpoints in the news. We therefore urge NLP erally be evaluated through hand-labelled evalua- researchers to not only consider and evaluate their tion sets, such sets do not provide the necessary systems on their own definitions and tasks, but also insights to determine their role in supporting a de- consider the wider societal and normative goals liberative democracy. In the end, we need to find their task connects to, and what other related tasks a way to connect accuracy of NLP technologies could be used to achieve the same or similar goals. to the overall increased diversity of news offers. Ideally, we would then also measure the ultimate 5.3 NLP and Other Disciplines impact on the users of a diverse recommender sys- NLP, especially NLP working on societal real- tem diversifying viewpoints or stances with an NLP world problems, should involve other fields, and method. Such an evaluation is highly complex and expertise in other fields. This is especially true clearly requires expertise from various fields (in- when working on complex problems like viewpoint cluding technology, user studies and methods for diversity in news recommendation. This recom- investigating social behavior). It could for instance mendation has also been made at the Dagstuhl per- involve longitudinal studies on user knowledge of spectives workshop “Diversity, fairness, and data- issues and viewpoints. driven personalization in (news) recommender systems” (Bernstein et al., 2020), but we would like to 5.2 No NLP Task is An Island emphasize it more specifically for the NLP field. We argue that NLP tasks have a clear role in the One example where a lack of interdisciplinary development of diverse recommender systems. Es- seems to sometimes to lead to issues for our prob- pecially recent developments in the field, such as lem is in the Polarization, Frames, and Propaganda the use of pre-trained language models and neural set of NLP tasks outlined in Section 4.4. Defini- models, could be used to obtain a reliable and use- tions used of ‘frame’, ‘propaganda’, and ‘polar- ful representations of issues in the news, as well as ization’ are sometimes seemingly made without viewpoints and perspectives on these issues. Such consulting relevant experts, or without consider- approaches are possibly more fine-grained and can ing earlier theoretical work defining these terms. be more reliable than the now commonly used un- This leads to definitions that are easy to compu- supervised methods such as topic models. tationally measure with existing NLP techniques, Benchmarking with separate datasets, defini- such as classification. However, these definitions tions, and shared tasks and challenges has brought do not necessarily do justice to the complex prob- our field far, and much progress has been achieved lem the model or task is aimed at. Such work also in this manner. However, we feel complex soci- does not consult earlier theoretical and empirical etal issues should be aimed at achieving a societal considerations of these terms and definitions. goal rather than evaluated on task-specific bench- We argue for the inclusion of experts from the marking dataset. When considering issues such as social sciences and humanities in every step of the diversity in news recommendation and its effects on process – designing the tasks and definitions, eval- democracy and public debate, we are at the limit uation of task success and usefulness, and tying the of what separate NLP tasks could bring us. We result to broader implications. For diversity in news should dare to look past the limits of separate tasks, recommenders, this means discussing and engag- and attempt to oversee the over-arching normative ing with experts on political theory and philosophy,

52 ethics of technology, and media studies and com- goals is currently missing in these tasks, while this munication science (Bernstein et al., 2020). is conceptually very useful and societally relevant. As such, taking this end goal into account can help 5.4 Ethical and Normative Considerations improve social relevance of NLP and support NLP When our goal is to foster a healthy democratic de- researchers in defining specific goals and next steps bate, we should consider whether we should high- in their research. light or recommend content with fringe opinions Research on recommendation systems could ben- that might be dangerous to individuals or the debate efit from more specific work that operationalizes itself, e.g. the anti-vaxxing argument in the vacci- the theoretical concepts in democratic theory. Such nation debate, conspiracy theories on the state of operationalizations should start with the ground- democracy, or inherently violent arguments. The work laid by NLP tasks such as stance detection, deliberative model of democracy values rational argumentation mining and tasks aiming at detect- and calm debate, not emotional or affective lan- ing frames, propaganda and polarization. However, guage. While this is a question of whether to rec- current NLP tasks do not address problems related ommend such views, not whether to detect them, to viewpoint diversity in news recommendation in we find it important to stress such considerations its full complexity yet. NLP should take the com- here. In a complex problem with a high-level nor- plexities of news and the news recommendation mative goal, it is important to make such consid- domain into account. News texts often contain erations explicit, as these also influence whether more than one stance or argument, and they tend we are actually fostering a healthy deliberative de- to have more implicitly expressed viewpoints than bate. This means a simple computational solution, other texts. Moreover, news comes with the chal- e.g. maximize diversity of viewpoints and debates, lenge that new topics constantly appear and training might not always be the best manner to reach the data on detecting viewpoints in some issues may normative goal (e.g. foster a healthy deliberative not generalize well to new data on other topics or democracy). issues. Such more nuanced and complex issues come This leads us to the following two concrete steps to light when we consider public values such as for future work, specifically in NLP: (1) researchers diversity and the normative goal of a deliberative should further advance methods that aim to iden- democracy. They are less explicit when only con- tify more subtle ways in which viewpoints occur in sidering the NLP task as a separate task, which real-world news text; (2) methods should address only needs to be evaluated by its performance on the issue of constant changes in data, with one pos- a benchmark dataset. However, questions such sible solution being one-shot learning. Last but as these are especially important when consider- not least, in order to find out how these tasks can ing that NLP and its technology is contributing to truly be used to improve a deliberative democracy, the solution of a societal problem. The attention we face the challenge of evaluating beyond assign- to an over-arching normative goal helps NLP re- ing correct labels to pieces of text. This brings us searchers to consider their responsibility and the back to the main message of this paper: Answering implications of their work when it is used in real- this question goes beyond the expertise of NLP re- world settings. This has been argued before by searchers. In order to maximize the impact of our researchers in the NLP community (Fokkens et al., technologies for addressing this complex problem, 2014; Bender et al., 2021), and we think it is a pos- we need expertise from other disciplines. itive development when NLP researchers consider the wider ethical and normative considerations of Acknowledgments their tasks and goals. This research is funded through Open Competi- 6 Conclusion tion Digitalization Humanities and Social Science grant nr 406.D1.19.073 awarded by the Nether- In this paper, we have provided an overview of lands Organization of Scientific Research (NWO). several separate NLP tasks related to news recom- We would like to thank our interdisciplinary team mender system diversity, especially considering the members, and the anonymous reviewers whose normative goal of a deliberative democracy. An ex- comments helped improve the paper. All opinions plicit incorporation of such over-arching normative and remaining errors are our own.

53 References In Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censor- Emily Allaway and Kathleen McKeown. 2020. Zero- ship, Disinformation, and Propaganda, pages 162– shot stance detection: A dataset and model using 170, Hong Kong. Association for Computational generalized topic representations. In Proceedings of Linguistics. the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 8913– Tim Draws, Jody Liu, and Nava Tintarev. 2020a. Help- 8931. ing users discover perspectives: Enhancing opinion mining with joint topic models. In Proceedings of Nina Bauwelinck and Els Lefever. 2020. Annotat- SENTIRE’20. ing topics, stance, argumentativeness and claims in dutch social media comments: A pilot study. In Pro- Tim Draws, Nava Tintarev, Ujwal Gadiraju, Alessan- ceedings of the 7th Workshop on Argument Mining, dro Bozzon, and Benjamin Timmermans. 2020b. pages 8–18. Assessing viewpoint diversity in search results using Emily M Bender, Timnit Gebru, Angelina McMillan- ranking fairness metrics. In Informal Proceedings Major, and Shmargaret Shmitchell. 2021. On the of the Bias and Fairness in AI Workshop at ECML- dangers of stochastic parrots: Can language models PKDD (BIAS 2020). be too big. Proceedings of FAccT. Sarah Eskens, Natali Helberger, and Judith Moeller. Abraham Bernstein, Claes De Vreese, Natali Helberger, 2017. Challenged by news personalisation: five per- Wolfgang Schulz, and Katharina A Zweig. 2020. Di- spectives on the right to receive information. Jour- versity, fairness, and data-driven personalization in nal of Media Law, 9(2):259–284. (news) recommender system (dagstuhl perspectives workshop 19482). Antske Fokkens, Serge ter Braake, Niels Ockeloen, Piek Vossen, Susan Legene,ˆ and Guus Schreiber. Frederik J Zuiderveen Borgesius, Damian Trilling, Ju- 2014. Biographynet: Methodological issues when dith Moller, Balazs´ Bodo,´ Claes H De Vreese, and nlp supports historical research. In Proceedings Natali Helberger. 2016. Should we worry about fil- of the Ninth International Conference on Language ter bubbles? Internet Policy Review, 5(1). Resources and Evaluation (LREC’14), pages 3728– 3735. Andrei Boutyline and Robb Willer. 2017. The social structure of political echo chambers: Variation in Antske Fokkens, Nel Ruigrok, Camiel Beukeboom, ideological homophily in online networks. Political Sarah Gagestein, and Wouter van Atteveldt. 2018. psychology, 38(3):551–569. Studying muslim stereotyping through microportrait extraction. In Proceedings of the Eleventh Interna- Engin Bozdag. 2013. Bias in algorithmic filtering and tional Conference on Language Resources and Eval- personalization. Ethics and information technology, uation (LREC 2018). 15(3):209–227. Andreas Hanselowski, PVS Avinesh, Benjamin Axel Bruns. 2019. Are filter bubbles real? John Wiley Schiller, Felix Caspelherr, Debanjan Chaudhuri, & Sons. Christian M Meyer, and Iryna Gurevych. 2018a. Peter Burger, Soeradj Kanhai, Alexander Pleijter, and A retrospective analysis of the fake news chal- Proceedings of the Suzan Verberne. 2019. The reach of commer- lenge stance-detection task. In 27th International Conference on Computational cially motivated junk news on facebook. PloS one, Linguistics 14(8):e0220446. , pages 1859–1874.

Mark Carlebach, Ria Cheruvu, Brandon Walker, Ce- Andreas Hanselowski, Avinesh PVS, Benjamin sar Ilharco Magalhaes, and Sylvain Jaume. 2020. Schiller, Felix Caspelherr, Debanjan Chaudhuri, News aggregation with diverse viewpoint identiﬁca- Christian M. Meyer, and Iryna Gurevych. 2018b.A tion using neural embeddings and semantic under- retrospective analysis of the fake news challenge standing models. In Proceedings of the 7th Work- stance-detection task. In Proceedings of the 27th shop on Argument Mining, pages 59–66. International Conference on Computational Lin- guistics, pages 1859–1874, Santa Fe, New Mexico, Costanza Conforti, Jakob Berndt, Mohammad Taher USA. Association for Computational Linguistics. Pilehvar, Chryssi Giannitsarou, Flavio Toxvaerd, and Nigel Collier. 2020. Stander: An expert- Natali Helberger. 2019. On the democratic role of news annotated dataset for news stance detection and ev- recommenders. Digital Journalism, 7(8):993–1012. idence retrieval. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Tamanna Hossain, Robert L Logan IV, Arjuna Ugarte, Processing: Findings, pages 4086–4101. Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. Covidlies: Detecting covid-19 misin- Giovanni Da San Martino, Alberto Barron-Cede´ no,˜ and formation on social media. In Proceedings of the 1st Preslav Nakov. 2019. Findings of the NLP4IF-2019 Workshop on NLP for COVID-19 (Part 2) at EMNLP shared task on ﬁne-grained propaganda detection. 2020.

54 Kathleen Hall Jamieson and Joseph N Cappella. 2008. Cass R Sunstein. 2018. # Republic: Divided democ- Echo chamber: Rush Limbaugh and the conserva- racy in the age of social media. Princeton University tive media establishment. Oxford University Press. Press.

Marius Kaminskas and Derek Bridge. 2016. Diversity, Nava Tintarev. 2017. Presenting diversity aware recom- serendipity, novelty, and coverage: a survey and em- mendations: Making challenging news acceptable. pirical analysis of beyond-accuracy objectives in rec- In The FATREC Workshop on Responsible Recom- ommender systems. ACM Transactions on Interac- mendation. tive Intelligent Systems (TiiS), 7(1):1–42. Chantal Van Son, Tommaso Caselli, Antske Fokkens, Mozhgan Karimi, Dietmar Jannach, and Michael Ju- Isa Maks, Roser Morante, Lora Aroyo, and Piek govac. 2018. News recommender systems–survey Vossen. 2016. GRaSP: A multilayered annotation and roads ahead. Information Processing & Man- scheme for perspectives. In Proceedings of the Tenth agement, 54(6):1203–1227. International Conference on Language Resources and Evaluation (LREC’16), pages 1177–1184. Feng Lu, Anca Dumitrache, and David Graus. 2020. Sanne Vrijenhoek, Mesut Kaya, Nadia Metoui, Judith Beyond optimizing for clicks: Incorporating edito- Moller,¨ Daan Odijk, and Natali Helberger. 2021. rial values in news recommendation. In Proceed- Recommenders with a mission: assessing diversity ings of the 28th ACM Conference on User Modeling, in news recommendations. In SIGIR Conference Adaptation and Personalization, pages 145–153. on Human Information Interaction and Retrieval (CHIIR) Proceedings. Bernard Manin. 1987. On legitimacy and political deliberation. Political theory, 15(3):338–368. Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Al- Saif M Mohammad, Parinaz Sobhani, and Svetlana berdingk Thijm, Graeme Hirst, and Benno Stein. Kiritchenko. 2017. Stance and sentiment in tweets. 2017. Computational argumentation quality assess- ACM Transactions on Internet Technology (TOIT), ment in natural language. In Proceedings of the 15th 17(3):1–23. Conference of the European Chapter of the Associa- tion for Computational Linguistics: Volume 1, Long Judith Moller,¨ Damian Trilling, Natali Helberger, and Papers, pages 176–187. Bram van Es. 2018. Do not blame it on the algorithm: an empirical assessment of multiple rec- Sandra Wachter. 2020. Afﬁnity proﬁling and discrimi- ommender systems and their impact on content di- nation by association in online behavioural advertis- versity. Information, Communication & Society, ing. Berkeley Technology Law Journal, 35(2). 21(7):959–977. Tao Zhou, Zoltan´ Kuscsik, Jian-Guo Liu, Matu´sˇ Medo, Roser Morante, Chantal Van Son, Isa Maks, and Piek Joseph Rushton Wakeling, and Yi-Cheng Zhang. Vossen. 2020. Annotating perspectives on vacci- 2010. Solving the apparent diversity-accuracy nation. In Proceedings of The 12th Language Re- dilemma of recommender systems. Proceedings of sources and Evaluation Conference, pages 4964– the National Academy of Sciences, 107(10):4511– 4973. 4515.

Zizi Papacharissi. 2002. The virtual sphere: The internet as a public sphere. New media & society, 4(1):9– 27.

Eli Pariser. 2011. The ﬁlter bubble: What the Internet is hiding from you. Penguin UK.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Be- havioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4902– 4912, Online. Association for Computational Lin- guistics.

Shamik Roy and Dan Goldwasser. 2020. Weakly supervised learning of nuanced frames for analyzing polarization in news media. EMNLP Findings.

Christian Stab and Iryna Gurevych. 2017. Parsing argumentation structures in persuasive essays. Com- putational Linguistics, 43(3):619–659.