Arxiv:1908.02322V1
Total Page:16
File Type:pdf, Size:1020Kb
DpgMedia2019: A Dutch News Dataset for Partisanship Detection Chia-Lun Yeh,1;2 Babak Loni,∗3 Marielle¨ Hendriks,1 Henrike Reinhardt,1 and Anne Schuth1 1DPG Media, The Netherlands 2TU Delft, The Netherlands 3TezolMarket, Iran [email protected], [email protected] fmarielle.hendriks, henrike.reinhardt, [email protected] Abstract contains a large number of articles that were la- beled using the partisanship of publishers. The We present a new Dutch news dataset with la- second part contains a few hundreds of articles that beled partisanship. The dataset contains more were annotated by readers who were asked to read than 100K articles that are labeled on the pub- lisher level and 776 articles that were crowd- each article and answer survey questions. In the sourced using an internal survey platform and following sections, we describe the collection and labeled on the article level. In this paper, we annotation of both parts of the dataset. document our original motivation, the collec- tion and annotation process, limitations, and 2 Dataset description applications. DpgMedia20191 is a Dutch dataset that was col- 2 1 Introduction lected from the publications within DPG Media. We took 11 publishers in the Netherlands for In a survey across 38 countries, the Pew Research the dataset. These publishers include 4 national Center reported that the global public opposed publishers, Algemeen Dagblad (AD), de Volk- partisanship in news media (Center, 2018b). It is, skrant (VK), Trouw, and Het Parool, and 7 re- however, challenging to assess the partisanship gional publishers, de Gelderlander, Tubantia, Bra- of news articles on a large scale. We thus made bants Dagblad, Eindhovens Dagblad, BN/De Stem an effort to create a dataset of articles annotated PZC, and de Stentor. The regional publishers are with political partisanship so that content analysis collectively called Algemeen Dagblad Regionaal systems can benefit from it. (ADR). A summary of the dataset is shown in Ta- ble1. To construct a dataset of news articles labeled with partisanship, it is required that some annota- 2.1 Publisher-level data tors read each article and decide whether it is parti- We used an internal database that stores all articles san. This is an expensive annotation process. An- written by journalists and ready to be published to other way to derive a label for an article is by using collect the articles. From the database, we queried the partisanship of the publisher of the article. Pre- all articles that were published between 2017 and vious work has used this method (Potthast et al., 2019. We filtered articles to be non-advertisement. 2018; Kulkarni et al., 2018; Kiesel et al., 2019). We also filtered on the main sections so that the arXiv:1908.02322v1 [cs.CL] 6 Aug 2019 This labeling paradigm is premised on that parti- articles were not published under the sports and san publishers publish more partisan articles and entertainment sections, which we assumed to be non-partisan publishers publish more non-partisan less political. After collecting, we found that a lot articles. Although there would be non-partisan ar- of the articles were published by several publish- ticles published by partisan publishers (and vice ers, especially a large overlap existed between AD versa), and thus labeled wrongly, the assumption and ADR. To deal with the problem without los- ensures more information than noise. Once the ing many articles, we decided that articles that ap- partisanship of a publisher is known, the labels of peared in both AD and its regional publications be- all its articles are known, which is fast and cheap. longed to AD. Therefore, articles were processed We created a dataset of two parts. The first part 1The dataset is released at https://github.com/ This∗ research was carried out while this author was dpgmedia/partisan-news2019 working at DPG Media. 2https://www.persgroep.nl/ Label Publisher-level Article-level Number of articles 100K 766 Percentage of partisan articles 50% 26% Number of publishers 11 11 Annotation method audience-based crowdsource Inter-rater agreement X Krippendorf’s alpha: 0.18 Table 1: Summary of the two parts of dpgMedia2019 dateset. in the following steps: Publisher #annotators Score 1. Remove any article that was published by vk 780 -0.6372 more than one national publisher (VK, AD, trouw 491 -0.5438 Trouw, and Het Parool). This gave us a list of parool 410 -0.4341 unique articles from the largest 4 publishers. degelderlander 208 -0.2452 tubantia 236 -0.1864 2. Remove any article from ADR that over- drabantsdagblad 188 -0.1862 lapped with the articles from national pub- eindhovensdagblad 194 -0.0722 lishers. destem 151 -0.0464 3. Remove any article that was published by pzc 202 -0.0248 more than one regional publisher (ADR). ad 695 0.0302 destentor 55 0.0727 The process assured that most of the articles are unique to one publisher. The only exceptions were total: 3,610 mean: -0.2067 the AD articles, of which some were also pub- Table 2: Publisher, number of people of whom the po- lished by ADR. This is not ideal but acceptable litical leaning we know, and the computed partisanship as we show in the section 2.1.1 that AD and ADR score. publishers would have the same partisanship la- bels. In the end, we have 103,812 articles. media report from the Pew Research Center in 2.1.1 Annotation of publisher partisanship 2018 (Center, 2018a), which found that VK is left- To our knowledge, there is no comprehensive re- leaning and partisan while AD is less partisan. search about the partisanship of Dutch publishers. Table3 shows the final publisher-level dataset We thus adopted the audience-based method to de- of dpgMedia2019, with the number of articles and cide the partisanship of publishers. Within the sur- class distribution. vey that will be explained in section 2.2, we asked the annotators to rate their political leanings. The 2.2 Article-level data question asked an annotator to report his or her po- To collect article-level labels, we utilized a litical standpoints to be extreme-left, left, neutral, platform in the company that has been used by right, or extreme-right. We mapped extreme-left the market research team to collect surveys from to -2, left to -1, center to 0, right to 1, extremely- the subscribers of different news publishers. right to 2, and assigned the value to each anno- The survey works as follows: The user is first tator. Since each annotator is subscribed to one presented with a set of selected pages (usually 4 of the publishers in our survey, we calculated the pages and around 20 articles) from the print paper partisanship score of a publisher by averaging the the day before. The user can select an article scores of all annotators that subscribed to the pub- each time that he or she has read, and answer lisher. The final score of the 11 publishers are some questions about it. We added 3 questions listed in Table2, sorted from the most left-leaning to the existing survey that asked the level of to the most right-leaning. partisanship, the polarity of partisanship, and We decided to treat VK, Trouw, and Het Parool which pro- or anti- entities the article presents. as partisan publishers and the rest non-partisan. We also asked the political standpoint of the user. This result largely accords with that from the news The complete survey can be found in Appendices. Partisanship Partisan Non-partisan Publisher de Volkskrant Trouw Het Parool AD ADR Article Num. 11,761 21,614 19,498 40,029 10,910 Total 52,873 (50.9%) 50,939 (49.1%) Table 3: Number of articles per publisher and class distribution of publisher-level part of dpgMedia2019. right and conservative ones. This shows that these The reason for using this platform was two-fold. left-leaning annotators were able to identify their First, the platform provided us with annotators partisanship and rate the articles accordingly. with a higher probability to be competent with We suspected that the annotators would induce the task. Since the survey was distributed to bias in ratings based on their political leaning and subscribers that pay for reading news, it’s more we might want to normalize it. To check whether likely that they regularly read newspapers and this was the case, we grouped annotators based on are more familiar with the political issues and their political leaning and calculate the percentage parties in the Netherlands. On the other hand, of each option being annotated. In Figure1, we if we use crowdsourcing platforms, we need to grouped options and color-coded political leanings design process to select suitable annotators, for to compare whether there are differences in the an- example by nationality or anchor questions to notation between the groups. We observe that the test the annotator’s ability. Second, the platform ”extreme-right” group used less ”somewhat parti- gave us more confidence that an annotator had san”, ”partisan”, and ”extremely-partisan” annota- read the article before answering questions. Since tions. This might mean that articles that were con- the annotators could choose which articles to sidered partisan by other groups were considered annotate, it is more likely that they would rate an ”non-partisan” or ”impossible to decide” by this article that they had read and had some opinions group. We didn’t observe a significant difference about. between the groups. Figure2 shows the same for the second question. Interestingly, the ”extreme- The annotation task ran for around two months right” group gave a lot more ”right” and slightly in February to April 2019.