\Textit {Newsedits}: a Dataset of Revision Histories for News Articles
Total Page:16
File Type:pdf, Size:1020Kb
NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing) Alexander Spangher and Jonathan May Information Sciences Institute University of Southern California [email protected], [email protected] Abstract new questions that news edits are uniquely posi- tioned to answer: What information is likely to News article revision histories have the po- change in the current state of a document? What tential to give us novel insights across varied must be incorporated? What perspectives need to fields of linguistics and social sciences. In this work, we present, to our knowledge, the first be layered in, and when? publicly available dataset of news article revi- We offer, to our knowledge, the first publicly sion histories, or NewsEdits. available corpus of news article revision histories, Our dataset is multilingual; it contains called NewsEdits. We compile our corpus from 1,278,804 articles with 4,609,430 versions various subcorpora collected by internet activists from over 22 English- and French-language who track news article version changes. Each sub- newspaper sources based in three countries. corpora is collected by monitoring article URLs, Across version pairs, we count 10.9 million and downloading article text when a new version added sentences; 8.9 million changed sen- of the same news article is published.2 tences and 6.8 million removed sentences. Our dataset consists of 1,278,804 articles and Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our 4,609,430 versions from over 22 English-language knowledge, the largest corpus of revision his- newspaper outlets. These outlets are based in the tories of any domain. U.S., Great Britain and Canada. As in Tamori et al.(2017), we compare article versions to each 1 Introduction other on the sentence-level, and then, for matched sentences, the word level. We count 10,929,051 Revision histories gathered from various natural added sentences; 8,908,254 changed sentences and language domains like Wikipedia (Grundkiewicz 6,807,624 removed sentences. and Junczys-Dowmunt, 2014), Wikihow (Faruqui et al., 2018) and student learner essays (Zhang and Our contributions are the following: Litman, 2015) have been studied widely in NLP. 1. We introduce NewsEdits, the first publicly These corpora have been used for as varied tasks available academic corpus of news article ver- as language modeling (Yin et al., 2018), sentence sion histories, as well as, to our knowledge, the modification (Shah et al., 2020), argumentation largest version-history corpus. design (Afrin et al., 2020), and even group collabo- 2. We process the dataset to identify various arXiv:2104.09647v1 [cs.CL] 19 Apr 2021 ration dynamics (Muric´ et al., 2019). forms of structural edits, in order to facilitate By utilizing a novel domain for revision histories, a wide range of possible analyses. We provide news article revision histories, we see the poten- simple, lightweight visualization tools to render tial for strengthening methods for these aforemen- article-comparisons in an intuitive format. tioned tasks, as well as the introduction of a novel In the remainder of the paper, we start by compar- set of tasks. Because news covers events (or world- ing this work to other edit corpora. We then discuss states) that are constantly updating, we hypothesize the dataset curation and processing. In upcoming that many edits in news either (1) incorporate new work, we will present a schema for characterizing information (2) update events or (3) broaden per- domains like Wikipedia (Yang et al., 2017), which tend to spectives.1 Each of these categories of edits poses focus on counter vandalism and syntax edits 2As such, our dataset only contains differences between 1This is in contrast with the distribution of edits in other published versions of news articles, not intermediate drafts. edits in news articles, developed with journalists 2.2 Student Learner Essays and graduate students in the communications field. Student Learner essays are another area of focus in revisions research. Such research focuses on edit- 2 Related Work ing revisions made during student essay-writing, particularly focusing on non-English speakers. Be- Previous work in natural language revision histo- cause of the difficulties in collecting data in this ries has primarily focused on two primary domains: domain, most natural datasets tend to be small (Lea- student learner essays and Wikipedia. There is, ad- cock et al., 2010). In recent work, researchers cre- ditionally, emerging work in using WikiHow for ate an editing platform; they instruct students to similar ends as Wikipedia (Anthonio et al., 2020; write 3 drafts of an essay using their platform, and Bhat et al., 2020). gather revisions made during the writing process (Zhang et al., 2017). They collect 60 essays with 2.1 Wikipedia Revisions 180 versions, or 3 drafts per essay, and they focus on classifying the discursive purpose of each edit Wikipedia is a resource that is often used in text- (i.e. what the edit introduces, like Evidence, or revision research. Many tasks have benefited from Claims). studying Wikipedia revisions, such as text simpli- In this vein, researchers have constructed Auto- fication (Yatskar et al., 2010), textual entailment mated Writing Evaluation systems (AWE) and used (Zanzotto and Pennacchiotti, 2010) and discourse these systems to evaluate writing when a student learning in edits (Daxenberger and Gurevych, 2012, submits a draft (Wang et al., 2020; Zhang, 2020; 2013; Fong and Biuk-Aghai, 2010). Discourse Zhang and Litman, 2015). In one recent work by learning for edits, as in other branches of discourse Afrin et al.(2020), researchers compile 143 essays learning, focuses on developing schemas to eluci- with 286 versions, or 2 versions each. They de- date the function or purpose of each edit, and then velop an annotation scheme focused on improving performing classification on these schema (Faigley the writing (ex: Evidence: Relevant or Irrelevant). and Witte, 1981). One such work, by Yang et al. While such work is interesting because the gen- (2017), developed a schema for Wikipedia edit eration process is fully controlled: i.e., students can intentions, included Wikepedia-specific edit cat- be instructed to write about similar topics and be egories, such as Counter-Vandalism and Wikifica- given similar feedback between rounds, the corpora tion, as well as general categories like Elaboration collected are small by modern machine learning and Refactoring. We take the general categories as standards. a starting point for our own work. The two largest-scale corpora to be processed 2.3 News and released, to our knoweldge, are the WikEd 2.3.1 Academic Work Error Corpus, in English, which contains 12 mil- lion sentences and 14 million revisions (Grund- Despite the centrality of published news articles as kiewicz and Junczys-Dowmunt, 2014), and Wiki- sources to some of the most fundamental corpora AtomicEdits, which contains 43 million “atomic in NLP (Marcus et al., 1993; Carlson et al., 2003; edits”.4 The WikEd Error Corpus has been used Pustejovsky et al., 2003; Walker, 2006), there is, to primarily for Grammar Error Correction (GEC), our knowledge, no revision-histories corpus based while WikiAtomicEdits has been used primarily on news edits currently published in the academic for language modeling (Yin et al., 2018; Shah et al., literature. For research on linguistics, corpora from 2020). multiple domains can capture relevant mechanisms, but for research on language describing real-world While our work has roughly the same number events – such as event extraction or temporality – of changed sentences as the Wikipedia-based cor- news is still the gold standard. pora (8.9 million), we have roughly twice as many atomic edits (72 million), and added/removed sen- We are aware of just one academic work, and its tences, which neither Wikipedia corpus reports. followup, that focuses on news edits (Tamori et al., 2017; Hitomi et al., 2017). Authors analyzed news edits to predict quality of news articles, as well 4Authors are unclear about how many sentences this cor- responds to, but in our work we found, on average, roughly 4 as to predict editing operations performed during atomic edits per changed sentence text generation. A dataset of news revisions from Corpus # Revisions Language Source Goal WiKed Error 12 million changed sen- English Wikipedia Grammatical Error Corpus tences Correction (GEC) WikiAtomic- 43 million “atomic edits”3 8 lan- Wikipedia Language Model- Edits guages ing WiCoPaCo 70,000 changed sentences French Wikipedia GEC and Sentence paraphrasing WikiHow- 2.7 million changed sen- English WikiHow Version prediction, ToImprove tences article improve- ment NewsEdits 8.9 million changed sen- English 22 media out- Language model- tences, 10.9 million added and lets ing, event sequenc- sentences, 6.7 million re- French ing, computational moved sentences. 72 mil- journalism lion atomic edits. Table 1: A comparison of natural langauge revision history corpora. a Japanese newspaper was used, but not released the wiki structure of WikiNews, including its revi- publicly.5 sion histories. We considered using WikiNews as In our work, in contrast, we publicly release an additional source, but were concerned that its a large dataset and outline several tasks that, in generation process (i.e. as a community-generated our opinion, are tasks for which news corpora are resource) was significantly different than the other best suited. In previously studied revision-corpora, sources we compiled (i.e. professionally developed researchers typically assume, implicitly, that the articles). In future work, when we are better able to writing focuses on a relatively static world-state – do more extensive comparison, we will consider in- both Wikipedia and student learner essays tend to clude it in our collection of news revision corpora. be focused on historical events or arguments (Yang et al., 2017). Thus in previous corpora, the nature 2.3.2 Nonacademic Work of edits are primarily argumentative or corrective.