A Resource and Analyses on Edits in Instructional Texts
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5721–5729 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC wikiHowToImprove: A Resource and Analyses on Edits in Instructional Texts Talita Rani Anthonio∗, Irshad Ahmad Bhat∗, Michael Roth University of Stuttgart Institute for Natural Language Processing fanthonta,bhatid,[email protected] Abstract Instructional texts, such as articles in wikiHow, describe the actions necessary to accomplish a certain goal. In wikiHow and other resources, such instructions are subject to revision edits on a regular basis. Do these edits improve instructions only in terms of style and correctness, or do they provide clarifications necessary to follow the instructions and to accomplish the goal? We describe a resource and first studies towards answering this question. Specifically, we create wikiHowToImprove, a collection of revision histories for about 2.7 million sentences from about 246 000 wikiHow articles. We describe human annotation studies on categorizing a subset of sentence-level edits and provide baseline models for the task of automatically distinguishing “older” from “newer” versions of a sentence. Keywords: Corpus creation, Semantics, Other 1. Introduction Text Timestamp The ever increasing size of the World Wide Web has made 1. Cut strips of paper and then write .. nouns on them (. ) it possible to find instructional texts, or how-to guides, on 2. Put the pieces of paper into a hat or bag. (. ) practically any topic or activity. wikiHow is an online plat- 3. Have the youngest player choose the first piece of paper. form on which a community of users collaboratively write 4. Have the other players such guides. As of March 2020, wikiHow consists of more (2007-04-02T13:43:10Z) than 246 000 articles.1 Factors that constitute good instruc- determine the chosen noun. 4. Have the other players tional texts have been studied across various disciplines for (2007-04-19T22:54:49Z) decades, including inference requirements in cognitive sci- guess the chosen noun. 4. Have all other players try ence (Britton et al., 1990); document design in educational (2007-05-04T17:15:00Z) to guess the chosen noun. research (Misanchuk, 1992); and motivational processes in sociology (Guthrie et al., 2004). Yet, it remains open Table 1: Example instruction steps from wikiHow, includ- what linguistic phenomena are involved in these factors and ing different versions (and their timestamps) of the last sen- whether they can be detected and handled automatically. tence (bottom half); the example represents one of approx- A first step towards filling this gap is to compare changes imately 2.7 million revision groups in our dataset. made across multiple versions of the same set of instruc- tions, under the assumption that later versions are improve- 2 ments over a first version. Recent work on revisions in thereof; second, we attempt to model edits computationally Wikipedia has shown that changes indeed serve a clarify- by developing a system to distinguish “older” from “newer” ing function (Faruqui et al., 2018). According to that study, versions of instructional sentences from wikiHow. however, most changes in Wikipedia provide new informa- In summary, we make the following main contribution:3 tion (43%), with refinements only ranking second (24%). The function and information provided by revisions is po- • We introduce and motivate the task of distinguishing tentially different in context of how-to guides, as their older and newer versions of instructions (x3). content is largely independent of factual knowledge that • We create wikiHowToImprove, a dataset of over 2.7 changes over time. Therefore, we perform a study simi- million sentences and their revision histories (x4). lar to that by Faruqui et al. (2018) on wikiHow articles. As our goal is to find observable patterns that reflect potential • We design and report on two annotation experiments improvements, we further attempt to sub-categorize edits that investigate the types of edits made and their pro- according to the information changed between two versions portions in a sample of revision histories (x5). of a wikiHow article. For this step, we create a dataset of sentence-level revisions for each article in wikiHow. An ex- • We develop and evaluate benchmark models that dis- ample is shown in Table 1. On this dataset, we carry out two tinguish different versions of a sentence (x6). types of studies: first, we perform annotation experiments to find out more about the types of changes and proportions 2. Related Work In this section, we present studies conducted within two ∗ equal contribution 1 related lines of research. In Section 2.1, we discuss previ- https://www.wikihow.com/wikiHow: ous work on revisions in the English Wikipedia. Currently About-wikiHow available wikiHow corpora are described in Section 2.2. 2This assumption is supported, for example, by wikiHow’s claim that articles are “changed 9 times per year” on average and 3 are continually reworked “till they are the most helpful and reli- Data and code are available here: https://github. able how-to guides on the web”. com/irshadbhat/wikiHowToImprove 5721 2.1. Revisions in English Wikipedia 2.2. Existing wikiHow Corpora There are a number of studies on revision histories from To the best of our knowledge, the only available corpus Wikipedia articles for various NLP tasks, such as sentence of wikiHow articles is from Koupaee and Wang (2018). simplification and linguistic bias detection (Recasens et al., The authors collected a large-scale summarization dataset 2013). A study particularly similar to ours has been car- consisting of 204,004 wikiHow articles to evaluate existing ried out by Faruqui et al. (2018) on Wikipedia articles. summarization systems. The structure of wikiHow articles Faruqui et al. investigate differences between phrases in- is well suited for this task: each article is divided into para- serted during a revision from the general language observed graphs, and each paragraph starts with a summary sentence. in Wikipedia texts. They approach this task through anno- The authors showed that the diversity of the topics and the tation experiments and linguistic analyses. The latter re- uniqueness of n-grams (i.e., the abstraction level) in their vealed that nouns, adjectives and adverbs occur consider- wikiHow dataset create interesting challenges for summa- ably more often in edited, inserted text than non-edited text. rization systems. For our study, the corpus of Koupaee and In their computational experiments, Faruqui et al. (2018) Wang (2018) is unsuitable since we need a collection of model and analyze edits that insert information through how-to guides that contains edited sentences as well their language models based on sequence-to-sequence methods: earlier versions. one trained on Wikipedia texts and one trained on their own WikiEdits corpus. The task of these models is to generate a 3. Problem Statement and Motivation phrase which would be appropriate to insert into a sentence The objective of this work is to categorize potential im- at a specific position. Their results show that a language provements made to instructional texts and to investigate model trained on article edits is more successful in propos- in how far they can be modelled computationally. Towards ing phrases that capture the same discourse function as hu- this objective, we examine in how far how-to guides in man insertions than a language model trained on Wikipedia wikiHow change over time. We make the simplifying as- more generally. Faruqui et al. (2018) concluded that the su- sumption that changes are usually made for the better and pervision provided by article edits encodes aspects of lan- therefore represent improvements to the original version of guage distinct from non-edited text. an article. Based on this assumption, we cast the modeling Another study which uses the revision history of the En- of improvements as a supervised learning problem, which glish Wikipedia is the work of Daxenberger and Gurevych requires the distinction between “older” and “newer” ver- (2012). They build a corpus of 1,995 edits from 891 article sions of a text. For simplicity, we focus on edits on the revisions from English Wikipedia texts and propose a 21- sentence level. That is, we consider all articles in wiki- category classification scheme of edit types. The categories How for which a revision history is available and examine are classified into three top layers: each original sentence, henceforth base version, and how it • Wikipedia Policy: invalid edits as defined by inter- is changed at subsequent points in time, henceforth revised nal Wikipedia Policies and respective defense mech- versions. anisms (e.g. VANDALISM). In Section 4, we first present wikiHowToImprove, a dataset of revision histories derived from wikiHow. We describe a • Surface Edits: edits not affecting the meaning of the set of simple methods that we put together in order to auto- text (e.g. SPELLING/GRAMMAR). matically download and extract sentence-level revisions for articles from wikiHow. Based on a small sample of these • Text-Base: edits affecting the meaning of the text (e.g. revision histories, we attempt to categorize different types INFORMATION-INSERT). of edits in two annotation studies. These studies, presented in Section 5, provide us with potential explanations for why Three annotators annotated the data and obtained an agree- edits are made, thereby indicating in how far our assump- ment in terms of Krippendorff’s alpha (Krippendorff, 1970) tion that edits represent actual improvements is reasonable. of 0.67. A follow-up analysis on the frequency distribution First steps to test whether such potential improvements can of the type of edits showed that most edits belong to the be modelled computationally are presented in Section 6. category Text-Base (51.19%), whereas 25.64% are Surface Edits.