Classifying Wikipedia Article Quality with Revision History Networks
Total Page:16
File Type:pdf, Size:1020Kb
Classifying Wikipedia Article Quality With Revision History Networks Narun Raman∗ Nathaniel Sauerberg∗ Carleton College Carleton College [email protected] [email protected] Jonah Fisher Sneha Narayan Carleton College Carleton College [email protected] [email protected] ABSTRACT long been interested in maintaining and investigating the quality We present a novel model for classifying the quality of Wikipedia of its content [4][6][12]. articles based on structural properties of a network representation Editors and WikiProjects typically rely on assessments of article of the article’s revision history. We create revision history networks quality to focus volunteer attention on improving lower quality (an adaptation of Keegan et. al’s article trajectory networks [7]), articles. This has led to multiple efforts to create classifiers that can where nodes correspond to individual editors of an article, and edges predict the quality of a given article [3][4][18]. These classifiers can join the authors of consecutive revisions. Using descriptive statistics assist in providing assessments of article quality at scale, and help generated from these networks, along with general properties like further our understanding of the features that distinguish high and the number of edits and article size, we predict which of six quality low quality Wikipedia articles. classes (Start, Stub, C-Class, B-Class, Good, Featured) articles belong While many Wikipedia article quality classifiers have focused to, attaining a classification accuracy of 49.35% on a stratified sample on assessing quality based on the content of the latest version of of articles. These results suggest that structures of collaboration an article [1, 4, 18], prior work has suggested that high quality arti- underlying the creation of articles, and not just the content of the cles are associated with more intense collaboration among editors article, should be considered for accurate quality classification. [8, 10, 21]. To this end, we introduce a novel model for article qual- ity classification, inspired by the network-based article trajectory CCS CONCEPTS model proposed by Keegan et al. [7], in order to predict an article’s quality by analyzing the structure of its collaboration network. • Human-centered computing ! Empirical studies in collab- Using statistics generated from each article’s revision history orative and social computing; Wikis. network, in conjunction with basic article characteristics like article KEYWORDS size and edit count, we run a multinomial logisitic regression (MLR) and attain a classification accuracy of 49.35% on a stratified sample Wikipedia, network analysis, quantitative methods, article quality, of 6000 articles. classification, collaboration ACM Reference Format: 2 COLLABORATION AND QUALITY Narun Raman, Nathaniel Sauerberg, Jonah Fisher, and Sneha Narayan. 2020. Classifying Wikipedia Article Quality With Revision History Networks. Structures of collaboration have been shown to affect outcomes in In 16th International Symposium on Open Collaboration (OpenSym 2020), traditional organizations, as well as volunteer groups online [13]. August 25–27, 2020, Virtual conference, Spain. ACM, New York, NY, USA, Structural properties of internal collaborative networks can influ- 7 pages. https://doi.org/10.1145/3412569.3412581 ence outcomes across a wide variety of contexts, from the creation of Broadway musicals [16], to open-source software development 1 INTRODUCTION [15]. In particular, small-world networks (i.e. highly clustered net- Wikipedia has become the largest and most popular online ref- works with small characteristic path length [19]) are believed to erence encyclopedia in the world1, and each of its articles is the enhance productivity and creativity [13, 15, 16]. product of collaboration between many volunteer editors working On Wikipedia in particular, leadership behavior and an interme- together to create a coherent and accurate resource [7][11]. Due diate level of small-wordliness in the social networks of WikiProject to Wikipedia’s widespread adoption, editors and researchers have talk pages have been found to be positively related to the efficiency and productivity of a project (measured in terms of edit count and ∗Both authors contributed equally to this research. edit longevity) [13]. Additionally, prior work has shown that arti- 1https://www.alexa.com/siteinfo/wikipedia.org cle quality is positively impacted by more communication among editors on talk pages, as well as a more centralized collaboration OpenSym 2020, August 25–27, 2020, Virtual conference, Spain structure [8]. While articles with many editors are more likely to © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. This is the author’s version of the work. It is posted here for your personal use. Not be of higher quality than articles with fewer editors [8, 9, 20], the for redistribution. The definitive Version of Record was published in 16th International addition of editors to a page only seems to improve its quality Symposium on Open Collaboration (OpenSym 2020), August 25–27, 2020, Virtual confer- ence, Spain, https://doi.org/10.1145/3412569.3412581. when those editors are collaborating appropriately [8]. Preliminary results also suggest that structural parameters of collaboration OpenSym 2020, August 25–27, 2020, Virtual conference, Spain Narun Raman, Nathaniel Sauerberg, Jonah Fisher, and Sneha Narayan networks generated from article revision histories can help dif- quality is subjective and classification judgments may vary among ferentiate controversial articles from featured articles [2]. Given editors. Second, approximately 7.52% (as of June 13, 2020) of all that these structures of collaboration have been related to individ- Wikipedia articles have not been assessed 4. This may impact the dis- ual article quality [2][8][10], as well as the overall productivity of tribution of assessed Wikipedia articles. Finally, because Wikipedia WikiProjects [13], we believe that characteristics of the collabora- is a constantly evolving platform, an article’s quality class rating tion networks of individual articles may be beneficial in predicting could be outdated because of new edits and revisions that occurred the quality level of a Wikipedia article. after its assessment. Despite these limitations, we use them as the ground-truth for 3 PREDICTING ARTICLE QUALITY ON article quality in our classifier because of their frequent use in WIKIPEDIA discussions of article quality on Wikipedia [5][10] as well as in similar classification models [4][6][14][18]. Existing models for predicting article quality on Wikipedia have typically used features related to the content of the article at a 4.2 The Revision History Network Model particular moment in time [1, 4, 18]. One such feature is the length of the article. A model with article word count as its only predictor The page history of a Wikipedia article can be viewed as an ordered was able to achieve a 97.15% accuracy rate in completing a binary list of revisions, each of which stores the state of the article at a classification task of featured and non-featured articles, beating particular point in time. Each revision is associated with an editor, out several more complex models [1]. Warncke-Wang et al. then identified by a username or IP address, depending on whether the introduced the ‘Actionable Model’, which uses article size along editor created an account. with features like the number of internal links, headings, images We define the revision history network of an article as an undi- and references to perform a more fine-grained classification task rected network whose nodes are the editors of the corresponding on all seven quality classes (including the now uncommon A-Class) article, and whose edges join the editors of consecutive revisions. with 42.5% accuracy on an imbalanced corpus of articles [18]. ORES, In other words, we can construct the revision history network of a machine-learning based web service that estimates Wikipedia an article from its revision history by iterating over the revisions article quality, uses a modified version of the Actionable Model chronologically and, for each revision 8 after the first, creating an with some additional predictor variables, including the number edge between the editors of revisions 8 − 1 and 8 if one does not of “[citation needed]” templates and the number of “Main article” already exist (see Figure 1). Note that this holds even if an editor linking templates [4]. This version was able to achieve an accuracy authors two consecutive revisions; in this case, we create an edge of 62.9% on their own corpus of articles using the six current quality joining the corresponding node to itself. Our model is adapted from classes. the article trajectory network model introduced in [7], which rep- However, these models do not take into account how the ar- resents each article’s revision history as a directed network with ticle was developed by its editors and the way they collaborated multiple edges. with one another. In light of work outlined in the previous section that suggests a relationship between the structure of editor collab- orations and article quality, we present a novel model that uses characteristics of an article’s revision history network to predict the quality of an article. 4 MODELING APPROACH 4.1 Operationalizing Article Quality In order to measure the baseline quality of an article, we make use of Wikipedia’s content