Download and Use.34 the Code Generated Content and Its Applications in a Multilingual Context
Total Page:16
File Type:pdf, Size:1020Kb
Language-agnostic Topic Classification for Wikipedia Isaac Johnson Martin Gerlach Diego Sáez-Trumper [email protected] [email protected] [email protected] Wikimedia Foundation Wikimedia Foundation Wikimedia Foundation United States United States United States ABSTRACT are attracting the interest of editors or readers?—especially across A major challenge for many analyses of Wikipedia dynamics—e.g., the different language editions. imbalances in content quality, geographic differences in what con- Wikipedia itself has a number of editor-curated annotation sys- tent is popular, what types of articles attract more editor discussion— tems that bring some order to all of this content. Most directly per- is grouping the very diverse range of Wikipedia articles into coher- haps is the category network, but content can also be categorized ent, consistent topics. This problem has been addressed using vari- based on properties stored in Wikidata, tagging by WikiProjects ous approaches based on Wikipedia’s category network, WikiPro- (groups of editors who focus on a specific topic), or inclusion of jects, and external taxonomies. However, these approaches have templates such as infoboxes. While these annotation systems are always been limited in their coverage: typically, only a small sub- quite powerful and extensive, they ultimately are human-generated set of articles can be classified, or the method cannot be applied and semi-structured and thus have many edge cases and under- across (the more than 300) languages on Wikipedia. In this paper, coverage in languages or communities that do not have editors who we propose a language-agnostic approach based on the links in an can maintain these annotations (see [10]). Researchers have devel- article for classifying articles into a taxonomy of topics that can be oped many approaches to improve these annotation systems by easily applied to (almost) any language and article on Wikipedia. using Wikipedia’s category network [17, 22], WikiProjects [2, 30], We show that it matches the performance of a language-dependent depending on DBPedia’s manually-curated taxonomy that then approach while being simpler and having much greater coverage. assigns topics based on infobox templates [17], or throwing out the editor-based annotation systems completely and learning a set CCS CONCEPTS number of topics through unsupervised techniques such as topic modeling [18, 20, 23, 27]. • Human-centered computing ! Empirical studies in collab- However, these approaches generally suffer from two limitations orative and social computing. related to coverage in terms of language and article. First, not all lan- guage communities have the editor base or need to maintain these KEYWORDS annotations to the same degree. For example, Arabic Wikipedia has Wikipedia, language-agnostic, topic classification the most categories per article at 25.5, and English Wikipedia with 4 ACM Reference Format: its approximately 40,000 monthly active editors, has 1.5M cate- Isaac Johnson, Martin Gerlach, and Diego Sáez-Trumper. 2021. Language- gories that are collectively applied 66M times across its 6.2 million 5 agnostic Topic Classification for Wikipedia. In Companion Proceedings of the articles. Wu Chinese Wikipedia, however, with approximately 20 Web Conference 2021 (WWW ’21 Companion), April 19–23, 2021, Ljubljana, active editors and 41,231 articles only has 6,990 categories that are Slovenia. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3442442. applied 26,396 times (leaving many articles with no categories at 3452347 all). Similar variation is seen with other systems as well: linkage to Wikidata is higher but still as low as 87.5% in Cebuano Wikipedia 1 INTRODUCTION with its 5.5M articles6, only 92 Wikipedia languages have a page de- 7 8 As of January 2021, Wikipedia has over 300 language editions with scribing WikiProjects, and many articles lack infoboxes. Second, 55.7 million articles1 about 20.4 million distinct entities2 and an approaches that seek to expand article coverage by also predicting additional 250 thousand articles created every month.3 These ar- topics for non-annotated articles depend on hand-labeling of topics arXiv:2103.00068v1 [cs.CY] 26 Feb 2021 ticles cover a very wide range of content and it can be difficult to (which requires language expertise) [20, 23, 27] or language model- track and understand these dynamics—e.g., what types of content ing that does not easily scale to all languages on Wikipedia [2]. In this paper, we make the following contributions: 1https://wikistats.wmcloud.org/display.php?t=wp • We present an approach to automatically labeling (almost) 2 Personal calculation based on 4 January 2021 Wikidata JSON dump: https://dumps. all Wikipedia articles across every language of Wikipedia wikimedia.org/wikidatawiki/entities/20210104/ 3https://stats.wikimedia.org/#/all-wikipedia-projects/contributing/new- with a consistent set of topics. Specifically, we build on work pages/normal|bar|2-year|~total|monthly 4https://stats.wikimedia.org/#/en.wikipedia.org/contributing/active-editors/normal| This paper is published under the Creative Commons Attribution 4.0 International line|2-year|(page_type)~content*non-content|monthly (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their 5Personal calculations using the December 2020 dumps and categorylinks table personal and corporate Web sites with the appropriate attribution. (https://www.mediawiki.org/wiki/Manual:Categorylinks_table) filtered by page table WWW ’21 Companion, April 19–23, 2021, Ljubljana, Slovenia to articles in namespace 0 (https://www.mediawiki.org/wiki/Manual:Page_table) © 2021 IW3C2 (International World Wide Web Conference Committee), published 6https://wikidata-analytics.wmcloud.org/app/WD_percentUsageDashboard under Creative Commons CC-BY 4.0 License. 7https://www.wikidata.org/wiki/Q4234303 ACM ISBN 978-1-4503-8313-4/21/04. 8Only about one-third of English Wikipedia articles have infoboxes per DBPedia’s https://doi.org/10.1145/3442442.3452347 statistics: https://wiki.dbpedia.org/services-resources/ontology WWW ’21 Companion, April 19–23, 2021, Ljubljana, Slovenia Johnson et al. by Asthana & Halfaker [2] that uses 64 topics derived from other ontologies, coverage is limited by how many infobox tem- WikiProject tags and extend their language-dependent ap- plates are present and mapped to DBPedia’s ontology. proach to all Wikipedia languages. The main innovation of Wikidata offers another ontology that is much more closely- our approach is to represent articles in a language-agnostic linked to Wikipedia. Wikidata items often either have an instance-of way using article links that have been mapped to Wikidata property (P31) or subclass-of property (P279), the network of which items (similar to Piccardi and West [23]). can be used to categorize Wikidata items (and their corresponding • We demonstrate through quantitative and qualitative evalua- Wikipedia articles) into a set of high-level topics [24]. Wikidata’s tions that our language-agnostic approach performs equally ontology contains loops, dead-ends, and other inconsistencies that well or better than alternative approaches. limit its usage, however, in applying coherent topics to articles [4, • We release the code and trained model, a dataset of every 24]. It also is a step removed from Wikipedia articles, which removes Wikipedia article and its predicted topics, and APIs for in- the direct connection and feedback loop between the topics applied teracting with the models. to an article and what content is included in the article. 2 RELATED WORK 2.3 Unsupervised Approaches Three general approaches have been taken to classifying Wikipedia Some researchers have also avoided these existing ontologies in articles into a consistent and coherent set of topics: 1) directly ap- favor of unsupervised learning of topics and post-hoc labeling. ply existing editor-generated annotations on Wikipedia, 2) linking These generally are learned via topic models, most notably Latent Wikipedia articles to an external taxonomy, and, 3) learning un- Dirichlet Allocation (LDA), with article text as input [18, 20, 27]. supervised topics and manually labeling them. This work pulls These unsupervised approaches have the benefit of generating most directly from Section 2.1 with additional modeling similar to continuous topic vectors that can be valuable for modeling and Section 2.3. having high coverage because they do not rely on annotations. However, there are limitations of these approaches for topic labeling 2.1 Annotations on Wikipedia of articles systematically. First, the identified latent topics cannot The most common and simplest strategy for classifying Wikipedia always easily be interpreted in terms of its content. Second, text- articles by topic is using existing annotations that editors have based approaches usually require custom adaptations when being added to articles. For instance, Wikipedia has a category network applied across all languages due to issues arising from parsing and that roughly forms a tree with approximately 40 root topics.9 Not pre-processing different scripts. all language communities of Wikipedia, however, use the same set The approach by Piccardi and West [23], which learns a topic of high-level categories or label articles with categories to the same model over articles as represented by their links mapped to the extent [17]. The category network also is messy, requiring careful language-agnostic Wikidata vocabulary, is very