Growing Wikipedia Across Languages Via Recommendation
Total Page:16
File Type:pdf, Size:1020Kb
Growing Wikipedia Across Languages via Recommendation Ellery Wulczyn Robert West Leila Zia Jure Leskovec Wikimedia Foundation Stanford University Wikimedia Foundation Stanford University [email protected] [email protected] [email protected] [email protected] ABSTRACT editors to find articles missing in their language that they would The different Wikipedia language editions vary dramatically in how like to contribute to. One approach to helping editors in this pro- comprehensive they are. As a result, most language editions con- cess is to generate personalized recommendations for the creation tain only a small fraction of the sum of information that exists of important missing articles in their areas of interest. across all Wikipedias. In this paper, we present an approach to Although Wikipedia volunteers have sought to increase the con- filling gaps in article coverage across different Wikipedia editions. tent coverage in different languages, research on identifying miss- Our main contribution is an end-to-end system for recommending ing content and recommending such content to editors based on articles for creation that exist in one language but are missing in an- their interests is scarce. Wikipedia’s SuggestBot [6] is the only other. The system involves identifying missing articles, ranking the end-to-end system designed for task recommendations. However, missing articles according to their importance, and recommending SuggestBot focuses on recommending existing articles that require important missing articles to editors based on their interests. We improvement and does not consider the problem of recommending empirically validate our models in a controlled experiment involv- articles that do not yet exist. ing 12,000 French Wikipedia editors. We find that personalizing Here we introduce an empirically tested end-to-end system to recommendations increases editor engagement by a factor of two. bridge gaps in coverage across Wikipedia language editions. Our Moreover, recommending articles increases their chance of being system has several steps: First, we harness the Wikipedia knowl- created by a factor of 3.2. Finally, articles created as a result of our edge graph to identify articles that exist in a source language but not recommendations are of comparable quality to organically created in a target language. We then rank these missing articles by impor- articles. Overall, our system leads to more engaged editors and tance. We do so by accurately predicting the potential future page faster growth of Wikipedia with no effect on its quality. view count of the missing articles. Finally, we recommend missing articles to editors in the target language based on their interests. In particular, we find an optimal matching between editors and miss- 1. INTRODUCTION ing articles, ensuring that each article is recommended only once, General encyclopedias are collections of information from all that editors receive multiple recommendations to choose from, and branches of knowledge. Wikipedia is the most prominent online that articles are recommended to the the most interested editors. encyclopedia, providing content via free access. Although the web- We validated our system by conducting a randomized experi- site is available in 291 languages, the amount of content in different ment, in which we sent article creation recommendations to 12,000 languages differs significantly. While a dozen languages have more French Wikipedia editors. We find that our method of personaliz- than one million articles, more than 80% of Wikipedia language ing recommendations doubles the rate of editor engagement. More editions have fewer than one hundred thousand articles [24]. It is important, our recommendation system increased the baseline ar- fair to say that one of the most important challenges for Wikipedia ticle creation rate by a factor of 3.2. Also, articles created via our is increasing the coverage of content across different languages. recommendations are of comparable quality to organically created Overcoming this challenge is no simple task for Wikipedia vol- articles. We conclude that our system can lead to more engaged unteers. For many editors, it is difficult to find important miss- editors and faster growth of Wikipedia with no effect on its quality. ing articles, especially if they are newly registered and do not have The rest of this paper is organized as follows. In Sec. 2 we years of experience with creating Wikipedia content. Wikipedians present the system for identifying, ranking, and recommending miss- have made efforts to take stock of missing articles via collections ing articles to Wikipedia editors. In Sec. 3 we describe how we of “redlinks”1 or tools such as “Not in the other language” [15]. evaluate each of the three system components. In Sec. 4 we discuss Both technologies help editors find missing articles, but leave ed- the details of the large scale email experiment in French Wikipedia. itors with long, unranked lists of articles to choose from. Since We discuss some of the opportunities and challenges of this work editing Wikipedia is unpaid volunteer work, it should be easier for and some future directions in Sec. 5 and share some concluding remarks in Sec. 6. 1Redlinks are hyperlinks that link from an existing article to a non- existing article that should be created. 2. SYSTEM FOR RECOMMENDING MISS- Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the ING ARTICLES author’s site if the Material is used in electronic media. We assume we are given a language pair consisting of a source WWW 2016, April 11–15, 2016, Montréal, Québec, Canada. ACM 978-1-4503-4143-1/16/04. language S and a target language T. Our goal is to support the http://dx.doi.org/10.1145/2872427.2883077. creation of important articles missing in T but existing in S. missing in German—something we want to avoid, since the topic Find Rank Match is already covered in the TUMOR article. missing articles by editors and In order to solve this problem, we need a way to partition the articles importance articles set of Wikidata concepts into groups of near-synonyms. Once we c Sec. 2.1 & 3.1 Sec. 2.2 & 3.2 Sec. 2.3 & 3.3 have such a partitioning, we may define a concept to be missing in language T if c’s group contains no article in T. In order to group Wikidata concepts that are semantically nearly Figure 1: System overview. Sec. 2.1, 2.2, 2.3 describe the com- identical, we leverage two signals. First, we extract inter-language ponents in detail; we evaluate them in Sec. 3.1, 3.2, 3.3. links which Wikipedia editors use to override the mapping speci- fied by Wikidata and to directly link articles across languages (e.g., in Fig. 2, English NEOPLASM is linked via an inter-language link Q133212 Q1216998 to German NEOPLASMA). We only consider inter-language links Wikidata added between S and T since using links from all languages has Wikipedia zh:肿瘤 de:Tumor en:Neoplasm et:Kasvaja been shown to lead to large clusters of non-synonymous concepts Redirect [3]. Second, we extract intra-language redirects. Editors have the fr:Tumeur fr:Néoplasie de:Neoplasma ability to create redirects pointing to other articles in the same language (e.g., as shown in Fig. 2, German Wikipedia contains a redirect from NEOPLASMA to TUMOR). We use these two addi- Figure 2: Language-independent Wikidata concepts (oval) tional types of links—inter-language links and redirects—to merge linking to language-specific Wikipedia articles (rectangular). the original concept clusters defined by Wikidata, as illustrated by Clusters are merged via redirects and inter-language links Fig. 2, where articles in the red and blue cluster are merged under hand-coded by Wikipedia editors. the same concept by virtue of these links. In a nutshell, we find missing articles by inspecting a graph whose nodes are language-independent Wikidata concepts and lan- Our system for addressing this task consists of three distinct guage-specific articles, and whose edges are Wikidata’s concept- stages (Fig. 1). First, we find articles missing from the target lan- to-article links together with the two additional kinds of link just guage but existing in the source language. Second, we rank the set described. Given this graph, we define that a concept c is miss- of articles missing in the target language by importance, by build- ing in language T if c’s weakly connected component contains no ing a machine-learned model that predicts the number of views an article in T. article would receive in the target language if it existed. Third, we match missing articles to well-suited editors to create those articles, 2.2 Ranking missing articles based on similarities between the content of the missing articles and of the articles previously edited by the editors. The steps are Not all missing articles should be created in all languages: some explained in more details in the following sections. may not be of encyclopedic value in the cultural context of the target language. In the most extreme case, such articles might be 2.1 Finding missing articles deleted shortly after being created. We do not want to encourage the creation of such articles, but instead want to focus on articles For any pair of languages (S;T), we want to find the set of arti- that would fill an important knowledge gap. cles in the source language S that have no corresponding article in A first idea would be to use the curated list of 10,000 articles ev- the target language T. We solve this task by leveraging the Wiki- ery Wikipedia should have [23]. This list provides a set of articles data knowledge base [20], which defines a mapping between lan- that are guaranteed to fill an important knowledge gap in any Wi- guage-independent concepts and language-specific Wikipedia arti- kipedia in which they are missing.