What Is Trending on Wikipedia?
Total Page:16
File Type:pdf, Size:1020Kb
What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions Volodymyr Miz, Joëlle Hanna, Nicolas Aspert, Benjamin Ricaud, and Pierre Vandergheynst LTS2, EPFL, Station 11, CH-1015 Lausanne, Switzerland [email protected] ABSTRACT of a topic, while others strive for an in-depth understanding of all In this work, we propose an automatic evaluation and comparison related facts. of the browsing behavior of Wikipedia readers that can be applied Due to these features, Wikipedia has become a global and uni- to any language editions of Wikipedia. As an example, we focus on fied information medium for people having different backgrounds, English, French, and Russian languages during the last four months hobbies, religious affiliations, political views, and languages. This of 2018. The proposed method has three steps. Firstly, it extracts makes Wikipedia an open window on cultural differences across the most trending articles over a chosen period of time. Secondly, it different languages and populations. Analysing WikipediaâĂŹs performs a semi-supervised topic extraction and thirdly, it compares viewership statistics could help identify collective interests of so- topics across languages. The automated processing works with the cially diverse communities of people speaking different languages data that combines Wikipedia’s graph of hyperlinks, pageview and compare them over chosen periods of time. statistics and summaries of the pages. The structure and the content of the online encyclopedia differ The results show that people share a common interest and curios- across languages. Some editions are more developed than others. ity for entertainment, e.g. movies, music, sports independently of When it comes to the coverage of certain topics, the level of detail their language. Differences appear in topics related to local events and the content varies largely from one language to another. This or about cultural particularities. Interactive visualizations showing difference is especially noticeable when we look at controversial clusters of trending pages in each language edition are available articles related to culture, politics or history, as it is shown in [16] online https://wiki-insights.epfl.ch/wikitrends for the web in general and in [2, 10] for Wikipedia in particular. For instance, the Italian version of the article about Leonardo da Vinci is CCS CONCEPTS much more detailed than the one in English. The content difference is also apparent in fairly non-controversial topics. For example, • Applied computing → Sociology; • Human-centered com- the article "Cat" in Spanish focuses on the animal’s diseases and puting → Social content sharing; Social navigation; Wikis. covers this aspect in greater detail than the same article written KEYWORDS in English. In addition to the different texts, different hyperlinks Wikipedia, languages, networks, data mining, society are also present pointing to different concepts or related pages [10]. ACM Reference Format: Such differences may trigger diverging associations depending on Volodymyr Miz, Joëlle Hanna, Nicolas Aspert, Benjamin Ricaud, and Pierre the language in which one reads the very same article. Vandergheynst. 2020. What is Trending on Wikipedia? Capturing Trends These observations lead to an interesting question: is it possible and Language Biases Across Wikipedia Editions. In Companion Proceedings to quantify differences between languages automatically? of the Web Conference 2020 (WWW ’20 Companion), April 20–24, 2020, Taipei, In this work, we identify and compare the most popular top- Taiwan. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3366424. ics among readers of three Wikipedia language editions, English, 3383567 French, and Russian, over the last 4 months of 2018. We focus on articles that have unusual spikes in the number of views and as- 1 INTRODUCTION sociate those spikes of activity with real-world events. We discuss arXiv:2002.06885v1 [cs.SI] 17 Feb 2020 Wikipedia has become one of the most popular sources of informa- similarities and differences in the preferences of the readers ofthe tion in the world. As of January 2020, it is written in 309 languages. three editions. Wikipedia readers have different motivations [19]. The range and We define trends as popular topics having a rise of interest from depth of their interests are very diverse. Some readers may look readers during a limited period of time. We rely on the bursty for an up-to-date source of information related to an event that behavior of page views, noticed and modeled in previous works [14, appeared in the news, while others are interested in solving work or 15, 22]. We require the trend to correspond to a group of Wikipedia school-related tasks. Some may be satisfied with a quick overview articles connected by hyperlinks and having a correlated increase in viewership. This allows discarding isolated pages that get a high This paper is published under the Creative Commons Attribution 4.0 International number of hits due to ’flash crowd’ coming from a popular page (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. pointing to a Wikipedia article, which is a very common use case. WWW ’20 Companion, April 20–24, 2020, Taipei, Taiwan We rely on the method presented in [14] to extract the trends. © 2020 IW3C2 (International World Wide Web Conference Committee), published From this method, we obtain well-separated clusters of Wikipedia under Creative Commons CC-BY 4.0 License. pages and their number of visits over time. This trend extraction ACM ISBN 978-1-4503-7024-0/20/04. https://doi.org/10.1145/3366424.3383567 WWW ’20 Companion, April 20–24, 2020, Taipei, Taiwan Volodymyr Miz, et al. English George H. W. Miss Universe Scott September Emmy Hurricane Lion Air Stan Lee Morrison 11 attacks award Michael Bush Flight 610 US (polititian) crash elections French Nicolas Quebec Men in September Nulle part Siège de Miss Hulot Formula 1 elections Stan Lee Black 3 11 attacks ailleurs (TV Homs France (politician) show) Russian Joseph September Soyuz Columbine Ukrainian George H. W. Figure skating Kobzon 11 attacks Formula 1 spacecraft High School Stan Lee Navy Bush (women) (singer) Massacre Activity level (normalized) level Activity Figure 1: Trends over time and across languages. Each colored curve, together with a short description, is associated to a trend. They represent the number of visits on the pages belonging to the trend over time. The associated keyword is the title of the most central page of the trend. Red dashed lines highlight the moments when readers’ interests in multiple language editions coincide. is illustrated on Fig. 1 where the most trending events and their differences due to political and cultural influences associated to popularity are displayed on a timeline. languages. Depending on the language some parts within an article Our first contribution is a flexible method that automatically: are more detailed, biased or more debated. As a consequence of • extracts the most popular topics during a chosen period of this editorial behavior, page contents in different languages con- time (trends) tain noticeable variations when comparing articles about famous • labels each trend according to the summaries of Wikipedia people[5], food-related pages [11] or history of states [17]. Not only pages related to it. the text is different but also the hyperlink structure [8]. The method is language-independent and relies solely on Wikipedia With the help of a large-scale poll, Lemmerich et al. [13] ana- pageview data and graph of hyperlinks. lyzed visitors’ motivations across 14 languages. The authors demon- Our second contribution is the analysis of trends, their similari- strated that Wikipedia viewership patterns and use cases vary in ties and differences across three language editions of Wikipedia. different language editions and connected their findings tothe socio-economic characteristics in certain countries, such as the 2 RELATED WORK Human Development Index. To our knowledge, this is the first study focusing on the auto- The first studies on the interest of readers [12, 21] used Wikipedia matic detection of viewership trends and biases in multiple language page view counts to identify the most popular articles. They found editions of Wikipedia. In contrast to previous works, topics are not out that entertainment (movies, music, sports, etc.) and people’s predefined but extracted automatically according to their popularity biographies are the most popular topics that interest Wikipedia (number of visits on the related pages). We provide an overview of readers. Recently, a more advanced method based on the dynamic the dynamical evolution of the most popular topics among the read- evolution of popularity [14] confirmed this tendency and shed light ers of English, French, and Russian language editions of Wikipedia on the relationship between events that appear in the news, trends and reveal differences and commonalities between them. and Wikipedia article popularity. Another topic of recent investigations is the motivation of Wiki- pedia readers [19]. It was reported that among the main motivations 3 DATASET for reading about a particular topic is that it was referenced in the In this work, we use the graph-structured dataset for Wikipedia media (30%), in a conversation (22%) or it is a current event (13%). research described in [1] where nodes represent Wikipedia articles This indicates a strong influence of trends and news on the readers’ and edges correspond to hyperlinks between them. The associated consumption of articles. Trends are an essential part of the search open source repository provides tools for pre-processing the official for information and that is what we want to extract and analyse. Wikipedia SQL dumps (hyperlinks and pageviews records) in all They are highly influenced by the readers’ environment and hence available languages. should reflect similarities and differences across languages andcul- We studied English, French, and Russian language editions of tures.