The Sum of Human Knowledge? Not in One Wikipedia Language Edition

Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition Marc Miquel-Ribé Published on: May 15, 2019 Updated on: Nov 26, 2019 Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition Image credit: Denis Schroeder (WMDE), Wikidata Items Map 2014—2017. “The sum of human wisdom is not contained in any one language, and no single language is capable of expressing all forms and degrees of human comprehension.” Ezra Pound Though I had used Wikipedia for years, it was only ten years ago when I discovered how each language edition community can freely organize its content—as there is no central editorial board. The Catalan version of the encyclopedia, in my native tongue, can have pages dedicated to its culture without impediment. Some might take this for granted, but I cherished this principle because of my memories of my grandfather, who was forbidden to speak his language in public during the forty years of Franco’s dictatorship, and of my mother, who did not have not the chance to be educated in her mother tongue. I did not immediately become a contributor, but I wanted to learn more and, hopefully, one day give back. Today, I am doing so as a researcher with the Wikipedia Cultural Diversity Observatory (WCDO). Though the English Wikipedia has brought much attention to the larger Wikimedia project, that project’s future and potential growth lie in many smaller languages and cultures, which are often overlooked—and under threat, as many human languages are likely to disappear by the end of the century. The poet Ezra Pound said that “the sum of human wisdom is not contained in any one language, and no single language is capable of expressing all forms and degrees of human comprehension.”1 Obviously, the same is true of Wikipedia. At the observatory, we work to discover the knowledge that is local to each language, the cultural pearls from every place in the world and promote its exchange. I believe this can be advanced using a model assessing project cultural diversity. Such a model will then allow us to better encourage Wikipedia language communities to raise awareness, organize events, adopt tools, and incorporate cultural diversity as part of their strategic plans. Researching the Cultures in Wikipedia Language Editions Although cultural diversity appears now to be a crystal-clear priority for the movement, it was not that obvious in 2011, when I attended my first Wikimania. In the most popular and crowded Wikipedia conference, the multitude of nationalities reminded me of an encyclopedian version of the United Nations. Our apparent differences were in clothing, colors, gestures and many other details. Before the conference, a friend of mine asked me a key question: if English Wikipedia has most of the articles, why should there be hundreds of other language editions? I hesitated a bit, and my answer was that for the different language editions to exist, they had to be different. 2 Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition Finding these differences became my main interest in Wikipedia. Even though I was initially more focused on the Catalan Wikipedia, I found an exciting quest in using algorithms to compare the contents from any language edition. I could see the extent and particularities of the coverage of each topic in each language as if they were patterns revealed in an aerial view, unperceivable to the eyes of other editors. Analyzing the editors’ behavior and the extent of topics in articles became the object of my Master’s thesis and later of my Ph.D. thesis. By understanding how this editing process unravels in the data and other researchers’ work, I found many reasons to justify the need for multiple language editions. I will try to summarize them into three. The first aspect I saw during my research was that the articles of every language edition are limited to specific groups of points of view or have a “linguistic point of view.” This was something intuitive to any Wikipedia user. Some topics are dealt very differently in the Catalan and Spanish Wikipedia - especially those concerning politics and culture. Hecht and Gergle showed us that these variations in points of view between the language versions of the same article could be measured by taking into account the outgoing links in the text they have in common.2 Even in general topics, like ‘Psychology’, one can find differences of 20% in the links pointing at different articles. Massa and Scrinzi pointed out that topics that elicit controversy, for instance, articles about the terrorist “Osama Bin Laden” or the international struggle “Israeli-Palestinian conflict,” showed the fewest number of links in common.3 This led me to think that even though Wikipedia asks for a neutral point of view (NPOV) (i.e. a fair representation of the different available points of view on a topic), we know this is an ideal. Since a language edition is a community phenomenon, group interests and power dynamics tend to reinforce or undermine certain points of view. Some perspectives are unknown or simply ignored, and very few are novel or exclusive to that particular group of speakers. This latter category is very valuable. Such novelty and uniqueness is, in fact, a valuable contribution, and should be seen as a complement to other language editions. Linguists sometimes defend a linguistic perspective by saying that every language is a specific worldview, or at least, one of a particular context. Each language you speak gives you concepts to map things and situations, and classify them according to the experience of generations. Any language accumulates knowledge in the vocabulary used to label the species of plants, the nouns to describe climatological changes in the natural environment, and the idioms and adjectives that have originated to understand human character and history in a specific way. Being able to compare linguistic differences and observe from multiple perspectives allows you to contrast and understand reality better. The eminent linguist Benjamin Lee Whorf went a bit further with this perspective and reinforced the idea that we need more than one language to gain depth in thinking. He claimed that all knowledge is 3 Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition provisional, and therefore, multilingual competencies allow you to advance faster in its development. “Western culture has made, through language, a provisional analysis of reality and, without correctives, holds resolutely to that analysis as final. The only correctives lie in all those other tongues which by aeons of independent evolution have arrived at different, but equally logical, provisional analyses.”4 This quote inevitably reminded me of how Wikipedia allows us to compare the different points of view, jumping through the parallel versions of an article that exists in several language editions. The second aspect I saw during my research was that the language editions are influenced by the territories where the language is spoken and they are the most complete at creating content about them. Hecht and Gergle measured in several language editions the number of links directed to articles geolocated on the territories where the language is spoken.5 With such a simple metric they could determine that each Wikipedia tends to be self-focused, as results indicated that these articles received many more links than other geolocated articles, i.e., they were more prominent in the linked graph structure. Even though geolocated articles show relevant language differences, one could argue that this is only a small portion of each Wikipedia. The articles about many other topics such as traditions, history, organizations, politics, and so on can explain the idiosyncrasies of any culture and the territories where the language is spoken. This way, by collecting all the articles about these topics, I thought we could get a better idea of what is genuine in the cultural and geographical contexts of every language edition. I hence proposed an algorithm to collect such articles and I entitled the selection of articles “Cultural Context Content” (or CCC). My first questions were (1) how many articles would each Wikipedia dedicate to their cultural contexts, and more importantly, (2) what would be the extent of this group of articles. As far as the Catalan Wikipedia was that it would overcompensate for the linguistic and cultural genocide suffered during the past century and that it would also be influenced by the current political self-determination struggle. This might result in an exaggerated number and proportion of articles set in this cultural context, which would be centered around Catalonia, Valencia, Balearic Islands, Andorra and a few scattered territories in the south of France and in the Aragonese autonomous community. Surprisingly, the proportion was only 20% and since the first measurement, it has decreased to the current 17.09%.6 Taking into account the top forty language editions, the average proportion of content dedicated to their cultural context is a quarter of each Wikipedia.7 Some like the English and the Japanese presented more than half of them. Others like the German, French, and Italian had lower proportions (33.7%, 26.9%, and 18.8% respectively). 4 Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition Figure 1. This is Cultural Context Content (CCC), i.e. the articles related to the editors’ cultural contexts in each language edition (traditions, language, politics, agriculture, biographies, places, events, etcetera). Each Wikipedia has its CCC and cultural diversity depends on how well it covers the other languages’ CCC. It is difficult to answer why some Wikipedia language editions dedicate more articles to their context than others, as it may depend on many factors.

The Sum of Human Knowledge? Not in One Wikipedia Language Edition

Librarians As Wikimedia Movement Organizers in Spain: an Interpretive Inquiry Exploring Activities and Motivations

The Wikipedia Diversity Observatory a Project to Identify and Bridge Content Gaps in Wikipedia

Master Thesis

Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories

The Case of 13 Wikipedia Instances

DLDP Digital Language Survival Kit

Gebiotoolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

Mining Cross-Cultural Relations from Wikipedia - a Study of 31 European Food Cultures

Proceedings of Rely on Different Character Sets Such As MATMT2008 Workshop: Mixing Approaches to CJK Or Arabic

Whatsupcat1.Pdf

Conference Abstracts

Wikitrip: Animated Visualization Over Time of Gender and Geo-Location of Wikipedians Who Edited a Page