The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context Brent Hecht* and Darren Gergle† Northwestern University Dept

The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context Brent Hecht* and Darren Gergle*† Northwestern University *Dept. of Electrical Engineering and Computer Science, † Dept. of Communication Studies [email protected], [email protected] ABSTRACT the goal of this research to illustrate the splintering effect of This study explores language’s fragmenting effect on user- this “Web 2.0 Tower of Babel”1 and to explicate the generated content by examining the diversity of knowledge positive and negative implications for HCI and AI-based representations across 25 different Wikipedia language applications that interact with or use Wikipedia data. editions. This diversity is measured at two levels: the concepts that are included in each edition and the ways in We begin by suggesting that current technologies and which these concepts are described. We demonstrate that applications that rely upon Wikipedia data structures the diversity present is greater than has been presumed in implicitly or explicitly espouse a global consensus the literature and has a significant influence on applications hypothesis with respect to the world’s encyclopedic that use Wikipedia as a source of world knowledge. We knowledge. In other words, they make the assumption that close by explicating how knowledge diversity can be encyclopedic world knowledge is largely consistent across beneficially leveraged to create “culturally-aware cultures and languages. To the social scientist this notion applications” and “hyperlingual applications”. will undoubtedly seem problematic, as centuries of work have demonstrated the critical role culture and context play Author Keywords in establishing knowledge diversity (although no work has Wikipedia, knowledge diversity, multilingual, hyperlingual, yet measured this effect in Web 2.0 user-generated content Explicit Semantic Analysis, semantic relatedness (UGC) on a large scale). Yet in practice, many of the ACM Classification Keywords technologies and applications that rely upon Wikipedia data H.5.3. [Information Interfaces and Presentation]: Group and structures adopt this view. In doing so, they make many Organization Interfaces – collaborative computing, incorrect assumptions and miss out on numerous computer-supported cooperative work technological design opportunities. General Terms To demonstrate the pitfalls of the global consensus Human Factors hypothesis – and to provide the first large-scale census of the effect of language in UGC repositories – we present a INTRODUCTION novel methodology for assessing the degree of world A founding principle of Wikipedia was to encourage knowledge diversity across 25 different Wikipedia language consensus around a single neutral point of view [18]. For editions. Our empirical results suggest that the common instance, its creators did not want U.S. Democrats and U.S. encyclopedic core is a minuscule number of concepts Republicans to have separate pages on concepts like (around one-tenth of one percent) and that sub-conceptual “Barack Obama”. However, this single-page principle knowledge diversity is much greater than one might broke down in the face of one daunting obstacle: language. initially think—drawing a stark contrast with the global Language has recently been described as “the biggest consensus hypothesis. barrier to intercultural collaboration” [29], and facilitating In the latter half of this paper, we show how this knowledge consensus formation across speakers of all the world’s diversity can affect core technologies such as information languages is, of course, a monumental hurdle. Consensus retrieval systems that rely upon Wikipedia-based semantic building around a single neutral point of view has been relatedness measures. We do this by demonstrating fractured as a result of the Wikipedia Foundation setting up knowledge diversity’s influence on the well-known over 250 separate language editions as of this writing. It is technology of Explicit Semantic Analysis (ESA). In this paper, our contribution is four-fold. First, we show that the quantity of the world knowledge diversity in Wikipedia is much greater than has been assumed in the literature. Second, we demonstrate the effect this diversity 1 See [18] for original use of the analogy. can have on important technologies that use Wikipedia as a language edition that is missing it, which, if not qualified, is source of world knowledge. Third, this work is the first a strong case of the global consensus hypothesis. We argue large-scale and large-number-of-language study to describe that although this “information arbitrage” model in many some of the effects of language on user-generated content. cases provides risk-free informational profit, it does not Finally, we conclude by describing how a more realistic always do so. In fact, if such an approach is applied perspective on knowledge diversity can open up a whole heedlessly it runs the risk of tempering cultural diversity in new area of applications: culturally aware applications, and knowledge representations or introducing culturally its important sub-area, hyperlingual applications. irrelevant/culturally false information. Our results can be used in tandem with approaches such as information BACKGROUND Wikipedia has in recent years earned a prominent place in arbitrage to provide a framework for separating “helpful” the literature of HCI, CSCW, and AI. It has served as a arbitrage, which can have huge benefits, from “injurious” laboratory for understanding many aspects of collaborative arbitrage, which has negative qualities. work [6, 16, 17, 23, 26] and has also become a game- Evidence Against the Global Consensus Hypothesis changing source for encyclopedic world knowledge in While fewer in number, some recent papers have focused many AI research projects [11, 20, 28, 30]. Yet the vast on studying the differences between language editions. majority of this work has focused on a single language These papers fall more in line with social science positions edition of Wikipedia, nearly always English. Only recently regarding the nature of knowledge diversity. Our prior work has the multilingual character of Wikipedia begun to be [14] demonstrated that each language edition of Wikipedia leveraged in the research community, ushering in a focuses content on the geographic culture hearth of its potential second wave of Wikipedia-related research. language, a phenomenon called the “self-focus” of user- generated content. In a similar vein, Callahan and Herring This new work has been hailed as having great potential for [7] examined a sample of articles on famous persons in the solving language- and culture-related HCI problems [14]. English and Polish Wikipedia and found that cultural Unfortunately, while pioneering, most of this research has differences are evident in the content. proceeded without a full understanding of the multilingual nature of Wikipedia. In the applied area, Adafre and de OUR APPROACH Rijke [1] developed a system to find similar sentences In order to question the veracity of the global consensus between the English and Dutch Wikipedias. Potthast and hypothesis, quantify the degree of knowledge diversity, and colleagues [25] extended ESA [11, 12] for multilingual demonstrate the importance of this knowledge diversity, a information retrieval. Hassan and Mihalcea [13] used an large number of detailed analyses are necessary. The ESA-based method to calculate semantic relatedness following provides an overview of the various stages of our measurements between terms in two different languages. study. Adar and colleagues [2] built Ziggurat, a system that Step One (“Data Preparation and Concept Alignment”): propagates structured information from a Wikipedia in one First, we develop a methodology similar to previous language into that of another in a process they call literature [2, 13, 25] that allows us to study knowledge “information arbitrage”. A few papers [22, 27] attempted to diversity. This methodology – highlighted by a simple add more interlanguage links between Wikipedias, a topic algorithm we call CONCEPTUALIGN – aligns concepts across we cover in detail below. different language editions, allowing us to formally The Global Consensus of World Knowledge understand that “World War II” (English), “Zweiter Though innovative, much of the previous work implicitly or Weltkrieg” (German), and “Andre verdenskrig” even explicitly adopts a position consistent with the global (Norwegian) all describe the same concept. Since the consensus hypothesis. The global consensus hypothesis methodology leverages user-contributed information, we posits that every language’s encyclopedic world knowledge also provide an evaluation to demonstrate its effectiveness. representation should cover roughly the same set of Step Two (“World Knowledge Diversity in Wikipedia”): concepts, and do so in nearly the same way. Any large After describing the concept alignment process, we can differences between language editions are treated as bugs analyze the extent of diversity of knowledge representations that need to be fixed, discrepancies that will go away in in different language editions of Wikipedia. These analyses time (and possibly need help doing so), or a problem that take place at both the conceptual (article) level and the sub- should simply be ignored. For example, Sorg and Cimiano conceptual (content) level, and serve two purposes: (1) they [27] state that since English does not cover a large majority

The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context Brent Hecht* and Darren Gergle† Northwestern University Dept

D2.8 GLAM-Wiki Collaboration Progress Report 2

12€€€€Collaborating on the Sum of All Knowledge Across Languages

The Case of 13 Wikipedia Instances

Europe's Online Encyclopaedias

Is Wikipedia Really Neutral? a Sentiment Perspective Study of War-Related Wikipedia Articles Since 1945

Parallel-Wiki: a Collection of Parallel Sentences Extracted from Wikipedia

Wikipedia @ 20

A Systematic Review of Scholarly Research on the Content of Wikipedia

1 Wikipedia As an Arena and Source for the Public. a Scandinavian

Template for Phd Dissertations

CCURL 2014: Collaboration and Computing for Under- Resourced Languages in the Linked Open Data Era

The Role of Local Content in Wikipedia: a Study on Reader and Editor Engagement1