The Wikipedia Diversity Observatory a Project to Identify and Bridge Content Gaps in Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Librarians As Wikimedia Movement Organizers in Spain: an Interpretive Inquiry Exploring Activities and Motivations
Librarians as Wikimedia Movement Organizers in Spain: An interpretive inquiry exploring activities and motivations by Laurie Bridges and Clara Llebot Laurie Bridges Oregon State University https://orcid.org/0000-0002-2765-5440 Clara Llebot Oregon State University https://orcid.org/0000-0003-3211-7396 Citation: Bridges, L., & Llebot, C. (2021). Librarians as Wikimedia Movement Organizers in Spain: An interpretive inQuiry exploring activities and motivations. First Monday, 26(6/7). https://doi.org/10.5210/fm.v26i3.11482 Abstract How do librarians in Spain engage with Wikipedia (and Wikidata, Wikisource, and other Wikipedia sister projects) as Wikimedia Movement Organizers? And, what motivates them to do so? This article reports on findings from 14 interviews with 18 librarians. The librarians interviewed were multilingual and contributed to Wikimedia projects in Castilian (commonly referred to as Spanish), Catalan, BasQue, English, and other European languages. They reported planning and running Wikipedia events, developing partnerships with local Wikimedia chapters, motivating citizens to upload photos to Wikimedia Commons, identifying gaps in Wikipedia content and filling those gaps, transcribing historic documents and adding them to Wikisource, and contributing data to Wikidata. Most were motivated by their desire to preserve and promote regional languages and culture, and a commitment to open access and open education. Introduction This research started with an informal conversation in 2018 about the popularity of Catalan Wikipedia, Viquipèdia, between two library coworkers in the United States, the authors of this article. Our conversation began with a sense of wonder about Catalan Wikipedia, which is ranked twentieth by number of articles, out of 300 different language Wikipedias (Meta contributors, 2020). -
© 2020 Xiaoman Pan CROSS-LINGUAL ENTITY EXTRACTION and LINKING for 300 LANGUAGES
© 2020 Xiaoman Pan CROSS-LINGUAL ENTITY EXTRACTION AND LINKING FOR 300 LANGUAGES BY XIAOMAN PAN DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2020 Urbana, Illinois Doctoral Committee: Dr. Heng Ji, Chair Dr. Jiawei Han Dr. Hanghang Tong Dr. Kevin Knight ABSTRACT Information provided in languages which people can understand saves lives in crises. For example, language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing Knowledge Base (KB) which is accessible to a user in their own language (e.g., a reporter from the World Health Organization who speaks English only). The ambitious goal of this thesis is to develop a Cross-lingual Entity Extraction and Linking framework for 1,000 fine-grained entity types and 300 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify entity name mentions, assign a fine-grained type to each mention, and link it to an English KB if it is linkable. Traditional entity linking methods rely on costly human annotated data to train supervised learning-to-rank models to select the best candidate entity for each mention. In contrast, we propose a novel unsupervised represent-and-compare approach that can accurately capture the semantic meaning representation of each mention, and directly compare its representation with the representation of each candidate entity in the target KB. -
Gender Bias and Under-Representation in Natural Language Processing Across Human Languages
Gender Bias and Under-Representation in Natural Language Processing Across Human Languages Yan Chen1, Christopher Mahoney1, Isabella Grasso1, Esma Wali1, Abigail Matthews2, Thomas Middleton1, Mariama Njie3, Jeanna Matthews1 1Clarkson University, 2University of Wisconsin-Madison, 3 Iona College [email protected] ABSTRACT 1 INTRODUCTION Natural Language Processing (NLP) systems are at the heart of Corpora of human language are regularly fed into machine learning many critical automated decision-making systems making crucial systems as a key way to learn about the world. Natural Language recommendations about our future world. However, these systems Processing plays a significant role in many powerful applications reflect a wide range of biases, from gender bias to a bias inwhich such as speech recognition, text translation, and autocomplete and voices they represent. In this paper, a team including speakers of is at the heart of many critical automated decision systems mak- 9 languages - Chinese, Spanish, English, Arabic, German, French, ing crucial recommendations about our future world [20][2][7]. Farsi, Urdu, and Wolof - reports and analyzes measurements of Systems are taught to identify spam email, suggest medical articles gender bias in the Wikipedia corpora for these 9 languages. In the or diagnoses related to a patient’s symptoms, sort resumes based process, we also document how our work exposes crucial gaps on relevance for a given position, and many other tasks that form in the NLP-pipeline for many languages. Despite substantial in- key components of critical decision making systems in areas such vestments in multilingual support, the modern NLP-pipeline still as criminal justice, credit, housing, allocation of public resources, systematically and dramatically under-represents the majority of and more. -
Master Thesis
MASTER THESIS TITLE : Cultural configuration of Wikipedia: measuring Autoreferentiality in different languages MASTER DEGREE: Master of Science in Telecommunication Engineering & Management AUTHOR: Marc Miquel Ribe´ DIRECTOR: Horacio Rodr´ıguez Hontoria TUTOR: Sebastia` Sallent Ribes DATE: March 31, 2011 T´ıtol : Cultural configuration of Wikipedia: measuring Autoreferentiality in different languages Autor: Marc Miquel Ribe´ Director: Horacio Rodr´ıguez Hontoria Tutor: Sebastia` Sallent Ribes Data: 31 de marc¸de 2011 Resum ”Wikipedia es´ un projecte enciclopedic` multiling¨ue, col·laboratiu, basat en web i sense anim` de lucre impulsat per la Fundacio´ Wikimedia”, aix´ı es´ com s’autodescriu Wikipedia en la definicio´ de l’article que du el seu nom. Aixo` significa que l’enciclopedia` pot ser modificada en qualsevol moment, per qualsevol persona i des de qualsevol lloc. Aquestes premisses i la seva gran participacio´ fan que es tracti d’un excel·lent objecte social d’estudi, que a la vegada, per tractar-se d’un artefacte tecnologic,` permeti tambe´ l’´us de tecniques` de processament llenguatge natural, obtencio´ i mineria de dades. Tanmateix, en la recerca actual hi ha una clara mancanc¸a en software que pugui aproximar-s’hi d’una manera integral. Tenint en compte aquest buit realitzem una caracteritzacio´ de Wikipedia amb l’objectiu de coneixer` a fons quins son´ els elements i estructures d’informacio´ que conte´ i com despres´ poden obtenir-se mitjanc¸ant una eina anal´ıtica. Partim de l’API existent anomenada wikAPIdia, que desenvolupem fins a incloure-hi noves funcionalitats i posar-la apunt per a encarar m´ultiples escenaris i problematiques` de les ciencies` socials. -
Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories
Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories Brent Hecht† and Darren Gergle†‡ †Dept. of Electrical Engineering and Computer Science ‡Dept. of Communication Studies Northwestern University [email protected], [email protected] ABSTRACT In this paper, we explore this question by introducing and Self-focus is a novel way of understanding a type of bias in measuring self-focus, a new way of understanding a type of bias in community-maintained Web 2.0 graph structures. It goes beyond the graph structures that underlie many of these community- previous measures of topical coverage bias by encapsulating both maintained repositories, including one of the largest: Wikipedia. node- and edge-hosted biases in a single holistic measure of an We define self-focus bias as occurring when contributors to a entire community-maintained graph. We outline two methods to knowledge repository encode information that is important and quantify self-focus, one of which is very computationally correct to them and a large proportion of contributors to the same inexpensive, and present empirical evidence for the existence of repository, but not important and correct to contributors of similar self-focus using a “hyperlingual” approach that examines 15 repositories. different language editions of Wikipedia. We suggest applications Self-focus bias is similar to topical coverage biases [8, 11] in that of our methods and discuss the risks of ignoring self-focus bias in it seeks to describe the semantic makeup of knowledge technological applications. repositories. Topical coverage bias studies explicitly or implicitly compare the distribution of articles (or a similar measure) in Categories and Subject Descriptors particular semantic categories in Wikipedia to that of a more H.5.3 [Information Systems]: Group and Organization Interfaces traditional knowledge repository, generally in an effort to show – collaborative computing, computer-supported cooperative work, that Wikipedia describes in more detail semantic areas that are of theory and models. -
The Case of 13 Wikipedia Instances
Interaction Design and Architecture(s) Journal - IxD&A, N.22, 2014, pp. 34-47 The Impact of Culture On Smart Community Technology: The Case of 13 Wikipedia Instances Zinayida Petrushyna1, Ralf Klamma1, Matthias Jarke1,2 1 Advanced Community Information Systems Group, Information Systems and Databases Chair, RWTH Aachen University, Ahornstrasse 55, 52056 Aachen, Germany 2 Fraunhofer Institute for Applied Information Technology FIT, 53754 St. Augustin, Germany {petrushyna, klamma}@dbis.rwth-aachen.de [email protected] Abstract Smart communities provide technologies for monitoring social behaviors inside communities. The technologies that support knowledge building should consider the cultural background of community members. The studies of the influence of the culture on knowledge building is limited. Just a few works consider digital traces of individuals that they explain using cultural values and beliefs. In this work, we analyze 13 Wikipedia instances where users with different cultural background build knowledge in different ways. We compare edits of users. Using social network analysis we build and analyze co- authorship networks and watch the networks evolution. We explain the differences we have found using Hofstede dimensions and Schwartz cultural values and discuss implications for the design of smart community technologies. Our findings provide insights in requirements for technologies used for smart communities in different cultures. Keywords: Social network analysis, Wikipedia communities, Hofstede dimensions, Schwartz cultural values 1 Introduction People prefer to leave in smart cities where their needs are satisfied [1]. The development of smart cities depends on the collaboration of individuals. The investigation of the flow [1] of knowledge created by the individuals allows the monitoring of city smartness. -
DLDP Digital Language Survival Kit
The Digital Language Diversity Project Digital Language Survival Kit The DLDP Recommendations to Improve Digital Vitality The DLDP Recommendations to Improve Digital Vitality Imprint The DLDP Digital Language Survival Kit Authors: Klara Ceberio Berger, Antton Gurrutxaga Hernaiz, Paola Baroni, Davyth Hicks, Eleonore Kruse, Vale- ria Quochi, Irene Russo, Tuomo Salonen, Anneli Sarhimaa, Claudia Soria This work has been carried out in the framework of The Digital Language Diversity Project (w ww. dldp.eu), funded by the European Union under the Erasmus+ Programme (Grant Agreement no. 2015-1-IT02-KA204- 015090) © 2018 This work is licensed under a Creative Commons Attribution 4.0 International License. Cover design: Eleonore Kruse Disclaimer This publication reflects only the authors’ view and the Erasmus+ National Agency and the Com- mission are not responsible for any use that may be made of the information it contains. www.dldp.eu www.facebook.com/digitallanguagediversity [email protected] www.twitter.com/dldproject 2 The DLDP Recommendations to Improve Digital Vitality Recommendations at a Glance Digital Capacity Recommendations Indicator Level Recommendations Digital Literacy 2,3 Increasing digital literacy among your native language-speaking community 2,3 Promote the upskilling of language mentors, activists or dissemi- nators 2,3 Establish initiatives to inform and educate speakers about how to acquire and use particular communication and content creation skills 2 Teaching digital literacy to children in your language community through -
Gebiotoolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4081–4088 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies Marta R. Costa-jussa,` Pau Li Lin, Cristina Espana-Bonet˜ ∗ TALP Research Center, Universitat Politecnica` de Catalunya, Barcelona ∗ DFKI GmBH and Saarland University, Saarbrucken¨ [email protected], [email protected], [email protected] Abstract We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and to other domains than biographical entries), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets. Keywords: corpora, gender bias, Wikipedia, machine translation 1. Introduction by volunteers. The toolkit is customizable for languages and gender-balance. We take advantage of Wikipedia mul- Gender biases are present in many natural language pro- tilinguality to extract a corpus of biographies, being each cessing applications (Costa-jussa,` 2019). This comes as biography a document available in all the selected lan- an undesired characteristic of deep learning architectures guages. -
Primary-School-Feasibility Study
WikiAfrica Primary School Feasibility Study WikiAfrica Primary School Feasibility Study Draft, November 2012. 1 WikiAfrica Primary School Feasibility Study The WikiAfrica Primary School Feasibility Study was produced in 2012 within the frame of WikiAfrica, a cross-continental collaboration that aims to increase the quantity and quality of African content on the world’s most referenced online encyclopaedia, Wikipedia. WikiAfrica is promoted by lettera27 Foundation and the Africa Centre and it was initiated by lettera27 Foundation in 2006. WikiAfrica Primary School Feasibility Study is edited by Iolanda Pensa, scientific director WikiAfrica; WikiAfrica project manager for lettera27 Cristina Perillo; WikiAfrica project manager for the Africa Centre Isla Haddow-Flood. WikiAfrica Primary School is a project conceived by Iolanda Pensa under Creative Commons attribution share alike license. The WikiAfrica Primary School Feasibility Study was made possible thanks to volunteers and the support of lettera27 Foundation. WikiAfrica Primary School Feasibility Study is under Creative Commons attribution share alike license. WikiAfrica, 2012. 2 WikiAfrica Primary School Feasibility Study “Education is the most powerful weapon which you can use to change the world “ Nelson Mandela 3 WikiAfrica Primary School Feasibility Study Table of content 1. Introduction 7 Key findings 8 2. Wikipedia 13 Outreach projects and education 13 Languages used in the African educational systems 15 General overview of involved Wikipedia projects 18 General overview on the -
Gender Bias in Natural Language Processing Across Human
Gender Bias in Natural Language Processing Across Human Languages Abigail Matthews Isabella Grasso Mariama Njie University of Christopher Mahoney Iona College Wisconsin-Madison Yan Chen Esma Wali Thomas Middleton Jeanna Matthews Clarkson University [email protected] Abstract relationship of profession words like doctor, nurse, or teacher relative to this gendered vector space. They Natural Language Processing (NLP) systems demonstrated that word embedding software trained on are at the heart of many critical automated a corpus of Google news could associate men with the decision-making systems making crucial rec- profession computer programmer and women with the ommendations about our future world. Gender profession homemaker. Systems based on such mod- bias in NLP has been well studied in English, els, trained even with “representative text” like Google but has been less studied in other languages. news, could lead to biased hiring practices if used to, In this paper, a team including speakers of 9 for example, parse resumes and suggest matches for languages - Chinese, Spanish, English, Ara- a computer programming job. However, as with many bic, German, French, Farsi, Urdu, and Wolof - results in NLP research, this influential result has not reports and analyzes measurements of gender been applied beyond English. bias in the Wikipedia corpora for these 9 lan- guages. We develop extensions to profession- In some earlier work from this team, “Quantifying level and corpus-level gender bias metric cal- Gender Bias in Different Corpora”, we applied Boluk- culations originally designed for English and basi et al.’s methodology to computing and compar- apply them to 8 other languages, including ing corpus-level gender bias metrics across different languages that have grammatically gendered corpora of the English text (Babaeianjelodar 2020). -
The Sum of Human Knowledge? Not in One Wikipedia Language Edition
Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition Marc Miquel-Ribé Published on: May 15, 2019 Updated on: Nov 26, 2019 Wikipedia @ 20 The Sum of Human Knowledge? Not in One Wikipedia Language Edition Image credit: Denis Schroeder (WMDE), Wikidata Items Map 2014—2017. “The sum of human wisdom is not contained in any one language, and no single language is capable of expressing all forms and degrees of human comprehension.” Ezra Pound Though I had used Wikipedia for years, it was only ten years ago when I discovered how each language edition community can freely organize its content—as there is no central editorial board. The Catalan version of the encyclopedia, in my native tongue, can have pages dedicated to its culture without impediment. Some might take this for granted, but I cherished this principle because of my memories of my grandfather, who was forbidden to speak his language in public during the forty years of Franco’s dictatorship, and of my mother, who did not have not the chance to be educated in her mother tongue. I did not immediately become a contributor, but I wanted to learn more and, hopefully, one day give back. Today, I am doing so as a researcher with the Wikipedia Cultural Diversity Observatory (WCDO). Though the English Wikipedia has brought much attention to the larger Wikimedia project, that project’s future and potential growth lie in many smaller languages and cultures, which are often overlooked—and under threat, as many human languages are likely to disappear by the end of the century. -
Mining Cross-Cultural Relations from Wikipedia - a Study of 31 European Food Cultures
Mining cross-cultural relations from Wikipedia - A study of 31 European food cultures Paul Laufer Claudia Wagner Graz University of Technology GESIS & U. of Koblenz Graz, Austria Cologne, Germany [email protected] [email protected] Fabian Flöck Markus Strohmaier GESIS GESIS & U. of Koblenz Cologne, Germany Cologne, Germany fabian.fl[email protected] [email protected] ABSTRACT the editor community of the Romanian-language Wikipedia For many people, Wikipedia represents one of the primary could either have a deviant mental picture of the French sources of knowledge about foreign cultures. Yet, differ- cuisine { or it might estimate the priorities of Romanian- ent Wikipedia language editions offer different descriptions speaking readers to rather be on meat-based French deli- of cultural practices. Unveiling diverging representations of catessen than on wine and baking goods. Further, the gen- cultures provides an important insight, since they may foster eral interest of the Romanian-speaking readers in the French the formation of cross-cultural stereotypes, misunderstand- cuisine (for example measured by the number of views of the ings and potentially even conflict. In this work, we explore article about French cuisine in the Romanian language edi- to what extent the descriptions of cultural practices in var- tion) might serve to potentially displease any Francophile, ious European language editions of Wikipedia differ on the since the Romanian speaking community might show no- example of culinary practices and propose an approach to tably less interest in the French kitchen than in the Russian mine cultural relations between different language commu- or Hungarian one. This hypothetical scenario serves as an nities trough their description of and interest in their own example for numerous similar real-world cases (which can- and other communities' food culture.