Text Corpora and the Challenge of Newly Written Languages

Total Page:16

File Type:pdf, Size:1020Kb

Text Corpora and the Challenge of Newly Written Languages Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), pages 111–120 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Text Corpora and the Challenge of Newly Written Languages Alice Millour∗, Karen¨ Fort∗y ∗Sorbonne Universite´ / STIH, yUniversite´ de Lorraine, CNRS, Inria, LORIA 28, rue Serpente 75006 Paris, France, 54000 Nancy, France [email protected], [email protected] Abstract Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written. Keywords: text corpora, dialectal variants, spelling, crowdsourcing 1. Introduction increasing number of speakers (see for instance, the stud- Speakers from various linguistic communities are increas- ies on specific languages carried by Rivron (2012) or Soria ingly writing their languages (Outinoff, 2012), and the In- et al. (2018)). Although linguistic communities are be- ternet is increasingly multilingual.1 ing threatened all over the world, this represents a valuable Many of these languages are less-resourced: these linguis- opportunity to observe, document and equip with appropri- tic productions are not sufficiently documented, while there ate tools an increasing number of languages. These new is an urge to provide tools that sustain their digital use. spaces of written conversation have been taken over by lin- What is more, when a language is not standardized, we guistic communities which practice had been mainly oral need to build adapted resources and tools that embrace its until then (van Esch et al., 2019). When no orthography diversity. has been defined for a given language, or when one (or var- Including these linguistic productions into natural language ious) conventions exist but are not consistently used by the processing (NLP) pipelines hence requires efforts on two speakers, spellings may vary from one speaker to another. complementary fronts: (i) collecting or building resources Indeed, when the spellings are not standardized by an ar- that represent the use of the language, (ii) developing tools bitrary convention, speakers may transcribe the language which can cope with the variation mechanisms. based on how they speak it. In all cases, the very first resource that is needed for further In fact, standardizing spelling does not confine to defining processing is a text corpus. the orthography, as it usually also acts as a unification pro- After presenting the challenges related to the processing of cess of potential linguistic variants towards a sole written oral languages when they come to be written, we introduce form. By contrast, the absence of such a standardization in Section 3. the existing multilingual sources that are com- process authorizes the raw transcription of a multitude of monly used to collect corpora, as well as their main short- linguistic variants of a given language. These variants be- comings. ing transcribed according to the spelling habits and linguis- In Section 4., we show how crowdsourcing has been used in tic backgrounds of each speaker, the Internet, especially in the past to involve the members of linguistic communities its conversational nature, has become a breeding ground for into collaboratively building resources for their languages. linguistic diversity expression and observation. We argue that this method is all the more reasonable when Situations of spelling variations observed on the Internet it comes to collecting meaningful data for non-standardized are documented in diverse linguistic contexts such as the languages. ones of: In Section 5., we present the existing initiatives to crowd- source text corpora. Based on our own experiments and on • The Zapotec and Chatino communities in Oaxaca, the result of a survey regarding the digital use of a non- Mexico, as detailed in (Lillehaugen, 2016) in the con- standardized language, we explain why collecting this par- text of the Voces del Valle program. During this pro- ticular type of resource is challenging. gram, speakers were encouraged to write tweets in their languages. They were provided with spelling 2. The Need for Text Corpora guidance they were not compelled to follow. As stated The rise in use of SMS, online chat and of social media by the author: “The result was that, for the most part, in general has created a new space of expression for an the writers were non-systematic in their spelling deci- sions—but they were writing”. 1See, for instance, the reports provided by w3tech such as https://w3techs.com/technologies/history_ • Some of the communities speaking Tibetan dialects overview/content_language/ms/y. outside China, which develop a “written form based 111 on the spoken language” independently of the Classi- • an insufficient coverage to constitute the basis for fur- cal Literary Tibetan (Tournadre, 2014). ther linguistic resources developments. These corpora are unlikely to be balanced in terms of representative- • The Eton ethnic group, in Cameroon, about ness of the existing practices. which Rivron (2012) observes that Internet is the sup- port of “the extension of a mother tongue outside its • their nature and license sometimes require operations habitual context and uses, and the correlated develop- that result in a loss of information such as the metadata ment of its graphic system”. necessary to identify the languages or the structure of the document. • Speakers of Javanese dialects who “have their own way of writing down the words they use according • using them requires additional linguistic resources (to to the pronunciation they understand”, regardless of perform language identification, for instance). the official spelling. Each dialect developing its own spelling, a dialect that was “originally only recogniz- In the following, we first present the Wikipedia project, able through its oral narratives (pronunciation) is now which, with 306 active Wikipedias distributed under Cre- easily recognizable through the spelling used in social ative Commons licenses, undoubtedly provides the largest media” (Fauzi and Puspitorini, 2018). freely available multilingual corpus. Second, we present how the Web can more generally be • Communities using Arabizi to transcribe Arabic on- used as a source of text corpora. We focus on describing line: as reported by Tobaili et al. (2019), Arabizi al- how Web crawling has been used to gather corpora for less- lows multiple mappings between Arabic and Roman resourced languages, and briefly comment on the use of so- alphanumerical characters, and thus makes apparent cial networks-based corpora. dialectal variations usually hidden in the traditional writing. 3.1. Wikipedia as a Corpus • Communities transliterating Indian dialects with Ro- Wikipedia is an online collaborative multilingual encyclo- man alphabet without observing systematic conven- pedia supported by the WIKIMEDIA FOUNDATION, a non- tions for transliteration (Shekhar et al., 2018). profit organization. Along with providing structured information from which • Regional European languages such as Alsatian, a con- lexical semantic resources or ontologies can be derived, tinuum of Alemannic dialects, for which a great di- Wikipedia is an easily accessible source of text corpora versity of spellings is reported (Millour and Fort, widely used in the NLP community, and from which both 2019) even though a flexible spelling system, Or- well- and less-resourced languages benefit. thal (Crevenat-Werner´ and Zeidler, 2008), has been developed. Its popularity and its collaborative structure make it the In the following, we will refer to these proteiform lan- most natural environment to foster collaborative text pro- guages as “multi-variant”. The variation observed is indeed duction. We discuss in this section to which extent the result of (at least) two simultaneous mechanisms: the Wikipedia represents a valuable source of text corpora for dialectal and scriptural variations. Both of these degrees less-resourced and non-standardized languages, in terms of of freedom may be impacted by the usual dimensions for further NLP processing. variation (diachronic, diatopic, diastratic, and diamesic). After a short introduction on the size and quality of the ex- From a NLP perspective, these linguistic productions rep- isting Wikipedias, we describe how the issue of language resent a challenge. In fact, they push us to deal with the identification and the purpose of Wikipedia prevent it to issues that variation processes imply, either because these be the most appropriate virtual place to host dialectal and productions account for most of the written existence of a scriptural diversity. non-standardized language, or because
Recommended publications
  • Compilation of a Swiss German Dialect Corpus and Its Application to Pos Tagging
    Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging Nora Hollenstein Noemi¨ Aepli University of Zurich University of Zurich [email protected] [email protected] Abstract Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data to serve as a stepping stone to automatically process the dialects. We compiled NOAH’s Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of- Speech tags. Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger and achieved an accuracy of 90.62%. 1 Introduction Swiss German is not an official language of Switzerland, rather it includes dialects of Standard German, which is one of the four official languages. However, it is different from Standard German in terms of phonetics, lexicon, morphology and syntax. Swiss German is not dividable into a few dialects, in fact it is a dialect continuum with a huge variety. Swiss German is not only a spoken dialect but increasingly used in written form, especially in less formal text types. Often, Swiss German speakers write text messages, emails and blogs in Swiss German. However, in recent years it has become more and more popular and authors are publishing in their own dialect. Nonetheless, there is neither a writing standard nor an official orthography, which increases the variations dramatically due to the fact that people write as they please with their own style.
    [Show full text]
  • Librarians As Wikimedia Movement Organizers in Spain: an Interpretive Inquiry Exploring Activities and Motivations
    Librarians as Wikimedia Movement Organizers in Spain: An interpretive inquiry exploring activities and motivations by Laurie Bridges and Clara Llebot Laurie Bridges Oregon State University https://orcid.org/0000-0002-2765-5440 Clara Llebot Oregon State University https://orcid.org/0000-0003-3211-7396 Citation: Bridges, L., & Llebot, C. (2021). Librarians as Wikimedia Movement Organizers in Spain: An interpretive inQuiry exploring activities and motivations. First Monday, 26(6/7). https://doi.org/10.5210/fm.v26i3.11482 Abstract How do librarians in Spain engage with Wikipedia (and Wikidata, Wikisource, and other Wikipedia sister projects) as Wikimedia Movement Organizers? And, what motivates them to do so? This article reports on findings from 14 interviews with 18 librarians. The librarians interviewed were multilingual and contributed to Wikimedia projects in Castilian (commonly referred to as Spanish), Catalan, BasQue, English, and other European languages. They reported planning and running Wikipedia events, developing partnerships with local Wikimedia chapters, motivating citizens to upload photos to Wikimedia Commons, identifying gaps in Wikipedia content and filling those gaps, transcribing historic documents and adding them to Wikisource, and contributing data to Wikidata. Most were motivated by their desire to preserve and promote regional languages and culture, and a commitment to open access and open education. Introduction This research started with an informal conversation in 2018 about the popularity of Catalan Wikipedia, Viquipèdia, between two library coworkers in the United States, the authors of this article. Our conversation began with a sense of wonder about Catalan Wikipedia, which is ranked twentieth by number of articles, out of 300 different language Wikipedias (Meta contributors, 2020).
    [Show full text]
  • Combating Attacks and Abuse in Large Online Communities
    University of California Santa Barbara Combating Attacks and Abuse in Large Online Communities Adissertationsubmittedinpartialsatisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Gang Wang Committee in charge: Professor Ben Y. Zhao, Chair Professor Haitao Zheng Professor Christopher Kruegel September 2016 The Dissertation of Gang Wang is approved. Professor Christopher Kruegel Professor Haitao Zheng Professor Ben Y. Zhao, Committee Chair July 2016 Combating Attacks and Abuse in Large Online Communities Copyright c 2016 ⃝ by Gang Wang iii Acknowledgements I would like to thank my advisors Ben Y.Zhao and Haitao Zheng formentoringmethrough- out the PhD program. They were always there for me, giving me timely and helpful advice in almost all aspects of both research and life. I also want to thank my PhD committee member Christopher Kruegel for his guidance in my research projects and job hunting. Finally, I want to thank my mentors in previous internships: Jay Stokes, Cormac Herley, Weidong Cui and Helen Wang from Microsoft Research, and Vicente Silveira fromLinkedIn.Specialthanksto Janet Kayfetz, who has helped me greatly with my writing and presentation skills. Iamverymuchthankfultomycollaboratorsfortheirhardwork, without which none of this research would have been possible. First and foremost, to the members of SAND Lab at UC Santa Barbara: Christo Wilson, Bolun Wang, Tianyi Wang, Manish Mohanlal, Xiaohan Zhao, Zengbin Zhang, Xia Zhou, Ana Nika, Xinyi Zhang, Shiliang Tang, Alessandra Sala, Yibo Zhu, Lin Zhou, Weile Zhang, Konark Gill, Divya Sambasivan, Xiaoxiao Yu, Troy Stein- bauer, Tristan Konolige, Yu Su and Yuanyang Zhang. Second, totheemployeesatMicrosoft: Jack Stokes, Cormac Herley and David Felstead. Third, to Miriam Metzger from the Depart- ment of Communications at UC Santa Barbara, and Sarita Y.
    [Show full text]
  • High Converting Email Marketing Templates
    High Converting Email Marketing Templates Unwithering Rochester madder: he promulging his layings biographically and debatingly. Regan never foins any Jefferson geometrize stirringly, is Apostolos long and pinnatipartite enough? Isodimorphous Dylan always bloused his caddice if Bear is Iroquoian or motorises later. Brooks sports and the middleware can adjust your website very possibly be wonderful blog, email marketing templates here to apply different characteristics and Email apps you use everyday to automate your business and be more productive. For best results, go beyond just open and click rates to understand exactly how your newsletters impact your customer acquisition. Another favorite amongst most email designers, there had been a dip in the emails featuring illustrations. Turnover is vanity; profit is sanity. These people love your brand, so they would happy to contribute. Outlook, Gmail, Yahoo is not supporting custom fonts and ignore media query, it will be replaced by standard web font. Once Jack explained your background in greater detail, I immediately realized you would be the perfect person to speak with. Search for a phrase that your audience cares about. Drip marketing is easy to grasp and implement. These techniques include segmentation, mapping out email flows for your primary segments, and consolidating as much as possible into one app or platform. Typically, newsletters are used as a way to nurture customer relationships and educate your readership as opposed to selling a product. What problem do you solve? SPF authentication detects malicious emails from false email addresses. My sellers love the idea and the look you produce. Keep things fresh for your audience.
    [Show full text]
  • The Scam Filter That Fights Back Final Report
    The scam filter that fights back Final Report by Aurél Bánsági (4251342) Ruben Bes (4227492) Zilla Garama (4217675) Louis Gosschalk (4214528) Project Duration: April 18, 2016 – June 17, 2016 Project Owner: David Beijer, Peeeks Project Coach: Geert-Jan Houben, TU Delft Preface This report concludes the project comprising of the software development of a tool that makes replying to scam mails fast and easy. This project was part of the course TI3806 Bachelor Project in the fourth quarter of the acadamic year of 2015-2016 and was commissioned by David Beijer, CEO of Peeeks. The aim of this report is documenting the complete development process and presenting the results to the Bachelor Project Committee. During the past ten weeks, research was conducted and assessments were made to come to a design and was then followed by an implementation. We would like to thank our coach, Geert-Jan Houben, for his advice and supervision during the project. Aurél Bánsági (4251342) Ruben Bes (4227492) Zilla Garama (4217675) Louis Gosschalk (4214528) June 29, 2016 i Abstract Filtering out scam mails and deleting them does not hurt a scammer in any way, but wasting their time by sending fake replies does. As the amount of replies grows, the amount of time a scammer has to spend responding to e-mails increases. This bachelor project has created a revolutionary system that recommends replies that can be sent to answer a scam mail. This makes it very easy for users to waste a scammer’s time and destroy their business model. The project has a unique adaptation of the Elo rating system, which is a very popular algorithm used by systems ranging from chess to League of Legends.
    [Show full text]
  • Social Turing Tests: Crowdsourcing Sybil Detection
    Social Turing Tests: Crowdsourcing Sybil Detection Gang Wang, Manish Mohanlal, Christo Wilson, Xiao Wang‡, Miriam Metzger†, Haitao Zheng and Ben Y. Zhao Department of Computer Science, U. C. Santa Barbara, CA USA †Department of Communications, U. C. Santa Barbara, CA USA ‡Renren Inc., Beijing, China Abstract The research community has produced a substantial number of techniques for automated detection of Sybils [4, As popular tools for spreading spam and malware, Sybils 32,33]. However, with the exception of SybilRank [3], few (or fake accounts) pose a serious threat to online communi- have been successfully deployed. The majority of these ties such as Online Social Networks (OSNs). Today, sophis- techniques rely on the assumption that Sybil accounts have ticated attackers are creating realistic Sybils that effectively difficulty friending legitimate users, and thus tend to form befriend legitimate users, rendering most automated Sybil their own communities, making them visible to community detection techniques ineffective. In this paper, we explore detection techniques applied to the social graph [29]. the feasibility of a crowdsourced Sybil detection system for Unfortunately, the success of these detection schemes is OSNs. We conduct a large user study on the ability of hu- likely to decrease over time as Sybils adopt more sophis- mans to detect today’s Sybil accounts, using a large cor- ticated strategies to ensnare legitimate users. First, early pus of ground-truth Sybil accounts from the Facebook and user studies on OSNs such as Facebook show that users are Renren networks. We analyze detection accuracy by both often careless about who they accept friendship requests “experts” and “turkers” under a variety of conditions, and from [2].
    [Show full text]
  • Semantic Approaches in Natural Language Processing
    Cover:Layout 1 02.08.2010 13:38 Seite 1 10th Conference on Natural Language Processing (KONVENS) g n i s Semantic Approaches s e c o in Natural Language Processing r P e g a u Proceedings of the Conference g n a on Natural Language Processing 2010 L l a r u t a This book contains state-of-the-art contributions to the 10th N n Edited by conference on Natural Language Processing, KONVENS 2010 i (Konferenz zur Verarbeitung natürlicher Sprache), with a focus s e Manfred Pinkal on semantic processing. h c a o Ines Rehbein The KONVENS in general aims at offering a broad perspective r on current research and developments within the interdiscipli - p p nary field of natural language processing. The central theme A Sabine Schulte im Walde c draws specific attention towards addressing linguistic aspects i t of meaning, covering deep as well as shallow approaches to se - n Angelika Storrer mantic processing. The contributions address both knowledge- a m based and data-driven methods for modelling and acquiring e semantic information, and discuss the role of semantic infor - S mation in applications of language technology. The articles demonstrate the importance of semantic proces - sing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting se - mantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval.
    [Show full text]
  • Read a Sample
    Intelligent Computing for Interactive System Design provides a comprehensive resource on what has become the dominant paradigm in designing novel interaction methods, involving gestures, speech, text, touch and brain-controlled interaction, embedded in innovative and emerging human-computer interfaces. These interfaces support ubiquitous interaction with applications and services running on smartphones, wearables, in-vehicle systems, virtual and augmented reality, robotic systems, the Internet of Things (IoT), and many other domains that are now highly competitive, both in commercial and in research contexts. This book presents the crucial theoretical foundations needed by any student, researcher, or practitioner working on novel interface design, with chapters on statistical methods, digital signal processing (DSP), and machine learning (ML). These foundations are followed by chapters that discuss case studies on smart cities, brain-computer interfaces, probabilistic mobile text entry, secure gestures, personal context from mobile phones, adaptive touch interfaces, and automotive user interfaces. The case studies chapters also highlight an in-depth look at the practical application of DSP and ML methods used for processing of touch, gesture, biometric, or embedded sensor inputs. A common theme throughout the case studies is ubiquitous support for humans in their daily professional or personal activities. In addition, the book provides walk-through examples of different DSP and ML techniques and their use in interactive systems. Common terms are defined, and information on practical resources is provided (e.g., software tools, data resources) for hands-on project work to develop and evaluate multimodal and multi-sensor systems. In a series of in-chapter commentary boxes, an expert on the legal and ethical issues explores the emergent deep concerns of the professional community, on how DSP and ML should be adopted and used in socially appropriate ways, to most effectively advance human performance during ubiquitous interaction with omnipresent computers.
    [Show full text]
  • Crowdsourcing Real-Time Viral Disease and Pest Information: A
    The Sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2018) Crowdsourcing Real-Time Viral Disease and Pest Information: A Case of Nation-Wide Cassava Disease Surveillance in a Developing Country Daniel Mutembesa,1 Christopher Omongo,2 Ernest Mwebaze1 1Makerere University P.O Box 7062, Kampala, Uganda. 2National Crops Resources Research Institute P.O Box 7084. Kampala, Uganda. ([email protected], [email protected], [email protected]) Abstract Because they have to work within limited budgets, this sur- vey may be delayed or fewer regions may be sampled in In most developing countries, a huge proportion of the na- a particular year. These expert surveys whilst providing an tional food basket is supported by small subsistence agricul- annual snapshot of the health of cassava across the country, tural systems. A major challenge to these systems is disease are limited in their ability to provide real-time actionable and pest attacks which have a devastating effect on the small- holder farmers that depend on these systems for their liveli- surveillance data. hoods. A key component of any proposed solution is a good In many other fields, citizen science has proven to be a disease surveillance network. However, current surveillance good intervention to supplement long-term focused systems efforts are unable to provide sufficient data for monitoring such as the surveillance system employed in Uganda (Sil- such phenomena over a vast geographic area efficiently and vertown 2009). Citizen science and more specifically crowd- effectively due to limited resources, both human and finan- sourcing can be employed to supplement such a system by cial.
    [Show full text]
  • Reciprocal Enrichment Between Basque Wikipedia and Machine Translation
    Reciprocal Enrichment Between Basque Wikipedia and Machine Translation Inaki˜ Alegria, Unai Cabezon, Unai Fernandez de Betono,˜ Gorka Labaka, Aingeru Mayor, Kepa Sarasola and Arkaitz Zubiaga Abstract In this chapter, we define a collaboration framework that enables Wikipe- dia editors to generate new articles while they help development of Machine Trans- lation (MT) systems by providing post-edition logs. This collaboration framework was tested with editors of Basque Wikipedia. Their post-editing of Computer Sci- ence articles has been used to improve the output of a Spanish to Basque MT system called Matxin. For the collaboration between editors and researchers, we selected a set of 100 articles from the Spanish Wikipedia. These articles would then be used as the source texts to be translated into Basque using the MT engine. A group of volunteers from Basque Wikipedia reviewed and corrected the raw MT translations. This collaboration ultimately produced two main benefits: (i) the change logs that would potentially help improve the MT engine by using an automated statistical post-editing system , and (ii) the growth of Basque Wikipedia. The results show that this process can improve the accuracy of an Rule Based MT (RBMT) system in nearly 10% benefiting from the post-edition of 50,000 words in the Computer Inaki˜ Alegria Ixa Group, University of the Basque Country UPV/EHU, e-mail: [email protected] Unai Cabezon Ixa Group, University of the Basque Country, e-mail: [email protected] Unai Fernandez de Betono˜ Basque Wikipedia and University of the Basque Country, e-mail: [email protected] Gorka Labaka Ixa Group, University of the Basque Country, e-mail: [email protected] Aingeru Mayor Ixa Group, University of the Basque Country, e-mail: [email protected] Kepa Sarasola Ixa Group, University of the Basque CountryU, e-mail: [email protected] Arkaitz Zubiaga Basque Wikipedia and Queens College, CUNY, CS Department, Blender Lab, New York, e-mail: [email protected] 1 2 Alegria et al.
    [Show full text]
  • Same Gap, Different Experiences
    SAME GAP, DIFFERENT EXPERIENCES An Exploration of the Similarities and Differences Between the Gender Gap in the Indian Wikipedia Editor Community and the Gap in the General Editor Community By Eva Jadine Lannon 997233970 An Undergraduate Thesis Presented to Professor Leslie Chan, PhD. & Professor Kenneth MacDonald, PhD. in the Partial Fulfillment of the Requirements for the Degree of Honours Bachelor of Arts in International Development Studies (Co-op) University of Toronto at Scarborough Ontario, Canada June 2014 ACKNOWLEDGEMENTS I would like to express my deepest gratitude to all of those people who provided support and guidance at various stages of my undergraduate research, and particularly to those individuals who took the time to talk me down off the ledge when I was certain I was going to quit. To my friends, and especially Jennifer Naidoo, who listened to my grievances at all hours of the day and night, no matter how spurious. To my partner, Karim Zidan El-Sayed, who edited every word (multiple times) with spectacular patience. And finally, to my research supervisors: Prof. Leslie Chan, you opened all the doors; Prof. Kenneth MacDonald, you laid the foundations. Without the support, patience, and understanding that both of you provided, it would have never been completed. Thank you. i EXECUTIVE SUMMARY According to the second official Wikipedia Editor Survey conducted in December of 2011, female- identified editors comprise only 8.5% of contributors to Wikipedia's contributor population (Glott & Ghosh, 2010). This significant lack of women and women's voices in the Wikipedia community has led to systemic bias towards male histories and culturally “masculine” knowledge (Lam et al., 2011; Gardner, 2011; Reagle & Rhue, 2011; Wikimedia Meta-Wiki, 2013), and an editing environment that is often hostile and unwelcoming to women editors (Gardner, 2011; Lam et al., 2011; Wikimedia Meta- Wiki, 2013).
    [Show full text]
  • Analyzing Wikidata Transclusion on English Wikipedia
    Analyzing Wikidata Transclusion on English Wikipedia Isaac Johnson Wikimedia Foundation [email protected] Abstract. Wikidata is steadily becoming more central to Wikipedia, not just in maintaining interlanguage links, but in automated popula- tion of content within the articles themselves. It is not well understood, however, how widespread this transclusion of Wikidata content is within Wikipedia. This work presents a taxonomy of Wikidata transclusion from the perspective of its potential impact on readers and an associated in- depth analysis of Wikidata transclusion within English Wikipedia. It finds that Wikidata transclusion that impacts the content of Wikipedia articles happens at a much lower rate (5%) than previous statistics had suggested (61%). Recommendations are made for how to adjust current tracking mechanisms of Wikidata transclusion to better support metrics and patrollers in their evaluation of Wikidata transclusion. Keywords: Wikidata · Wikipedia · Patrolling 1 Introduction Wikidata is steadily becoming more central to Wikipedia, not just in maintaining interlanguage links, but in automated population of content within the articles themselves. This transclusion of Wikidata content within Wikipedia can help to reduce maintenance of certain facts and links by shifting the burden to main- tain up-to-date, referenced material from each individual Wikipedia to a single repository, Wikidata. Current best estimates suggest that, as of August 2020, 62% of Wikipedia ar- ticles across all languages transclude Wikidata content. This statistic ranges from Arabic Wikipedia (arwiki) and Basque Wikipedia (euwiki), where nearly 100% of articles transclude Wikidata content in some form, to Japanese Wikipedia (jawiki) at 38% of articles and many small wikis that lack any Wikidata tran- sclusion.
    [Show full text]