Banglabert: Combating Embedding Barrier for Low-Resource Language Understanding

Total Page:16

File Type:pdf, Size:1020Kb

Banglabert: Combating Embedding Barrier for Low-Resource Language Understanding BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding Abhik Bhattacharjee1, Tahmid Hasan1, Kazi Samin1, Md Saiful Islam2, Anindya Iqbal1, M. Sohel Rahman1, Rifat Shahriyar1 Bangladesh University of Engineering and Technology (BUET)1, University of Rochester2 {abhik,samin}@ra.cse.buet.ac.bd, [email protected], [email protected], {anindya,msrahman,rifat}@cse.buet.ac.bd Abstract en: Joe 2010. Joe da pale leo 2010. bn : েজা ২০১০went thereসােলর today পর আজ for the�থমবােরর �irst time মত since েসখােন েগল। In this paper, we introduce “Embedding Bar- sw: alien kwa mara ya kwanza tangu rier”, a phenomenon that limits the monolin- Figure 1: We present one sentence from each of the gual performance of multilingual models on three languages we study: English (en), Bangla (bn), low-resource languages having unique typolo- and Swahili (sw). Swahili and English have the same gies. We build ‘BanglaBERT’, a Bangla lan- typology, allowing Swahili to use better (subword) tok- guage model pretrained on 18.6 GB Internet- enizations of common entities like names, numerals etc. crawled data and benchmark on five stan- (colored blue) by readily borrowing them from English dard NLU tasks. We discover a signifi- in a multilingual vocabulary. Some tokens (colored or- cant drop in the performance of the state- ange), despite having different meanings, may be also of-the-art multilingual model (XLM-R) from be borrowed, that can later be contextualized for the BanglaBERT and attribute this to the Embed- corresponding language. In contrast, Bangla, having a ding Barrier through comprehensive experi- completely different script, does not enjoy this benefit. ments. We identify that a multilingual model’s performance on a low-resource language is the NLU field is yet to be democratized, as most hurt when its writing script is not similar to any of the high-resource languages. To tackle notable works are restricted to only a few high- the barrier, we propose a straightforward so- resource languages. For example, despite being the lution by transcribing languages to a com- sixth most spoken language, there has not been any mon script, which can effectively improve the comprehensive study on Bangla NLU. With these performance of a multilingual model for the shortcomings in mind, we build BanglaBERT – Bangla language. As a bi-product of the stan- a Transformer-based (Vaswani et al., 2017) NLU dard NLU benchmarks, we introduce a new model for Bangla, pretrained on 18.6 GB data we downstream dataset on natural language infer- ence (NLI) and show that BanglaBERT out- crawled from 60 popular Bangla websites, intro- performs previous state-of-the-art results on duce a new NLI dataset (a task previously unex- all tasks by up to 3.5%. We are making plored in Bangla), and evaluate BanglaBERT on the BanglaBERT language model and the new five downstream tasks. BanglaBERT outperforms Bangla NLI dataset publicly available in the multilingual baselines and previous state-of-the-art hope of advancing the community. The re- results on all tasks by up to 3.5%.1 sources can be found at https://github. Although there have been efforts to pretrain mul- com/csebuetnlp/banglabert. arXiv:2101.00204v2 [cs.CL] 28 Aug 2021 tilingual NLU models (Pires et al., 2019; Conneau and Lample, 2019; Conneau et al., 2020a) that can 1 Introduction generalize to hundreds of languages, low-resource languages are not their primary focus, leaving them Self-supervised pretraining (Devlin et al., 2019; underrepresented. Moreover, these models often Radford et al., 2019; Liu et al., 2019) has become trail in performance compared to language-specific a standard practice in natural language process- models (Canete et al., 2020; Antoun et al., 2020; ing. This pretraining stage allows language models Nguyen and Tuan Nguyen, 2020), resulting in a to learn general linguistic representations (Jawahar surge of interest in pretraining language-specific et al., 2019) that can later be fine-tuned for different models in the NLP community. However, pretrain- downstream natural language understanding (NLU) tasks (Wang et al., 2018; Rajpurkar et al., 2016; 1We are releasing the BanglaBERT model and the NLI Tjong Kim Sang and De Meulder, 2003). However, dataset to advance the Bangla NLP community. ing is highly expensive in terms of data and com- 2 BanglaBERT putational resources, rendering them very difficult In this section, we describe the pretraining data, pre- for low-resource communities. processing steps, model architecture, objectives, To systematically analyze the performance gap and downstream task benchmarks of BanglaBERT. in the multilingual models, we use BanglaBERT as a pivot and compare it with the best perform- 2.1 Pretraining Data ing multilingual model, XLM-RoBERTa (XLM- A high volume of good quality text data is a prereq- R) (Conneau et al., 2020a) on the Bangla NLU uisite for pretraining large language models. For in- tasks. We identify that even though XLM-R was stance, BERT (Devlin et al., 2019) is pretrained on pretrained with a similar objective, a comparable the English Wikipedia and the Books corpus (Zhu amount of texts, and had the same architecture as et al., 2015) containing 3.3 billion tokens. Subse- BanglaBERT, the vocabulary dedicated to Bangla quent works like RoBERTa (Liu et al., 2019) and scripts in the embedding layer was less than 1% of XLNet (Yang et al., 2019) used an order of mag- the total vocabulary (and less than ten folds com- nitude larger web-crawled data that have passed pared to BanglaBERT), which limited the represen- through heavy automatic filtering and cleaning. tation ability of XLM-R for Bangla. Also, since Bangla is a rather resource-constrained language Bangla does not share its vocabulary or writing in the web domain; for example, the Bangla script with any other high-resource language, the Wikipedia dump from June 2020 is only 350 vocabulary is strictly limited within that bound. We megabytes in size, two orders of magnitudes name this phenomenon as the ‘Embedding Bar- smaller than the English Wikipedia. As a result, we rier’. Furthermore, we conduct a series of experi- had to crawl the web extensively to collect our pre- ments to establish that Embedding Barrier is indeed training data. We selected 60 Bangla websites by a key contributing factor to the performance degra- their Amazon Alexa website rankings2 and volume dation of XLM-R. Through our experiments, we of extractable texts. The contents included encyclo- show that: pedias, news, blogs, e-books, and story sites. The 1. There is a considerable loss in performance amount of collected data totaled around 30 GB. (up to 4% accuracy) because of the Embed- There are other noisy sources of data dumps ding Barrier when vocabularies are limited. publicly available, a prominent one being OSCAR 2. Even a distant low-resource language having (Suárez et al., 2019). However, the OSCAR dataset the same typology as a high-resource one may contained a lot of offensive texts and repetitive benefit from borrowing (subword) tokens (al- contents, which we found too difficult to clean beit having different meanings). thoroughly, and hence opted not to use it. 3. On the contrary, linguistically similar lan- 2.2 Pre-processing guages may be hindered from contributing We performed thorough deduplication on the pre- to each other’s monolingual performance be- training data, removed non-textual contents (e.g., cause of being written in different scripts. HTML/JavaScript tags), and filtered out non- 4. Sister languages with identical scripts can Bangla pages using a language classifier (Joulin boost each other’s monolingual performance et al., 2017). After the processing, the dataset was even when they are low-resource. reduced to 18.6 GB in size. A detailed description of the corpus can be found in the Appendix. In addition to addressing the embedding bar- We trained a Wordpiece (Wu et al., 2016) vo- rier, we propose a simple solution to overcome cabulary of 32k tokens on the resulting corpus it – transcribing texts into a common script. Our with a 400 character alphabet, kept larger than the proposed method alleviates this embedding barrier native Bangla alphabet to capture code-switching by increasing the effective vocabulary size beyond (Poplack, 1980) and allow romanized Bangla con- the native scripts. Although not being a perfect tents for better generalization. While creating a solution, it establishes that having an uncommon training sample, we limited the sequence length to typology indeed limits the performance of multi- 512 tokens and did not cross document boundaries lingual models. We call for new ways to represent (Liu et al., 2019) while creating a sample. vocabularies of low-resource languages in multilin- gual models that can better lift this barrier. 2www.alexa.com/topsites/countries/BD SC EC DC NER NLI Model/Baseline (Acc.) (Weighted F-1) (Acc.) (Macro F-1) (Acc.) Previous SoTA 85.67 69.73 72.50 57.44 N/A mBERT 83.39 56.02 98.64 67.40 71.41 XLM-R 89.49 66.70 98.71 70.63 76.76 monsoon-nlp/bangla-electra 73.54 34.55 97.64 52.57 64.43 sagorsarker/bangla-bert-base 87.30 61.51 98.79 70.97 69.18 BanglaBERT (ours) 92.18 72.75 99.07 71.54 80.15 Table 1: Performance comparison on different downstream tasks 2.3 Pretraining Objective and Model p3.16xlarge instance on AWS. We used the Adam (Kingma and Ba, 2015) optimizer with a 2e-4 learn- Language models are pretrained with self- ing rate and linear warmup of 10,000 steps. supervised objectives that can take advantage of unannotated text data. BERT (Devlin et al., 2019) 2.4 Downstream Tasks and Results was pretrained with masked language modeling (MLM) and next sentence prediction (NSP).
Recommended publications
  • Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network
    In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network Morten Warncke-Wang1, Anuradha Uduwage1, Zhenhua Dong2, John Riedl1 1GroupLens Research Dept. of Computer Science and Engineering 2Dept. of Information Technical Science University of Minnesota Nankai University Minneapolis, Minnesota Tianjin, China {morten,uduwage,riedl}@cs.umn.edu [email protected] ABSTRACT 1. INTRODUCTION Wikipedia has become one of the primary encyclopaedic in- The world: seven seas separating seven continents, seven formation repositories on the World Wide Web. It started billion people in 193 nations. The world's knowledge: 283 in 2001 with a single edition in the English language and has Wikipedias totalling more than 20 million articles. Some since expanded to more than 20 million articles in 283 lan- of the content that is contained within these Wikipedias is guages. Criss-crossing between the Wikipedias is an inter- probably shared between them; for instance it is likely that language link network, connecting the articles of one edition they will all have an article about Wikipedia itself. This of Wikipedia to another. We describe characteristics of ar- leads us to ask whether there exists some ur-Wikipedia, a ticles covered by nearly all Wikipedias and those covered by set of universal knowledge that any human encyclopaedia only a single language edition, we use the network to under- will contain, regardless of language, culture, etc? With such stand how we can judge the similarity between Wikipedias a large number of Wikipedia editions, what can we learn based on concept coverage, and we investigate the flow of about the knowledge in the ur-Wikipedia? translation between a selection of the larger Wikipedias.
    [Show full text]
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • Modeling Communities in Information Systems: Informal Learning Communities in Social Media"
    "Modeling Communities in Information Systems: Informal Learning Communities in Social Media" Von der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades einer Doktorin der Naturwissenschaften genehmigte Dissertation vorgelegt von Zinayida Kensche, geb. Petrushyna, MSc. aus Odessa, Ukraine Berichter: Universitätsprofessor Professor Dr. Matthias Jarke Universitätsprofessor Professor Dr. Marcus Specht Privatdozent Dr. Ralf Klamma Tag der mündlichen Prüfung: 17.11.2015 Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek online verfügbar. i c Copyright by 2016 All Rights Reserved ii Abstract Information modeling is required for creating a successful information system while modeling of communities is pivotal for maintaining community information systems (CIS). Online social media, a special case of CIS, have been intensively used but not usually adopted for learning community needs. Thus community stakeholders meet problems by supporting learning communities in social media. Under the prism of Community of Practice theory, such communities have three dimensions that are responsible for community sustainability: mutual engagement, joint enterprises and shared repertoire. Existing modeling solutions use either perspectives of learning theories, or anal- ysis of learner or community data captured in social media but rarely combine both approaches. Therefore, current solutions produce community models that supply only a part of community stakeholders with information that can hardly describe community success and failure. We also claim that community models must be created based on community data analysis integrated with our learning community dimensions. More- over, the models need to be adapted according to environmental changes. This work provides a solution to continuous modeling of informal learning com- munities in social media.
    [Show full text]
  • Same Gap, Different Experiences
    SAME GAP, DIFFERENT EXPERIENCES An Exploration of the Similarities and Differences Between the Gender Gap in the Indian Wikipedia Editor Community and the Gap in the General Editor Community By Eva Jadine Lannon 997233970 An Undergraduate Thesis Presented to Professor Leslie Chan, PhD. & Professor Kenneth MacDonald, PhD. in the Partial Fulfillment of the Requirements for the Degree of Honours Bachelor of Arts in International Development Studies (Co-op) University of Toronto at Scarborough Ontario, Canada June 2014 ACKNOWLEDGEMENTS I would like to express my deepest gratitude to all of those people who provided support and guidance at various stages of my undergraduate research, and particularly to those individuals who took the time to talk me down off the ledge when I was certain I was going to quit. To my friends, and especially Jennifer Naidoo, who listened to my grievances at all hours of the day and night, no matter how spurious. To my partner, Karim Zidan El-Sayed, who edited every word (multiple times) with spectacular patience. And finally, to my research supervisors: Prof. Leslie Chan, you opened all the doors; Prof. Kenneth MacDonald, you laid the foundations. Without the support, patience, and understanding that both of you provided, it would have never been completed. Thank you. i EXECUTIVE SUMMARY According to the second official Wikipedia Editor Survey conducted in December of 2011, female- identified editors comprise only 8.5% of contributors to Wikipedia's contributor population (Glott & Ghosh, 2010). This significant lack of women and women's voices in the Wikipedia community has led to systemic bias towards male histories and culturally “masculine” knowledge (Lam et al., 2011; Gardner, 2011; Reagle & Rhue, 2011; Wikimedia Meta-Wiki, 2013), and an editing environment that is often hostile and unwelcoming to women editors (Gardner, 2011; Lam et al., 2011; Wikimedia Meta- Wiki, 2013).
    [Show full text]
  • Critical Point of View: a Wikipedia Reader
    w ikipedia pedai p edia p Wiki CRITICAL POINT OF VIEW A Wikipedia Reader 2 CRITICAL POINT OF VIEW A Wikipedia Reader CRITICAL POINT OF VIEW 3 Critical Point of View: A Wikipedia Reader Editors: Geert Lovink and Nathaniel Tkacz Editorial Assistance: Ivy Roberts, Morgan Currie Copy-Editing: Cielo Lutino CRITICAL Design: Katja van Stiphout Cover Image: Ayumi Higuchi POINT OF VIEW Printer: Ten Klei Groep, Amsterdam Publisher: Institute of Network Cultures, Amsterdam 2011 A Wikipedia ISBN: 978-90-78146-13-1 Reader EDITED BY Contact GEERT LOVINK AND Institute of Network Cultures NATHANIEL TKACZ phone: +3120 5951866 INC READER #7 fax: +3120 5951840 email: [email protected] web: http://www.networkcultures.org Order a copy of this book by sending an email to: [email protected] A pdf of this publication can be downloaded freely at: http://www.networkcultures.org/publications Join the Critical Point of View mailing list at: http://www.listcultures.org Supported by: The School for Communication and Design at the Amsterdam University of Applied Sciences (Hogeschool van Amsterdam DMCI), the Centre for Internet and Society (CIS) in Bangalore and the Kusuma Trust. Thanks to Johanna Niesyto (University of Siegen), Nishant Shah and Sunil Abraham (CIS Bangalore) Sabine Niederer and Margreet Riphagen (INC Amsterdam) for their valuable input and editorial support. Thanks to Foundation Democracy and Media, Mondriaan Foundation and the Public Library Amsterdam (Openbare Bibliotheek Amsterdam) for supporting the CPOV events in Bangalore, Amsterdam and Leipzig. (http://networkcultures.org/wpmu/cpov/) Special thanks to all the authors for their contributions and to Cielo Lutino, Morgan Currie and Ivy Roberts for their careful copy-editing.
    [Show full text]
  • The Case of 13 Wikipedia Instances
    Interaction Design and Architecture(s) Journal - IxD&A, N.22, 2014, pp. 34-47 The Impact of Culture On Smart Community Technology: The Case of 13 Wikipedia Instances Zinayida Petrushyna1, Ralf Klamma1, Matthias Jarke1,2 1 Advanced Community Information Systems Group, Information Systems and Databases Chair, RWTH Aachen University, Ahornstrasse 55, 52056 Aachen, Germany 2 Fraunhofer Institute for Applied Information Technology FIT, 53754 St. Augustin, Germany {petrushyna, klamma}@dbis.rwth-aachen.de [email protected] Abstract Smart communities provide technologies for monitoring social behaviors inside communities. The technologies that support knowledge building should consider the cultural background of community members. The studies of the influence of the culture on knowledge building is limited. Just a few works consider digital traces of individuals that they explain using cultural values and beliefs. In this work, we analyze 13 Wikipedia instances where users with different cultural background build knowledge in different ways. We compare edits of users. Using social network analysis we build and analyze co- authorship networks and watch the networks evolution. We explain the differences we have found using Hofstede dimensions and Schwartz cultural values and discuss implications for the design of smart community technologies. Our findings provide insights in requirements for technologies used for smart communities in different cultures. Keywords: Social network analysis, Wikipedia communities, Hofstede dimensions, Schwartz cultural values 1 Introduction People prefer to leave in smart cities where their needs are satisfied [1]. The development of smart cities depends on the collaboration of individuals. The investigation of the flow [1] of knowledge created by the individuals allows the monitoring of city smartness.
    [Show full text]
  • A Wikipedia Reader
    UvA-DARE (Digital Academic Repository) Critical point of view: a Wikipedia reader Lovink, G.; Tkacz, N. Publication date 2011 Document Version Final published version Link to publication Citation for published version (APA): Lovink, G., & Tkacz, N. (2011). Critical point of view: a Wikipedia reader. (INC reader; No. 7). Institute of Network Cultures. http://www.networkcultures.org/_uploads/%237reader_Wikipedia.pdf General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:05 Oct 2021 w ikipedia pedai p edia p Wiki CRITICAL POINT OF VIEW A Wikipedia Reader 2 CRITICAL POINT OF VIEW A Wikipedia Reader CRITICAL POINT OF VIEW 3 Critical Point of View: A Wikipedia Reader Editors: Geert Lovink
    [Show full text]
  • 34 Understanding Wikipedia Practices Through Hindi, Urdu, and English
    Understanding Wikipedia practices through Hindi, Urdu, and English takes on an evolving regional conflict MOLLY G. HICKMAN, Computer Science, Virginia Tech 34 VIRAL PASAD, Computer Science, Virginia Tech HARSH SANGHAVI, Industrial and Systems Engineering, Virginia Tech JACOB THEBAULT-SPIEKER, The Information School, University of Wisconsin - Madison SANG WON LEE, Computer Science, Virginia Tech Wikipedia is the product of thousands of editors working collaboratively to provide free and up-to-date encyclopedic information to the project’s users. This article asks to what degree Wikipedia articles in three languages — Hindi, Urdu, and English — achieve Wikipedia’s mission of making neutrally-presented, reliable information on a polarizing, controversial topic available to people around the globe. We chose the topic of the recent revocation of Article 370 of the Constitution of India, which, along with other recent events in and concerning the region of Jammu and Kashmir, has drawn attention to related articles on Wikipedia. This work focuses on the English Wikipedia, being the preeminent language edition of the project, as well as the Hindi and Urdu editions. Hindi and Urdu are the two standardized varieties of Hindustani, a lingua franca of Jammu and Kashmir. We analyzed page view and revision data for three Wikipedia articles to gauge popularity of the pages in our corpus, and responsiveness of editors to breaking news events and problematic edits. Additionally, we interviewed editors from all three language editions to learn about differences in editing processes and motivations, and we compared the text of the articles across languages as they appeared shortly after the revocation of Article 370.
    [Show full text]
  • Wikipedia @ 20
    Wikipedia @ 20 Wikipedia @ 20 Stories of an Incomplete Revolution Edited by Joseph Reagle and Jackie Koerner The MIT Press Cambridge, Massachusetts London, England © 2020 Massachusetts Institute of Technology This work is subject to a Creative Commons CC BY- NC 4.0 license. Subject to such license, all rights are reserved. The open access edition of this book was made possible by generous funding from Knowledge Unlatched, Northeastern University Communication Studies Department, and Wikimedia Foundation. This book was set in Stone Serif and Stone Sans by Westchester Publishing Ser vices. Library of Congress Cataloging-in-Publication Data Names: Reagle, Joseph, editor. | Koerner, Jackie, editor. Title: Wikipedia @ 20 : stories of an incomplete revolution / edited by Joseph M. Reagle and Jackie Koerner. Other titles: Wikipedia at 20 Description: Cambridge, Massachusetts : The MIT Press, [2020] | Includes bibliographical references and index. Identifiers: LCCN 2020000804 | ISBN 9780262538176 (paperback) Subjects: LCSH: Wikipedia--History. Classification: LCC AE100 .W54 2020 | DDC 030--dc23 LC record available at https://lccn.loc.gov/2020000804 Contents Preface ix Introduction: Connections 1 Joseph Reagle and Jackie Koerner I Hindsight 1 The Many (Reported) Deaths of Wikipedia 9 Joseph Reagle 2 From Anarchy to Wikiality, Glaring Bias to Good Cop: Press Coverage of Wikipedia’s First Two Decades 21 Omer Benjakob and Stephen Harrison 3 From Utopia to Practice and Back 43 Yochai Benkler 4 An Encyclopedia with Breaking News 55 Brian Keegan 5 Paid with Interest: COI Editing and Its Discontents 71 William Beutler II Connection 6 Wikipedia and Libraries 89 Phoebe Ayers 7 Three Links: Be Bold, Assume Good Faith, and There Are No Firm Rules 107 Rebecca Thorndike- Breeze, Cecelia A.
    [Show full text]
  • Inspire New Readers Campaign
    Inspire New Readers Campaign Results from 2018 In January 2018, the Wikimedia Foundation ran an Inspire Campaign to generate & fund ideas for how our community ideas could reach “New Readers.” The problem: Awareness of Wikipedia and other projects is very low among internet users in many countries. Wikipedia awareness among internet users before New Readers interventions Iraq USA France 25%19% Japan 87% 84% 64% Mexico 45% India Nigeria 40%33% Brazil 48%27% 39% Sources: https://meta.wikimedia.org/wiki/Global_Reach/Insights Project goals Primary: Support communities to increase awareness of Wikimedia Projects Secondary: Create learning patterns and guidelines to support future awareness grants Project timeline Inspire campaign: Pilot grants: Rapid grants: Build community awareness Communities run Templates and learning of the problem & generate projects to raise patterns for rapid grants to possible solutions awareness support ongoing efforts JAN 2018 MAR - OCT 2018 LAUNCH JAN 2019 Campaign overview Inspire campaign GOAL: more communities know that low awareness is a problem where they live. 538 participants Target: 300 362 ideas Target: 150 8 funded grants 11 submitted Target: 20 [3 Nigeria, 3 India, 1 Nepal, 1 DRC] Icons from the noun project, CC BY SA 3.0 by Tengwan & corpus delecti Grants overview 8 grants awarded Total budget: $12,982 ● 4 Online promotion 3 - Video production 1 - Social media promotion ● 4 Offline promotion 1 - Radio show 1 - Rickshaw announcements 2 - Workshops Icons 1. Video by Christian Antonius 2. instagram by Arthur
    [Show full text]
  • Wikimania Is an Annual Global Event Devoted to Wikimedia Projects
    Wikimania is an annual global event devoted to Wikimedia projects around the globe (including Wikipedia, Wikibooks, Wikinews, Wiktionary, Wikispecies, and Wikimedia Commons). The conference is a community gathering, giving the editors and users of Wikimedia projects an opportunity to meet each other, exchange ideas, report on research and projects, and collaborate on the future of the projects. Maps The Accommodation Guide 1. Medical Univ. Dormitory 1 5. Gdansk Univ. Dormitory 6 9. Hotel Gdansk 13. Hotel Hanza 2. SSM Youth Hostel 1 6. Old Town Hostel 10. Villa Biala Lilia 3. Railway Station 7. Music Academy Dormitory 2 11. Hotel Parnas W - WIKIMANIA 4. Hotel Hilton 8. Dom Muzyka (Musician House) 12. Holland House The Map of the Neighbourhood of the Venue 1. Conference Venue 3. A good place to buy souvenirs Green line: a route to the Venue 2. Areas with bars and restaurants 4. Ferry across the river You can find more information here: http://bit.ly/buTMGR 2 | Page Program at a Glance Friday, July 9 Friday, July 9 - Baltic Philharmonic Concert hall Jazz hall Green hall Oak hall 08.00 - 09.00 On-Site Registration Opening Ceremony (10 min for organizers) + Opening Plenary (Sue Gardner + WMF Board) 09.00 - 11.00 + Wikimania Madness (last 15 min) 11.00 - 11.30 Coffee Break + Spotlight on Posters Panels: Chapters: Collaboration Workshop: Brainstorming Talks: Wikiversity Talk+Workshop: Accessibility 11:30 - 13:00 and Coordination Wikimedia and Social Media Workshop 13.00 - 14.30 Lunch Break at Baltic Philharmonic: Gdańsk salon and foyer level
    [Show full text]
  • Linguistic Neighbourhoods: Explaining Cultural Borders on Wikipedia Through Multilingual Co-Editing Activity
    Samoilenko et al. RESEARCH Linguistic neighbourhoods: Explaining cultural borders on Wikipedia through multilingual co-editing activity Anna Samoilenko1,3*, Fariba Karimi1, Daniel Edler2, J´er^omeKunegis3 and Markus Strohmaier1,3 *Correspondence: [email protected] Abstract 1GESIS { Leibniz-Institute for the Social Sciences, 6-8 Unter In this paper, we study the network of global interconnections between language Sachsenhausen, 50667 Cologne, communities, based on shared co-editing interests of Wikipedia editors, and show Germany that although English is discussed as a potential lingua franca of the digital Full list of author information is available at the end of the article space, its domination disappears in the network of co-editing similarities, and instead local connections come to the forefront. Out of the hypotheses we explored, bilingualism, linguistic similarity of languages, and shared religion provide the best explanations for the similarity of interests between cultural communities. Population attraction and geographical proximity are also significant, but much weaker factors bringing communities together. In addition, we present an approach that allows for extracting significant cultural borders from editing activity of Wikipedia users, and comparing a set of hypotheses about the social mechanisms generating these borders. Our study sheds light on how culture is reflected in the collective process of archiving knowledge on Wikipedia, and demonstrates that cross-lingual interconnections on Wikipedia are not dominated by one powerful language. Our findings also raise some important policy questions for the Wikimedia Foundation. Keywords: Wikipedia; Multilingual; Cultural similarity; Network; Digital language divide; Socio-linguistics; Digital Humanities; Hypothesis testing 1 Introduction Measuring the extent to which cultural communities overlap via the knowledge they preserve can paint a picture of how culturally proximate or diverse they are.
    [Show full text]