Are All Good Word Vector Spaces Isomorphic?

Total Page:16

File Type:pdf, Size:1020Kb

Are All Good Word Vector Spaces Isomorphic? Are All Good Word Vector Spaces Isomorphic? Ivan Vulic´1∗ Sebastian Ruder2∗ Anders Søgaard3;4∗ 1 Language Technology Lab, University of Cambridge 2 DeepMind 3 Department of Computer Science, University of Copenhagen 4 Google Research, Berlin [email protected] [email protected] [email protected] Abstract es fr 0.8 pt it de Existing algorithms for aligning cross-lingual nl 0.7 word vector spaces assume that vector spaces id ca pl sv are approximately isomorphic. As a result, ro cs 0.6 da hu they perform poorly or fail completely on non- ja ru isomorphic spaces. Such non-isomorphism 0.5 el tr fi ms he zh has been hypothesised to result from typolog- sk hr vi 0.4 fa ar uk ical differences between languages. In this BLI scores (Accuracy) et hi sl ko work, we ask whether non-isomorphism is also 0.3 crucially a sign of degenerate word vector lv th lt spaces. We present a series of experiments 0.2 10M 20M 50M 100M 200M 500M across diverse languages which show that vari- # of tokens in Wikipedia ance in performance across language pairs is not only due to typological differences, but can Figure 1: Performance of a state-of-the-art BLI model mostly be attributed to the size of the mono- mapping from English to a target language and the size lingual resources available, and to the proper- of the target language Wikipedia are correlated. Linear ties and duration of monolingual training (e.g. fit shown as a blue line (log scale). “under-training”). 1 Introduction alignment methods for cross-lingual word embed- dings. It furthermore provides us with an expla- Word embeddings have been argued to reflect how nation for reported failures to align word vector language users organise concepts (Mandera et al., spaces in different languages (Søgaard et al., 2018; 2017; Torabi Asr et al., 2018). The extent to which Artetxe et al., 2018a), which has so far been largely they really do so has been evaluated, e.g., using attributed only to inherent typological differences. semantic word similarity and association norms In fact, the amount of data used to induce the (Hill et al., 2015; Gerz et al., 2016), and word monolingual embeddings is predictive of the qual- analogy benchmarks (Mikolov et al., 2013c). If ity of the aligned cross-lingual word embeddings, word embeddings reflect more or less language- as evaluated on bilingual lexicon induction (BLI). independent conceptual organisations, word em- arXiv:2004.04070v2 [cs.CL] 20 Oct 2020 Consider, for motivation, Figure1; it shows the per- beddings in different languages can be expected to formance of a state-of-the-art alignment method— be near-isomorphic. Researchers have exploited RCSLS with iterative normalisation (Zhang et al., this to learn linear transformations between such 2019)—on mapping English embeddings onto em- spaces (Mikolov et al., 2013a; Glavasˇ et al., 2019), beddings in other languages, and its correlation which have been used to induce bilingual dictionar- (ρ = 0:72) with the size of the tokenised target lan- ies, as well as to facilitate multilingual modeling guage Polyglot Wikipedia (Al-Rfou et al., 2013). and cross-lingual transfer (Ruder et al., 2019). We investigate to what extent the amount of data In this paper, we show that near-isomorphism available for some languages and corresponding arises only with sufficient amounts of training. This training conditions provide a sufficient explanation is of practical interest for applications of linear for the variance in reported results; that is, whether ∗All authors contributed equally to this work. it is the full story or not: The answer is ’almost’, that is, its interplay with inherent typological differ- tant indicator for performance has been the degree ences does have a crucial impact on the ‘alignabil- of isomorphism, that is, how (topologically) similar ity’ of monolingual vector spaces. the structures of the two vector spaces are. We first discuss current standard methods of Mapping-based approaches The prevalent way to quantifying the degree of near-isomorphism be- learn a cross-lingual embedding space, especially tween word vector spaces (§2.1). We then outline in low-data regimes, is to learn a mapping between training settings that may influence isomorphism a source and a target embedding space (Mikolov (§2.2) and present a novel experimental protocol et al., 2013a). Such mapping-based approaches for learning cross-lingual word embeddings that assume that the monolingual embedding spaces simulates a low-resource environment, and also are isomorphic, i.e., that one can be transformed controls for topical skew and differences in morpho- into the other via a linear transformation (Xing logical complexity (§3). We focus on two groups et al., 2015; Artetxe et al., 2018a). Recent unsu- of languages: 1) Spanish, Basque, Galician, and pervised approaches rely even more strongly on Quechua, and 2) Bengali, Tamil, and Urdu, as these this assumption: They assume that the structures are arguably spoken in culturally related regions, of the embedding spaces are so similar that they but have very different morphology. Our experi- can be aligned by minimising the distance between ments, among other findings, indicate that a low- the transformed source language and the target lan- resource version of Spanish is as difficult to align guage embedding space (Zhang et al., 2017; Con- to English as Quechua, challenging the assumption neau et al., 2018; Xu et al., 2018; Alvarez-Melis from prior work that the primary issue to resolve in and Jaakkola, 2018; Hartmann et al., 2019). cross-lingual word embedding learning is language dissimilarity (instead of, e.g., procuring additional 2.1 Quantifying Isomorphism raw data for embedding training). We also show We employ measures that quantify isomorphism that by controlling for different factors, we reduce in three distinct ways—based on graphs, metric the gap between aligning Spanish and Basque to spaces, and vector similarity. English from 0.291 to 0.129. Similarly, under these controlled circumstances, we do not observe any Eigenvector similarity (Søgaard et al., 2018) substantial performance difference between align- Eigenvector similarity (EVS) estimates the degree ing Spanish and Galician to English, or between of isomorphism based on properties of the near- aligning Bengali and Tamil to English. est neighbour graphs of the two embedding spaces. We first length-normalise embeddings in both em- We also investigate the learning dynamics of bedding spaces and compute the nearest neighbour monolingual word embeddings and their impact graphs on a subset of the top most frequent N on BLI performance and near-isomorphism of the words. We then calculate the Laplacian matrices resulting word vector spaces (§4), finding training L and L of each graph. For L , we find the duration, amount of monolingual resources, prepro- 1 2 1 smallest k such that the sum of its k largest eigen- cessing, and self-learning all to have a large impact. 1 1 values Pk1 λ is at least 90% of the sum of all The findings are verified across a set of typolog- i=1 1i its eigenvalues. We proceed analogously for k ically diverse languages, where we pair English 2 and set k = min(k ; k ). The eigenvector sim- with Spanish, Arabic, and Japanese. 1 2 ilarity metric ∆ is now the sum of the squared We will release our new evaluation dictionar- differences of the k largest Laplacian eigenvalues: ies and subsampled Wikipedias controlling for ∆ = Pk (λ − λ )2. The lower ∆, the more topical skew and morphological differences to fa- i=1 1i 2i similar are the graphs and the more isomorphic are cilitate future research at: https://github. the embedding spaces. com/cambridgeltl/iso-study. Gromov-Hausdorff distance (Patra et al., 2019) 2 Isomorphism of Vector Spaces The Hausdorff distance is a measure of the worst case distance between two metric spaces X and Y Studies analyzing the qualities of monolingual with a distance function d: word vector spaces have focused on intrinsic tasks (Baroni et al., 2014), correlations (Tsvetkov et al., H(X ; Y) = maxf sup inf d(x; y); x2X y2Y 2015), and subspaces (Yaghoobzadeh and Schutze¨ , sup inf d(x; y)g 2016). In the cross-lingual setting, the most impor- y2Y x2X Intuitively, it measures the distance between the Corpus size It has become standard to align mono- nearest neighbours that are farthest apart. The lingual word embeddings trained on Wikipedia Gromov-Hausdorff distance (GH) in turn min- (Glavasˇ et al., 2019; Zhang et al., 2019). As can be imises this distance over all isometric transforms seen in Figure1, and also in Table1, Wikipedias (orthogonal transforms in our case as we apply of low-resource languages are more than a mag- mean centering) X and Y as follows: nitude smaller than Wikipedias of high-resource languages.3 Corpus size has been shown to play GH(X ; Y) = inf H(f(X ); g(Y)) a role in the performance of monolingual embed- f;g dings (Sahlgren and Lenci, 2016), but it is unclear In practice, GH is calculated by computing the Bot- how it influences their structure and isomorphism. tleneck distance between the metric spaces (Chazal Training duration As it is generally too expen- et al., 2009; Patra et al., 2019). sive to tune hyper-parameters separately for each Relational similarity As an alternative, we con- language, monolingual embeddings are typically sider a simpler measure inspired by Zhang et al. trained for the same number of epochs in large- (2019). This measure, dubbed RSIM, is based scale studies. As a result, word embeddings of on the intuition that the similarity distributions of low-resource languages may be “under-trained”. translations within each language should be similar.
Recommended publications
  • State of Wikimedia Communities of India
    State of Wikimedia Communities of India Assamese http://as.wikipedia.org State of Assamese Wikipedia RISE OF ASSAMESE WIKIPEDIA Number of edits and internal links EDITS PER MONTH INTERNAL LINKS GROWTH OF ASSAMESE WIKIPEDIA Number of good Date Articles January 2010 263 December 2012 301 (around 3 articles per month) November 2011 742 (around 40 articles per month) Future Plans Awareness Sessions and Wiki Academy Workshops in Universities of Assam. Conduct Assamese Editing Workshops to groom writers to write in Assamese. Future Plans Awareness Sessions and Wiki Academy Workshops in Universities of Assam. Conduct Assamese Editing Workshops to groom writers to write in Assamese. THANK YOU Bengali বাংলা উইকিপিডিয়া Bengali Wikipedia http://bn.wikipedia.org/ By Bengali Wikipedia community Bengali Language • 6th most spoken language • 230 million speakers Bengali Language • National language of Bangladesh • Official language of India • Official language in Sierra Leone Bengali Wikipedia • Started in 2004 • 22,000 articles • 2,500 page views per month • 150 active editors Bengali Wikipedia • Monthly meet ups • W10 anniversary • Women’s Wikipedia workshop Wikimedia Bangladesh local chapter approved in 2011 by Wikimedia Foundation English State of WikiProject India on ENGLISH WIKIPEDIA ● One of the largest Indian Wikipedias. ● WikiProject started on 11 July 2006 by GaneshK, an NRI. ● Number of article:89,874 articles. (Excludes those that are not tagged with the WikiProject banner) ● Editors – 465 (active) ● Featured content : FAs - 55, FLs - 20, A class – 2, GAs – 163. BASIC STATISTICS ● B class – 1188 ● C class – 801 ● Start – 10,931 ● Stub – 43,666 ● Unassessed for quality – 20,875 ● Unknown importance – 61,061 ● Cleanup tags – 43,080 articles & 71,415 tags BASIC STATISTICS ● Diversity of opinion ● Lack of reliable sources ● Indic sources „lost in translation“ ● Editor skills need to be upgraded ● Lack of leadership ● Lack of coordinated activities ● ….
    [Show full text]
  • As You Wish Meaning in Bengali
    As You Wish Meaning In Bengali Collapsable Jack stet glumly and stockily, she denitrify her Pharisee cuckoos thither. Sanders boned her earners broad-mindedly, flammable and uncalled-for. Miles biking his throttles pavilion wearyingly, but trivial Chuck never backlash so pectinately. This can be understood with an example. Amazon as a Manager where I was access for creating process guidelines, training content and preparing weekly performance reports that involved complex writing skills. Where are the restrooms? English dictionary on so happy times go through on buy in hindi translation people is in sanskrit malayalam good morning has successfully completed. This Good Evening Love Message will help us to overcome many problems. Anyone you render on facebook or twitter just imagine how naked the sea will and, your! Do you want her give your low each to this translation? Bitcoin meaning in bengali rear end be misused to book hotels off Expedia, shop for furniture on buy in and buy Xbox games. Moreover, combining translation and typesetting in this way creates efficiencies and economies over carrying them out separately. RATHER definition are included in the result of RATHER meaning in bengali at kitkatwords. Only the user who asked this question will those who disagreed with color answer. Can i am and you in her! How children you say suffer in Korean? This category only used for furniture on. Bengali encodings, the vowels that are attached to the left character are written first followed by consonant. Indic languages were traditionally small businesses may have a substantial bengali r dissertation meaning. However, some of the words are actually closely related to Latin as well.
    [Show full text]
  • SOME NOTES on the ETYMOLOGY of the WORD Altai / Altay Altai/Altay Kelimesinin Etimolojisi Üzerine Notlar
    ........... SOME NOTES ON THE ETYMOLOGY OF THE WORD altai / altay Altai/Altay Kelimesinin Etimolojisi Üzerine Notlar Hüseyin YILDIZ*1 Dil Araştırmaları, Bahar 2017/20: 187-201 Öz: Günümüzde Altay Dağları’nı adlandırmada kullanılan altay kelimesine tarihî Türk lehçelerinde bu fonetik biçimiyle rastlanmamaktadır. Bize göre altay kelimesinin altun kelimesiyle doğrudan ilgisi vardır. Ancak bunu daha net bir şekilde görebilmek için şu iki soruyu cevaplandırmak gerekir: i) altun kelimesi, Altay dağları için kullanılan adlandırmalarda yer almakta mıdır? ii) İkinci hecedeki /u/ ~ /a/ ve /n/ ~ /y/ denklikleri fonetik olarak açıklanabilir mi? Bu makalede altay kelimesinin etimolojisi fonetik (/n/ ~ /y/ ve /a/~ /u/), semantik (altun yış, chin-shan, altun owla…) ve leksik (Altun Ğol, Altan Ḫusu Owla…) yönlerden tartışılacaktır. Bize göre hem Eski Türkçe altay kelimesinin tarihinden hem de ‘Altay Dağları’ için kullanılan kelimelerden hareketle, kelimenin en eski şekli *paltuń şeklinde tasarlanabilir. Anahtar Kelimeler: Türk yazı dilleri, Eski Türkçe, etimoloji, altay kelimesi, yer adları, Altay Dağları Abstract: The word altay is an oronym referring to Altai Mountains, but it, in this phonological form, is not attested to have existed in early Turkic works. We consider that the word altay is directly related with the word altun. However, to see this clearer we must answer following two questions: i) Is the word altun used for any denominating system of the Altay Mountains? and ii) Could the equivalencies between /u/ ~ /a/ and between /n/ ~ /y/ phonemes on the second syllable be explained phonologically? This article aims to discuss the etymology of the word altai in following respects: phonologically (/n/ ~ /y/ and /a/~ /u/), semantically (altun yış, chin- shan, altun owla…) and lexically (Altun Ğol, Altan Ḫusu Owla…).
    [Show full text]
  • International Jury Report
    INTERNATIONAL 2018 JURY REPORT Wiki Loves Earth is a photo contest of the na- In 2018 Wiki Loves Earth continued this trend. tural monuments, where participants picture 32 countries organised a national stage, and protected areas and upload their photos to Wi- the list of first-time participants included Chi- kimedia Commons. The goal of the project is, le, Italy, Jordan and the Philippines. As in 2016 on one hand, to picture under a free license as and 2017, there was a special nomination in many natural monuments and protected areas collaboration with UNESCO which was exten- as possible, on the other hand, to contribute to ded to Biosphere Reserves and UNESCO Global nature protection by raising public awareness. Geoparks. Most of countries organised the con- test from 1 May to 31 May 2017, while some After years of successful organisation of Wiki countries and UNESCO special nomination ex- Loves Monuments there was an idea of a similar tended the contest period till 30 June. During contest for natural monuments. The idea of Wiki the contest 89,829 pictures were submitted Loves Earth was born in April 2012 during a by 7,659 participants, 6,371 of whom made discussion at Wikimedia Conference. The idea their first upload. was realised for the first time in Ukraine, where the contest was held from 15 April to 15 May The contest was organised in each country by 2013. Wiki Loves Earth became international in local volunteers. Each national jury submitted 2014 with 16 participating countries from four up to 10 pictures to the international stage, to- continents, and expanded its geography further talling 297 pictures submitted to the internati- in the following years.
    [Show full text]
  • Jai Singh Bengali Wikipedia 10Th Anniversary Celebration Kolkata Main Page City and Conference Community & Team Logistics Pr
    jai singh Bengali Wikipedia 10th Anniversary Celebration Kolkata Main page City and Conference Community & Team Logistics Programs Registration Scholars hip Call for Participation Sponsors BN10 Conference Logo-Kolkata.png Event: Bengali Wikipedia 10th Anniversary Celebration Date: 9th January (Friday) and 10 January (Saturday) 2015 Venue: Jadavpur University Accommodation: Jadavpur University Guest House Host: Kolkata & Bangla Wikipedia Community, Wikipedia Kolkata Chapter & Wikime dia India Sponsors: Wikimedia Foundation , Wikimedia India and India Access To Knowl edge Social Networking Site: Facebook Email us: Jayanta Nath: jayanta at wikimedia.in Kalyan Sarkar: kalyan at wikimedia.in Share this Share with Facebook Share on twitter.com Share on Digg.com Share on stumbleupon. com Share on reddit.com Share on identi.ca Wikimedia Commons Wikimedia Commons has more media related to: 10th Anniversary of Bengali Wikipedia Bengali Wikipedia 10th Anniversary Celebration is being organized in Kolkata & B angladesh for outreach program in the Wikimedia movement. Bengali Wikipedia 10th Anniversary Celebration is organized by: Wikimedians in Kolkata.svg Wikimedia India logo.svg Call for Papers can be seen under the Call for Participation page. Category: Bengali Wikipedia 10th Anniversary Celebration Kolkata Navigation menu EnglishCreate accountLog inContent pageDiscussionReadEditView history Main page Wikimedia News Translations Recent changes Random page Help Babel Community Wikimedia Forum Mailing lists Requests Babylon Reports Research Planet Wikimedia Beyond the Web Meet Wikimedians Events Chapters Donate Print/export Create a book Download as PDF Printable version Tools What links here Related changes Special pages Permanent link Page information Cite this page This page was last modified on 27 November 2014, at 14:33 (UTC).
    [Show full text]
  • Sayı: 20 Bahar 2017
    Sayı: 20 Bahar 2017 Ankara ........... DİL ARAŞTIRMALARI/Language Studies Uluslararası Hakemli Süreli (Altı Aylık) Dergi ISSN: 1307-7821 Sayı: 20 Bahar 2017 SAHİBİ/ Owner EDİTÖR/ Avrasya Yazarlar Birliği adına Editor Yakup ÖMEROĞLU Prof. Dr. Ahmet Bican ERCİLASUN SORUMLU YAZI İŞLERİ MÜDÜRÜ/ YAYIN YÖNETMENİ YARDIMCILARI/ Editorial Director Vice Editors Prof. Dr. Ekrem ARIKOĞLU Doç. Dr. Dilek ERGÖNENÇ AKBABA Doç. Dr. Feyzi ERSOY Doç. Dr. Habibe YAZICI ERSOY Doç. Dr. Yavuz KARTALLIOĞLU Dr. Hüseyin YILDIZ YAYIN DANIŞMA KURULU/Editorial Advisory Board Prof. Dr. Cengiz ALYILMAZ • Prof. Dr. Şükrü Haluk AKALIN • Prof. Dr. Mustafa ARGUNŞAH • Prof. Dr. Sema BARUTÇU ÖZÖNDER • Prof. Dr. Tsendiin BATTULGA • Prof. Dr. Uwe BLÄSİNG • Prof. Dr. Ahmet BURAN • Prof. Dr. İsmet CEMİLOĞLU • Prof. Dr. Hülya KASAPOĞLU ÇENGEL • Prof. Dr. Nurettin DEMİR • Prof.Dr.Nikolay İvanoviç YEGEROV • Prof. Dr. Marcel ERDAL • Prof. Dr. Bilgehan Atsız GÖKDAĞ • Prof. Dr. Tuncer GÜLENSOY • Prof. Dr. Gürer GÜLSEVİN • Prof. Dr. Ayşe İLKER • Prof. Dr. Henryk JANKOWSKİ • Prof. Dr. Mehmet KARA • Prof. Dr. Günay KARAAĞAÇ • Prof. Dr. Leylâ KARAHAN • Prof. Dr. Yakup KARASOY • Prof. Dr. Ceval KAYA • Prof. Dr. M. Fatih KİRİŞÇİOĞLU • Prof. Dr. Zeynep KORKMAZ • Prof. Dr. Mustafa ÖNER • Prof. Dr. Mustafa ÖZKAN • Prof. Dr. Nevzat ÖZKAN • Prof. Dr. Çetin PEKACAR • Prof. Dr. Osman Fikri SERTKAYA • Prof. Dr. Marek STACHOWSKİ • Prof. Dr. Vahit TÜRK • Prof. Dr. Kerime ÜSTÜNOVA • Prof. Dr. Zühal YÜKSEL • Prof. Dr. Ercan ALKAYA • Prof. Dr. Erhan AYDIN • Doç. Dr. Dilek ERGÖNENÇ AKBABA • Doç. Dr. Feyzi ERSOY • Doç. Dr. Habibe YAZICI ERSOY • Doç. Dr. Yavuz KARTALLIOĞLU • Doç. Dr. Ferruh AĞCA • Doç. Dr. Nergis BİRAY • Doç. Dr. Akartürk KARAHAN • Yrd. Doç. Ferhat TAMİR YAZI KURULU/Executive Board Doç.
    [Show full text]
  • Digital Review of Asia Pacific 2007-2008
    DIGITAL REVIEW of ri LA V i. L I 2007 2008 REPORTS ON 31 ECONOMIES 2 SUB-REGIONAL ASSOCIATIONS ICT4D in Asia Pacific: An Overview of Emerging Issues Mobile and Wireless Technologies for Development in Asia Pacific The Role of ICTs in Risk Communication in Asia Pacific I Localization in Asia Pacific I I Key Policy Issues in I Intellectual Property and Technology in Asia Pacific I State and Evolution of ICTs: A Tale of Two Asias ARCHIV 127081 www.digital-review.org DIGITAL REVIEW of Asia acjfic 2007-2008 I Supplementary news, reports and analyses are available for download at: http ://www. digital-review. org DIGITAL REVIEW of As 2007—2008 <dirAP> CHIEF EDIToR: Felix Librero CONTRIBUTING AUTHORS: ASSOCIATE EDITOR: Patricia B. Arinto Frederick John Abo Salman Malik EDITORIAL BOARD: Musa Abu Hassan Muhammad Aimal Marjan Ilyas Ahmed Jamshed Masood Danny Butt Zorayda Ruth Andam Ram Mohan Claude-Yves Charron Lkhagvasuren Ariunaa Charles Mok Suchit Nanda Batpurev Batchuluun Rapin Mudiardjo Maria Ng Lee Hoon Axel Bruns Frederick Noronha Milagros Rivera Danny Butt Them Oo Rajesh Sreenivasan Donny B.U. Sushil Pandey Knshnamurthy Sriramesh Elizabeth V. Cardoza Adam Peake Jian Yan Wang Claude-Yves Charron Phonpasit Phissamay Kapil Chawla Gopi Pradhan Masoud Davarinejad Ananya Raihan Deng Jianguo Naomi Robinson Massood Saffari Hj Abd Rahim Derus Lorraine Carlos Joâo Câncio Freitas Salazar George Sciadas John Fung Basanta Shrestha Atanu Garai Abhishek Singh Goh Seow Hiong Rajesh Sreenivasan Lelia Green Krishnamurthy Sriramesh Nalaka Gunawardene Tan
    [Show full text]
  • Wikimedia India Newsletter, September 2010
    Wikimedia India Community Newsletter Copyright The text of this newsletter is copyrighted and is formally licensed to the public under liberal license "Creative Commons Attribution-Share alike 3.0 Unported License (CC-BY-SA)". This newsletter as a whole (including this copyright statement) or the content of this newsletter can be copied, modified, and redistributed if and only if the copied version is made available on similar license terms. Every copied, modified or redistributed version of this newsletter request to attribute the authors of this newsletter (a link back to the original document or a word about it generally satisfy the attribution requirement). Reuse of Logos of the Wikimedia Foundation is strictly restricted. The logo of Wikimedia foundation, wikipedia, and the logo of other wiki projects are used in this newsletter as per the trademark policy of Wikimedia foundation. Usage of logos in media and press reports about Wikimedia and its projects is permitted, any other usage needs explicit permission. Content of this document is covered by a disclaimer. Disclaimer The items contained herein are published as submitted and are provided for general information purposes only. This information is not advice. Readers should not rely solely on this information, but should make their own inquiries before making any decisions. Authors behind this newsletter work to maintain up-to-date information from reliable sources; however, no responsibility is accepted for any errors or omissions or results of any actions based upon this information. If you have any questions regarding any of these items, contact back. This newsletter may contain links to websites that are created and maintained by other volunteers outside this newsletter and it is not guarantee the accuracy or completeness of any information presented there.
    [Show full text]
  • Supplementary Information Appendix for Links That Speak: the Global Language Network and Its Association with Global Fame Shahar Ronen, Bruno Gonçalves, Kevin Z
    Supplementary Information Appendix for Links that speak: The global language network and its association with global fame Shahar Ronen, Bruno Gonçalves, Kevin Z. Hu, Alessandro Vespignani, Steven Pinker, César A. Hidalgo Supplementary online material (SOM) and additional visualizations are available on http://language.media.mit.edu Table of Contents S1 Data ................................................................................................................................... 2 S1.1 Twitter ..................................................................................................................................................... 2 S1.2 Wikipedia ................................................................................................................................................ 4 S1.3 Book translations .................................................................................................................................... 7 S2 Language notation and demographics ......................................................................... 8 S2.1 Notation ................................................................................................................................................... 8 S2.2 Population ............................................................................................................................................... 9 S2.3 Language GDP ....................................................................................................................................
    [Show full text]
  • Common Issues Faced by Indic Wikipedia
    Indic Wikipedia Policies & Guidelines Handbook Table of content Preface Introduction to policies Types of policies Features of a policy page Necessity of policies and guidelines Creating policies Proposing Village pump Article or project talk page Policy page and its talk page Initial proposal Highlighting important discussion Discussing Consensus Implementing Modifying or updating an existing policy Enforcements Common issues faced by Indic Wikipedia communities Missing or incomplete policy pages Incomplete or untranslated policy pages Lack of active translators/editors Addressing the issues Dedicated team or task force Using MediaWiki translation tool Policy mapping Credits Images Text Screenshots Planning suggestions Proofreading: Preface Currently CIS-A2K is working with five Indian-language Wikimedia communities (Kannada, Konkani, Marathi, Odia and Telugu). While working with the mentioned Indic Wikimedia communities, we observed a number of issues affecting them and we also noticed that there are many similarities between the issues and difficulties faced by these communities. So, we decided to create this “Indic Wikipedia Policies and Guidelines Handbook”. At first, we created a short handbook discussing a number of topics, such as how to create new policies, or modify the existing ones, using village pump, enforcing policies etc. Then we talked to Indic Wikipedians to know more about the policy and guideline related issues and problems they are facing. We also asked for their feedback on the first draft of this handbook. When we contacted them and requested them to join our survey, we received overwhelming responses from them. We must thank everyone who has taken part in our surveys and we will continue communicating with Indic Wikimedians.
    [Show full text]
  • The Global Language Network and Its Association with Global Fame
    Links that speak: The global language network and its association with global fame The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Ronen, Shahar, Bruno Gonçalves, Kevin Z. Hu, Alessandro Vespignani, Steven Pinker, and César A. Hidalgo. 2014. “Links That Speak: The Global Language Network and Its Association with Global Fame.” Proc Natl Acad Sci USA 111 (52) (December 15): E5616–E5622. doi:10.1073/pnas.1410931111. Published Version 10.1073/pnas.1410931111 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:23680428 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP Links that speak: The global language networks of Twitter, Wikipedia and Book Translations Shahar Ronen1, Bruno Gonçalves2,3,4, Kevin Z. Hu1, Alessandro Vespignani2, Steven A. Pinker5, César A. Hidalgo1 1 Macro Connections, The MIT Media Lab, Cambridge, MA 02139, USA 2 Department of Physics, Northeastern University, Boston, MA 02115, USA 3 Aix-Marseille Université, CNRS, CPT, UMR 7332, 13288 Marseille, France 4 Université de Toulon, CNRS, CPT, UMR 7332, 83957 La Garde, France 5 Department of Psychology, Harvard University, Cambridge, MA 02138, USA Abstract Languages vary enormously in global importance because of historical, demographic, political, and technological forces, and there has been much speculation about the current and future status of English as a global language. Yet there has been no rigorous way to define or quantify the relative global influence of languages.
    [Show full text]
  • The India Chronicles Dear Community
    September 2011 By Tory Read Growing Wikipedia: The India Chronicles Dear Community, As the Wikimedia Foundation began its catalyst work, we commissioned documentarian Tory Read to create a vivid description of our work in India during the important early stages of our activities. This was done in the interest of transparency and to ensure that we captured lessons from this new approach. It also serves as a window into some of the exciting developments in the Indian Wikimedia community. Our goal is to honestly communicate about our work in this new arena and to stimulate dialogue about diverse ways to support and build Wikipedia communities and Wikimedia projects around the world. We hope that you take away a nuanced understanding of the work in India. We encourage you to tell us what you think and ask informed questions as this work continues to unfold. Sincerely, Barry Newstead Chief Global Development Officer, Wikimedia Foundation This is a journalistic account and analysis, based on document review, interviews and observations conducted between November 2010 and June 2011, including 16 days in India in June 2011. I planned and organized my visit based on where the most Wikipedia activities were happening at the time. The Malayalam Wikipedia community had been planning their annual meetup, and they scheduled it to fall within my travel dates so I could report on it. The views expressed herein are my own and do not necessarily reflect the views of Wikimedia Foundation. Tory Read, documentarian Growing Wikipedia: The India Chronicles | September 2011 2 “Wikipedia saved my life.” That’s what Srikeit Tadepalli, an MBA student in Pune, India, told me one day in June.
    [Show full text]