A Wikipedia-Based Corpus for Contextualized Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

A Wikipedia-Based Corpus for Contextualized Machine Translation A Wikipedia-based Corpus for Contextualized Machine Translation Jennifer Drexler∗, Pushpendre Rastogiy, Jacqueline Aguilarz, Benjamin Van Durmez, Matt Postz ∗Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology yCenter for Language and Speech Processing, Johns Hopkins University zHuman Language Technology Center of Excellence, Johns Hopkins University [email protected], fpushpendre,jacqui.aguilar,vandurme,[email protected] Abstract We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated English Wikipedia articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish source articles cited within them, and (iv) English reference translations of all the Spanish data. In experiments, we evaluate the effect on translation quality when including language models built over these English documents and interpolated with other, separately-derived, more general language model sources. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score. Keywords: Machine Translation, Domain Adaptation, Corpus 1. Introduction comparable English Spanish article article We describe a corpus for target-contextualized machine translation (MT). Here, the task is to improve the trans- lation of source documents using language models built English translation over presumably related, comparable documents in the tar- Spanish get language. As a motivating example, this could be use- source ful in a situation where there is a collection of in-language Spanish source documents related to a topic, but a specific document of Spanish interest is available only in another language. Our corpus source comprises (i) a set of curated English Wikipedia articles English translation describing news events, along with (ii) their Spanish coun- English terparts, (iii) a number of the Spanish source articles cited translation English within them, and (iv) human-produced reference transla- translation tions of all the Spanish documents. In experiments, we translated these Spanish documents using a general trans- lation system built on out-of-domain data. We then built multiple “in-domain” language models on the comparable Figure 1: The data configuration. We collected compa- English documents — one for each pair of documents — rable English and Spanish documents along with sources and interpolated them with the larger model, evaluating the cited within the Spanish articles. Reference translations effect on translation quality, effectively casting this task as were then collected for all Spanish documents; in the ex- one of (hyper) domain adaptation for MT. We find that even periments, we use both the original (comparable) English under this simplistic baseline approach, we achieve signifi- articles (x3.3.) and the translated articles (x3.1.) to build a cant improvements as measured by BLEU score (Papineni language model to aid translation of each set of correspond- et al., 2002). ing source documents. This work relates to previous efforts in domain adaptation in machine translation (Eidelman et al., 2012; Langlais et 2. Dataset al., 2000; Langlais and Lapalme, 2002; Brousseau et al., Our dataset consists of 9 English articles from Wikipedia 1995; Dymetman et al., 1994). Many of these previous on events of interest to the Spanish-speaking portion of problem formulations were concerned with recognizing the the world. We manually extracted and sentence-segmented general domain, or topic, of the material to be translated. each article along with the Spanish Wikipedia article on the In comparison, we are concerned here with adapting the same topic.1 It is important to note that these are not par- translation via access to information related to the exact allel documents, but are instead at best “comparable cor- events being described, presented in the form of a language pora”, existing in the form of independently-written article model. We hope this might help lead to richer approaches on the same topic. Many such articles on Wikipedia, in fact, to MT, where coreference and semantics might be measur- ably brought to bear. 1Documents were taken from Wikipedia on July 2, 2013. 3593 Topic Articles Sources λ 1.0 0.7 0.3 0.0 Eng. Spa. Spa. Articles 28.97 31.45 31.27 23.69 Bolivia gas war 236 114 21 Sources 26.76 27.89 27.50 18.01 Catalan autonomy protest 31 45 87 Chavez death 122 52 49 Table 2: BLEU scores (averaged across each article or group of Chilean mining accident 474 140 144 source articles) from changing the interpolation weights λ, which Comayagua prison fire 53 11 50 denotes the weight assigned to the large, general-purpose lan- H1N1 chile 151 92 66 guage model. Articles denotes the average score from translating Papal election 147 70 82 the Spanish articles, while sources denotes the source documents Pemex explosion 29 82 117 cited within those articles. Venezuelan presidential election 110 44 79 Total 1,343 650 695 Table 1: The topics and sentence counts of the Spanish Wikipedia these reference translations can be thought of as parallel articles used as the basis for data collection. For each topic, we Wikipedia documents, in contrast to the comparable ones have the comparable Wikipedia articles (both English and Span- that we extracted. In the second, the English Wikipedia ar- ish) and the sources cited within the Spanish articles. ticles serve this role. We explore a range of interpolation weights in order to determine whether any setting will re- are written by authors writing independently in their native sult in some improvement. The purpose of these two exper- language on the same topic, but from their own perspective, iments is to determine whether in-domain language models in contrast to simply filling out the inter-language resources built even on very small, targeted pieces of data might be of the encyclopedia by translating documents from high- useful in improving machine translation quality: first, in a to low-resource languages. Our dataset leverages this fact, parallel setting, and second, in a comparable setting. Fi- providing a relatively small corpus with high information nally, in a third set of experiments, we test whether these content on the Spanish side, and comparable but impover- ideas can generalize when we have very small amounts of ished versions on the target (English side). This is pictured data, using a proxy setting to determine the interpolation in Figure 1. weights. In addition to the comparable document pairs, we reached 3.1. Experiment 1: Can the contextual data help into the Spanish documents and collected many of the (parallel setting)? sources cited within them. Many of these citations linked outside of Wikipedia, often to news websites. These The contextualized target-side language models are built on sources were then also manually extracted and sentence- very small amounts of data, numbering at most a few hun- segmented. In total, we collected 695; Table 1 contains a dred sentences (Table 1). This made them too unreliable list of the Wikipedia articles along with sentence-level ex- to use as separate language models during decoding (with tract statistics. their own weight assigned in the decoder’s linear model). All Spanish data was then manually translated by a single Instead, our approach was to interpolate the large, general bilingual speaker to produce a reference translation used for language model trained on millions of sentences (Europarl) translation. with the domain-specific one. For this first set of experiments, instead of using the (com- parable) English Wikipedia documents to build the contex- tualized language model, we looked at what should be an 3. Experiments easier setting: using the translations of the articles to build We used the Joshua Machine Translation Toolkit2 (Post et a language model when translating the sources, and vice al., 2013). We include both baseline results translating the versa. Table 2 presents the results four settings for the in- Spanish source articles as well as method of adapting the terpolation weight λ, which determines how much weight is baseline machine translation system to the domain of a spe- given to the large LM. With a weight of λ = 0:7, we see im- cific document. The baseline models were trained on gram- provements in BLEU score from including the in-domain mars extracted from the Spanish–English portion of version LM. To be clear, when translating the articles, the contex- 7 of the Europarl Corpus (Koehn, 2005). From these, we tualized LM is built on the sources, and when translating extracted Hiero grammars (Chiang, 2005) using the stan- the sources, the LM is built on the articles. Note that the dard extraction settings (Chiang, 2007). We then build a 5- setting λ = 1:0 is the baseline setting, with all the weight gram Kneser-Ney-smoothed language model from the En- assigned to the large, out-of-domain language model. glish side of this data using KenLM (Heafield, 2011). The parameters of the decoder’s linear model were tuned with 3.2. Experiment 2: Can contextualized LMs help Z-MERT (Och, 2003; Zaidan, 2009), which are included in (comparable setting)? the Joshua toolkit. For the second set of experiments, we use the comparable We describe three experiments. In the first, we use the ref- English Wikipedia articles to build the in-domain language erence translations of the articles as our set of target knowl- model. Table 3 contains the results. Note that the scores edge. Since they were produced from the Spanish versions, are all slightly lower than when using the parallel setting, which makes sense, since the comparable data are not cer- 2joshua-decoder.org tain to be as close. Importantly, however, the best setting 3594 λ 1.0 0.7 0.3 0.0 and enables research in this specific domain-adaptation sce- Sources 26.76 27.32 26.30 16.18 nario, and the results suggest that further research in this area could be productive.
Recommended publications
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Omnipedia: Bridging the Wikipedia Language
    Omnipedia: Bridging the Wikipedia Language Gap Patti Bao*†, Brent Hecht†, Samuel Carton†, Mahmood Quaderi†, Michael Horn†§, Darren Gergle*† *Communication Studies, †Electrical Engineering & Computer Science, §Learning Sciences Northwestern University {patti,brent,sam.carton,quaderi}@u.northwestern.edu, {michael-horn,dgergle}@northwestern.edu ABSTRACT language edition contains its own cultural viewpoints on a We present Omnipedia, a system that allows Wikipedia large number of topics [7, 14, 15, 27]. On the other hand, readers to gain insight from up to 25 language editions of the language barrier serves to silo knowledge [2, 4, 33], Wikipedia simultaneously. Omnipedia highlights the slowing the transfer of less culturally imbued information similarities and differences that exist among Wikipedia between language editions and preventing Wikipedia’s 422 language editions, and makes salient information that is million monthly visitors [12] from accessing most of the unique to each language as well as that which is shared information on the site. more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with In this paper, we present Omnipedia, a system that attempts a multilingual Wikipedia experience. These include to remedy this situation at a large scale. It reduces the silo visualizing content in a language-neutral way and aligning effect by providing users with structured access in their data in the face of diverse information organization native language to over 7.5 million concepts from up to 25 strategies. We present a study of Omnipedia that language editions of Wikipedia. At the same time, it characterizes how people interact with information using a highlights similarities and differences between each of the multilingual lens.
    [Show full text]
  • Title of Thesis: ABSTRACT CLASSIFYING BIAS
    ABSTRACT Title of Thesis: CLASSIFYING BIAS IN LARGE MULTILINGUAL CORPORA VIA CROWDSOURCING AND TOPIC MODELING Team BIASES: Brianna Caljean, Katherine Calvert, Ashley Chang, Elliot Frank, Rosana Garay Jáuregui, Geoffrey Palo, Ryan Rinker, Gareth Weakly, Nicolette Wolfrey, William Zhang Thesis Directed By: Dr. David Zajic, Ph.D. Our project extends previous algorithmic approaches to finding bias in large text corpora. We used multilingual topic modeling to examine language-specific bias in the English, Spanish, and Russian versions of Wikipedia. In particular, we placed Spanish articles discussing the Cold War on a Russian-English viewpoint spectrum based on similarity in topic distribution. We then crowdsourced human annotations of Spanish Wikipedia articles for comparison to the topic model. Our hypothesis was that human annotators and topic modeling algorithms would provide correlated results for bias. However, that was not the case. Our annotators indicated that humans were more perceptive of sentiment in article text than topic distribution, which suggests that our classifier provides a different perspective on a text’s bias. CLASSIFYING BIAS IN LARGE MULTILINGUAL CORPORA VIA CROWDSOURCING AND TOPIC MODELING by Team BIASES: Brianna Caljean, Katherine Calvert, Ashley Chang, Elliot Frank, Rosana Garay Jáuregui, Geoffrey Palo, Ryan Rinker, Gareth Weakly, Nicolette Wolfrey, William Zhang Thesis submitted in partial fulfillment of the requirements of the Gemstone Honors Program, University of Maryland, 2018 Advisory Committee: Dr. David Zajic, Chair Dr. Brian Butler Dr. Marine Carpuat Dr. Melanie Kill Dr. Philip Resnik Mr. Ed Summers © Copyright by Team BIASES: Brianna Caljean, Katherine Calvert, Ashley Chang, Elliot Frank, Rosana Garay Jáuregui, Geoffrey Palo, Ryan Rinker, Gareth Weakly, Nicolette Wolfrey, William Zhang 2018 Acknowledgements We would like to express our sincerest gratitude to our mentor, Dr.
    [Show full text]
  • Does Wikipedia Matter? the Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Discus­­ Si­­ On­­ Paper No
    Dis cus si on Paper No. 15-089 Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Dis cus si on Paper No. 15-089 Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko First version: December 2015 This version: September 2017 Download this ZEW Discussion Paper from our ftp server: http://ftp.zew.de/pub/zew-docs/dp/dp15089.pdf Die Dis cus si on Pape rs die nen einer mög lichst schnel len Ver brei tung von neue ren For schungs arbei ten des ZEW. Die Bei trä ge lie gen in allei ni ger Ver ant wor tung der Auto ren und stel len nicht not wen di ger wei se die Mei nung des ZEW dar. Dis cus si on Papers are inten ded to make results of ZEW research prompt ly avai la ble to other eco no mists in order to encou ra ge dis cus si on and sug gesti ons for revi si ons. The aut hors are sole ly respon si ble for the con tents which do not neces sa ri ly repre sent the opi ni on of the ZEW. Does Wikipedia Matter? The Effect of Wikipedia on Tourist Choices ∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ First version: December 2015 This version: September 2017 September 25, 2017 Abstract We document a causal influence of online user-generated information on real- world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional information on Wikipedia about cities affects tourists’ choices of overnight visits.
    [Show full text]
  • Flags and Banners
    Flags and Banners A Wikipedia Compilation by Michael A. Linton Contents 1 Flag 1 1.1 History ................................................. 2 1.2 National flags ............................................. 4 1.2.1 Civil flags ........................................... 8 1.2.2 War flags ........................................... 8 1.2.3 International flags ....................................... 8 1.3 At sea ................................................. 8 1.4 Shapes and designs .......................................... 9 1.4.1 Vertical flags ......................................... 12 1.5 Religious flags ............................................. 13 1.6 Linguistic flags ............................................. 13 1.7 In sports ................................................ 16 1.8 Diplomatic flags ............................................ 18 1.9 In politics ............................................... 18 1.10 Vehicle flags .............................................. 18 1.11 Swimming flags ............................................ 19 1.12 Railway flags .............................................. 20 1.13 Flagpoles ............................................... 21 1.13.1 Record heights ........................................ 21 1.13.2 Design ............................................. 21 1.14 Hoisting the flag ............................................ 21 1.15 Flags and communication ....................................... 21 1.16 Flapping ................................................ 23 1.17 See also ...............................................
    [Show full text]
  • Natural Language Processing for Similar Languages, Varieties, and Dialects: a Survey
    Natural Language Engineering (2020), 26, pp. 595–612 doi:10.1017/S1351324920000492 SURVEY PAPER Natural language processing for similar languages, varieties, and dialects: A survey Marcos Zampieri 1,∗ Preslav Nakov 2 and Yves Scherrer 3 1Rochester Institute of Technology, USA, 2Qatar Computing Research Institute, HBKU, Qatar and 3University of Helsinki, Finland ∗ Corresponding author. E-mail: [email protected] Abstract There has been a lot of recent interest in the natural language processing (NLP) community in the com- putational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar lan- guages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language vari- eties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects. Keywords: Dialects; similar languages; language varieties; language identification machine; translation parsing 1. Introduction Variation is intrinsic to human language and it is manifested in different ways. There are four generally accepted dimensions of language variation, namely diaphasic, diastratic, diachronic,and diatopic.
    [Show full text]
  • Wikipedia Matters∗
    Wikipedia Matters∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ September 29, 2017 Abstract We document a causal impact of online user-generated information on real-world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional content on Wikipedia pages about cities affects tourists’ choices of overnight visits. Our treatment of adding information to Wikipedia increases overnight visits by 9% during the tourist season. The impact comes mostly from improving the shorter and incomplete pages on Wikipedia. These findings highlight the value of content in digital public goods for informing individual choices. JEL: C93, H41, L17, L82, L83, L86 Keywords: field experiment, user-generated content, Wikipedia, tourism industry 1 Introduction Asymmetric information can hinder efficient economic activity. In recent decades, the Internet and new media have enabled greater access to information than ever before. However, the digital divide, language barriers, Internet censorship, and technological con- straints still create inequalities in the amount of accessible information. How much does it matter for economic outcomes? In this paper, we analyze the causal impact of online information on real-world eco- nomic outcomes. In particular, we measure the impact of information on one of the primary economic decisions—consumption. As the source of information, we focus on Wikipedia. It is one of the most important online sources of reference. It is the fifth most ∗We are grateful to Irene Bertschek, Avi Goldfarb, Shane Greenstein, Tobias Kretschmer, Thomas Niebel, Marianne Saam, Greg Veramendi, Joel Waldfogel, and Michael Zhang as well as seminar audiences at the Economics of Network Industries conference in Paris, ZEW Conference on the Economics of ICT, and Advances with Field Experiments 2017 Conference at the University of Chicago for valuable comments.
    [Show full text]
  • Conferenceabstracts
    TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Honorary Patronage of His Excellency Mr. Borut Pahor, President of the Republic of Slovenia MAY 23 – 28, 2016 GRAND HOTEL BERNARDIN CONFERENCE CENTRE Portorož , SLOVENIA CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik , Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis. Assistant Editors: Sara Goggi, Hélène Mazo The LREC 2016 Proceedings are licensed under a Creative Commons Attribution- NonCommercial 4.0 International License LREC 2016, TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2016 Conference Abstracts Distributed by: ELRA – European Language Resources Association 9, rue des Cordelières 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] ISBN 978-2-9517408-9-1 EAN 9782951740891 ii Introduction of the Conference Chair and ELRA President Nicoletta Calzolari Welcome to the 10 th edition of LREC in Portorož, back on the Mediterranean Sea! I wish to express to his Excellency Mr. Borut Pahor, the President of the Republic of Slovenia, the gratitude of the Program Committee, of all LREC participants and my personal for his Distinguished Patronage of LREC 2016. Some figures: previous records broken again! It is only the 10 th LREC (18 years after the first), but it has already become one of the most successful and popular conferences of the field. We continue the tradition of breaking previous records. We received 1250 submissions, 23 more than in 2014. We received 43 workshop and 6 tutorial proposals.
    [Show full text]
  • Critical Point of View: a Wikipedia Reader
    w ikipedia pedai p edia p Wiki CRITICAL POINT OF VIEW A Wikipedia Reader 2 CRITICAL POINT OF VIEW A Wikipedia Reader CRITICAL POINT OF VIEW 3 Critical Point of View: A Wikipedia Reader Editors: Geert Lovink and Nathaniel Tkacz Editorial Assistance: Ivy Roberts, Morgan Currie Copy-Editing: Cielo Lutino CRITICAL Design: Katja van Stiphout Cover Image: Ayumi Higuchi POINT OF VIEW Printer: Ten Klei Groep, Amsterdam Publisher: Institute of Network Cultures, Amsterdam 2011 A Wikipedia ISBN: 978-90-78146-13-1 Reader EDITED BY Contact GEERT LOVINK AND Institute of Network Cultures NATHANIEL TKACZ phone: +3120 5951866 INC READER #7 fax: +3120 5951840 email: [email protected] web: http://www.networkcultures.org Order a copy of this book by sending an email to: [email protected] A pdf of this publication can be downloaded freely at: http://www.networkcultures.org/publications Join the Critical Point of View mailing list at: http://www.listcultures.org Supported by: The School for Communication and Design at the Amsterdam University of Applied Sciences (Hogeschool van Amsterdam DMCI), the Centre for Internet and Society (CIS) in Bangalore and the Kusuma Trust. Thanks to Johanna Niesyto (University of Siegen), Nishant Shah and Sunil Abraham (CIS Bangalore) Sabine Niederer and Margreet Riphagen (INC Amsterdam) for their valuable input and editorial support. Thanks to Foundation Democracy and Media, Mondriaan Foundation and the Public Library Amsterdam (Openbare Bibliotheek Amsterdam) for supporting the CPOV events in Bangalore, Amsterdam and Leipzig. (http://networkcultures.org/wpmu/cpov/) Special thanks to all the authors for their contributions and to Cielo Lutino, Morgan Currie and Ivy Roberts for their careful copy-editing.
    [Show full text]
  • Assessing the Accuracy and Quality of Wikipedia Entries Compared to Popular Online Encyclopaedias
    Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias A preliminary comparative study across disciplines in English, Spanish and Arabic ˃ Imogen Casebourne ˃ Dr. Chris Davies ˃ Dr. Michelle Fernandes ˃ Dr. Naomi Norman Casebourne, I., Davies, C., Fernandes, M., Norman, N. (2012) Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias: A comparative preliminary study across disciplines in English, Spanish and Arabic. Epic, Brighton, UK. Retrieved from: http://commons.wikimedia.org/wiki/File:EPIC_Oxford_report.pdf This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported Licence http://creativecommons.org/licences/by-sa/3.0/ Supplementary information Additional information, datasets and supplementary materials from this study can be found at: http://meta.wikimedia.org/wiki/Research:Accuracy_and_quality_of_Wikipedia_entries/Data 2 Table of Contents Table of Contents .......................................................................................................................................... 2 Executive Summary ....................................................................................................................................... 5 1. Introduction .......................................................................................................................................... 5 2. Aims and Objectives .............................................................................................................................
    [Show full text]
  • Dictionary of Slovenian Exonyms Explanatory Notes
    GEOGRAFSKI INŠTITUT ANTONA MELIKA ZNANSTVENORAZISKOVALNI CENTER SLOVENSKE AKADEMIJE ZNANOSTI IN UMETNOSTI ANTON MELIK GEOGRAPHICAL INSTITUTE OF ZRC SAZU DICTIONARY OF SLOVENIAN EXONYMS EXPLANATORY NOTES Drago Kladnik Drago Perko AMGI ZRC SAZU Ljubljana 2013 1 Preface The geocoded collection of Slovenia exonyms Zbirka slovenskih eksonimov and the dictionary of Slovenina exonyms Slovar slovenskih eksonimov have been set up as part of the research project Slovenski eksonimi: metodologija, standardizacija, GIS (Slovenian Exonyms: Methodology, Standardization, GIS). They include more than 5,000 of the most frequently used exonyms that were collected from more than 50,000 documented various forms of these types of geographical names. The dictionary contains thirty-four categories and has been designed as a contribution to further standardization of Slovenian exonyms, which can be added to on an ongoing basis and used to find information on Slovenian exonym usage. Currently, their use is not standardized, even though analysis of the collected material showed that the differences are gradually becoming smaller. The standardization of public, professional, and scholarly use will allow completely unambiguous identification of individual features and items named. By determining the etymology of the exonyms included, we have prepared the material for their final standardization, and by systematically documenting them we have ensured that this important aspect of the Slovenian language will not sink into oblivion. The results of this research will not only help preserve linguistic heritage as an important aspect of Slovenian cultural heritage, but also help preserve national identity. Slovenian exonyms also enrich the international treasury of such names and are undoubtedly important part of the world’s linguistic heritage.
    [Show full text]
  • Wikipedia Matters∗
    Wikipedia Matters∗ Marit Hinnosaar† Toomas Hinnosaar‡ Michael Kummer§ Olga Slivko¶ July 14, 2019 Abstract We document a causal impact of online user-generated information on real-world economic outcomes. In particular, we conduct a randomized field experiment to test whether additional content on Wikipedia pages about cities affects tourists’ choices of overnight visits. Our treatment of adding information to Wikipedia in- creases overnight stays in treated cities compared to non-treated cities. The impact is largely driven by improvements to shorter and relatively incomplete pages on Wikipedia. Our findings highlight the value of content in digital public goods for informing individual choices. JEL: C93, H41, L17, L82, L83, L86 Keywords: field experiment, user-generated content, Wikipedia, tourism industry ∗We are grateful to Irene Bertschek, Avi Goldfarb, Shane Greenstein, Tobias Kretschmer, Michael Luca, Thomas Niebel, Marianne Saam, Greg Veramendi, Joel Waldfogel, and Michael Zhang as well as seminar audiences at the Economics of Network Industries conference (Paris), ZEW Conference on the Economics of ICT (Mannheim), Advances with Field Experiments 2017 Conference (University of Chicago), 15th Annual Media Economics Workshop (Barcelona), Conference on Digital Experimentation (MIT), and Digital Economics Conference (Toulouse) for valuable comments. Ruetger Egolf, David Neseer, and Andrii Pogorielov provided outstanding research assistance. Financial support from SEEK 2014 is gratefully acknowledged. †Collegio Carlo Alberto and CEPR, [email protected] ‡Collegio Carlo Alberto, [email protected] §Georgia Institute of Technology & University of East Anglia, [email protected] ¶ZEW, [email protected] 1 1 Introduction Asymmetric information can hinder efficient economic activity (Akerlof, 1970). In recent decades, the Internet and new media have enabled greater access to information than ever before.
    [Show full text]