Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Pages 1–10, Dublin, Ireland, August 23 2014
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Compilation of a Swiss German Dialect Corpus and Its Application to Pos Tagging
Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging Nora Hollenstein Noemi¨ Aepli University of Zurich University of Zurich [email protected] [email protected] Abstract Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data to serve as a stepping stone to automatically process the dialects. We compiled NOAH’s Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of- Speech tags. Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger and achieved an accuracy of 90.62%. 1 Introduction Swiss German is not an official language of Switzerland, rather it includes dialects of Standard German, which is one of the four official languages. However, it is different from Standard German in terms of phonetics, lexicon, morphology and syntax. Swiss German is not dividable into a few dialects, in fact it is a dialect continuum with a huge variety. Swiss German is not only a spoken dialect but increasingly used in written form, especially in less formal text types. Often, Swiss German speakers write text messages, emails and blogs in Swiss German. However, in recent years it has become more and more popular and authors are publishing in their own dialect. Nonetheless, there is neither a writing standard nor an official orthography, which increases the variations dramatically due to the fact that people write as they please with their own style. -
Librarians As Wikimedia Movement Organizers in Spain: an Interpretive Inquiry Exploring Activities and Motivations
Librarians as Wikimedia Movement Organizers in Spain: An interpretive inquiry exploring activities and motivations by Laurie Bridges and Clara Llebot Laurie Bridges Oregon State University https://orcid.org/0000-0002-2765-5440 Clara Llebot Oregon State University https://orcid.org/0000-0003-3211-7396 Citation: Bridges, L., & Llebot, C. (2021). Librarians as Wikimedia Movement Organizers in Spain: An interpretive inQuiry exploring activities and motivations. First Monday, 26(6/7). https://doi.org/10.5210/fm.v26i3.11482 Abstract How do librarians in Spain engage with Wikipedia (and Wikidata, Wikisource, and other Wikipedia sister projects) as Wikimedia Movement Organizers? And, what motivates them to do so? This article reports on findings from 14 interviews with 18 librarians. The librarians interviewed were multilingual and contributed to Wikimedia projects in Castilian (commonly referred to as Spanish), Catalan, BasQue, English, and other European languages. They reported planning and running Wikipedia events, developing partnerships with local Wikimedia chapters, motivating citizens to upload photos to Wikimedia Commons, identifying gaps in Wikipedia content and filling those gaps, transcribing historic documents and adding them to Wikisource, and contributing data to Wikidata. Most were motivated by their desire to preserve and promote regional languages and culture, and a commitment to open access and open education. Introduction This research started with an informal conversation in 2018 about the popularity of Catalan Wikipedia, Viquipèdia, between two library coworkers in the United States, the authors of this article. Our conversation began with a sense of wonder about Catalan Wikipedia, which is ranked twentieth by number of articles, out of 300 different language Wikipedias (Meta contributors, 2020). -
The Wikipedia Diversity Observatory a Project to Identify and Bridge Content Gaps in Wikipedia
The Wikipedia Diversity Observatory A Project to Identify and Bridge Content Gaps in Wikipedia Marc Miquel-Ribé David Laniado Universitat Pompeu Fabra, Barcelona, Catalonia Eurecat, Centre Tecnològic de Catalunya [email protected] [email protected] ABSTRACT 1 Introduction In this paper we present the Wikipedia Diversity Observatory, Wikipedia is among the largest information repositories on the a project aimed to increase diversity within Wikipedia Internet that are both multilingual and created through language editions. The project includes dashboards with collaborative effort. Its prime objective1 is to "give free access visualizations and tools which show the gaps in terms of to the sum of all human knowledge" and, consequently, it exists concepts not represented or not shared across languages. The in as many as 309 languages. Even though the language dashboards are built on datasets generated for each of the more communities make the projects grow on a constant basis, the than 300 language editions, with features that label each article content does not represent the existing diversity in peoples, according to different categories relevant to overall content places, and cultures of the world; furthermore, there is a gap diversity. Through various examples, we show how the tools between language editions and articles often are not shared, or encourage and help editors to bridge the gaps in Wikipedia remain even unique to one language [1]. The creation of articles content. Finally, we discuss the project's impact on the in Wikipedia language editions is spontaneous and non- communities and implications for the Wikimedia movement, in directed. Several studies showed that cultural and geographical a moment in which covering diversity is considered strategic. -
Master Thesis
MASTER THESIS TITLE : Cultural configuration of Wikipedia: measuring Autoreferentiality in different languages MASTER DEGREE: Master of Science in Telecommunication Engineering & Management AUTHOR: Marc Miquel Ribe´ DIRECTOR: Horacio Rodr´ıguez Hontoria TUTOR: Sebastia` Sallent Ribes DATE: March 31, 2011 T´ıtol : Cultural configuration of Wikipedia: measuring Autoreferentiality in different languages Autor: Marc Miquel Ribe´ Director: Horacio Rodr´ıguez Hontoria Tutor: Sebastia` Sallent Ribes Data: 31 de marc¸de 2011 Resum ”Wikipedia es´ un projecte enciclopedic` multiling¨ue, col·laboratiu, basat en web i sense anim` de lucre impulsat per la Fundacio´ Wikimedia”, aix´ı es´ com s’autodescriu Wikipedia en la definicio´ de l’article que du el seu nom. Aixo` significa que l’enciclopedia` pot ser modificada en qualsevol moment, per qualsevol persona i des de qualsevol lloc. Aquestes premisses i la seva gran participacio´ fan que es tracti d’un excel·lent objecte social d’estudi, que a la vegada, per tractar-se d’un artefacte tecnologic,` permeti tambe´ l’´us de tecniques` de processament llenguatge natural, obtencio´ i mineria de dades. Tanmateix, en la recerca actual hi ha una clara mancanc¸a en software que pugui aproximar-s’hi d’una manera integral. Tenint en compte aquest buit realitzem una caracteritzacio´ de Wikipedia amb l’objectiu de coneixer` a fons quins son´ els elements i estructures d’informacio´ que conte´ i com despres´ poden obtenir-se mitjanc¸ant una eina anal´ıtica. Partim de l’API existent anomenada wikAPIdia, que desenvolupem fins a incloure-hi noves funcionalitats i posar-la apunt per a encarar m´ultiples escenaris i problematiques` de les ciencies` socials. -
Semantic Approaches in Natural Language Processing
Cover:Layout 1 02.08.2010 13:38 Seite 1 10th Conference on Natural Language Processing (KONVENS) g n i s Semantic Approaches s e c o in Natural Language Processing r P e g a u Proceedings of the Conference g n a on Natural Language Processing 2010 L l a r u t a This book contains state-of-the-art contributions to the 10th N n Edited by conference on Natural Language Processing, KONVENS 2010 i (Konferenz zur Verarbeitung natürlicher Sprache), with a focus s e Manfred Pinkal on semantic processing. h c a o Ines Rehbein The KONVENS in general aims at offering a broad perspective r on current research and developments within the interdiscipli - p p nary field of natural language processing. The central theme A Sabine Schulte im Walde c draws specific attention towards addressing linguistic aspects i t of meaning, covering deep as well as shallow approaches to se - n Angelika Storrer mantic processing. The contributions address both knowledge- a m based and data-driven methods for modelling and acquiring e semantic information, and discuss the role of semantic infor - S mation in applications of language technology. The articles demonstrate the importance of semantic proces - sing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting se - mantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. -
Text Corpora and the Challenge of Newly Written Languages
Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), pages 111–120 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Text Corpora and the Challenge of Newly Written Languages Alice Millour∗, Karen¨ Fort∗y ∗Sorbonne Universite´ / STIH, yUniversite´ de Lorraine, CNRS, Inria, LORIA 28, rue Serpente 75006 Paris, France, 54000 Nancy, France [email protected], [email protected] Abstract Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written. Keywords: text corpora, dialectal variants, spelling, crowdsourcing 1. Introduction increasing number of speakers (see for instance, the stud- Speakers from various linguistic communities are increas- ies on specific languages carried by Rivron (2012) or Soria ingly writing their languages (Outinoff, 2012), and the In- et al. -
Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories
Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories Brent Hecht† and Darren Gergle†‡ †Dept. of Electrical Engineering and Computer Science ‡Dept. of Communication Studies Northwestern University [email protected], [email protected] ABSTRACT In this paper, we explore this question by introducing and Self-focus is a novel way of understanding a type of bias in measuring self-focus, a new way of understanding a type of bias in community-maintained Web 2.0 graph structures. It goes beyond the graph structures that underlie many of these community- previous measures of topical coverage bias by encapsulating both maintained repositories, including one of the largest: Wikipedia. node- and edge-hosted biases in a single holistic measure of an We define self-focus bias as occurring when contributors to a entire community-maintained graph. We outline two methods to knowledge repository encode information that is important and quantify self-focus, one of which is very computationally correct to them and a large proportion of contributors to the same inexpensive, and present empirical evidence for the existence of repository, but not important and correct to contributors of similar self-focus using a “hyperlingual” approach that examines 15 repositories. different language editions of Wikipedia. We suggest applications Self-focus bias is similar to topical coverage biases [8, 11] in that of our methods and discuss the risks of ignoring self-focus bias in it seeks to describe the semantic makeup of knowledge technological applications. repositories. Topical coverage bias studies explicitly or implicitly compare the distribution of articles (or a similar measure) in Categories and Subject Descriptors particular semantic categories in Wikipedia to that of a more H.5.3 [Information Systems]: Group and Organization Interfaces traditional knowledge repository, generally in an effort to show – collaborative computing, computer-supported cooperative work, that Wikipedia describes in more detail semantic areas that are of theory and models. -
The Case of 13 Wikipedia Instances
Interaction Design and Architecture(s) Journal - IxD&A, N.22, 2014, pp. 34-47 The Impact of Culture On Smart Community Technology: The Case of 13 Wikipedia Instances Zinayida Petrushyna1, Ralf Klamma1, Matthias Jarke1,2 1 Advanced Community Information Systems Group, Information Systems and Databases Chair, RWTH Aachen University, Ahornstrasse 55, 52056 Aachen, Germany 2 Fraunhofer Institute for Applied Information Technology FIT, 53754 St. Augustin, Germany {petrushyna, klamma}@dbis.rwth-aachen.de [email protected] Abstract Smart communities provide technologies for monitoring social behaviors inside communities. The technologies that support knowledge building should consider the cultural background of community members. The studies of the influence of the culture on knowledge building is limited. Just a few works consider digital traces of individuals that they explain using cultural values and beliefs. In this work, we analyze 13 Wikipedia instances where users with different cultural background build knowledge in different ways. We compare edits of users. Using social network analysis we build and analyze co- authorship networks and watch the networks evolution. We explain the differences we have found using Hofstede dimensions and Schwartz cultural values and discuss implications for the design of smart community technologies. Our findings provide insights in requirements for technologies used for smart communities in different cultures. Keywords: Social network analysis, Wikipedia communities, Hofstede dimensions, Schwartz cultural values 1 Introduction People prefer to leave in smart cities where their needs are satisfied [1]. The development of smart cities depends on the collaboration of individuals. The investigation of the flow [1] of knowledge created by the individuals allows the monitoring of city smartness. -
World Languages Using Latin Script
World languages using Latin script Source: http://www.omniglot.com/writing/langalph.htm https://www.ethnologue.com/browse/names Sort order : Language status, ISO 639-3 Lang, ISO Language name Classification Population status Language map Comment 639-3 (EGIDS) Botswana, Lesotho, South Indo-European, Germanic, West, Low 1. Afrikaans, afr 7,096,810 1 Africa and Saxon-Low Franconian, Low Franconian SwazilandNamibia Azerbaijan,Georgia,Iraq 2. Azeri,Azerbaijani azj Turkic, Southern, Azerbaijani 24,226,940 1 Jordan and Syria Indo-European Balto-Slavic Slavic West 3. Czech Bohemian Cestina ces 10,619,340 1 Czech Republic Czech-Slovak Chamorro,Chamorru Austronesian Malayo-Polynesian Guam and Northern 4. cha 94,700 1 Tjamoro Chamorro Mariana Islands Seychelles Creole,Seselwa Creole, Creole, Ilois, Kreol, 5. Kreol Seselwa, Seselwa, crs Creole, French based 72,700 1 Seychelles Seychelles Creole French, Seychellois Creole Indo-European Germanic North East Denmark Finland Norway 6. DanishDansk Rigsdansk dan Scandinavian Danish-Swedish Danish- 5,520,860 1 and Sweden Riksmal Danish AustriaBelgium Indo-European Germanic West High Luxembourg and 7. German Deutsch Tedesco deu German German Middle German East 69,800,000 1 NetherlandsDenmark Middle German Finland Norway and Sweden 8. Estonianestieesti keel ekk Uralic Finnic 1,132,500 1 Estonia Latvia and Lithuania 9. English eng Indo-European Germanic West English 341,000,000 1 over 140 countries Austronesian Malayo-Polynesian 10. Filipino fil Philippine Greater Central Philippine 45,000,000 1 Filippines L2 users population Central Philippine Tagalog Page 1 of 48 World languages using Latin script Lang, ISO Language name Classification Population status Language map Comment 639-3 (EGIDS) Denmark Finland Norway 11. -
DLDP Digital Language Survival Kit
The Digital Language Diversity Project Digital Language Survival Kit The DLDP Recommendations to Improve Digital Vitality The DLDP Recommendations to Improve Digital Vitality Imprint The DLDP Digital Language Survival Kit Authors: Klara Ceberio Berger, Antton Gurrutxaga Hernaiz, Paola Baroni, Davyth Hicks, Eleonore Kruse, Vale- ria Quochi, Irene Russo, Tuomo Salonen, Anneli Sarhimaa, Claudia Soria This work has been carried out in the framework of The Digital Language Diversity Project (w ww. dldp.eu), funded by the European Union under the Erasmus+ Programme (Grant Agreement no. 2015-1-IT02-KA204- 015090) © 2018 This work is licensed under a Creative Commons Attribution 4.0 International License. Cover design: Eleonore Kruse Disclaimer This publication reflects only the authors’ view and the Erasmus+ National Agency and the Com- mission are not responsible for any use that may be made of the information it contains. www.dldp.eu www.facebook.com/digitallanguagediversity [email protected] www.twitter.com/dldproject 2 The DLDP Recommendations to Improve Digital Vitality Recommendations at a Glance Digital Capacity Recommendations Indicator Level Recommendations Digital Literacy 2,3 Increasing digital literacy among your native language-speaking community 2,3 Promote the upskilling of language mentors, activists or dissemi- nators 2,3 Establish initiatives to inform and educate speakers about how to acquire and use particular communication and content creation skills 2 Teaching digital literacy to children in your language community through -
Parsing Approaches for Swiss German
Master’s thesis presented to the Faculty of Arts and Social Sciences of the University of Zurich for the degree of Master of Arts UZH nmod case compound amod Parsing Approaches for Swiss German NOUN NOUN ADP ADJ NOUN Author: Noëmi Aepli Student ID Nr.: 10-719-250 Examiner: Prof. Dr. Martin Volk Supervisor: Dr. Simon Clematide Institute of Computational Linguistics Submission date: 10.01.2018 nmod case compound amod Parsing Approaches for Swiss German Abstract NOUN NOUN ADP ADJ NOUN Abstract This thesis presents work on universal dependency parsing for Swiss German. Natural language pars- ing describes the process of syntactic analysis in Natural Language Processing (NLP) and is a key area as many applications rely on its output. Building a statistical parser requires expensive resources which are only available for a few dozen languages. Hence, for the majority of the world’s languages, other ways have to be found to circumvent the low-resource problem. Triggered by such scenarios, research on different approaches to cross-lingual learning is going on. These methods seem promising for closely related languages and hence especially for dialects and varieties. Swiss German is a dialect continuum of the Alemannic dialect group. It comprises numerous vari- eties used in the German-speaking part of Switzerland. Although mainly oral varieties (Mundarten), they are frequently used in written communication. On the basis of their high acceptance in the Swiss culture and with the introduction of digital communication, Swiss German has undergone a spread over all kinds of communication forms and social media. Considering the lack of standard spelling rules, this leads to a huge linguistic variability because people write the way they speak. -
Gebiotoolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4081–4088 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies Marta R. Costa-jussa,` Pau Li Lin, Cristina Espana-Bonet˜ ∗ TALP Research Center, Universitat Politecnica` de Catalunya, Barcelona ∗ DFKI GmBH and Saarland University, Saarbrucken¨ [email protected], [email protected], [email protected] Abstract We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and to other domains than biographical entries), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets. Keywords: corpora, gender bias, Wikipedia, machine translation 1. Introduction by volunteers. The toolkit is customizable for languages and gender-balance. We take advantage of Wikipedia mul- Gender biases are present in many natural language pro- tilinguality to extract a corpus of biographies, being each cessing applications (Costa-jussa,` 2019). This comes as biography a document available in all the selected lan- an undesired characteristic of deep learning architectures guages.