Text Corpora and the Challenge of Newly Written Languages Alice Millour, Karën Fort
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
ALEMANA GERMAN, ALEMÁN, ALLEMAND Language
ALEMANA GERMAN, ALEMÁN, ALLEMAND Language family: Indo-European, Germanic, West, High German, German, Middle German, East Middle German. Language codes: ISO 639-1 de ISO 639-2 ger (ISO 639-2/B) deu (ISO 639-2/T) ISO 639-3 Variously: deu – Standard German gmh – Middle High german goh – Old High German gct – Aleman Coloniero bar – Austro-Bavarian cim – Cimbrian geh – Hutterite German kksh – Kölsch nds – Low German sli – Lower Silesian ltz – Luxembourgish vmf – Main-Franconian mhn – Mócheno pfl – Palatinate German pdc – Pennsylvania German pdt – Plautdietsch swg – Swabian German gsw – Swiss German uln – Unserdeutssch sxu – Upper Saxon wae – Walser German wep – Westphalian Glotolog: high1287. Linguasphere: [show] Beste izen batzuk (autoglotonimoa: Deutsch). deutsch alt german, standard [GER]. german, standard [GER] hizk. Alemania; baita AEB, Arabiar Emirerri Batuak, Argentina, Australia, Austria, Belgika, Bolivia, Bosnia-Herzegovina, Brasil, Danimarka, Ekuador, Errumania, Errusia (Europa), Eslovakia, Eslovenia, Estonia, Filipinak, Finlandia, Frantzia, Hegoafrika, Hungaria, Italia, Kanada, Kazakhstan, Kirgizistan, Liechtenstein, Luxenburgo, Moldavia, Namibia, Paraguai, Polonia, Puerto Rico, Suitza, Tajikistan, Uzbekistan, Txekiar Errepublika, Txile, Ukraina eta Uruguain ere. Dialektoa: erzgebirgisch. Hizkuntza eskualde erlazionatuenak dira Bavarian, Schwäbisch, Allemannisch, Mainfränkisch, Hessisch, Palatinian, Rheinfränkisch, Westfälisch, Saxonian, Thuringian, Brandenburgisch eta Low saxon. Aldaera asko ez dira ulerkorrak beren artean. high -
Compilation of a Swiss German Dialect Corpus and Its Application to Pos Tagging
Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging Nora Hollenstein Noemi¨ Aepli University of Zurich University of Zurich [email protected] [email protected] Abstract Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data to serve as a stepping stone to automatically process the dialects. We compiled NOAH’s Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of- Speech tags. Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger and achieved an accuracy of 90.62%. 1 Introduction Swiss German is not an official language of Switzerland, rather it includes dialects of Standard German, which is one of the four official languages. However, it is different from Standard German in terms of phonetics, lexicon, morphology and syntax. Swiss German is not dividable into a few dialects, in fact it is a dialect continuum with a huge variety. Swiss German is not only a spoken dialect but increasingly used in written form, especially in less formal text types. Often, Swiss German speakers write text messages, emails and blogs in Swiss German. However, in recent years it has become more and more popular and authors are publishing in their own dialect. Nonetheless, there is neither a writing standard nor an official orthography, which increases the variations dramatically due to the fact that people write as they please with their own style. -
Tonal Categories in Prosodic Annotation of Dialectal Variation
Tonal categories in prosodic annotation of dialectal variation Frank Kügler University of Cologne / Goethe University of Frankfurt The aim of this talk is to discuss prosodic categories from the point of view of phonological and pre-phonological analysis. The pre-phonological analysis refers to the workshop’s theme of labelling surface intonation contours. I will illustrate the discussion by the annotation of similar intonation contours of two different German dialects. The approach to discover tonal categories in a language variety or in a dialect is similar to the approach to develop the tonal inventory of a hitherto unknown or undescribed language. Different ways of establishing tonal categories have been proposed and have successfully been shown to pin down the phonological oppositions of a tonal grammar: (i) contrasting prosodically minimal pairs which for instance was applied for German dialects [1, 2], (ii) going from phonetic characteristics of F0-contours to tonal categories [1, 3–6], (iii) using perceptual tests to express the functional load of a prosodic or tonal category [7, 8]. The variation in F0-contours between different dialects is hard to grasp unless a phonological analysis contrasts relevant contours. On a pre- phonological level similar contours may misleadingly be described by one and the same tonal label (or category, depending on the level of description). This talk will discuss data from two German dialects (Swabian and Upper Saxon German) analysed in [1]. A rising-falling contour serves for illustration. The rise-fall in Swabian German is phonologically analysed as a rising pitch accent L*H followed by a low tone L, which represents a tonal affix (cf. -
Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia. -
Review of Louden, Mark. 2016. Pennsylvania Dutch: the Story of an American Language
Journal of Amish and Plain Anabaptist Studies Volume 4 Issue 2 Special section: Old Colony Mennonites Article 10 2016 Review of Louden, Mark. 2016. Pennsylvania Dutch: The Story of an American Language. Baltimore, MD: Johns Hopkins University Press. Pp. 473. Steve Hartman Keiser Follow this and additional works at: https://ideaexchange.uakron.edu/amishstudies Part of the German Language and Literature Commons Please take a moment to share how this work helps you through this survey. Your feedback will be important as we plan further development of our repository. Recommended Citation Keiser, Steve Hartman. 2016. "Review of Louden, Mark. 2016. Pennsylvania Dutch: The Story of an American Language. Baltimore, MD: Johns Hopkins University Press. Pp. 473." Journal of Amish and Plain Anabaptist Studies 4(2):222-27. This Book Reviews is brought to you for free and open access by IdeaExchange@UAkron, the institutional repository of The University of Akron in Akron, Ohio, USA. It has been accepted for inclusion in Journal of Amish and Plain Anabaptist Studies by an authorized administrator of IdeaExchange@UAkron. For more information, please contact [email protected], [email protected]. 222 Journal of Amish and Plain Anabaptist Studies 4(2) helpfully peaceful role, during the formation of the EPMC, of such bishops as David Thomas and Raymond Charles of the Lancaster Conference, and Homer Bomberger and Isaac Sensenig among those withdrawing. (3) There is due recognition of and appreciation for the remarkableness of the Lancaster Conference’s “amiable” -
Librarians As Wikimedia Movement Organizers in Spain: an Interpretive Inquiry Exploring Activities and Motivations
Librarians as Wikimedia Movement Organizers in Spain: An interpretive inquiry exploring activities and motivations by Laurie Bridges and Clara Llebot Laurie Bridges Oregon State University https://orcid.org/0000-0002-2765-5440 Clara Llebot Oregon State University https://orcid.org/0000-0003-3211-7396 Citation: Bridges, L., & Llebot, C. (2021). Librarians as Wikimedia Movement Organizers in Spain: An interpretive inQuiry exploring activities and motivations. First Monday, 26(6/7). https://doi.org/10.5210/fm.v26i3.11482 Abstract How do librarians in Spain engage with Wikipedia (and Wikidata, Wikisource, and other Wikipedia sister projects) as Wikimedia Movement Organizers? And, what motivates them to do so? This article reports on findings from 14 interviews with 18 librarians. The librarians interviewed were multilingual and contributed to Wikimedia projects in Castilian (commonly referred to as Spanish), Catalan, BasQue, English, and other European languages. They reported planning and running Wikipedia events, developing partnerships with local Wikimedia chapters, motivating citizens to upload photos to Wikimedia Commons, identifying gaps in Wikipedia content and filling those gaps, transcribing historic documents and adding them to Wikisource, and contributing data to Wikidata. Most were motivated by their desire to preserve and promote regional languages and culture, and a commitment to open access and open education. Introduction This research started with an informal conversation in 2018 about the popularity of Catalan Wikipedia, Viquipèdia, between two library coworkers in the United States, the authors of this article. Our conversation began with a sense of wonder about Catalan Wikipedia, which is ranked twentieth by number of articles, out of 300 different language Wikipedias (Meta contributors, 2020). -
Masarykova Univerzita Filozofická Fakulta Ústav
Masarykova univerzita Filozofická fakulta Ústav jazykovědy a baltistiky Magisterská diplomová práce 2019 Ekaterina Smirnova Masaryk University Faculty of Arts Department of Linguistics and Baltic Studies General Linguistics Bc. Ekaterina Smirnova The Etymological Characteristics of Basic Food Names in English, Russian, Czech and Norwegian Master’s Diploma Thesis 2019 I hereby declare that I am the sole author of this work and that I have used only the sources listed in the bibliography. Brno 30.11.2019 3 I would like to thank my supervisor, PhDr. Pavla Valčáková, CSc., for her assistance and guidance with my work. I would also like to thank my family for endless love and incredible support all the way from Murmansk. 4 Table of Contents INTRODUCTION 8 RUSSENORSK 14 FOOD 20 Language and etymology 22 I Russian 22 II Czech 26 III English 28 IV Norwegian 30 CEREALS 32 CEREAL FOODS (PORRIDGE) 33 Language and Etymology 34 I, II Russian, Czech 34 III English 35 IV Norwegian 36 SPELT 40 Language and etymology 41 I Russian 41 II Czech 42 III English 42 IV Norwegian 42 PANICUM / MILLET 44 Language and etymology 45 I Russian 45 II Czech 45 III English 46 IV Norwegian 47 BREAD 50 Language and etymology 51 I, II Russian and Czech 51 III, IV English and Norwegian 52 VEGETABLES 55 Language and etymology 55 5 I Russian 55 II Czech 56 III English 56 IV Norwegian 57 RADISH 59 Language and etymology 60 I Russian 60 II Czech 61 III English 61 IV Norwegian 62 ONION 63 Language and etymology 63 I Russian 63 II Czech 65 III English 65 IV Norwegian 67 CARROT 68 Language and etymology. -
Inflectional Paradigms Have a Base: Evidence from S-Dissimilation in Southern German Dialects
Morphology (2007) 17:151–178 DOI 10.1007/s11525-007-9112-z ORIGINAL PAPER Inflectional paradigms have a base: evidence from s-Dissimilation in Southern German dialects T. A. Hall Æ John H. G. Scott Received: 7 March 2007/Accepted: 4 June 2007/Published online: 15 September 2007 © Springer Science+Business Media B.V. 2007 Abstract In many varieties of Southern German the contrast between /s/ and /∫/ is neutralized to [∫] before /p t/ anywhere within a word (e.g. Post [po∫t] ÔmailÕ), but neutralization does not occur before inflectional suffixes (e.e. ku¨ss-t [kyst] ‘kiss (3 SG)’). It will be argued that the underapplication of neutralization before inflectional suffixes is an example of a Paradigm Uniformity effect: Neutralization is blocked from applying to the final /s/ of a stem so that it will retain a constant shape in a paradigm. Underapplication in examples like [kyst] follows from a requirement that the stem in a derived word be identical to the unaffixed base. By contrast, the German data will be shown to be problematic for the Optimal Paradigms model (McCarthy 2005), since this approach does not allow for a base in inflectional paradigms. Keywords Paradigm uniformity Æ Base identity Æ Optimality Theory Æ German Æ Swabian Æ Realize Morpheme Æ Homophony T. A. Hall (&) Æ John H. G. Scott Department of Germanic Studies, Indiana University, Ballantine Hall 644, 1020 Kirkwood Avenue, Bloomington, IN 47405-7103, USA e-mail: [email protected] J. H. G. Scott [email protected] 123 152 T. A. Hall, John H. G. Scott 1 Introduction In many southern German dialects (e.g. -
Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science
(forthcoming in the Journal of the Association for Information Science and Technology) Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science Misha Teplitskiy Grace Lu Eamon Duede Dept. of Sociology and KnowledgeLab Computation Institute and KnowledgeLab University of Chicago KnowledgeLab University of Chicago [email protected] University of Chicago [email protected] (773) 834-4787 [email protected] (773) 834-4787 5735 South Ellis Avenue (773) 834-4787 5735 South Ellis Avenue Chicago, Illinois 60637 5735 South Ellis Avenue Chicago, Illinois 60637 Chicago, Illinois 60637 Abstract With the rise of Wikipedia as a first-stop source for scientific knowledge, it is important to compare its representation of that knowledge to that of the academic literature. Here we identify the 250 most heavi- ly used journals in each of 26 research fields (4,721 journals, 19.4M articles in total) indexed by the Scopus database, and test whether topic, academic status, and accessibility make articles from these journals more or less likely to be referenced on Wikipedia. We find that a journal’s academic status (im- pact factor) and accessibility (open access policy) both strongly increase the probability of its being ref- erenced on Wikipedia. Controlling for field and impact factor, the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to paywall journals. One of the implica- tions of this study is that a major consequence of open access policies is to significantly amplify the dif- fusion of science, through an intermediary like Wikipedia, to a broad audience. Word count: 7894 Introduction Wikipedia, one of the most visited websites in the world1, has become a destination for information of all kinds, including information about science (Heilman & West, 2015; Laurent & Vickers, 2009; Okoli, Mehdi, Mesgari, Nielsen, & Lanamäki, 2014; Spoerri, 2007). -
Language-Agnostic Relation Extraction from Abstracts in Wikis
information Article Language-Agnostic Relation Extraction from Abstracts in Wikis Nicolas Heist, Sven Hertling and Heiko Paulheim * ID Data and Web Science Group, University of Mannheim, Mannheim 68131, Germany; [email protected] (N.H.); [email protected] (S.H.) * Correspondence: [email protected] Received: 5 February 2018; Accepted: 28 March 2018; Published: 29 March 2018 Abstract: Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph. Keywords: relation extraction; knowledge graphs; Wikipedia; DBpedia; DBkWik; Wiki farms 1. Introduction Large-scale knowledge graphs, such as DBpedia [1], Freebase [2], Wikidata [3], or YAGO [4], are usually built using heuristic extraction methods, by exploiting crowd-sourcing processes, or both [5]. -
Defending Democracy & Kristin Skare Orgeret Nordicom-Information
Defending Defending Oslo 8–11 August 2013 Democracy The 2013 NordMedia conference in Oslo marked the 40 years that had passed since the very first Nordic media conference. To acknowledge this 40-year anniversary, it made sense to have a conference theme that dealt with a major and important topic: Defending Democracy. Nordic Defending and Global Diversities in Media and Journalism. Focusing on the rela- tionship between journalism, other media practices and democracy, the plenary sessions raised questions such as: Democracy & Edited by What roles do media and journalism play in democratization • Kristin Skare Orgeret processes and what roles should they play? Nordic and Global Diversities How does the increasingly complex and omnipresent media in Media and Journalism • Hornmoen Harald field affect conditions for freedom of speech? This special issue contains the keynote speeches of Natalie Fenton, Stephen Ward and Ib Bondebjerg. A number of the conference papers have been revised and edited to become articles. Together, the articles presented should give the reader an idea of the breadth and depth of Edited by current Nordic scholarship in the area. Harald Hornmoen & Kristin Skare Orgeret SPECIAL ISSUE Nordicom Review | Volume 35 | August 2014 Nordicom-Information | Volume 36 | Number 2 | August 2014 Nordicom-Information Nordicom Review University of Gothenburg 2014 issue Special Box 713, SE 405 30 Göteborg, Sweden Telephone +46 31 786 00 00 (op.) | Fax +46 31 786 46 55 www.nordicom.gu.se | E-mail: [email protected] SPECIAL ISSUE Nordicom Review | Volume 35 | August 2014 Nordicom-Information | Volume 36 | Number 2 | August 2014 Nordicom Review Journal from the Nordic Information Centre for Media and Communication Research Editor NORDICOM invites media researchers to contri- Ulla Carlsson bute scientific articles, reviews, and debates. -
Semantic Approaches in Natural Language Processing
Cover:Layout 1 02.08.2010 13:38 Seite 1 10th Conference on Natural Language Processing (KONVENS) g n i s Semantic Approaches s e c o in Natural Language Processing r P e g a u Proceedings of the Conference g n a on Natural Language Processing 2010 L l a r u t a This book contains state-of-the-art contributions to the 10th N n Edited by conference on Natural Language Processing, KONVENS 2010 i (Konferenz zur Verarbeitung natürlicher Sprache), with a focus s e Manfred Pinkal on semantic processing. h c a o Ines Rehbein The KONVENS in general aims at offering a broad perspective r on current research and developments within the interdiscipli - p p nary field of natural language processing. The central theme A Sabine Schulte im Walde c draws specific attention towards addressing linguistic aspects i t of meaning, covering deep as well as shallow approaches to se - n Angelika Storrer mantic processing. The contributions address both knowledge- a m based and data-driven methods for modelling and acquiring e semantic information, and discuss the role of semantic infor - S mation in applications of language technology. The articles demonstrate the importance of semantic proces - sing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting se - mantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval.