The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
ALEMANA GERMAN, ALEMÁN, ALLEMAND Language
ALEMANA GERMAN, ALEMÁN, ALLEMAND Language family: Indo-European, Germanic, West, High German, German, Middle German, East Middle German. Language codes: ISO 639-1 de ISO 639-2 ger (ISO 639-2/B) deu (ISO 639-2/T) ISO 639-3 Variously: deu – Standard German gmh – Middle High german goh – Old High German gct – Aleman Coloniero bar – Austro-Bavarian cim – Cimbrian geh – Hutterite German kksh – Kölsch nds – Low German sli – Lower Silesian ltz – Luxembourgish vmf – Main-Franconian mhn – Mócheno pfl – Palatinate German pdc – Pennsylvania German pdt – Plautdietsch swg – Swabian German gsw – Swiss German uln – Unserdeutssch sxu – Upper Saxon wae – Walser German wep – Westphalian Glotolog: high1287. Linguasphere: [show] Beste izen batzuk (autoglotonimoa: Deutsch). deutsch alt german, standard [GER]. german, standard [GER] hizk. Alemania; baita AEB, Arabiar Emirerri Batuak, Argentina, Australia, Austria, Belgika, Bolivia, Bosnia-Herzegovina, Brasil, Danimarka, Ekuador, Errumania, Errusia (Europa), Eslovakia, Eslovenia, Estonia, Filipinak, Finlandia, Frantzia, Hegoafrika, Hungaria, Italia, Kanada, Kazakhstan, Kirgizistan, Liechtenstein, Luxenburgo, Moldavia, Namibia, Paraguai, Polonia, Puerto Rico, Suitza, Tajikistan, Uzbekistan, Txekiar Errepublika, Txile, Ukraina eta Uruguain ere. Dialektoa: erzgebirgisch. Hizkuntza eskualde erlazionatuenak dira Bavarian, Schwäbisch, Allemannisch, Mainfränkisch, Hessisch, Palatinian, Rheinfränkisch, Westfälisch, Saxonian, Thuringian, Brandenburgisch eta Low saxon. Aldaera asko ez dira ulerkorrak beren artean. high -
Introduction
INTRODUCTION Hindi, an Indo-European language, is the official language of India, spoken by about 40% of the more than one billion citizens of India. It is also the official language in certain of the territories where a sizeable diaspora settled such as the Republic of Mauritius1. But the label ‘Hindi’ covers considerably distinct speeches, as evidenced by the number of regional varieties covered by the category Hindi in the various Censuses of India: 91 mother tongues in the 1961 Census, up to 331 languages or speeches according to Srivastava 1994, totalizing to six hundred millions speakers (Bhatia 1987: 9). On the other end, the notion that Standard Hindi, more or less corresponding to the variety used in written literature and media, does not exist as a mother tongue, has gained currency during the seventies and eighties: according to Gumperz & Naim (1960), modern standard Hindi is a second or third speech for most of its speakers, being a native language for a small section of the urban population, which until recently was itself a small minority of the global Indian population. However, due to the rapid increase in this urban population, Hindi native speakers can no longer be considered as a non-significant minority, as noticed by Ohala (1983) and Singh & Agnihotri (1997). Many scholars still consider that the modern colloquial language, used for instance in popular movies, does not differ from modern colloquial Urdu. Similarly, many consider that the higher registers of both languages are still only two styles of one and the same language. According to Abdul Haq (1961), “the language we speak and write and call by the name Urdu today is derived from Hindi and constituted of Hindi”. -
Prioritizing African Languages: Challenges to Macro-Level Planning for Resourcing and Capacity Building
Prioritizing African Languages: Challenges to macro-level planning for resourcing and capacity building Tristan M. Purvis Christopher R. Green Gregory K. Iverson University of Maryland Center for Advanced Study of Language Abstract This paper addresses key considerations and challenges involved in the process of prioritizing languages in an area of high linguistic di- versity like Africa alongside other world regions. The paper identifies general considerations that must be taken into account in this process and reviews the placement of African languages on priority lists over the years and across different agencies and organizations. An outline of factors is presented that is used when organizing resources and planning research on African languages that categorizes major or crit- ical languages within a framework that allows for broad coverage of the full linguistic diversity of the continent. Keywords: language prioritization, African languages, capacity building, language diversity, language documentation When building language capacity on an individual or localized level, the question of which languages matter most is relatively less complicated than it is for those planning and providing for language capabilities at the macro level. An American anthropology student working with Sierra Leonean refugees in Forecariah, Guinea, for ex- ample, will likely know how to address and balance needs for lan- guage skills in French, Susu, Krio, and a set of other languages such as Temne and Mandinka. An education official or activist in Mwanza, Tanzania, will be concerned primarily with English, Swahili, and Su- kuma. An administrator of a grant program for Less Commonly Taught Languages, or LCTLs, or a newly appointed language authori- ty for the United States Department of Education, Department of Commerce, or U.S. -
Corpus Collection for Under-Resourced Languages with More Than One Million Speakers
Corpus collection for under-resourced languages with more than one million speakers Dirk Goldhahn1, Maciej Sumalvico1, Uwe Quasthoff1,2 Natural Language Processing Group, University of Leipzig, Germany Department of African Languages, University of South Africa, South Africa Email: { dgoldhahn, janicki, quasthoff, }@informatik.uni-leipzig.de Abstract For only 40 of about 350 languages with more than one million speakers, the situation concerning text resources is comfortable. For the remaining languages, the number of speakers indicates a need for both corpora and tools. This paper describes a corpus collection initiative for these languages. While random Web crawling has serious limitations, native speakers with knowledge of web pages in their language are of invaluable help. The aim is to give interested scholars, language enthusiasts the possibility to contribute to corpus creation or extension by simply entering a URL into the Web Interface. Using a Web portal URLs of interest are collected with the help of the respective communities. A standardized corpus processing chain for daily newspaper corpora creation is adapted to append newly added web pages to an increasing corpus. As a result we will be able to collect larger corpora for under-resourced languages by a community effort. These corpora will be made publicly available. Keywords: corpora, under-resourced languages, Web portal, community 1. Introduction links require the execution of JavaScript code by the There are about 350 languages with more than one million crawler which often produces errors. So, if a website speakers1. For about 40 of them, the situation concerning heavily uses JavaScript links and there are no other links text resources is comfortable: there are corpora of pointing to special pages (coming from another website, reasonable size and also tools like POS taggers adapted to for instance), then all but the main page might be these languages. -
Ling 51: Indo-European Language and Society Germanic Languages I
Ling 51: Indo-European Language and Society Germanic Languages I. Proto-Germanic A. East Germanic 1. Gothic 2. Crimean Gothic B. North Germanic 1. Runic 2. Old Norse and Old Icelandic 3. Old Swedish C. West Germanic 1. Continental a. Old High German b. Old Saxon c. Frankish 2. Ingvaeonic a. Old Frisian b. Old English Numerous dialects. West Saxon is the standard, although modern English descends principally from others Germanic Sound Changes 1. Grimm’s Law Perhaps the most famous of all of the ‘Laws of Indo-European’, Grimm’s Law was established by Jacob Grimm (one of the Grimm brothers), an early 19th century German philologist and collector of German folk materials. PIE Proto-Germanic *p > *f The voiceless stops become *t > * þ the corresponding fricatives *k > *χ (*h) þ = [θ] and χ = [x] *kʷ > *χʷ (*hʷ) *b > *p The voiced stops become *d > *t the corresponding voiceless stops *g > *k *gʷ > *kʷ *bʰ > *β ~ *b The voiced aspirated stops *dʰ > *ð ~ *d become the voiced fricatives, *gʰ > *ɣ ~ g and these usually turn to stops *gʷʰ > *ɣʷ ~ gʷ eventually. Examples of Grimm’s Law *bʰréh₂-ter > *brōþor ‘brother’ *pék ̑u > *feχu ‘wealth on four legs’ *ok ̑tōu̯ > *aχtōu ‘eight’ 2. Verner’s Law Verner’s Law is probably the second most famous IE sound law. After Grimm’s Law was proposed, linguists noticed many cases in which the expected result was not observed for the voiceless fricatives. Sometimes instead of being voiced they emerged as voiceless instead. For some time this was assumed to be because sound changes were not entirely regular, and the idea of reconstructing a proto-language would therefore be impossible. -
Definitionen in Wörterbuch Und Text
Definitionen in W¨orterbuch und Text: Zur manuellen Annotation, korpusgestutzten¨ Analyse und automatischen Extraktion definitorischer Textsegmente im Kontext der computergestutzten¨ Lexikographie Inauguraldissertation zur Erlangung des Grades eines Doktors der Philosophie an der kulturwissenschaftlichen Fakult¨at der Technischen Universit¨at Dortmund vorgelegt im Mai 2010 von Irene Magdalena Cramer geboren in Frankfurt am Main Stand Februar 2011 Inhaltsverzeichnis 1 Einleitung3 2 Zum Wort Definition, seiner Bedeutung und Verwendung9 2.1 Zum Einstieg in die Definitionstheorie................ 12 2.2 Ursprunge¨ der Definition oder: Wozu?................ 14 2.2.1 Zur Etymologie........................ 14 2.2.2 Platon............................. 14 2.2.3 Aristoteles........................... 16 2.3 Definitionen als Werkzeug des Erkenntnisgewinns.......... 17 2.4 Mehr zum Wozu: Blaise Pascal und John Locke........... 19 2.4.1 Blaise Pascal.......................... 20 2.4.2 John Locke.......................... 21 2.4.3 Zusammenfassung....................... 22 2.5 Vom Wozu und Wann: Ludwig Wittgenstein, Definitionen und Spracherwerb........ 23 2.5.1 Ludwig Wittgenstein und die verweisende Definition..... 23 2.5.2 Definitionen und Spracherwerb................ 25 2.5.3 Zusammenfassung....................... 27 2.6 Wozu und Wann in Lexikographie, Terminographie, den Wissenschaften und im Alltag. 29 2.6.1 Die Funktion der terminologischen Definition........ 29 2.6.2 Die Funktion der wissenschaftlichen Definition........ 31 2.6.3 Die Funktion der lexikographischen Definition........ 36 2.6.4 Die Funktion der Alltagsdefinition.............. 44 i ii 2.6.5 Zusammenfassung....................... 48 2.7 Die Pragmatik des Definierens.................... 49 2.8 Was definieren Definitionen?..................... 55 2.9 Bestandteile.............................. 57 2.9.1 Der Definitor: grundlegende Annahmen........... 57 2.9.2 Definiendum und Definiens: grundlegende Annahmen.... 63 2.10 Definitionstypen von Isidor de Sevilla bis heute........... -
Czech Language and Literature Peter Zusi
chapter 17 Czech Language and Literature Peter Zusi Recent years have seen a certain tendency to refer to Kafka as a ‘Czech’ author – a curious designation for a writer whose literary works, without exception, are composed in German. As the preceding chapter describes, Kafka indeed lived most of his life in a city where Czech language and society gradually came to predominate over the German-speaking minor- ity, and Kafka – a native German-speaker – adapted deftly to this changing social landscape. Referring to Kafka as Czech, however, is inaccurate, explicable perhaps only as an attempt to counterbalance a contrasting simplification of his complicated biography: the marked tendency within Kafka scholarship to investigate his work exclusively in the context of German, Austrian or Prague-German literary history. The Czech socio-cultural impulses that surrounded Kafka in his native Prague have primarily figured in Kafka scholarship through sociological sketches portraying ethnic animosity, lack of communication and, at times, open violence between the two largest lin- guistic communities in the city. These historical realities have given rise to the persistent image of a ‘dividing wall’ between the Czech- and German- speaking inhabitants of Prague, with the two populations reading different newspapers, attending separate cultural institutions and congregating in segregated social venues. This image of mutual indifference or antagonism has often made the question of Kafka’s relation to Czech language and cul- ture appear peripheral. Yet confronting the perplexing blend of proximity and distance, famil- iarity and resentment which characterized inter-linguistic and inter- cultural contact in Kafka’s Prague is a necessary challenge. -
UTOPIAN for BEGINNERS an Amateur Linguist Loses Control of the Language He Invented
ANNALS OF LINGUISTICS DECEMBER 24 & 31, 2012 IUE UTOPIAN FOR BEGINNERS An amateur linguist loses control of the language he invented. By Joshua Foer Quijada’s invented language has two seemingly incompatible ambitions: to be maximally precise but also maximally concise. here are so many ways for speakers of English to see the world. We can T glimpse, glance, visualize, view, look, spy, or ogle. Stare, gawk, or gape. Peek, watch, or scrutinize. Each word suggests some subtly different quality: looking implies volition; spying suggests furtiveness; gawking carries an element of social judgment and a sense of surprise. When we try to describe an act of vision, we consider a constellation of available meanings. But if thoughts and words exist on different planes, then expression must always be an act of compromise. Languages are something of a mess. They evolve over centuries through an unplanned, democratic process that leaves them teeming with irregularities, quirks, and words like “knight.” No one who set out to design a form of communication would ever end up with anything like English, Mandarin, or any of the more than six thousand languages spoken today. “Natural languages are adequate, but that doesn’t mean they’re optimal,” John Quijada, a fty-three-year-old former employee of the California State Department of Motor Vehicles, told me. In 2004, he published a monograph on the Internet that was titled “Ithkuil: A Philosophical Design for a Hypothetical Language.” Written like a linguistics textbook, the fourteen-page Web site ran to almost a hundred and sixty thousand words. It documented the grammar, syntax, and lexicon of a language that Quijada had spent three decades inventing in his spare time. -
Gothic Introduction – Part 1: Linguistic Affiliations and External History Roadmap
RYAN P. SANDELL Gothic Introduction – Part 1: Linguistic Affiliations and External History Roadmap . What is Gothic? . Linguistic History of Gothic . Linguistic Relationships: Genetic and External . External History of the Goths Gothic – Introduction, Part 1 2 What is Gothic? . Gothic is the oldest attested language (mostly 4th c. CE) of the Germanic branch of the Indo-European family. It is the only substantially attested East Germanic language. Corpus consists largely of a translation (Greek-to-Gothic) of the biblical New Testament, attributed to the bishop Wulfila. Primary manuscript, the Codex Argenteus, accessible in published form since 1655. Grammatical Typology: broadly similar to other old Germanic languages (Old High German, Old English, Old Norse). External History: extensive contact with the Roman Empire from the 3rd c. CE (Romania, Ukraine); leading role in 4th / 5th c. wars; Gothic kingdoms in Italy, Iberia in 6th-8th c. Gothic – Introduction, Part 1 3 What Gothic is not... Gothic – Introduction, Part 1 4 Linguistic History of Gothic . Earliest substantively attested Germanic language. • Only well-attested East Germanic language. The language is a “snapshot” from the middle of the 4th c. CE. • Biblical translation was produced in the 4th c. CE. • Some shorter and fragmentary texts date to the 5th and 6th c. CE. Gothic was extinct in Western and Central Europe by the 8th c. CE, at latest. In the Ukraine, communities of Gothic speakers may have existed into the 17th or 18th century. • Vita of St. Cyril (9th c.) mentions Gothic as a liturgical language in the Crimea. • Wordlist of “Crimean Gothic” collected in the 16th c. -
Interference of Bhojpuri Language in Learning English As a Second Language
ELT VOICES – INDIA INTERNATIONAL JOURNAL FOR TEACHERS OF ENGLISH FEBRUARY 2014 | VOLUME 4, ISSUE 1 | ISSN 2230-9136 (PRINT) 2321-7170 (ONLINE) Interference of Bhojpuri Language in Learning English as a Second Language V. RADHIKA1 ABSTRACT Studying the special features of any language is interesting for exploring many unrevealed facts about that language. The purpose of this paper is to create interest in the mind of readers to identify the special features of Bhojpuri language, which is gaining importance in India in the recent years among the spoken languages and make the readers to get rid of the problem of interference of this language in learning English as a second language. 1. Assistant Professor in English, AVIT, VMU, Paiyanoor, India. ELT VOICES – INDIA February 2014 | Volume 4, Issue 1 1. Introduction The paper proposes to make an in-depth study of Bhojpuri language to enable learners to compare and contrast between two languages, Bhojpuri and English by which communicative skill in English can be properly perceived. Bhojpuri language has its own phonological and morphological pattern. The present study is concerned with tracing the difference between two languages (i.e.) English and Bhojpuri, with reference to their grammatical structure and its relevant information. Aim of the Paper Create interest in the mind of scholars to make the detailed study about the Bhojpuri language with reference to their grammar pattern and to find out possible solution to eliminate the problem of mother tongue interference. 2. Origin of Bhojpuri Language Bhojpuri language gets its name from a place called ‘Bhojpur’ in Bihar. It is believed that Ujjain Rajputs claimed their descent from Raja Bhoj of Malwa in the sixteenth century. -
Czech Literature Guide
CZECH CZECH LITERATURE LITERATURE GUIDE GUIDE GUIDE LITERATURE CZECH CZECH LITERATURE GUIDE Supported by the Ministry of Culture of the Czech Republic © Institut umění – Divadelní ústav (Arts and Theatre Institute) First edition ISBN 978-80-7008-272-0 All rights of the publication reserved CONTENT ABOUT THE CZECH REPUBLIC 12 A CONCISE HISTORY OF CZECH LITERATURE 13 LITERATURE 1900–45 13 LITERATURE AFTER 1945 13 CONTEMPORARY CZECH LITERATURE 1995–2010 20 PROSE 20 POETRY 26 ESSAY 31 LITERATURE FOR CHILDREN AND YOUNG PEOPLE 37 THE BOOK MARKET 41 THE TEACHING OF WRITING IN THE CZECH REPUBLIC 48 CONTEMPORARY LITERARY LIFE 51 ORGANIZATIONS AND INSTITUTIONS 51 LITERARY AND BOOK AWARDS 56 FESTIVALS AND FAIRS 58 EDUCATION 59 LEGISLATION AND LITERARY AGENCIES 62 LIBRARIES AND ARCHIVES 63 GRANTS AND SCHOLARSHIPS 65 LITERARY CAFÉS AND TEA-ROOMS 68 ANTIQUARIAN BOOKSHOPS 71 MEDIA 77 LITERARY PERIODICALS 77 CZECH LITERATURE ON THE WEB 82 LITERARY PROGRAMMES ON TV 83 LITERARY PROGRAMMES ON RADIO 83 CZECH LITERATURE ABROAD 87 VARIOUS LINKS 87 OVERVIEW OF FOREIGN CZECH STUDIES SCHOLARS, TRANSLATORS AND FRIENDS OF CZECH CULTURE 87 DEAR READERS WITH AN INTEREST IN CZECH LITERATURE Allow us to draw your attention to our Czech Literature Guide. It presents a panorama of the contemporary life of Czech literature with a short historical overview. It has been produced for everyone who has an interest in understanding Czech literary culture and its milieu, from the specialist and scholarly to the active and practical. 11 ABOUT THE CZECH REPUBLIC CZECH LITERATURE GUIDE 12 ABOUT THE CZECH REPUBLIC The Czech Republic (CR) is a landlocked country The GDP per capita in CZK in 2010 was 361,986 with a territory of 78,865 m2 lying in the centre (exchange rate EUR 1 = CZK 24.5) and the infl ation of Europe. -
National Minorities, Minority and Regional Languages in Germany
National minorities, minority and regional languages in Germany National minorities, minority and regional languages in Germany 2 Contents Foreword . 4 Welcome . 6 Settlement areas . 8 Language areas . 9 Introduction . 10 The Danish minority . 12 The Frisian ethnic group . 20 The German Sinti and Roma . 32 The Sorbian people . 40 Regional language Lower German . 50 Annex I . Institutions and bodies . 59 II . Legal basis . 64 III . Addresses . 74 Publication data . 81 Near the Reichstag building, along the Spree promenade in Berlin, Dani Karavan‘s installation “Basic Law 49” shows the articles of Germany‘s 1949 constitution on 19 glass panes. Photo: © Jens Kalaene/dpa “ No person shall be favoured or disfavoured because of sex, parentage, race, language, homeland and origin, faith, or religious or political opinions.” Basic Law for the Federal Republic of Germany, Art. 3 (3), first sentence. 4 Foreword Four officially recognized national minorities live in Germany: the Danish minority, the Frisian ethnic group, the German Sinti and Roma, and the Sorbian people. The members of national minorities are German na- tionals and therefore part of the German legal order. They enjoy all rights and freedoms granted under the Basic Law without any restrictions. This brochure describes the history, the settlement areas and the organizations of the national minorities in Germany and explores how they see themselves Dr Thomas de Maizière, Member and how they live while trying to preserve their cultural of the German Bundestag roots. Each of the four minorities identifies itself in Federal Minister of the Interior particular through its own language. As language is an Photo: © Press and Information Office of the Federal Government important part of their identity, it deserves particular protection.