NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications

Total Page:16

File Type:pdf, Size:1020Kb

NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications Rebecca Knowles and Samuel Larkin and Darlene Stewart and Patrick Littell National Research Council Canada fRebecca.Knowles, Samuel.Larkin, Darlene.Stewart, [email protected] Abstract 2 Approaches We describe the National Research Council 2.1 General System Notes of Canada (NRC) neural machine translation In both translation directions, our final systems con- systems for the German–Upper Sorbian super- sist of ensembles of multiple systems, built using vised track of the 2020 shared task on Unsu- transfer learning (Section 2.2), BPE-Dropout (Sec- pervised MT and Very Low Resource Super- tion 2.3), alternative preprocessing of Czech data vised MT. Our models are ensembles of Trans- former models, built using combinations of (Section 2.4), and backtranslation (Section 2.5). BPE-dropout, lexical modifications, and back- We describe these approaches and related work in translation. the following sections, providing implementation details for reproducibility in Sections3,4 and5. 1 Introduction 2.2 Transfer Learning We describe the National Research Council of Zoph et al.(2016) proposed a transfer learning Canada (NRC) neural machine translation systems approach for neural machine translation, using lan- for the shared task on Unsupervised MT and Very guage pairs with larger amounts of data to pre-train Low Resource Supervised MT. We participated a parent system, followed by finetuning a child sys- in the supervised track of the low resource task, tem on the language pair of interest. Nguyen and building Upper Sorbian–German neural machine Chiang(2017) expand on that, showing improved translation (NMT) systems in both translation direc- performance using BPE and shared vocabularies tions. Upper Sorbian is a minority language spoken between the parent and child. We follow this ap- in Germany. We built baseline systems (standard proach: we build disjoint source and target BPE Transformer (Vaswani et al., 2017) with a byte-pair models and vocabularies, with one vocabulary for encoding vocabulary (BPE; Sennrich et al., 2016b)) German (DE) and one for the combination of Czech trained on all available parallel data (60,000 lines), (CS) and Upper Sorbian (HSB); see Section4. which resulted in unusually high BLEU scores for We chose to use Czech–German data as the par- a language pair with such limited data. ent language pair due to the task suggestions, rela- In order to improve upon this baseline, we used tive abundance of data, and the close relationship transfer learning with modifications to the training between Czech and Upper Sorbian (cf. Lin et al., lexicon. We did this in two ways: by experiment- 2019; Kocmi and Bojar, 2018). While Czech and ing with the application of BPE-dropout (Provilkov Upper Sorbian cognates are often not identical at et al., 2020) to the transfer learning setting (Sec- the character level (Table1), there is a high level of tion 2.3), and by modifying Czech data used for character-level overlap; trying to take advantage of training parent systems with word and character that overlap without assuming complete character- replacements in order to make it more “Upper level identity is a motivation for the explorations Sorbian-like” (Section 2.4). in subsequent sections (Section 2.3, Section 2.4). Our final systems were ensembles of systems Another relatively high-resource language related built using transfer learning and these two ap- to Upper Sorbian is Polish, but while the Czech proaches to lexicon modification, along with it- and Upper Sorbian orthographies are fairly simi- erative backtranslation. lar, mostly using the same characters for the same 1112 Proceedings of the 5th Conference on Machine Translation (WMT), pages 1112–1122 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics sounds (with a few notable exceptions), Polish or- presence of the former during pre-training can ini- thography is more distinct. This, combined with tialize at least some of the subwords that the model the lack of a direct Polish–German parallel dataset will later see in Upper Sorbian. However, other in the constrained condition, led us to choose Czech times the character-level differences lead to seg- as our transfer language for these experiments. mentations where no subwords are shared (e.g. CS hospoda´r@@ˇ ska´ vs. HSB hospodar@@ Czech Upper Sorbian sce or potom vs. HSB po@@ tym). Consider- analyzovat analyzowac´ ing a wider variety of segmentations would, we donesl donjesł hypothesized, mean that Upper Sorbian subwords extern´ıch eksternych would have more chance of being initialized during hospoda´rskˇ a´ hospodarsce Czech pre-training (see AppendixC). kreativn´ı kreatiwne Rather than modifying the NMT system itself to okres wokrjes reapply BPE-dropout during training, we treated potom potym BPE-dropout as a preprocessing step. Additionally, projekt projekt we experimented with BPE-dropout in the context semantick´ a´ semantisku of transfer learning, examining the effects of us- velkym´ wulkim ing source-side, both-sides, or no dropout in both parent and child systems. Table 1: A sample of probable Czech–Upper Sorbian cognates and shared loanwords, mined from the Czech– 2.4 Pseudo-Sorbian German and German–Upper Sorbian parallel corpora and filtered by Levenshtein distance. For the Upper Sorbian–German direction, we also experimented with two techniques for modifying Other work on transfer learning for low-resource the Czech–German parallel data so that the Czech machine translation includes multilingual seed side is more like Upper Sorbian. In particular, we models (Neubig and Hu, 2018), dynamically concentrated on modification methods that require adding to the vocabulary when adding languages neither large amounts of data, nor in-depth knowl- (Lakew et al., 2018), and using a hierarchical archi- edge of the historical relationships between the tecture to use multiple language pairs (Luo et al., languages, since both of these are often lacking for 2019). the lower-resourced language. We considered two variations of this idea: 2.3 BPE-Dropout We apply the recently-proposed approach of per- • word-level modification, in which some fre- forming BPE-dropout (Provilkov et al., 2020), quent Czech words (e.g. prepositions) are which takes an existing BPE model and randomly replaced by likely Upper Sorbian equivalents, drops some merges at each merge step when ap- and plying the model to text. The goal of this, beside • character-level modification, where we at- leading to more robust subword representations tempt to convert Czech words at the character in general, is to produce subword representations level to forms that may more closely resemble that are more likely to overlap between the pre- Upper Sorbian words. training (Czech–German) and finetuning (Upper Sorbian–German) stages. We hypothesized that, in Note that in neither case do we know what the same way that BPE-Dropout leads to robust- particular conversions are correct; we ourselves ness against accidental spelling errors and variant do not know enough about historical Western spellings (Provilkov et al., 2020), it could likewise Slavic to predict the actual Upper Sorbian cog- lead to robustness to the kind of spelling variations nates of Czech words. Rather, we took inspiration we see between two related languages. from stochastic segmentation methods like BPE- For example, consider the putative Czech–Upper Dropout (Provilkov et al., 2020) and SentencePiece Sorbian cognates and shared loanwords presented (Kudo and Richardson, 2018): when we have an in Table1. Sometimes a fixed BPE segmenta- idea of the possible solutions to the segmentation tion happens to separate shared characters into problem but do not know which one is the correct shared subwords (e.g. CS analy@@ z@@ ovat one, we can sample randomly from the possible vs. HSB analy@@ z@@ owac´), such that the segmentations as a sort of regularization, with the 1113 goal of discouraging the model from relying too correspondences (e.g. CS v to HSB w, CS d to heavily on a single segmentation scheme and giv- HSB dz´ before front vowels, etc.), we wrote a pro- ing it some exposure to a variety of possible seg- gram to randomly replace the appropriate Czech mentations. Whereas BPE-dropout and Sentence- character sequences with possible correspondences Piece focus on possible segmentations of the word, in Upper Sorbian. Again, as Czech-Upper Sorbian our pseudo-Sorbian experiments focus on possi- correspondences are not entirely predictable (CS ble word- and character-level replacements. The e might happen to correspond, in a particular cog- goal was to discourage the parent Czech–German nate, to HSB e or ej or i or a or o, etc.), we model from relying too heavily on regularities in cannot expect that any given result is correct Upper Czech (e.g. the presence of particular frequent Sorbian. Rather, we can think of this process as words, the presence of particular Czech character attempting to train a system that can respond to in- n-grams) and perhaps also gain some prior expo- puts from a variety of possible (but not necessarily sure to Upper Sorbian words and characters that actual) Western Slavic languages, rather than just a will occur in the genuine Upper Sorbian data; we system that can respond to precisely-spelled Czech can also think of this as a form of low-resource and only Czech. data augmentation (Fadaee et al., 2017; Wang et al., 2018). See AppendixC for an analysis of increased 2.4.3 Combined pseudo-Sorbian subword overlap between pseudo-Sorbian and test data, as compared to BPE-dropout and the baseline In initial testing, we determined that a combina- approach. tion of word-level and character-level modification performed best; we ran each process on the Czech– 2.4.1 Word-level pseudo-Sorbian German corpus separately, then concatenated the To generate the word-level pseudo-Sorbian, we ran resulting corpora and trained a parent model on it.
Recommended publications
  • Sorbian Languages
    Hornjoserbsce Sorbian languages Dolnoserbski Lusatia Sorbian: Location: Germany - Lusatia Users: 20 – 30 thousand Lower Sorbian: Upper Sorbian: Location:Niederlausitz - Location: Upper Saxony Lower Lusatia (Dolna state, Bautzen (Budysin), Luzica), Cottbus (Chósebuz) and Kamenz main town Population: around 18 Population: around 7 thousand thousand Gramma ● Dual for nouns, pronouns, Case Upper Sorb. Lower Sorb. ajdectives and verbs. Nom. žona žeńska Hand - Ruka (one) - Ruce Gen. žony žeńske (two) - Ruki (more than two) Dat. žonje žeńskej ● Upper Sorbian - seven Acc. žonu žeńsku cases Instr. ze žonu ze žeńskeju ● Lower Sorbian - six cases (no Vocativus) Loc. wo žonje wó žeńskej Voc. žono Sounds in comparison to Polish Polish sounds ć and dź in To be - Być - Biś Lower Sorbian change to ś Children - Dzieci - Źisi and ź. Group of polish sounds tr and Right - Prawy – Pšawy, pr change into tš, pś and pš. Scary - Straszny – Tšašny Pronunciation Sorbian Polish 1. č 1. cz 2. dź 2. soft version of dz 3. ě 3. beetwen polish e and i 4. h 4. mute before I and in the end of a word (bahnity) 5. kh 5. ch 6. mute in the end of the word (niósł) 6. ł 7. beetwen polish o and u 7. ó 8. rz (křidło) 8. ř 9. sz 9. š 10. before a consonant is mute (wzdać co - wyrzec 10. w się), otherwise we read it as u ( Serbow) 11. ž 11. z Status Sorbian languages are recognize by the German goverment. They have a minority language status. In the home areas of the Sorbs, both languages are officially equal to German.
    [Show full text]
  • Text and Audio Corpus of Native Lower Sorbian Tekstowy a Zukowy Korpus Maminorěcneje Dolnoserbšćiny
    Lower Sorbian Text and audio corpus of native Lower Sorbian Tekstowy a zukowy korpus maminorěcneje dolnoserbšćiny Name of language: Lower Sorbian Generic affiliation: Indo‐European, Slavic, West Slavic, Sorbian Country and region: Germany, Brandenburg, Lower Lusatia Number of speakers of native Lower Sorbi‐ an: a few hundred History Sorbian tribes were first mentioned in 631 AD and the ancestors of today’s Sorbs have settled in the region to become known as ‘Lusatia’ as early as the 6th century AD. The first written document in (Eastern) Lower Sorbian is the New Testament (in the version of Martin Luther) translated by Mikławš Jaku‐ bica in 1548. Sorbian (used as a generic term 10 years Witaj Kindergarten Sielow/Žylow, 2008; by courtesy of W. Meschkank for both Sorbian languages, Lower Sorbian and Upper Sorbian) comprises a large number of dialects. Since the 16th century, in the wake Permanent project team Challenges and importance of the of the Reformation, both languages began to Sorbian Institute (Germany) project develop a literary variety. Because of natural and forced assimilation, the language area of Dr. Hauke Bartels The biggest challenge for the Lower Sorbian Sorbian has shrunk considerably over the project head DoBeS project is the small and very fast de‐ course of the centuries. creasing number of native speakers. The con‐ Kamil Thorquindt‐Stumpf tinuity of spoken Lower Sorbian dialects is Although many dialects are already extinct or project coordinator most likely to end soon, when the few still almost extinct, today’s native dialect‐based existing native speakers have passed away. Lower Sorbian shows significant differences to Jan Meschkank With all native speakers being of the oldest the literary language taught in a few schools generation, gaining access to them or even in Lower Lusatia.
    [Show full text]
  • The Role of Morphosyntactic Feature Economy in the Evolution of Slavic Number
    The Role of Morphosyntactic Feature Economy in the Evolution of Slavic Number Tatyana Slobodchikoff Troy University 1. Introduction The Slavic dual pronouns exhibit two different patterns of diachronic change. The first pattern of diachronic change is exemplified in three languages - Slovenian, Upper, and Lower Sorbian. In these languages, dual pronouns continue to be used by the speakers in a bi-morphemic morphological structure consisting of a plural pronominal stem and the numeral dva ('two') or the dual suffix -j (1a). The second pattern of diachronic change is more pervasive and occurred in all of the Slavic languages including Russian and Kashubian. In Russian and Kashubian, dual pronouns were reanalyzed by the speakers as plural (1b). (1) Two Patterns of Diachronic Change in the Slavic Dual a. dual → plural + dva/-j Slovenian & Upper, Lower Sorbian b. dual → plural Russian & Kashubian The diachronic changes in the Slavic dual, although well documented, remain a puzzling problem for morphological theory. Some scholars attribute diachronic changes in the Slavic dual to morphological markedness of dual number. Derganc (2003) shows that the nominal dual, as a more morphologically marked category, was replaced by the plural in Ljubljana Slovenian and some of its other dialects. In a detailed study of the Old Russian dual, Žolobov (2001) proposes that dual pronouns were reanalyzed as plural due to markedness of the dual as a more restricted category of number as opposed to the plural. Nevins (2011) argues that morphological markedness of the abstract features of the dual in Slovenian and Sorbian can trigger either postsyntactic deletion of the marked dual features themselves or deletion of other phi-features, such as gender.
    [Show full text]
  • Mutual Intelligibility and Slavic Constructed Interlanguages : a Comparative Study of Ruski Jezik and ​ ​ Interslavic
    Mutual intelligibility and Slavic constructed interlanguages : a comparative study of Ruski Jezik and ​ ​ Interslavic ** Jade JOANNOT M.A Thesis in Linguistics Leiden university 2020-2021 Supervisor : Tijmen Pronk TABLE OF CONTENTS Jade Joannot M.A Thesis Linguistics 24131 words 1.1. Abstract 1.2. Definitions 1.2.1. Constructed languages 1.2.2. Interlanguage 1.2.3. Mutual intelligibility 1.3. Object of study 1.3.1. History of Slavic constructed languages Pan-Slavic languages (19th century) Esperanto-inspired projects Contemporary projects 1.3.2. Ruski Jezik & Interslavic Ruski Jezik (17th century) Interslavic (21th century) 1.3.3. Shared aspects of Ruski Jezik and Interslavic 1.4. Relevance of the study 1.4.1. Constructed languages and mutual intelligibility 1.4.2. Comparative study of Ruski Jezik and Interslavic 1.4.3. Historical linguistics 1.5. Structure of the thesis 1.5.1. Research question 1.8. Description of the method 1.8.1. Part 1 : Approaches to Slavic mutual intelligibility and their conclusions 1.8.2. Part 2 : Study of Ruski Jezik and Interslavic I.1. Factors of mutual intelligibility I.1.1. Extra-linguistic factors I.1.2. Linguistic predictors of mutual intelligibility I.1.2.1. Lexical distance I.1.2.2. Phonological distance I.1.2.3. Morphosyntactic distance I.1.2.3.1. Methods of measurements I.1.2.3.2. The importance of morphosyntax I.1.3. Conclusions I.2. Mutual intelligibility in the Slavic area I.2.1. Degree of mutual intelligibility of Slavic languages I.2.2. The case of Bulgarian 2 Jade Joannot M.A Thesis Linguistics 24131 words I.2.3.
    [Show full text]
  • Stereotyp Tożsamości 27 Magdalena Pytlak Swoje, Obce, Skolonizowane
    Tożsamość Słowian zachodnich i południowych w świetle XX ‑wiecznych dyskusji i polemik Tożsamość Słowian zachodnich i południowych w świetle XX ‑wiecznych dyskusji i polemik T. 1 Konteksty filologiczne i kulturoznawcze pod redakcją Katarzyny Majdzik i Józefa Zarka Wydawnictwo Uniwersytetu Śląskiego • Katowice 2017 Redaktor serii: Historia Literatur Słowiańskich Piotr Fast Recenzent Patrycjusz Pająk Spis treści Wstęp (Katarzyna Majdzik, Józef Zarek) 7 Barbara Czapik ‑Lityńska Tożsamość jako wartość 13 Bożena Tokarz Stereotyp tożsamości 27 Magdalena Pytlak Swoje, obce, skolonizowane. Uwagi na temat wewnętrznych sporów o model kultury bułgarskiej w ostatniej dekadzie XX wieku 43 Dorota Gołek ‑Sepetliewa Kwestie tożsamości narodowej i kulturowej w twórczości poetyckiej Ani Iłkowa 53 Vesna Grahovac ‑Pražić, Sanja Vrcić ‑Mataija Zavičajni identitet u nastavi 65 Cvijeta Pavlović Granice europskoga/slavenskoga istoka i zapada: poljska književnost u kanonu svjetske književnosti iz hrvatske perspektive 83 Antonina Kurtok Chorwacka mniejszość narodowa na Węgrzech – w poszukiwaniu defi‑ nicji przynależności 103 Agata Jawoszek Problemy tożsamościowe bośniackiej diaspory w Turcji. Zarys proble‑ matyki 123 Joanna Czaplińska Jak pozostać sobą w obcym języku? Strategie konwersji językowej pisa‑ rzy pochodzenia słowiańskiego 139 6 Spis treści Anita Gostomska Od Kultury kłamstwa do Kultury karaoke. Obraz przemian utrwalony w eseistyce Dubravki Ugrešić 155 Divna Mrdeža Antonina O protonacionalnom identitetu Jurja Barakovića u povijesti književnosti 173 Zvjezdana Rados Problematika hrvatskoga nacionalnog identiteta u Krležinu romanu Zastave 193 Natalia Wyszogrodzka ‑Liberadzka W obronie chorwackości. Kwestie tożsamościowe na łamach czasopisma „Krugovi” (1952–1958) 211 Ana Dalmatin Dijalektička perspektiva južnoslavenskih književno ‑jezičnih odnosa 221 Silvija Borovnik Problem tożsamości Słoweńców w esejach Draga Jančara 233 Marta Buczek Tożsamość jednostki w słowackiej prozie naturyzmu 247 Tomasz Derlatka Opisy podróży po krajach słowiańskich z lat 50.
    [Show full text]
  • Slovenski Jezik Slovene Linguistic Studies
    Slovenski jezik Slovene Linguistic Studies 6 2007 POSEBNI ODTIS – OFFPRINT Ljubljana – Lawrence D.Slovenski F. Reindl, jezik Slovene – Slovene Ultra-Formal Linguistic Studies Address: 6 (2007): Borrowing, 151–168 Innovation, and Analysis 151 Donald F. Reindl Filozofska fakulteta, Ljubljana Slovene Ultra-Formal Address: Borrowing, Innovation, and Analysis Slovenščina ima ogovorni sistem, ki se od osnovnega dvojnega ogovornega sistema mnogih ev- ropskih jezikov loči v tem, da oblikovno razlikuje do štirih ravni formalnosti (neformalno/tikanje, polformalno/napol vikanje, formalno/vikanje in ultraformalno/onikanje). Do nedavnega je bilo onikanje v redni uporabi tako v neposrednem kot v posrednem ogovoru (oz. govorjenjem o odsot- ni osebi). Čeprav bi lahko slovnične značilnosti onikanja izvirale iz stika z nemščino, se zdi, da predstavlja slovenska uporaba onikanja v posrednem ogovoru samostojen izum. Avtor analizira Linhartovo veseloigro Županova Micka z namenom, da razišče in prikaže vzajemno delovanje teh ogovornih oblik. Podobne raziskave onikanja v drugih jezikih (češčina, slovaščina) bi lahko bolje osvetlile pojav, ki je prisoten v več slovanskih jezikih. Slovene has a system of address that differs from the basic binary address system of many Europe- an languages by grammatically distinguishing up to four levels of formality (informal, semiformal, formal, and ultra-formal). Until recently, ultra-formal address was regularly used in direct as well as indirect address (i.e., reference to absent persons). Although the grammatical characteristics of Slovene ultra-formal address (3rd plural) appear to have been the result of contact with German, the Slovene application of this form to indirect address appears to have been an independent innova- tion. Anton Tomaž Linhart’s play Županova Micka is analyzed in order to explore and illustrate the interaction of these various address forms.
    [Show full text]
  • 1 Font Technical Documentation
    Euclid Mono 1 font Technical Documentation Swiss Typefaces Euclid Mono Typeface Technical Documentation www.swisstypefaces.com/lab/ 1/9 Euclid Mono Typeface Overview 1 font, Latin Euclid Mono When a typeface is expanded to include a monos- paced style, this usually means: Letterforms get squeezed into a straitjacket, with added bars on ‘i’ and Euclid ‘l’ – and that’s it. To Swiss Typefaces, “usually” is a fo- reign word, though. Euclid Mono Vanguard brings a distinct style to the table, expanding the palette of typographic expres- sion. It redefines what a geometric mono can be, while Mono staying true to the Euclid spirit. The accelerating forms for ‘S’/‘s’ and ‘Z’/‘z’ thwart any rigidity that may be caused by the fixed widths. ‘M’ and ‘W’/‘w’ have diago- nals, but not like you’d think. Together with the capital ligatures, they represent a contemporary nod to the special features of Avant Garde Gothic, a milestone of Vanguard geometric type design that keeps providing inspiration half a century after its release. There are alts for those dancing glyphs, but don’t ex- pect to find a safe fallback. Circle + right angle: ‘Q’ sums up the Euclidian mindset. Swiss Typefaces Euclid Mono Typeface Technical Documentation www.swisstypefaces.com/lab/ 2/9 Euclid Mono Typeface Overview Euclid Mono cs MAX ri ez WVY we Swiss Typefaces Euclid Mono Typeface Technical Documentation www.swisstypefaces.com/lab/ 3/9 Euclid Mono Glyph Overview Uppercase (Latin) Figures OT Stylistic Set 1: Mirrored ABCDEFGHIJKLMNOP 1234567890 ABCDEFGHIJKLMNOP QRSTUVWXYZ QRSTUVWXYZ ÀÁÂÃÄÅĀĂĄÆÇĆČĈĊĎ Standard Ligatures (Latin) abcdefghijklmnop ĐÈÉËĔÊĚĒĘĖĞĜĠĢĦĤ fi fl qrstuvwxyz ÌÍÎÏĬİĪĮĨỊĴĶĹĽĻĿ Punctuation ŁÑŊŃŇŅÒÓÔÕÖØŎŐŌỌ OT Stylistic Set 2: Capital Ligatures ŒÙÚÛÜŬŮŰŪŲŨỤŔŘŖŚ !?¡¿*#%&@/),-–— GOMOOCOGOMOOOWWO ŠŞŜȘŦŤŢẂŴẄẀẄÝŶŸȲ _.:;…·[\]|«»‹›‘ ŹŽŻÞẞƏ ’“”‚„°•'" Euclid Mono offers two Stylistic Sets.
    [Show full text]
  • From Lesser Taught Languages to Lesser Taught Cultures
    The pagination of this PDF does not match the published version. Slovak Studies Program, University of Pittsburgh votruba “at” pitt “dot” edu http://www.pitt.edu/~votruba Herder and Modernity: From Lesser Taught Languages to Lesser Taught Cultures Martin Votruba East/West: Journal of Ukrainian Studies, 4: 1. ISSN 2292-7956 Abstract: The typical North American curriculum of a lesser taught Slavic language implicitly relies on the legacy of Johann Gottfried von Herder’s assertion that language contains national (ethnic) culture. At the same time, enrollments are dwindling even in courses in the most commonly taught Slavic languages. Post-Millennials’ (Generation Z’s) understandable focus on the practicality of the courses they take make it unlikely for the lesser taught languages to resist the slump. Foreign culture courses appear to hold their ground more successfully. Slavic departments may reconsider Herder’s dictum as they try to maintain or establish programs in lesser taught languages and cultures. The frequency with which a Slavic language is taught at North American universities does not go directly hand-in-hand with the number of its speakers in the country of its main use or its speakers in the world. While a small number of speakers is, understandably, a strong factor to shift the status of a language to “lesser taught,” historical reasons have contributed to the cur- rent situation as well. Ukrainian is the most striking instance. In sheer number of its speakers,1 it should be taught about as often as Polish2 and be well ahead of Czech. Yet, its teaching has been affected by circumstances similar to those that have also impacted the teaching of Slovak and Slovenian, languages with but a small fraction of the speakers of Ukrainian.
    [Show full text]
  • Is Yugoslav Really a President Tito Yugoslav?
    UNCLASSIFIED (b) (3)-P.L. 86-36 Is Yugoslav President Tito Really a Yugoslav? The Yugoslav President, Josip Tito, appears to language which does not have this phenomenon at­ speak Serbo-Croatian, allegedly his native language, tempts to speak English, he will frequently fail to with a foreign accent. This article will analyze certain aspirate his word initial voiceless stops. This will be phonological 1 and morphological 2 features in his speech 3 one of the numerous features which will characterize which point to that conclusion and which cannot be that speaker as having a "foreign accent." A similar explained by his advanced age and the consequent situation exists in the case of dental vs. alveolar stops5 possible loss of faculties. (orthographically represented as t and d). Native First of all, it may be useful to define what exactly speakers of English articulate them at the alveolar 6 is meant by a "foreign accent." While most everyone ridge • Native speakers of Russian articulate them at can recognize this phenomenon upon encountering it, the back of the upper incisors. While the difference is most of us are usually hard pressed to define it beyond minute, it is nonetheless perceptible and constitutes a general impressionistic statement, such as "he talks yet another feature of the foreign accent. 7 In view of funny." What that means, of course, is that the the above, a foreign accent may be defined as an person has failed to fully master the phonology of the attempt to substitute the phonology of one's native target language. language for that of the target language or, to put it In general, linguists maintain that each language another way, as the presence of features which nor­ has its unique phonology, a finite number of phonemes4 mally do not appear in the pronunciation of native selected from the theoretically infinite number of speakers.
    [Show full text]
  • Definiteness in a Language Without Articles – a Study on Polish
    Definiteness in a Language without Articles – A Study on Polish Adrian Czardybon Hana Filip, Peter Indefrey, Laura Kallmeyer, Sebastian Löbner, Gerhard Schurz & Robert D. Van Valin, Jr. (eds.) Dissertations in Language and Cognition 3 Adrian Czardybon 2017 Definiteness in a Language without Articles – A Study on Polish BibliograVsche Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen NationalbibliograVe; detaillierte bibliograVsche Daten sind im Internet über http://dnb.dnb.de abrufbar. D 61 © düsseldorf university press, Düsseldorf 2017 http://www.dupress.de Einbandgestaltung: Doris Gerland, Christian Horn, Albert Ortmann Satz: Adrian Czardybon, Thomas Gamerschlag Herstellung: docupoint GmbH, Barleben Gesetzt aus der Linux Libertine ISBN 978-3-95758-047-4 Für meine Familie Contents 1 Introduction 1 2 Theoretical basis 9 2.1 The distribution of the definite article in English and German . 9 2.2 Approaches to definiteness . 13 2.2.1 Familiarity . 13 2.2.2 Uniqueness . 16 2.3 Löbner’s approach to definiteness . 18 2.3.1 Inherent uniqueness and inherent relationality . 18 2.3.2 Concept types . 19 2.3.3 Shifts and determination . 20 2.3.4 Semantic vs. pragmatic uniqueness . 22 2.3.5 Scale of uniqueness . 25 2.4 Mass/count distinction . 30 2.5 Definiteness strategies discussed in the Slavistic literature . 35 3 Demonstratives 43 3.1 Criteria for the grammaticalization of definite articles . 44 3.2 Polish determiners and the paradigm of ten . 46 3.3 Previous studies on demonstratives in Polish . 49 3.4 My analysis of ten . 55 3.4.1 The occurrence of ten with pragmatic uniqueness .
    [Show full text]
  • The Role of Languages in Intercultural Communication Rolo De Lingvoj En Interkultura Komunikado Rola Języków W Komunikacji Międzykulturowej
    Cross-linguistic and Cross-cultural Studies 1 The Role of Languages in Intercultural Communication Rolo de lingvoj en interkultura komunikado Rola języków w komunikacji międzykulturowej Editors – Redaktoroj – Redakcja Ilona Koutny & Ida Stria & Michael Farris Poznań 2020 The Role of Languages in Intercultural Communication Rolo de lingvoj en interkultura komunikado Rola języków w komunikacji międzykulturowej 1 2 Uniwersytet im. Adama Mickiewicza – Adam Mickiewicz University Instytut Etnolingwistyki – Institute of Ethnolinguistics The Role of Languages in Intercultural Communication Rolo de lingvoj en interkultura komunikado Rola języków w komunikacji międzykulturowej Editors – Redaktoroj – Redakcja Ilona Koutny & Ida Stria & Michael Farris Poznań 2020 3 Cross-linguistic and Cross-cultural Studies 1 Redaktor serii – Series editor: Ilona Koutny Recenzenci: Věra Barandovská-Frank Probal Dasgupta Nicolau Dols Salas Michael Farris Sabine Fiedler Federico Gobbo Wim Jansen Kimura Goro Ilona Koutny Timothy Reagan Ida Stria Bengt-Arne Wickström Projekt okładki: Ilona Koutny Copyright by: Aŭtoroj – Authors – Autorzy Copyright by: Wydawnictwo Rys Wydanie I, Poznań 2020 ISBN 978-83-66666-28-3 DOI 10.48226/978-83-66666-28-3 Wydanie: Wydawnictwo Rys ul. Kolejowa 41 62-070 Dąbrówka tel. 600 44 55 80 e-mail: [email protected] www.wydawnictworys.com 4 Contents – Enhavtabelo – Spis treści Foreword / Antaŭparolo / Przedmowa ................................................................................... 7 1. Intercultural communication:
    [Show full text]
  • Electronic Corpora of Slavic Micro-Languages at Their Threshold – the State of the Art and Its Further Prospects
    Mira Načeva-Marvanová (J. E. Purkyně University) Electronic Corpora of Slavic Micro-languages at Their Threshold – the State of the Art and its Further Prospects After the fall of the iron curtain, since the middle of the 1990s, new written and spoken language digital corpora have been developed for almost all of the standard Slavic literary languages. One should add that this quite appropriate, useful, and above all, extremely beneficial approach is the best method of language documentation of micro-languages as well. It has also been applied to a certain extent to some other sociolinguistic varieties of language. This presentation is focused on the already available current Slavic micro-language corpora, such as: (1) the two monolingual "large size" written text corpora of Upper Sorbian – HOTKO (of 32 million words) and of Lower Sorbian – DOTKO (of 12 million words) - both these corpora are hosted by the Czech National Corpus using its corpus managers and interface; (2) the series of five oral digital multimedia corpora of: Burgenland Croatian, colloquial Upper Sorbian, Molise Slavic (Na-našu) in Southern Italy, Nashta (Liti, Northern Greece) and the language of Edesa (Northern Greece, called by its researcher Bulgaro-Macedonian) – all these corpora are a part of the French-German research Euroslav 2010 and LaCiTo (France); (3) the WORTSCHATZ corpora of Kashubian, Lower Sorbian, Upper Sorbian, Silesian and Transcarpathian Rusyn (a project of the University of Leipzig); (4) the digital archive of Lower Sorbian language data at the DoBeS (Documentation Bedrohter Sprachen = Documentation of Endangered Languages); (5) the Transdanubian electronic corpus of Bulgarian dialects (from 38 locations) in Southern Romania (of the University of Calgary and University of Sofia).
    [Show full text]