Speeding up Corpus Development for Linguistic Research: Language Documentation and Acquisition in Romansh Tuatschin Géraldine Walther, Benoît Sagot

Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin Géraldine Walther, Benoît Sagot To cite this version: Géraldine Walther, Benoît Sagot. Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin. Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Aug 2017, Vancouver, Canada. pp.89 - 94, 10.18653/v1/W17-2212. hal-01570614 HAL Id: hal-01570614 https://hal.inria.fr/hal-01570614 Submitted on 31 Jul 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin Geraldine´ Walther Benoˆıt Sagot University of Zurich¨ Inria Plattenstrasse 54 2 rue Simone Iff 8032 Zurich,¨ Switzerland 75 012 Paris, France [email protected] [email protected] Abstract Similarly, studies on language acquisition are based on recorded, transcribed, and annotated data In this paper, we present ongoing work of parent-child interactions. Language acquisition for developing language resources and ba- research has produced significant databases, such sic NLP tools for an undocumented vari- as the CHILDES project (McWhinney, 2000), yet ety of Romansh, in the context of a lan- mainly manually and at enormous costs. guage documentation and language acqui- Relying solely on manual annotators is too sition project. Our tools are designed to costly an option. Minimising resource develop- improve the speed and reliability of cor- ment costs is crucial. Manual language resource pus annotations for noisy data involving development for language documentation and lan- large amounts of code-switching, occur- guage acquisition projects should be sped up, as rences of child speech and orthographic soon as technically possible, by employing NLP noise. Being able to increase the efficiency tools such as spelling correction/normalisation of language resource development for lan- tools and part-of-speech (POS) taggers. For in- guage documentation and acquisition re- stance, POS taggers used as pre-annotators have search also constitutes a step towards solv- been shown to increase both annotation speed and ing the data sparsity issues with which re- annotation quality, even when trained on limited searchers have been struggling. amounts of data (Marcus et al., 1993; Fort and Sagot, 2010). 1 Introduction Yet language acquisition and language docu- Contemporary linguistic research relies more and mentation data presents specific challenges for au- more heavily on the exploration of statistical pat- tomatic linguistic annotation. Firstly, such data terns in language. The non-categorical distribution usually consists of transcriptions of spontaneous and variety of linguistic units has become a focus speech. Secondly, previously undescribed lan- for understanding complex variations across pat- guages are often not written and lack established terns, such as dative alternations (Bresnan et al., orthographies, resulting in noisy transcriptions. 2007) or cases of optionality (Wasow et al., 2011). Thirdly, acquisition data consists of recordings of Studies like these require the availability of large child-parent interactions. The recorded target chil- consistently annotated corpora. dren’s language production can differ dramatically For lesser or undescribed languages however, from adult language, adding another layer of lin- such resources are not readily available. What guistic variation. Finally, as new data is usually is worse, the current rate of language extinction still being collected, available raw and even more could lead to the disappearance of 20-90% of to- so annotated data is rare, which significantly limits day’s spoken languages by the end of the 21st the available training data for annotation tools. century (Krauss, 1992). Documenting endangered In this paper, we show the interaction be- languages will allow us to preserve traces of the tween manual resource development (morpholog- current language diversity, a part of the world’s ical lexicon, spelling and POS-annotated corpus) cultural heritage. Building reliable linguistic re- and automatic tools on current annotation experi- sources will allow us to study them according to ments for a language documentation and acquisi- fast evolving research standards. tion project on the undocumented and previously non-written Romansh dialect of Tuatschin. ical items or full utterances is ubiquitous. As a result, our recorded Tuatschin data com- 2 Romansh Tuatschin prises a high amount of German and standard Sur- silvan. This is even more acute in the acquisition The term Romansh denotes a set of Romance corpus data used within present experiments, as languages with important Germanic lexical and the children’s families tend to be natively bilin- grammatical influence, mostly spoken in the can- gual. While the children’s mothers are all na- ton of the Grisons in South-Eastern Switzerland. tive Tuatschin speakers, the fathers are Sursilvan, Although Romansh is considered one of the four Swiss German or Italian speakers. The children official national languages of Switzerland, the therefore produce a significant amount of mixed term Romansh covers in fact a variety of languages utterances. In addition to the high amount of code- and dialects with significantly differing features. switching due to this particular language setting, The dialect we focus on in present paper corre- language acquisition data also comprises intrin- sponds to a previously undocumented dialect of sic noise due to the differences between child and the Romansh Sursilvan variety called Tuatschin. It adult language, including nonce-words or specific is spoken by approximately 1,500 speakers in the child-speech. The variation observed in our cor- Val Tujetsch area. Contrary to the neighbouring pus data ranges from language and dialectal varia- main Sursilvan dialect, which is also the main lan- tion to adult/child register differences. Developing guage in local schools, Tuatschin is at this point automated tools for such a corpus requires coping an unwritten language. It is however still natively with noisy, heterogeneous corpus data.1 spoken and transmitted both in the local area and within families who have left and settled in larger 3 Developing guidelines cities within the country. Speakers are proud of their language and culture and promote it through 3.1 Orthography a local cultural association and occasional, non- The first challenge in developing linguistic re- normalised, publications in the local newspaper. sources and automated tools for a previously un- The development of the resources described written language like Tuatschin consists in devel- here is part of a project which combines language oping an orthography that can be used for tran- documentation and acquisition research. One aim scribing recorded data, and training future tran- of this project is to gain better understanding of scribers (native speakers) in using this new or- intergenerational language transmission in endan- thography. We wanted this orthography to also be gered language contexts, which might contribute usable by native speakers outside the project. In to eventually slowing down, if not reversing, lin- collaboration with two native speakers within the guistic and cultural erosion of minority languages. Val Tujetsch, we developed a new orthography for The data used in this project is mostly original Tuatschin, which is mainly based on the orthog- data from fieldwork in the Val Tujetsch, original raphy for the neighbouring written dialect Sur- recordings within five Tuatschin speaking families silvan, but accommodates the phonetic and mor- with at least one child aged between 2 and 3 years, phological differences of Tuatschin, from the pro- and updated and normalised lexical data from the nunciation of specific vowels and diphthongs to Tuatschin word-list by Caduff(1952). diacritic marking of infinitive forms. Once the Adult speakers of Tuatschin are usually multi- main principles of the orthography had been estab- lingual, natively speaking at least two, if not more, lished, we started training our native corpus tran- Romansh dialects (Tuatschin and Sursilvan), as scribers. However, without complete resources well as German (both High German and the lo- (such as a full lexicon or grammar) at their dis- cal variety of Swiss German), and often a fair posal, each one of them still had their own in- amount of Italian and French. Everyday conver- terpretation of the overall principles for the tran- sations in the Tuatschin speaking area comprise a 1Note that while Sursilvan does have an established or- high amount of code-switching. Conversations be- thography and is used a a language of instruction in schools, tween speakers of neighbouring dialects more of- there are no automatic tools available for the language. The ten than not result in each speaker speaking their existing online lexicon can

Speeding up Corpus Development for Linguistic Research: Language Documentation and Acquisition in Romansh Tuatschin Géraldine Walther, Benoît Sagot

Language Contact at the Romance-Germanic Language Border

Switzerland 4Th Periodical Report

X. Romansh Studies

Developments of the Lateral in Occitan Dialects and Their Romance and Cross-Linguistic Context Daniela Müller

By Chiara Frigeni a Thesis Submitted in Conformity with the Requirements

1 Hoch Über Der Vallumnezia 2 Von Der Neuzeit Zurück In

Mesemna, Ils 6 Da Mars 2013

Agenda 24.10.2020 Bis 01.11.2020 Highlights

Dapli Rg Che Sursilvan En Scola Ein Aunc Adina Vacants Ils Idioms Da Las Emprimas Classas

(Iv) Zum “Niev Vocabulari Romontsch Sursilvan- Tudestg” (NVRST)1

Gliesta Alfabetica

The Interaction of Vowel Deletion and Syllable Structure Constraints