Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin Géraldine Walther, Benoît Sagot

To cite this version:

Géraldine Walther, Benoît Sagot. Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin. Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Aug 2017, Vancouver, Canada. pp.89 - 94, ￿10.18653/v1/W17-2212￿. ￿hal-01570614￿

HAL Id: hal-01570614 https://hal.inria.fr/hal-01570614 Submitted on 31 Jul 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin

Geraldine´ Walther Benoˆıt Sagot University of Zurich¨ Inria Plattenstrasse 54 2 rue Simone Iff 8032 Zurich,¨ 75 012 Paris, France [email protected] [email protected]

Abstract Similarly, studies on language acquisition are based on recorded, transcribed, and annotated data In this paper, we present ongoing work of parent-child interactions. Language acquisition for developing language resources and ba- research has produced significant databases, such sic NLP tools for an undocumented vari- as the CHILDES project (McWhinney, 2000), yet ety of Romansh, in the context of a lan- mainly manually and at enormous costs. guage documentation and language acqui- Relying solely on manual annotators is too sition project. Our tools are designed to costly an option. Minimising resource develop- improve the speed and reliability of cor- ment costs is crucial. Manual language resource pus annotations for noisy data involving development for language documentation and lan- large amounts of code-switching, occur- guage acquisition projects should be sped up, as rences of child speech and orthographic soon as technically possible, by employing NLP noise. Being able to increase the efficiency tools such as spelling correction/normalisation of language resource development for lan- tools and part-of-speech (POS) taggers. For in- guage documentation and acquisition re- stance, POS taggers used as pre-annotators have search also constitutes a step towards solv- been shown to increase both annotation speed and ing the data sparsity issues with which re- annotation quality, even when trained on limited searchers have been struggling. amounts of data (Marcus et al., 1993; Fort and Sagot, 2010). 1 Introduction Yet language acquisition and language docu- Contemporary linguistic research relies more and mentation data presents specific challenges for au- more heavily on the exploration of statistical pat- tomatic linguistic annotation. Firstly, such data terns in language. The non-categorical distribution usually consists of transcriptions of spontaneous and variety of linguistic units has become a focus speech. Secondly, previously undescribed lan- for understanding complex variations across pat- guages are often not written and lack established terns, such as dative alternations (Bresnan et al., orthographies, resulting in noisy transcriptions. 2007) or cases of optionality (Wasow et al., 2011). Thirdly, acquisition data consists of recordings of Studies like these require the availability of large child-parent interactions. The recorded target chil- consistently annotated corpora. dren’s language production can differ dramatically For lesser or undescribed languages however, from adult language, adding another layer of lin- such resources are not readily available. What guistic variation. Finally, as new data is usually is worse, the current rate of language extinction still being collected, available raw and even more could lead to the disappearance of 20-90% of to- so annotated data is rare, which significantly limits day’s spoken languages by the end of the 21st the available training data for annotation tools. century (Krauss, 1992). Documenting endangered In this paper, we show the interaction be- languages will allow us to preserve traces of the tween manual resource development (morpholog- current language diversity, a part of the world’s ical lexicon, spelling and POS-annotated corpus) cultural heritage. Building reliable linguistic re- and automatic tools on current annotation experi- sources will allow us to study them according to ments for a language documentation and acquisi- fast evolving research standards. tion project on the undocumented and previously non-written Romansh dialect of Tuatschin. ical items or full utterances is ubiquitous. As a result, our recorded Tuatschin data com- 2 Romansh Tuatschin prises a high amount of German and standard Sur- silvan. This is even more acute in the acquisition The term Romansh denotes a set of Romance corpus data used within present experiments, as languages with important Germanic lexical and the children’s families tend to be natively bilin- grammatical influence, mostly spoken in the can- gual. While the children’s mothers are all na- ton of the in South-Eastern Switzerland. tive Tuatschin speakers, the fathers are , Although Romansh is considered one of the four Swiss German or Italian speakers. The children official national languages of Switzerland, the therefore produce a significant amount of mixed term Romansh covers in fact a variety of languages utterances. In addition to the high amount of code- and dialects with significantly differing features. switching due to this particular language setting, The dialect we focus on in present paper corre- language acquisition data also comprises intrin- sponds to a previously undocumented dialect of sic noise due to the differences between child and the Romansh Sursilvan variety called Tuatschin. It adult language, including nonce-words or specific is spoken by approximately 1,500 speakers in the child-speech. The variation observed in our cor- Val area. Contrary to the neighbouring pus data ranges from language and dialectal varia- main Sursilvan dialect, which is also the main lan- tion to adult/child register differences. Developing guage in local schools, Tuatschin is at this point automated tools for such a corpus requires coping an unwritten language. It is however still natively with noisy, heterogeneous corpus data.1 spoken and transmitted both in the local area and within families who have left and settled in larger 3 Developing guidelines cities within the country. Speakers are proud of their language and culture and promote it through 3.1 Orthography a local cultural association and occasional, non- The first challenge in developing linguistic re- normalised, publications in the local newspaper. sources and automated tools for a previously un- The development of the resources described written language like Tuatschin consists in devel- here is part of a project which combines language oping an orthography that can be used for tran- documentation and acquisition research. One aim scribing recorded data, and training future tran- of this project is to gain better understanding of scribers (native speakers) in using this new or- intergenerational language transmission in endan- thography. We wanted this orthography to also be gered language contexts, which might contribute usable by native speakers outside the project. In to eventually slowing down, if not reversing, lin- collaboration with two native speakers within the guistic and cultural erosion of minority languages. Val Tujetsch, we developed a new orthography for The data used in this project is mostly original Tuatschin, which is mainly based on the orthog- data from fieldwork in the Val Tujetsch, original raphy for the neighbouring written dialect Sur- recordings within five Tuatschin speaking families silvan, but accommodates the phonetic and mor- with at least one child aged between 2 and 3 years, phological differences of Tuatschin, from the pro- and updated and normalised lexical data from the nunciation of specific vowels and diphthongs to Tuatschin word-list by Caduff(1952). diacritic marking of infinitive forms. Once the Adult speakers of Tuatschin are usually multi- main principles of the orthography had been estab- lingual, natively speaking at least two, if not more, lished, we started training our native corpus tran- Romansh dialects (Tuatschin and Sursilvan), as scribers. However, without complete resources well as German (both High German and the lo- (such as a full lexicon or grammar) at their dis- cal variety of Swiss German), and often a fair posal, each one of them still had their own in- amount of Italian and French. Everyday conver- terpretation of the overall principles for the tran- sations in the Tuatschin speaking area comprise a 1Note that while Sursilvan does have an established or- high amount of code-switching. Conversations be- thography and is used a a language of instruction in schools, tween speakers of neighbouring dialects more of- there are no automatic tools available for the language. The ten than not result in each speaker speaking their existing online lexicon can only be queried online but is not freely available. Despite the languages’ similarity there are own variety, accommodating the other person’s di- no existing resources that could be leveraged to facilitate au- alect to varying degrees. Insertion of German lex- tomatic work on Tuatschin at this point. scription of individual words, adding an additional tilingual parser development, cross-lingual learn- transcriber-related layer to the variation within the ing, and parsing research from a language typol- data. Developing the new orthography also re- ogy perspective.” In the current version (2.0), sev- quired several passes, based on feedback from our eral dozen languages are already covered, some- transcribers and progress in our own understand- times with more than one treebank. What is more, ing of the language data through our ongoing field the UD website announces a Romansch and a (dis- work. Subsequent changes contributed to varia- tinct) Sursilvan Romansch treebank. Our initia- tion even within an individual transcriber’s orthog- tive is not related to these treebank development raphy, however reducing the differences between efforts—we had independently decided to develop different transcribers’ orthographic strategies. a deterministic mapping between our language- specific label inventory and a UD-compatible an- 3.2 Annotation tagset inventories notation scheme, in order to pave the way for a For our corpus annotation, we developed two sep- future Romansch-Tuaschin UD corpus. Together arate tagsets, one for part-of-speech (POS) anno- with the dependency annotation of our data, this tation and one for morphosyntactic features. Just will be the focus of future work. as for our orthographic conventions, our tagset evolved alongside field work progress while anno- 4 Manual ressource development tation was already ongoing, adding further noise 4.1 TuatLex to our corpus data. This kind of noise is a com- mon problem in language documentation projects We manually developed a morphological lexi- where data collection, annotation, and analysis con for Tuatschin. For that, we devised an ex- are conducted in parallel. Without the availabil- plicit grammatical description of Tuatschin nom- ity of automatic tools, it normally requires several inal and verbal morphology based on our own passes of manual post-cleaning and adds to the field work data. We implemented this descrip- overall cost of language resource development. tion within the AlexinaP ARSLI lexical framework (Sagot and Walther, 2013). In addition to the Our POS tagset comprises a fine-grained and implemented morphological description, our Tu- a coarse POS inventory. The full coarse-grained atschin lexicon, TuatLex, comprises a list of 2,176 inventory, used for our POS tagging experi- lemmas, based on an updated version of the Tu- ments (see Section 5.2) is the following: ADJ, atschin word-list by Caduff(1952) complemented ADV(erb), COMPL(ementiser), CONJ(unction), with our newly collected data, among which 780 DET(erminer), INTER(jection), N(oun), PN verbs, 949 nouns, and 146 adjectives. The Alex- (proper noun), PREP(osition), PRN (pronoun), ina inflection tool uses the grammar to produce PUNCT(uation), QW (question word), SOUND, 46,089 inflected entries, among which 29,361 ver- V(erb). Fine-grained tags add information such bal, 15,137 nominal, and 762 adjectival forms. as DET def for definite articles or PREP loc for locative prepositions. It also comprises a specific 4.2 Manual corpus annotation childspeech refinement of all POS for words that are specific to child-speech. The morphosyntactic After devising the tagset, we trained advanced lin- features follow the Leipzig Glossing Rule conven- guistic undergraduate students to manually anno- tions commonly used in language documentation tate our corpus data. The students were no na- projects and comprise distinctions for number and tive speakers of Romansh. They were asked to gender, as well as tense, mood and person. annotate each token for (fine-grained) POS and morphosyntactic information and to indicate an Although we have designed this language- 3 specific inventory as a means to better model English translation for each token’s base form. the morphological and morphosyntactic proper- Using the WebAnno annotation tool (Eckart de ties of Tuatschin, we recognise the relevance and Castilho et al., 2014), annotators took approxi- importance of the Universal Dependency (UD) mately 15 to 25 hours for annotating files contain- initiative,2 whose aim is to “develo[p] cross- ing typically 1,000 to 2,000 words. Difficulties for linguistically consistent treebank annotation for annotators mainly came from orthographic vari- many languages, with the goal of facilitating mul- 3For example, for the Tuatschin token nursasˆ , they would have provided the following annotations: POS = N, Mor- 2http://universaldependencies.org phosyn = pl, and the English translation ‘EWE’. ation, variation in the representation of word to- in the form of a Tuatschin-specific parametrisa- kens,4 and dialectal variation.5 Weekly meetings tion of the SXPipe shallow processing architecture between two trained linguists, among whom one (Sagot and Boullier, 2008). Sursilvan native speaker, and all annotators were Note that the spelling standardisation/correction organised to compare notes, discuss recurrent dif- tool is not meant to be used as a standalone tool. It ficulties, and adapt the tagsets whenever neces- has been designed and developed only for speed- sary.6 ing up future corpus development and for serving as a cleaning step before applying the POS tagger, 4.3 Manual corpus normalisation whose results are obviously better on (even only In order to simplify the annotation task, we set partially) normalised data. up a procedure for systematic orthographic cor- Our spelling standardisation/correction tool re- rection. We first asked one of the native speakers, lies on a list of deterministic rewriting patterns who had been involved in the development of the that we automatically extracted and manually se- orthographic conventions, to manually correct al- lected based on the data described in Section 4.3. ready transcribed corpus data for orthographic er- More precisely, we applied standard alignment rors and individual-word-based code-switching in techniques for extracting n-to-m correspondences a separate normalised tier within the corpus.7 We between raw and normalised text. Among the ex- then manually introduced an additional tier indi- tracted rules, 695 were deterministic, which means cating for each ‘normalised’ token whether it had that there was a unique output in the corpus for been corrected for orthography, code-switching, a given input. Out of these 695 candidate rules, child-speech, or actual pronunciation errors. This we manually selected 603, whose systematic ap- intermediate layer, in addition to being useful for plication could not result in any over-correction.9 subsequent acquisition or code-switching studies, Several others are non-deterministic. However, is meant to help the development of an automatic a careful manual examination of these candidate spell-checker. Aside from variation in the usage ambiguous rules showed that most of the ambi- of diacritics, some of the most frequent errors in guity is an artifact of the rule extraction process, the transcribed data involved the amalgamation of and that true ambiguities can be resolved in con- words meant to be written as separate tokens.8 text.10 As a result, contextual information was in- cluded in our standardisation/correction patterns, 5 Automation and tools thus resulting in a fully deterministic standardisa- 5.1 Tokenisation and orthographic correction tion/correction module.

In order to speed up the manual development of or- 5.2 POS tagger thographically normalised corpora based on new transcriptions, but also to prepare the fully auto- In order to assess and improve the quality of the matic processing of non-annotated text, we first POS annotation, but also to have a POS pre- developed an automatic tokenisation and spelling annotation tool for speeding up future manual an- standardisation/correction tool. It is implemented notation, we trained the MElt POS tagger (Denis and Sagot, 2012) on the manually POS-annotated 4 E.g. vegni instead of vegn i ‘it comes’ (lit. ‘comes it’). data available so far, using coarse POS to re- 5For example, unexpected morphological forms that pre- vented the recognition of inflected forms such as verbs end- duce sparsity (see Section 3.2). The 2,571 al- ing in Sursilvan -el instead of Tuatschin -a in the first person ready annotated sentences, containing 9,927 to- singular. kens, were divided in training, development and 6The project being mainly a documentation project on a previously undescribed language, data collection, annotation, test sets by randomly selecting sentences with re- and data analysis are currently being carried out in parallel. Data annotation in particular has been performed as a collab- 9A few examples: stu`→stu, sel` →se’l` , schia→scheia´ . orative task rather than as a task conducted by individual an- 10For instance, vegn`ı and all other verbs from TuatLex’s notators, that would have to be evaluated for inter-annotator inflection class Vi can produce ambiguous tokens such as agreement. vegni, which must be changed into vegn`ı (infinitive) before 7The purpose of this normalised layer is solely to help pronouns such as ju, te´, el, ella, el, i, nus, vus, els, ellas, automatic (and, to a lesser extent, manual) annotation of the ins, but, in most other contexts, must be rewritten as vegn i data. It is not meant to replace the original transcription layer, (V+PRN) with an expletive pronoun i. We applied the latter which remains the relevant layer for subsequent linguistic to those verb tokens likely to appear with expletive subjects analysis. like vegn`ı ’to come’ while inserting the infinitive diacritic on 8Cf. the vegni/vegn i example from previous footnote. all other verb instances (such as cap`ı ’understand’). spective probabilities 0.8, 0.1 and 0.1. We also ex- on the information comprised within the interme- tracted from the TuatLex lexicon described above diate tier introduced during the normalisation pro- a set of 19,771 unique (form, POS) pairs, to be cedure (see Section 4.3). This intermediate layer used by MElt as a complementary source of infor- is meant to be ultimately automatically generated mation, in addition to the training corpus. by SXPipe. The richer and more accurate informa- We trained a first version of MElt, and ap- tion that our tools will be able to provide will also plied the resulting POS tagger on the training data facilitate subsequent quantitative linguistic studies themselves. Annotation mismatches allowed us to on Romansh Tuatschin, its acquisition by children identify several systematic errors in the training and its influences by surrounding languages, espe- data. Some of them came from individual errors cially Sursilvan and (Swiss) German. or annotator disagreement. But most were due to changes made in the annotation guidelines while 7 Acknowledgements manual annotation was already ongoing, and of The authors gratefully acknowledge the SNF which some had not yet been correctly retroac- project number 159544 ”The morphosyntax of tively applied to already annotated data.11 We agreement in Tuatschin: acquisition and contact” applied a series of POS correction rules to the whose funding has been instrumental to the work whole corpus (train+dev+test) as a result of this presented here. We also thank our project col- training-corpus-based study, and re-trained MElt laborators and especially our annotators Nathalie on the corrected data. The result is a POS tag- Schweizer, Michael Redmond, and Rolf Hotz for ger trained on 2,062 sentences (7,901 words) with their help in annotating the data, and Claudia Cath- a 91.7% accuracy overall and a 65.3% accuracy omas for her extremely valuable help and native on words unknown to the training corpus. Inter- expertise on Romansh data. For their help in de- estingly, if trained without TuatLex lexical infor- veloping a new Tuatschin orthography we would mation, accuracy figures do not drop as much as like to express our warmest thanks to Ciril Monn usually observed (Denis and Sagot, 2012): respec- and Norbert Berther. tively 91.6% and 62.5%. This suggests that lexical information might not be as important for improv- ing POS taggers for child-related speech as it is References for tagging standard text, a fact likely related to the more limited vocabulary size in such corpora. Joan Bresnan, Anna Cueni, Tatiana Nikitina, and Har- ald Baayen. 2007. Predicting the dative alterna- 6 Conclusion tion. In G. Boume, I. Kraemer, and J. Zwarts, editors, Cognitive Foundations of Interpretation, We have described ongoing efforts for developing Royal Netherlands Academy of Science, Amster- language resources (lexicon, annotated corpus)12 dam, pages 69–94. and basic NLP tools for the Romansh variety of Leonard´ Caduff. 1952. Essai sur la phonetique´ du Tuatschin, in the context of a project on language parier rhetoroman´ de la Vallee´ de Tavetsch (canton description and language acquisition dedicated to des Grisons - Suisse). Francke, Bern. a previously non-written language. Our next step will consist in using our tools for pre-annotating Pascal Denis and Benoˆıt Sagot. 2012. Coupling an annotated corpus and a lexicon for state-of-the-art new raw data, in order to speed up annotation POS tagging. Language Resources and Evaluation while increasing its quality and consistency. They 46(4):721–736. will also be used for creating automatically anno- tated data, which will complement the manually Richard Eckart de Castilho, Chris Biemann, Iryna Gurevych, and Seid Muhie Yimam. 2014. Webanno: annotated corpus. a flexible, web-based annotation tool for clarin. On a longer term, our tools are also meant to In Proceedings of the CLARIN Annual Conference be used for automatically categorising tokens and (CAC) 2014. CLARIN ERIC, Utrecht, Netherlands, sequences of tokens into occurrences of child- page online. Extended abstract. speech or various types of code-switching, relying Karen¨ Fort and Benoˆıt Sagot. 2010. Influence of pre- 11For instance, the previously defined POS “V particle” annotation on POS-tagged corpus development. In had been discarded at some point during corpus development, Proceedings of the Fourth ACL Linguistic Annota- yet it still had a number of occurrences in the training data. tion Workshop (LAW IV). Uppsala, Sweden, pages 12The lexicon and tools will all be made freely available. 56–63. Michael Krauss. 1992. The worlds language in crisis. Language (68). Mitchell Marcus, Beatrice´ Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computa- tional Linguistics 19(2):313–330. Bryan McWhinney. 2000. The CHILDES Project: Tools for analyzing talk.. Lawrence Erlbaum Asso- ciates, Mahwah, NJ, 3 edition.

Benoˆıt Sagot and Pierre Boullier. 2008. SxPipe 2: ar- chitecture pour le traitement pre-syntaxique´ de cor- pus bruts. Traitement Automatique des Langues 49(2):155–188. Benoˆıt Sagot and Geraldine´ Walther. 2013. Implement- ing a formal model of inflectional morphology. In Cerstin Mahlow and Michael Piotrowski, editors, Proceedings of the Third International Workshop on Systems and Frameworks for Computational Morphology (SFCM 2013). Humboldt-Universitat,¨ Springer-Verlag, Berlin, Germany, volume 380 of Communications in Computer and Information Sci- ence (CCIS), pages 115–134. Thomas Wasow, T. Florian Jaeger, and David Orr. 2011. Lexical variation in relativizer frequency. In H. Simon and H. Wiese, editors, Expecting the Unexpected: Exceptions in Grammar, de Gruyter, Berlin, pages 175–195.