Towards a Corsican Basic Language Resource Kit
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2726–2735 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Towards a Corsican Basic Language Resource Kit Laurent Kevers, Stella Retali-Medori UMR CNRS 6240 LISA, Universita` di Corsica - Pasquale Paoli Avenue Jean Nicoli, 20250 Corte, France fkevers l, medori [email protected] Abstract The current situation regarding the existence of natural language processing (NLP) resources and tools for Corsican reveals their virtual non-existence. Our inventory contains only a few rare digital resources, lexical or corpus databases, requiring adaptation work. Our ob- jective is to use the Banque de Donnees´ Langue Corse project (BDLC) to improve the availability of resources and tools for the Corsican language and, in the long term, provide a complete Basic Language Ressource Kit (BLARK). We have defined a roadmap setting out the actions to be undertaken: the collection of corpora and the setting up of a consultation interface (concordancer), and of a language detection tool, an electronic dictionary and a part-of-speech tagger. The first achievements regarding these topics have already been reached and are presented in this article. Some elements are also available on our project page (http://bdlc.univ-corse.fr/tal/). Keywords: less-resourced languages, Corsican, corpora, lexical resources, language identification, POS tagging We will first contextualize our research and present some a polynomic approach (Marcellesi, 1984) that encompasses general information about the Corsican language (1.1., its all dialectal variants, the writing of the language is not stan- digital presence (1.2.) and the BDLC project (1.3.), as pre- dardised, which is an obstacle for its automatic processing. viously done in Kevers et al. (2019) and Kevers and Retali- Medori (2020). Not without having drawn up a brief state 1.2. Digital presence of the art (2.1.), we will present the achievements concern- Nowadays, Corsican is, with French, part of a diglossic lan- ing the NLP resources and tools for Corsican (2.2.). In par- guage environment, and its use is declining. The devel- ticular, we will address corpora collection (2.3.), the set up opment of tools is necessary for its preservation, enhance- of a consultation interface in the form of a concordancer ment, transmission and promotion1. Public policy encour- (2.4.) and of a language detection tool (2.5.), the electronic ages the exploitation of new technologies to this end. Sev- dictionary building (2.6.) and POS tagging (2.7.). These eral tools and resources exist for the learning and linguistic developments are guided by a roadmap that we have de- description of Corsican dialects, but their inclusion in the fined (Kevers et al., 2019) and that is largely in line with digital humanities domain remains insufficient2. the steps proposed by Ceberio Berger et al. (2018). In particular, sites and applications dedicated to translation, 1. The Corsican language and its digital lexicon and syntax contain little data in comparison with the richness and complexity of the language. On the other presence hand, this wealth is found on databases such as the Banque 1.1. Corsican language de Donnees´ Langue Corse (BDLC) and Infcor3 (Banca di Corsican is a Latin language and is part of the Italo- dati di a lingua corsa). The latter was conceived in an Romance domain. It has known various contacts and associative context by ADECEC, an organisation active in linguistic influences. Due to the Pisan domination (9th- the field of Corsican language and culture. It has been de- 13th centuries), it is particularly linked to medieval Tuscan veloped and was made available to the general public as which constitutes its superstratum. Corsican has borrowed a website as well as a smartphone application wich offer 4 from other Italo-Romance and even Romance varieties as numerous lexical records and multi-criteria query modes . well as from Germanic and Arabic languages. Regarding the BDLC, it is a tool designed in a scientific From a dialectal point of view, four or even five ar- context, associated with the production of a linguistic atlas, eas are identifiable (Dalbera-Stefanaggi, 2002; Dalbera- the NALC. Stefanaggi, 2007). The extreme southern area even crosses the borders of Corsica, extending into Gallura, in the north 1Recommendation of the UNESCO (Brenzinger et al., 2003) of Sardinia. However, these five areas constitute a contin- 2Corsican language can be found in various forms on the uum and do not prevent interunderstanding between speak- web: interfaces in Corsican language (Qwant and Google search ers, even with those of the central and southern varieties of engines, Facebook social network), websites or application like Italy. Google Translate, Wikipedia or Wiktionary, smartphone apps for tourism, education or translation, online educational content such The genetic and historical relation between Corsican and as Corsican language learning websites, blogs about Corsican lan- Tuscan languages led the latter to be the writing language guage and culture. of the islanders from the Middle Ages until the emergence 3http://infcor.adecec.net of a conscious writing in Corsican language in the 19th 4Altough all the lexical data would require a revision of the century. The spelling of Corsican is therefore, with some information related to variation, meanings, morphology and ety- adaptations, based on the Italian graphic system (Retali- mology, the material contained in this database is essential and Medori, 2015). However, despite the implementation of could help in the development of new tools. 2726 1.3. The BDLC project els) or durm`ıa / durmia / “he was sleeping” (hiatus The Nouvel Atlas Linguistique et ethnographique de la accentuation); Corse (NALC) was initiated by the CNRS in 1975 and transmitted to Marie-Jose´ Dalbera-Stefanaggi in 1981 at the 2. Development of NLP resources and tools opening of the University of Corsica. In 1986, in response for Corsican to a request from the Regional Assembly of Corsica, the 2.1. State of the art 5 Banque de Donnees´ Langue Corse (BDLC) was created To our knowledge, there are very few resources and tools 6 and was naturally linked to the atlas . designed for Natural Language Processing (NLP) in Corsi- The NALC-BDLC hosts linguistic data related to Corsi- can. The ELDA 2014 report on linguistic resources dedi- can know-how and cultural traditions throughout the is- cated to the languages of France (Leixa et al., 2014) lists land. During field surveys with local speakers, these data 93 resources for Corsican. More than a third of these are 7 are collected through thematic questionnaires made up recordings and transcriptions from the BDLC project. The 8 of French wordlists : for example la vigne (“the vine”), rest is made up of various documents: blogs, scientific pa- le cep (“the vine stock”), tailler la vigne (“pruning the pers, institutional sites, newspapers sites, etc. There are vine”), le tonneau (“the barrel”). Based on a question also some lexicons, including Infcor or the Corsican ver- such as Cumu si dici “tailler la vigne” in corsu? (“How sion of the Wiktionary. In addition to this inventory, there 9 do we say ‘to prune the vine’ in Corsican language?” ), are a few other contributions, including the Speaking Atlas the corresponding translations into Corsican are recorded of the Regional Languages of France (Boula de Mareuil¨ et and a semi-directed interview entirely in Corsican, and re- al., 2018), the W2C corpus (Majlis and Zabokrtsky,´ 2012), lating to practices, is undertaken in order to collect eth- the BabelNet semantic network (Navigli and Ponzetto, notexts (testimonies). This data is then processed, lin- 2012), which offers a number of units in Corsican, and the guistically analysed and put online on the website http: Corsican resources created within the Crubad´ an´ Project by //bdlc.univ-corse.fr. Scannell (2007). Except for the last three, the majority of If this database constitutes a real asset for the development available resources are not directly usable for NLP. of NLP applied to Corsican, one of the major difficulties Corsican therefore falls into the category of less-resourced comes from the rich variation characterising the Corsican languages. These languages are an active area of re- language. According to the objects, significant lexical vari- search. The TALN conference hosted several events ded- ations are documented: for example, 25 lemmas were col- icated to this issue, including the workshop Traitement au- lected to describe the act of pricking the vine out. These tomatique des langues minoritaires et des petites langues lemmas will in turn be affected by variable transcriptions, (Streiter, 2003), as well as the TALaRE workshops (Morin particularly as a result of the non-standardisation of the and Esteve,` 2013; Vergez-Couret et al., 2015). Similarly, Corsican language and as a production carried out by vari- the LREC conference hosted multiple workshops, the most ous transcribers during the 30 years of the project existence. recent of which are SaLTMiL (Alegria et al., 2010; De Pauw The different writing choices meet objectives such as: et al., 2012) and CCURL (Pretorius et al., 2014; Soria et • keep the dialectal richness, for example to name ”the al., 2016; Soria et al., 2018). Recently, the TAL journal also jar”, according to the pronunciation in the localities published a thematic issue on the subject (Bernhard and So- surveyed, we will find the forms cerra and gerra from ria, 2018). However, the place of the Corsican language in the same etymology; these publications is almost non-existent. • in some cases, explicitly indicate the placement of the 2.2. Initiatives for Corsican tonic accent as well as the opening of the vowels, Given this observation, we have decided to work to im- for example in trovula´ rather than trovula / “bowl”, prove the situation of the Corsican language with regard compulu` / compulu / “shelter”, pergula` / pergula / “ar- to its place in the digital world, and more particularly in the bour”, tepidu´ / tepidu / “tepid” (proparoxytones vow- field of Natural Language Processing.