Alsatian, Occitan and Picard

Corpora with Part-of-Speech Annotations for Three Regional Languages of France: Alsatian, Occitan and Picard Delphine Bernhard1, Anne-Laure Ligozat2, Fanny Martin3, Myriam Bras4, Pierre Magistry5, Marianne Vergez-Couret6, Lucie Steible´1, Pascale Erhart1, Nabil Hathout4, Dominique Huck1, Christophe Rey7, Philippe Reynes´ 3, Sophie Rosset5, Jean Sibille4, Thomas Lavergne8 1LiLPa, Universite´ de Strasbourg, France 2LIMSI, CNRS, ENSIIE, Universite´ Paris-Saclay, F-91405 Orsay, France 3Laboratoire Habiter le Monde - HM - EA 4278, Universite´ de Picardie Jules Verne, Amiens, France 4CLLE, Universite´ de Toulouse, CNRS, UT2J, France 5LIMSI, CNRS, Universite´ Paris-Saclay, F-91405 Orsay, France 6Queen’s University, Belfast 7LT2D (Lexiques, Textes, Discours, Dictionnaires), Universite´ de Cergy-Pontoise, IUF 8LIMSI, CNRS, Univ. Paris-Sud, Universite´ Paris-Saclay, F-91405 Orsay, France 1fdbernhard,lucie.steible,pascale.erhart,[email protected] 2;5;8fannlor,magistry,sophie.rosset,[email protected] 3ffanny.martin,[email protected] 4fmyriam.bras,nabil.hathout,[email protected] [email protected] [email protected] Abstract This article describes the creation of corpora with part-of-speech annotations for three regional languages of France: Alsatian, Occitan and Picard. These manual annotations were performed in the context of the RESTAURE project, whose goal is to develop resources and tools for these under-resourced French regional languages. The article presents the tagsets used in the annotation process as well as the resulting annotated corpora. Keywords: corpus, annotation, part-of-speech, Alsatian, Occitan, Picard 1. Introduction The overall objective of the RESTAURE1 project is to pro- Since the constitutional amendment (Article 75-1) pub- vide computational resources and processing tools for three lished in 2008, regional languages are officially part of the regional languages of France: Alsatian, Occitan and Picard. heritage of France, although the only official language in The three of them belong to the languages of France listed France is French (Article 2). As such, it is essential to im- in Cerquiglini’s 1999 report (Cerquiglini, 1999), which in- plement modern means for their preservation and transmis- ventories the regional languages of France within the mean- sion, relying particularly on digital technologies. Regional ing of the European Charter for Regional or Minority Lan- languages of France can be considered as low-resourced, guages. They have no official status in France and as such, that is to say there are no or only few electronic resources have suffered from a lack of institutional support until re- (corpora, lexicons, dictionaries) and tools. All languages cently. The goal of this project is to bring these languages with little resources have in common that their computeri- to the front and foster NLP research on these languages. sation has a low financial profitability which does not com- Processing natural language is complex and the develop- pensate for considerable development costs. However, en- ment of NLP tools requires significant resources, both hu- dowing these languages with electronic resources and tools man and financial. This explains the lack of such tools for is a major concern for their dissemination, protection and regional languages of France. teaching (including for new speakers). Automatic tools In this work, we present the methodology used to create help ensure data collection (scanning), storage in standard- corpora annotated with part-of-speech information for the ized formats, categorization and retrieval. Moreover, the three regional languages considered in RESTAURE. The availability of digital data and automatic processing tools main challenge for writing the annotation guidelines was can transform the attitude of speakers towards regional lan- the lack of comprehensive grammatical descriptions, en- guages, in particular increase the use of written material compassing all the dialectal variants found in our corpora. that often remains marginal. In a broader perspective, it is In addition, the annotation triggered further discussion on the diversity of world languages which would be better pre- tokenisation issues and the use of POS tagsets and taggers served and the amount of data available to researchers in developed for closely related languages. The whole process human and social sciences (linguistics, sociology, anthro- was made possible thanks to the close cooperation between pology, literature, history, ...) would increase (Soria et al., 2013). 1http://restaure.unistra.fr/ 3917 linguists and NLP specialists, as well as the parallel and dialects (e.g., verbal inflection varies from one dialect to the collaborative work on three different languages facing sim- other). As far as spelling is concerned, Occitan is not stan- ilar challenges. dardized as a whole but has two major spelling standards: the classical system, inspired from the troubadour’s me- 2. Description of Alsatian, Picard and dieval spelling, and another system, closer to French con- Occitan ventions (Sibille, 2002). In this section, we briefly describe the three French regional 2.3. Picard languages considered in the project, in particular with re- The linguistic area of Picard includes the Hauts-de-France spect to their morpho-syntactic properties. administrative region, and the Hainaut province in Bel- 2.1. Alsatian gium. Picard is an o¨ıl language, which also belongs to the larger Romance language group. It differs from French Alsatian is spoken in North-Eastern France and is part of with respect to several aspects. Word order can be differ- the High German dialects, which are subdivided into Cen- ent: for example, il o foait keud asse´ in Picard translates to tral German and Upper German. The majority of the Al- il a fait assez chaud (it has been quite hot) –in French, the satian dialects belongs to (Low) Alemannic, an Upper Ger- adverb is placed before the adjective, while in Picard the ad- man dialect. A small part of the dialectal space (North- jective is placed before the adverb. Even if Picard and Occ- West) belongs to Central German Rhine Franconian. Like itan are both Romance languages, Picard is closer to French all dialectal, phonological and, partially, lexical spaces, concerning inflection. As in French and the other langues the Alsatian dialects are characterized by spatial variation, d’o¨ıl, gender and number are mainly marked in Picard by which is the main characteristic of a dialect. Since the means of determiners at the level of the noun phrase. An- second half of the 20th century, the writers and speakers other device is shared with the other langues d’o¨ıl: in the of Alsatian have had, in their vast majority, a plurilingual verbal phrase, a subject personal pronoun must be used in repertoire, with French taking more and more importance order to express personal rank in a deflective way. Picard and driving them to use, in Alsatian, linguistic strategies does not have a unitary standardized spelling system and Pi- and calques (loan translations) from French. The funda- card texts can contain a dot which must not be considered as mental morphosyntactic characteristics are little affected by a word boundary, e.g., lon.mint (for a long time), erwet.tent this phenomenon. They are common to all Alsatian di- (look), fin.mes (women). More often than in French, words alects and are very similar to those of standard German. can also contain apostrophes – c’min (path) – and hyphens Finite verbs are marked with morphemes of tense, mode – gardin-neux (gardeners). and person-number and noun phrases are marked with number, case and gender. The tense and mode system is much 3. Elaboration of the Tagsets simpler to standard German and there is only one person As we wanted to exploit the proximity to better-resourced number marker for the plural persons (Huck, to appear). languages, an important issue is that of the tagset, i.e. the However, the surface form of the morphemes can present list of part-of-speech categories used for the manual and, intradialectal variations. The Alsatian dialects are used afterwards, automatic annotation. One solution would have mainly orally and written production is limited to some lit- been to use the tagsets from annotated corpora and part-of- erary works (poetry, theater plays), linguistic descriptions speech annotation tools for closely-related languages, such (dictionaries, lexicons), small contributions to otherwise as the German TreeTagger or Stanford Tagger for Alsatian. French publications (chronicles in newspapers) and online However, these tagsets are usually very detailed: the Ger- texts (Wikipedia, social networks). What is more, several man TreeTagger and StanfordTagger use a set of 54 tags,2 spelling conventions have been proposed, but none of them the French TreeTagger identifies 33 tags3 etc. Such level can be considered as a widely accepted and used standard. of detail is not necessarily needed and entails several draw- 2.2. Occitan backs: higher cost for training annotators and reduced per- formance for the part-of-speech (POS) taggers, which have Occitan, or Oc language, is spoken in a large area in the to discriminate between very similar categories. Further- south of France, in several valleys of Italy and the Aran more, we wanted to create corpora annotated with the same valley in Spain. Occitan is not a unitary language, it has standard tagset if possible, in order to facilitate the diffu- several varieties, organized in 6 large dialects (Auvernhas,` sion of the corpora, to be able to compare our experiments Gascon, Lengadocian, Lemosin, Provençau, and Vivaro- with state-of-the-art work and to enable comparison be- aupenc). It is a Romance language: as such, it shares tween the languages of the project. We thus chose to base many morpho-syntactic properties with other Romance lan- the tagsets for Alsatian, Occitan and Picard on the univer- guages (e.g., number and genre inflection marks on all the sal POS tags defined in the context of the Universal Depen- items of the noun phrase ; tense, person, number inflection dencies project (Nivre et al., 2016).4 A first issue was to marks on finite verbs).

Alsatian, Occitan and Picard

Some Principles of the Use of Macro-Areas Language Dynamics &A

CEU Department of Medieval Studies, 2001)

Language Transmission in France in the Course of the 20Th Century

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Managing France's Regional Languages

This Electronic Thesis Or Dissertation Has Been Downloaded from the King's Research Portal At

Partitive Article

Word-Final Cluster Simplification in Vimeu French: a Preliminary Analysis

Contributions to a Study of the Printed Dictionary in France Before 1539

Exploring Occitan and Francoprovençal in Rhône-Alpes, France Michel Bert, Costa James

Authentic Language

Imprints of Devotion: Print and the Passion in the Iberian World (1472-1598)