Building an Old Occitan Corpus Via Cross-Language Transfer
Total Page:16
File Type:pdf, Size:1020Kb
Building An Old Occitan Corpus via Cross-Language Transfer Olga Scrivner Sandra Kubler¨ Indiana University Indiana University Bloomington, IN, USA Bloomington, IN, USA [email protected] [email protected] Abstract the goal of this project is to implement a resource- light approach, exploiting existing resources and This paper describes the implementation of common characteristics shared by Romance lan- a resource-light approach, cross-language guages. transfer, to build and annotate a histori- It is well known that Romance languages share cal corpus for Old Occitan. Our approach transfers morpho-syntactic and syntactic many lexical and syntactic properties. The fol- annotation from resource-rich source lan- lowing example illustrates the similarity in word guages, Old French and Catalan, to a ge- order and lexicon of Old Occitan, Old Catalan, netically related target language, Old Oc- and Old French2: citan. The present corpus consists of three sub-corpora in XML format: 1) raw text; 2) (1) Oc: Dedins la cambra son part-of-speech tagged text; and 3) syntacti- Cat: Dins la cambra son´ cally annotated text. Fr: Dans la chambre elles sont vengudas, dejosta lui son 1 Introduction vingudas, davant ell son´ entrees,´ devant lui elles se sont In the past decade, a number of annotated corpora assegudas. have been developed for Medieval Romance lan- assegudas. guages, namely corpora of Old Spanish (Davies, assises. 2002), Old Portuguese (Davies and Ferreira, ’They came into the room, they sat down 2006), and Old French (Stein, 2008; Martineau, next to him.’ 2010). However, annotated data are still sparse for less-common languages, such as Old Occi- Several recent experiments have demonstrated tan. For example, the only available electronic that genetically related languages can share database “The Concordance of Medieval Occi- their knowledge through cross-language transfer 1 tan” , published in 2001, is not free and is limited (Hana et al., 2006; Feldman and Hana, 2010) to lexical search. between closely related languages. That is, a While the majority of historical corpora are resource-rich language can be used to process built by means of specific tools developed for unannotated data in other genetically related lan- each language and project, such as TWIC for guages. For example, Hana et al. (2006) have NCA (Nouveau Corpus d’Amsterdam) (Stein, used Spanish morphological and lexical data for 2008), or a probabilistic parser for the GTRC automatic tagging of Brazilian Portuguese. While project (MCVF Corpus) (Martineau et al., 2007), the idea of cross-language transfer is not new and 1 http://www.digento.de/titel/100553. 2Catalan and French translation is ours. We approxi- html mated Old Catalan and Old French word order as far as pos- sible. 392 Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 is mainly used with parallel corpora and large bilingual lexicons (Yarowsky and Ngai, 2001; Hwa et al., 2005), experiments by Hana et al. (2006) have demonstrated the usability of this method in situations where there are no parallel corpora but instead resources for a closely related language. Our goal is to build a corpus of Old Occitan in a resource-light manner, by using cross-language transfer. This approach will be used not only for part-of-speech tagging, but also for syntactic an- notation. The organization of the remainder of the pa- per is as follows: Section 2 provides a brief de- scription of Old Occitan. Section 3 reviews the Figure 1: Linguistic map of France, from (Bec, 1973) concept of a resource-light approach in corpus linguistics. Section 4 provides details on corpus pre-processing. The methods for cross-linguistic part-of-speech (POS) tagging and cross-linguistic parsing are described in Sections 5 and 6. Finally, Case Old Occitan Old French the conclusions and directions for further work (2) Nominative lo murs li murs are presented in Section 7. Accusative lo mur lo mur 2 Old Occitan (Provenc¸al) On the other hand, Occitan has syntactic traits similar to Catalan, such as a relatively free word Occitan, often referred to as Provenc¸al, consti- order and null subjects, illustrated in (3) and (4). tutes an important element of the literary, lin- guistic, and cultural heritage in the history of (3) Gran honor nos fai Romance languages. Provenc¸al (Occitan) poetry great honor usDat does was a predecessor of French lyrics. Moreover, ‘He grants us a great honor’ (Old Occitan - Occitan was the only administrative language in Flamenca) Medieval France, besides Latin (Belasco, 1990). While the historical importance of this language (4) molt he gran desig is indisputable, Occitan, as a language, remains much have big desire linguistically understudied. Compared to Old ‘I have a lot of great desire...’ (Old Catalan - French, Provenc¸al is still lacking digitized copies Ramon Llull)3 of scanned manuscripts, as well as annotated cor- pora for morpho-syntactic or syntactic research. The close relationship between these three lan- Typologically, Old Occitan is classified as one guages is also marked geographically. The north- of the Gallo-Roman languages, together with ern border of the Occitan-speaking area is ad- French and Catalan (Bec, 1973). If one examines jacent to the French linguistic domain, whereas Old Occitan, Old French, and Old Catalan, on the in the south, Occitan borders on the Catalan- one hand, it is striking how many lexical and mor- speaking area, as shown in Figure 1. phological characteristics these languages share. This project focuses on the 13th century Old For example, French and Occitan have rich verbal Occitan romance “Flamenca”, from the edition inflection and a two-case nominal system (nomi- by Meyer (1901). Apart from a very intrigu- native and accusative), illustrated with an exam- ing fable of beautiful Flamenca imprisoned in a ple of the word ‘wall’ in (2): tower by her jealous husband, this story presents a 3http://orbita.bib.ub.edu/ramon/velec. asp 393 Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 very interesting linguistic document consisting of That is, words in a source language are mapped 8097 lines of the “universally acknowledged mas- to words in a target language. Dien et al. (2004) terpiece of Old Occitan narrative” (Fleischmann, used this method on English-Vietnamese corpus. 1995). Multiple styles, such as internal mono- They extracted a syntactic tree set from English logues, dialogues and narratives, provide a rich and transferred it into the target language. Ob- lexical, morphological and syntactic database of tained two sets of parsed trees, English and Viet- a language spoken in southern France. namese, were further used as training data to ex- tract transfer rules. In contrast, Hanneman et 3 Linguistic Annotation via al. (2009) extracted unique grammar rules from Cross-Language Transfer English-French parallel parsed corpus and se- Corpus-based approaches often require a great lected high-frequency rules to reorder position amount of parallel data or manual labor. In con- of constituents. Hwa et al. (2005) describe an trast, the cross-language transfer, as proposed by approach that focuses on syntax projection per Hana et al. (2006), is a resource-light approach. se, but their approach also relies on word align- That is, this method does not involve any re- ment in a parallel corpus. They show that the sources in the target language, neither training approach works better for closely related lan- data, a large lexicon, nor time-consuming manual guages (English to Spanish) than for languages annotation. as different as English and Chinese. It is widely While cross-language transfer has been previ- agreed that word alignment and thus, syntac- ously applied to languages with parallel corpora tic transfer, is best applied in similar languages and bilingual lexica (Yarowsky and Ngai, 2001; due to their word order pattern (Watanabe and Hwa et al., 2005), Hana et al. (2006) introduced Sumita, 2003). Therefore, genetically related Ro- a method in the area where these additional re- mance languages should be well suited for syn- sources are not available. Feldman and Hana tactic cross language transfer. McDonald et al. (2010) performed several experiments with Ro- (2011) and Naseem et al. (2012) describe novel mance and Slavic languages. The only resources approaches that use more than one source lan- they used were i) POS tagged data of the source guage, reaching results similar to those of a su- language, ii) raw data in the target language, and pervised parser for the source language. iii) a resource-light morphological analyzer for While cross-language transfer has been applied the target language. For POS tagging, the Markov successfully to modern languages, we decided to model tagger TnT (Brants, 2000) was trained on use it to transfer linguistic annotation to a his- a source language, namely Spanish and Czech, in torical corpus. The choice of source languages order to obtain transition probabilities. The rea- was based on the availability of annotated re- soning is that the word order patterns of source sources and the similarity of language character- and target languages are very similar so that given istics. Thus, Old French corpus (Martineau et al., the same tagset, the transition probabilities should 2007) was selected as a source for the morpho- be similar, too. Since the languages differ in their syntactic annotation of Occitan. However, to morphological characteristics, a direct transfer of transfer syntactic information, we used the Cata- the lexical probabilities was not possible. Instead, lan dependency treebank (Civit et al., 2006) since a shallow morphological analyzer was developed modern Catalan displays a pro-drop feature and a for the target languages, using cognate informa- relatively free word order, similarly to Old Occi- tion, among other similarities. The trained mod- tan. els were then applied to the target languages, Por- 4 Corpus Pre-Processing tuguese, Catalan, and Russian.