<<

Building An Corpus via Cross-Language Transfer

Olga Scrivner Sandra Kubler¨ Indiana University Indiana University Bloomington, IN, USA Bloomington, IN, USA [email protected] [email protected]

Abstract the goal of this project is to implement a resource- light approach, exploiting existing resources and This paper describes the implementation of common characteristics shared by Romance lan- a resource-light approach, cross-language guages. transfer, to build and annotate a histori- It is well known that share cal corpus for Old Occitan. Our approach transfers morpho-syntactic and syntactic many lexical and syntactic properties. The fol- annotation from resource-rich source lan- lowing example illustrates the similarity in word guages, and Catalan, to a ge- order and lexicon of Old Occitan, , netically related target language, Old Oc- and Old French2: citan. The present corpus consists of three sub-corpora in XML format: 1) raw text; 2) (1) Oc: Dedins la cambra son part-of-speech tagged text; and 3) syntacti- Cat: Dins la cambra son´ cally annotated text. Fr: Dans la chambre elles sont vengudas, dejosta lui son 1 Introduction vingudas, davant ell son´ entrees,´ devant lui elles se sont In the past decade, a number of annotated corpora assegudas. have been developed for Medieval Romance lan- assegudas. guages, namely corpora of (Davies, assises. 2002), Old Portuguese (Davies and Ferreira, ’They came into the room, they sat down 2006), and Old French (Stein, 2008; Martineau, next to him.’ 2010). However, annotated data are still sparse for less-common languages, such as Old Occi- Several recent experiments have demonstrated tan. For example, the only available electronic that genetically related languages can share database “The Concordance of Medieval Occi- their knowledge through cross-language transfer 1 tan” , published in 2001, is not free and is limited (Hana et al., 2006; Feldman and Hana, 2010) to lexical search. between closely related languages. That is, a While the majority of historical corpora are resource-rich language can be used to process built by means of specific tools developed for unannotated data in other genetically related lan- each language and project, such as TWIC for guages. For example, Hana et al. (2006) have NCA (Nouveau Corpus d’Amsterdam) (Stein, used Spanish morphological and lexical data for 2008), or a probabilistic parser for the GTRC automatic tagging of . While project (MCVF Corpus) (Martineau et al., 2007), the idea of cross-language transfer is not new and 1 http://www.digento.de/titel/100553. 2Catalan and French translation is ours. We approxi- html mated Old Catalan and Old French word order as far as pos- sible.

392

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 is mainly used with parallel corpora and large bilingual lexicons (Yarowsky and Ngai, 2001; Hwa et al., 2005), experiments by Hana et al. (2006) have demonstrated the usability of this method in situations where there are no parallel corpora but instead resources for a closely related language. Our goal is to build a corpus of Old Occitan in a resource-light manner, by using cross-language transfer. This approach will be used not only for part-of-speech tagging, but also for syntactic an- notation. The organization of the remainder of the pa- per is as follows: Section 2 provides a brief de- scription of Old Occitan. Section 3 reviews the Figure 1: Linguistic map of France, from (Bec, 1973) concept of a resource-light approach in corpus linguistics. Section 4 provides details on corpus pre-processing. The methods for cross-linguistic part-of-speech (POS) tagging and cross-linguistic parsing are described in Sections 5 and 6. Finally, Case Old Occitan Old French the conclusions and directions for further work (2) Nominative lo murs li murs are presented in Section 7. Accusative lo mur lo mur 2 Old Occitan (Provenc¸al) On the other hand, Occitan has syntactic traits similar to Catalan, such as a relatively free word Occitan, often referred to as Provenc¸al, consti- order and null subjects, illustrated in (3) and (4). tutes an important element of the literary, lin- guistic, and cultural heritage in the history of (3) Gran honor nos fai Romance languages. Provenc¸al (Occitan) poetry great honor usDat does was a predecessor of French lyrics. Moreover, ‘He grants us a great honor’ (Old Occitan - Occitan was the only administrative language in Flamenca) Medieval France, besides (Belasco, 1990). While the historical importance of this language (4) molt he gran desig is indisputable, Occitan, as a language, remains much have big desire linguistically understudied. Compared to Old ‘I have a lot of great desire...’ (Old Catalan - French, Provenc¸al is still lacking digitized copies Ramon Llull)3 of scanned manuscripts, as well as annotated cor- pora for morpho-syntactic or syntactic research. The close relationship between these three lan- Typologically, Old Occitan is classified as one guages is also marked geographically. The north- of the Gallo-Roman languages, together with ern border of the Occitan-speaking area is ad- French and Catalan (Bec, 1973). If one examines jacent to the French linguistic domain, whereas Old Occitan, Old French, and Old Catalan, on the in the south, Occitan borders on the Catalan- one hand, it is striking how many lexical and mor- speaking area, as shown in Figure 1. phological characteristics these languages share. This project focuses on the 13th century Old For example, French and Occitan have rich verbal Occitan romance “Flamenca”, from the edition inflection and a two-case nominal system (nomi- by Meyer (1901). Apart from a very intrigu- native and accusative), illustrated with an exam- ing fable of beautiful Flamenca imprisoned in a ple of the word ‘wall’ in (2): tower by her jealous husband, this story presents a 3http://orbita.bib.ub.edu/ramon/velec. asp

393

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 very interesting linguistic document consisting of That is, words in a source language are mapped 8097 lines of the “universally acknowledged mas- to words in a target language. Dien et al. (2004) terpiece of Old Occitan narrative” (Fleischmann, used this method on English-Vietnamese corpus. 1995). Multiple styles, such as internal mono- They extracted a syntactic tree set from English logues, dialogues and narratives, provide a rich and transferred it into the target language. Ob- lexical, morphological and syntactic database of tained two sets of parsed trees, English and Viet- a language spoken in southern France. namese, were further used as training data to ex- tract transfer rules. In contrast, Hanneman et 3 Linguistic Annotation via al. (2009) extracted unique grammar rules from Cross-Language Transfer English-French parallel parsed corpus and se- Corpus-based approaches often require a great lected high-frequency rules to reorder position amount of parallel data or manual labor. In con- of constituents. Hwa et al. (2005) describe an trast, the cross-language transfer, as proposed by approach that focuses on syntax projection per Hana et al. (2006), is a resource-light approach. se, but their approach also relies on word align- That is, this method does not involve any re- ment in a parallel corpus. They show that the sources in the target language, neither training approach works better for closely related lan- data, a large lexicon, nor time-consuming manual guages (English to Spanish) than for languages annotation. as different as English and Chinese. It is widely While cross-language transfer has been previ- agreed that word alignment and thus, syntac- ously applied to languages with parallel corpora tic transfer, is best applied in similar languages and bilingual lexica (Yarowsky and Ngai, 2001; due to their word order pattern (Watanabe and Hwa et al., 2005), Hana et al. (2006) introduced Sumita, 2003). Therefore, genetically related Ro- a method in the area where these additional re- mance languages should be well suited for syn- sources are not available. Feldman and Hana tactic cross language transfer. McDonald et al. (2010) performed several experiments with Ro- (2011) and Naseem et al. (2012) describe novel mance and Slavic languages. The only resources approaches that use more than one source lan- they used were i) POS tagged data of the source guage, reaching results similar to those of a su- language, ii) raw data in the target language, and pervised parser for the source language. iii) a resource-light morphological analyzer for While cross-language transfer has been applied the target language. For POS tagging, the Markov successfully to modern languages, we decided to model tagger TnT (Brants, 2000) was trained on use it to transfer linguistic annotation to a his- a source language, namely Spanish and Czech, in torical corpus. The choice of source languages order to obtain transition probabilities. The rea- was based on the availability of annotated re- soning is that the word order patterns of source sources and the similarity of language character- and target languages are very similar so that given istics. Thus, Old French corpus (Martineau et al., the same tagset, the transition probabilities should 2007) was selected as a source for the morpho- be similar, too. Since the languages differ in their syntactic annotation of Occitan. However, to morphological characteristics, a direct transfer of transfer syntactic information, we used the Cata- the lexical probabilities was not possible. Instead, lan dependency treebank (Civit et al., 2006) since a shallow morphological analyzer was developed modern Catalan displays a pro-drop feature and a for the target languages, using cognate informa- relatively free word order, similarly to Old Occi- tion, among other similarities. The trained mod- tan. els were then applied to the target languages, Por- 4 Corpus Pre-Processing tuguese, Catalan, and Russian. Tagging accura- cies for Catalan, Portuguese, and Russian yielded The romance ‘Flamenca’ is available in scanned 70.7%, 77.2%, and 78.6% respectively. images format, therefore, the initial step included In contrast, syntactic transfer is mainly used conversion to an electronic version via OCR and in machine translation. This approach requires manual correction. Figure 2 shows a sample of a bilingual corpus aligned on a sentence level. the manuscript.

394

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 Corrections by the editor were omitted. As can Tag Description ADJ adjective be seen in the first line of the document (see Fig- ADJR comparative form of adjective ure 2), the editor (Meyer, 1901) enclosed a silent ADV adverb ADVR comparative form of adverb letter ‘s’ in brackets which we have excluded from AG gerundive of auxiliary ’to have’ our pre-processed text. In addition, we detached AJ present of auxiliary ’to have’ APP past participle of auxiliary ’to have’ the clitic pronouns that are joined to the verbs, as AX infinitive of auxiliary ’to have’ in (5), as shown in (6). CONJO coordinative conjunction CONJS subordinate conjunction (5) Per son anel dominim manda Que COMP comparative adverb D determiner (indefinite, definite, demonstrative) for his ring coat of arms sends that DAT dative Flamenca penra sim voil. DZ possessive determiner EG gerundive of auxiliary ’to be’ Flamenca takes if me want EJ present of auxiliary ’to be’ ‘He is sending his family ring as a guarantee EPP past participle of auxiliary ’to be’ EX infinitive of auxiliary ’to be’ that, if I want, he will marry Flamenca’ ITJ interjection MDG gerundive of modal verb (6) Per son anel domini m manda Que MDJ present of modal verb MDPP past participle of modal verb Flamenca penra si m voil. MDX infinitive of modal verb NCPL noun common plural NCS noun common singular NEG negation NPRPL noun proper plural NPRS noun proper singular NUM numeral P preposition PON punctuation inside the clause PONFP the end of the sentence PRO pronoun Q quantifier VG gerundive of the main verb VJ present of the main verb VPP past participle of the main verb VX infinitive of the main verb WADV interrogative, relative or exclamative adverb Figure 2: Sample of page from ‘Flamenca’ WD interrogative, relative or exclamative determiner WPRO interrogative, relative or exclamative pronoun

Currently, we have pre-processed and format- Table 1: Occitan tagset ted 3 095 lines, which corresponds to 12 573 to- kens. The text file was then converted to XML format using EXMARaLDA4. While EXMAR- aLDA is mostly used for transcriptions, it also im- ports files from several formats, such as plain text English (PPCME) (Kroch and Taylor, 2000), or tab format, and exports them as EXMARaLDA which was modified to represent French morpho- XML files. This XML is timeline-based and sup- syntactically, as illustrated in (7). ports the annotation of different linguistic levels (7) Les/D petites/ADJ filles/NCPL in different tiers. the/Det. little/Adj. girls/Noun-Cmn-pl. 5 Cross-Linguistic POS Tagging ‘the little girls’ While the MCVF tagset consists of 55 tags, we Since Old French and Old Occitan share many have decreased the tagset to 39 tags for our cor- morphological features, we have adopted the pus. The Occitan tagset is shown in Table 1. The POS tagset from the MCVF corpus of Old simplification included joining certain subclasses French (Martineau et al., 2007). The MCVF into one class. The reason for this modification tagset is based on the annotation scheme of lies in the particularities of Occitan. First, as a the Penn-Helsinki Parsed Corpus of Middle pro-drop language, Occitan omits an impersonal 4www.exmaralda.org pronoun (8), in contrast to Old French (9).

395

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 (8) No m’ o cal dir Accuracy N% Not me itAcc must say All Words 640/1000 64.00 ‘it is not necessary for me to tell this’ Known Words 539/742 72.64 Unknown Words 101/258 39.15 (9) Que te faut il en ce Table 2: POS evaluation using the unmodified Me- what youAcc must itimpersonal in this dieval French lexicon pais? country ‘What do you need in this country?’ (MCVF corpus) reached an accuracy of 64.00% for all words and Furthermore, the grammar of Old Occitan 72.64% for known words, cf. Table 2. (Anglade, 1921) does not use “near future” tense, A manual analysis of randomly selected 20 which is common in Old French and is formed by sentences revealed that a number of errors are the verb aller ‘to go’ and the infinitive of a main caused by lexical duplicates that have different verb (10). Therefore, the specific labels LJ, LX, meanings in each language. For example, in (13) LPP for the auxiliary aller from the corpus of Old no is a possessive determiner ‘our’ in Old French, French are mapped to the corresponding tags of while in Occitan no is a negation. The second the main verb, such as VJ, VX, VPP, VG (see Ta- type of errors is the result of TnT’s algorithm for ble 1). handling unknown words by a suffix trie. That is, unknown words are assigned to an ambiguity (10) Et que iroi ge faire? class depending on their suffix. For example, the and what will go I do unknown Occitan word ancar ‘yet’ is recognized ‘What am I going to do?’ as an infinitive (VX), based on the ending -ar which is common for French infinitives; whereas Finally, the French tagset contains a label FP entremes ‘involves’ receives higher probability as for focus particle, such as seullement/seulement an adjective (ADJ) because of its ending -es (see and ne...que ‘only’ (11). The following compari- (13)) . son shows that the latter construction ne..que does not convey focus in Old Occitan (12): (13) Ancar d’ amor no s’ TnT: VX P NCS DZ ADV (11) le jeune homme ne le fait que gold: ADV P NCS NEG PRO the young man not it does only entremes pour l’ avarice ADJ for the greed VJ ‘ Young man does it only for greed’ (MCVF ‘he is not yet involved in the love affair’ corpus) Therefore, to improve the fit of the lexicon ex- (12) Don non cug que ja mais reveinha tracted from Medieval French with words specific of it not think that ever returns to Occitan, we added 171 manually annotated ‘I do not think that he ever recovers from it’ sentences from ‘Flamenca’ to the training data. We trained TnT on 28 265 sentences from The validation on the test set yielded 78.10% the Medieval French texts (MCVF). This trained accuracy for all words, 81.10% for known, and model was used to POS tag Old Occitan with- 57.48% for unknown words, see Table 3. The re- out any modification to the lexicon. For the per- sults prove that adding even a small set of high formance evaluation we extracted 50 sentences quality, manually annotated sentences in the tar- (1000 tokens) from the corpus and annotated get languages improves POS tagging quality con- them manually. Then, the tagger output was siderably, bringing the tagger’s performance close compared to the gold standard. The POS tagger to results reached for modern languages (given

396

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 Accuracy N% All Words 781/1000 78.10 Known Words 708/873 81.10 Unknown Words 73/127 57.48

Table 3: Evaluation with the Occitan-enriched lexicon

Figure 3: Example of Dependency Relation from the Catalan Treebank (Ancora) a morphologically richer language and a small ID Word Lemma Pos Head Dep training set), thus validating our approach. 1 Li ell PP3CSD00 2 ci Furthermore, in the course of our project we 2 agrada agradar VMIP3S0 0 sent have compiled an Occitan dictionary, consist- 3 l’ el DA0CS0 4 spec 4 actualitzacio´ actualitzacio´ NCFS000 2 suj ing of 2 800 entries. The glossary to Flamenca 5 que que PR0CC000 8 cd (Meyer, 1901) was used as a reference guide. 6 Barea Barea NP00000 8 suj 7 ha haver VAIP3S0 8 v Thus, our tagged corpus was augmented with 8 fet fer VMP00SM 4 S lemmas from the dictionary. We have further 9 del del SPCMS 8 creg 10 text text NCMS000 9 sn manually checked and corrected 8 284 tags in the 11 . . Fp 2 f corpus. Finally, the POS tagged version was con- verted to XML, again using EXMARaLDA. Table 4: Annotation in AnCora 6 Syntactic Cross-Linguistic Parsing While it has been widely accepted that syntac- tic annotation in terms of constituent trees pro- et al., 2006). For training a dependency parser, we vides a rich internal tree structure, recent years use Catalan rather than modern French since the have shown an increased interest in dependency syntactic characteristics of Catalan are more sim- graphs (Civit et al., 2006). Dependency graphs ilar to Occitan than French. For example, Old Oc- provide an immediate access to lexical informa- citan is a pro-drop language while French is not. tion for words, word pairs, and their grammatical We used the Catalan treebank from AnCora5. relations. For example, each word in the Cata- The Catalan treebank consists of 16 591 sen- lan sentence (14) has exactly one head, as demon- tences extracted from newspapers and annotated strated in Figure 3. The arcs show dependencies syntactically and semantically. The dependency from heads to dependents. treebank has been converted automatically from (14) Li agrada l’actalizacio´ que a constituency format with the help of a table of to him likes the modernisation that head finding rules (Civit et al., 2006). The sen- Barea ha fet del text. tence in (14), for example, is annotated in the tree- Barea has made of the text. bank format as shown in Table 4. ‘He likes how Barea modernized the text’. As shown in Table 4, the Catalan tagset de- scribes information about the major POS classes, In addition, it is argued that dependency gram- represented as letters, and morphological fea- mars “deal especially well with languages involv- tures, such as gender, number, case, person, time ing relatively free word order” (Bamman and and mode, represented as digits. The tagset has Crane, 2011). Since Old Occitan has relatively a total of 280 different labels (Taule´ et al., 2008). free word order, we aim for a syntactic annotation The rich morphological information allowed us to in form of dependency graphs. At present, de- map the Catalan tags to our Occitan tagset with pendency treebanks are available only for modern high accuracy. For example, in the sentence from Romance languages, namely French (Abeille´ et 5http://clic.ub.edu/corpus/en/ al., 2003), Catalan, and Spanish (AnCora), (Civit ancora-descarregues

397

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 Table 4, verbal features such as VMIP (main verb Occitan Relation type Catalan ROOT Main clause Sentence indicative present) and VMP (main verb past par- S(bar) Infinitival sentence infinitiu ticiple) were mapped to Occitan tags VJ and VPP, Subordinate complement ao SBJ Subject suj respectively. The example of the mapped Catalan OBJ Nominal Direct Object cd tags is shown in (15). CL Pronominal direct object cd Pronominal morpheme morfema.pron (15) Li agrada l’actalizacio´ que Impersonal impers Passive ‘se’ pass PRO VJ NCS WPRO PRED Predicative cpred Barea ha fet del text . Atribute atr NMOD Determiner d NPC AJ VPP P NCS PONFP Numbers z ‘He likes how Barea modernized the text’. Prepositional phrase sp Adjectival phrase s.a Determiner spec The Catalan dependency representation con- Second adj inside sn grup.a tains a large set of grammatical relations. We VMOD Prepositional complement creg Adverbial phrase sadv found 48 different labels in the AnCora Cor- Verb adjunct cc pus. We decided to map these dependencies Indirect object ci to the Penn Treebank core dependencies, such PMOD Noun phrase sn Adverbial modifier r as subject (SBJ), direct object (OBJ), predicate Second noun phrase inside sp grup.nom (PRED), nominal modifier (NMOD), verbal mod- Direct object ci V Auxilliary v ifier (VMOD). In addition, we added a language NEG Verb modifier ’no’ mod specific relation - CL (clitic) (16). The complete CONJ Conjunction que conj list of Occitan dependency labels and their corre- COORD Coordination c REL Relative pronoun relatiu sponding labels in Catalan is shown in Table 5. INC Inserted phrase inc P Punctuation f (16) Pero vostre sen m’ en digas but your opinion me about it tell Table 5: Dependency Labels ’But tell me you opinion about it.’ Deprel precision recall P 91.2 86.7 It is necessary to note that verbal head selec- CL 89.3 78.1 tion in the AnCora Corpus differs from the con- NEG 77.8 77.8 stituency head assignment (Civit et al., 2006). In NMOD 86.1 87.3 the dependency annotation, the head is assigned PMOD 76.9 78.9 to the rigthmost element in the verbal phrase. PRED 42.9 16.7 For example, past participles, and gerunds are ROOT 44.8 78.9 heads, whereas auxiliaries are dependents. In ad- OBJ 42.3 34.4 dition, the treebank is automatically augmented S 45.8 16.2 by empty elements to represent null subjects. SBJ 37.9 55.0 In order to annotate our Old Occitan texts, V 53.3 57.1 we trained a transition-based dependency parser, VMOD 46.2 56.3 MaltParser (Nivre et al., 2007) on the Catalan treebank with the reduced tagset. We then used Table 6: Parsing results the trained model to parse our corpus. We manually annotated 30 sentences to eval- uate the accuracy of the parser. For the evalu- ation we used MaltEval6, an evaluation tool for dependency trees. The results yielded 63.1% of nominal modifiers, prepositional modifiers, clitic, label accuracy and 55.8% of labeled attachment. negation and punctuation, cf. Table 6. The highest score of precision and recall was for As can be seen from Table 6, subject, object and predicate relations are the least accurate. This 6http://w3.msi.vxu.se/users/jni/ malteval/ is due to a relatively free word order in Old Occi-

398

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 $<$sentence id="19" user="" date=""$>$ $<$word id="1" form="Le" postag="D" head="2" deprel="NMOD"/$>$ $<$word id="2" form="coms" postag="NCS" head="3" deprel="SBJ"/$>$ $<$word id="3" form="fes" postag="VJ" head="0" deprel="ROOT"/$>$ $<$word id="4" form="sa" postag="DZ" head="5" deprel="NMOD"/$>$ $<$word id="5" form="mollier" postag="NCS" head="3" deprel="OBJ"/$>$ $<$word id="6" form="venir" postag="VX" head="5" deprel="S"/$>$ $<$word id="7" form="." postag="PONFP" head="3" deprel="P"/$>$ $<$/sentence$>$

Figure 4: An example of the syntactic annotation tan and its pro-drop feature. The example in (17) parsing results yielded 63.4% of labeled attach- illustrates that in some cases, the noun, here Fla- ment. menca, can be ambiguous between subject or ob- For the future, we plan to annotate the whole ject of the sentence. corpus of 8 000 lines, manually correct them, and make it available on the web. At present, 3 095 (17) Que Flamenca penra lines of tagged and parsed XML files are available that Flamenca will take upon request. ’that Flamenca will take’ / ’that he will take Flamenca’ References In contrast, the low precision of ROOT is due Anne Abeille,´ Lionel Clement,´ and Franc¸ois Tou- to MaltParser’s design which may lead to in- ssenel. 2003. Building a treebank for French. In complete syntactic annotations. To repair this Treebanks. Kluwer. type of errors, we performed post-editing, which Joseph Anglade. 1921. Grammaire de l’ancien readjusts ROOT labels, increasing its accuracy to provenc¸al ou ancienne langue d’oc. Librairie C. 71%, instead of 44.8%. In addition, it yielded bet- Klincksieck. ter results for label accuracy - 76.4% and label at- David Bamman and Gregory Crane. 2011. The An- tachment score - 63.4%. cient Greek and Latin dependency treebanks. In Language Technology for Cultural Heritage: Se- Finally, the parsed output was converted to lected Papers from the LaTeCH Workshop Series 7 , XML format using Malt Converter . An example pages 79–98. Springer. of the resulting XML file is presented in Figure 4. Pierre Bec. 1973. La langue occitane. Number 1059 in Que sais-je? Paris: Presses universitaires 7 Conclusion and Future Work de France. While the annotation of historical corpora re- Simon Belasco. 1990. France’s rich relation: The Oc connection. The French Review, 63(6):996–1013. quires precision, the first steps in building a syn- Thorsten Brants. 2000. TnT–a statistical part-of- tactically annotated corpus can be resource-light. speech tagger. In Proceedings of the 1st Confer- If we use a cross-language transfer, we can profit ence of the North American Chapter of the Asso- from existing resources for historical or modern ciation for Computational Linguistics and the 6th languages sharing similar morphological and syn- Conference on Applied Natural Language Process- tactic features. We have shown that Old French ing (ANLP/NAACL), pages 224–231, Seattle, WA. presents a good source for POS tagging of Old Monserat Civit, Antonia` Mart´ı, and Nuria Buf´ı. 2006. Occitan while modern Catalan is a good source Cat3LB and Cast3LB: From constituents to depen- dencies. In Advances in Natural Language Pro- for the syntactic annotation of Occitan. cessing, pages 141–153. Springer. The POS tagger model trained on the Old Mark Davies and Michael Ferreira. 2006. Cor- French Corpus MCVF (Martineau et al., 2007) pus do Portugues: 45 million words, 1300s- yielded 78% accuracy when we added a small Oc- 1900s. Available online at http://www. citan lexicon into training, whereas dependency corpusdoportugues.org. Mark Davies. 2002. Corpus del Espanol:˜ 100 million 7http://w3.msi.vxu.se/ nivre/research/ ˜ http: MaltXML.html words, 1200s-1900s. Available online at //www.corpusdelespanol.org.

399

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012 Dinh Dien, Thuy Ngan, Xuan Quang, and Chi Nam. Joakim Nivre, Johan Hall, Jens Nilsson, Atanas 2004. The parallel corpus approach to building Chanev, Guls¸en¨ Eryigit,ˇ Sandra Kubler,¨ Svetoslav the syntactic tree transfer set in the English-to- Marinov, and Erwin Marsi. 2007. MaltParser: Vietnamese machine translation. In 2004 Interna- A language-independent system for data-driven de- tional Conference on Electronics, Information, and pendency parsing. Natural Language Engineering, Communications (ICEIC2004), pages 382–386, Ha 13(2):95–135. Noi, Vietnam. Achim Stein. 2008. Syntactic annotation of Old Anna Feldman and Jirka Hana. 2010. A Resource- French text corpora. Corpus, 7:157–161. Light Approach to Morpho-Syntactic Tagging. Mariona Taule,´ M. Antonia` Mart´ı, and Marta Re- Rodopi. casens. 2008. Ancora: Multilevel annotated cor- Suzanne Fleischmann. 1995. The non-lyric texts. pora for Catalan and Spanish. In Proceedings of the In F.R.P. Akehurst and Judith M. Davis, editors, Sixth International Conference on Language Re- A Handbook of the , pages 176–184. sources and Evaluation (LREC’08), pages 3405– University of California Press. 3408, Marrakech, Morocco. Jirka Hana, Anna Feldman, Chris Brew, and Luiz Taro Watanabe and Eiichiro Sumita. 2003. Example- Amaral. 2006. Tagging Portuguese with a Spanish based decoding for statistical machine translation. tagger using cognates. In Proceedings of the EACL In Proceedings of HLT-NAACL 2004: Short Papers, Workshop on Cross-Language Knowledge Induc- pages 9–12, Boston, MA. tion, pages 33–40, Trento, Italy. David Yarowsky and Grace Ngai. 2001. Inducing Greg Hanneman, Vamshi Ambati, Jonathan H. Clark, multilingual POS taggers and NP bracketers via Alok Parlikar, and Alon Lavie. 2009. An robust projection across aligned corpora. In Pro- improved statistical transfer system for French– ceedings of the 2nd Meeting of the North American English machine translation. In Proceedings of the Chapter of the Association for Computational Lin- Fourth Workshop on Statistical Machine Transla- guistics (NAACL), Pittsburgh, PA. tion, pages 140–144. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3):311–325. Anthony Kroch and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. France Martineau, Constanta Diaconescu, and Paul Hirschbuhler.¨ 2007. Le corpus ‘voies du franc¸ais’: De l’elaboration´ a` l’annotation. In Pierre Kunst- mann and Achim Stein, editors, Le Nouveau Cor- pus d’Amsterdam, pages 121–142. Steiner. France Martineau. 2010. Corpus MCVF, modeliser´ le changement: les voies du franc¸ais. http://www.arts.uottawa.ca/voies/ voies_fr.html. Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process- ing, pages 62–72, Edinburgh, UK. Paul Meyer. 1901. Le Roman de Flamenca. Librarie Emile Bouillon. Tahira Naseem, Regina Barzilay, and Amir Glober- son. 2012. Selective sharing for multilingual de- pendency parsing. In Proceedings of the 50th An- nual Meeting of the Association for Computational Linguistics, pages 629–637, Jeju Island, Korea.

400

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012