Development of a Universal Dependencies Treebank for Welsh
Total Page:16
File Type:pdf, Size:1020Kb
Development of a Universal Dependencies treebank for Welsh Johannes Heinecke Francis M. Tyers Orange Labs Department of Linguistics 2 rue Pierre Marzin Indiana University F - 22307 Lannion cedex Bloomington, IN [email protected] [email protected] Abstract principles which have been used in typological very different languages, we chose to develop the This paper describes the development of Welsh treebank within the UD project. At the time the first syntactically-annotated corpus of of writing 601 sentences with 10 756 tokens in to- Welsh within the Universal Dependencies tal have been annotated and released with UD ver- (UD) project. We explain how the corpus sion 2.4. was prepared, and some Welsh-specific The paper is laid out as follows: in Section 2 we constructions that require attention. The give a short typological overview of Welsh. Sec- treebank currently contains 10 756 tokens. tion 3 describes briefly prior work for Welsh in An 10-fold cross evaluation shows that computational linguistics and syntax as well as ex- results of both, tagging and dependency isting resources. In sections 4 and 5 we describe parsing, are similar to other treebanks of the annotated corpus, the preprocessing steps and comparable size, notably the other Celtic present some particularities of Welsh and how we language treebanks within the UD project. annotate them. Section 6 explains the validation 1 Introduction process. We conclude with a short evaluation in section 7 and some remarks on future work (sec- The Welsh Treebank is the third Celtic language tion 8). within the Universal Dependencies project (Nivre et al., 2016), after Irish (Lynn and Foster, 2016) 2 Welsh and Breton (Tyers and Ravishankar, 2018). The main goal of the Universal Dependencies tree- Welsh is a Celtic language of the Brythonic banks is to have many different languages anno- branch1 of the Insular Celtic languages. There tated with identical guidelines and universally de- are about 500 000 (Jones, 2012) native speakers fined set of universal POS tags and dependency in Wales (United Kingdom). Apart from very relations. These cross-linguistically consistent young children, all speakers are bilingual with En- grammatical annotations can be used, for instance, glish. There are also a few thousand Welsh speak- for typological language comparison or develop- ers in the province of Chubut, in Argentina, who ping and evaluating cross-linguistic tagging and are the descendants of Welsh emigrants of the parsing tools. 1850s, who are now all bilingual with Spanish. A The motivation was twofold: To have a Welsh short overview on the Welsh Grammar is given in treebank annotated using the same guidelines as Thomas (1992) and Thorne (1992), more detailed many existing treebanks which permits language information can be found in Thorne (1993) and comparison and to have a (start for a) treebank King (2003). which can be used to train dependency parsers. Even though Welsh is a close cognate to Bre- Since the UD project already contains 146 tree- ton (and Cornish), it is different from a typologi- banks for 83 languages and provides annotation cal point of view. Like Breton (and the Celtic lan- c 2019 The authors. This article is licensed under a Creative guages of the Goidelic branch), it has initial conso- Commons 4.0 licence, no derivative works, attribution, CC- BY-ND. 1Together with Breton and Cornish Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 21 nant mutations, inflected prepositions (ar “on”, ar- a Welsh version of WordNet. Further corpora (in- naf i “on me”), genitive constructions with a single cluding CEG) are available at University of Ban- determiner (ty’rˆ frenhines, lit. “house the queen”: gor’s Canolfan Bedwyr 3. “the house of the queen”) and impersonal verb Other important resources are the proceedings forms. However, unlike Breton, Welsh has a pre- of the Third Welsh Assembly4 and Eurfa, a dominantly verb-subject-object (VSO) word order, full form dictionary5 with about 10 000 lemmas does not have composed tenses with an auxiliary (210 578 forms). There is also the full form list for corresponding to “have” and uses extensively pe- Welsh of the Unimorph project6. Currently, how- riphrastic verbal clauses to convey tense and aspect ever, this list contains only 183 lemmas (10 641 (Heinecke, 1999). It has only verbnouns instead forms). of infinitives (direct objects become genitive mod- The Welsh treebank is comparable in size to the ifiers or possessives). Like Irish but unlike Bre- Breton treebank (10 348 tokens, 888 sentences). ton Welsh does not have a verb “to have” to con- The Irish treebank is twice as big with 23 964 to- very possession. It uses a preposition “with” in- kens (1 020 sentences). UD treebanks vary very stead: Mae dau fachgen gen i “I have two sons”, much in size. Currently the largest UD treebank lit. “There is two son(SG) with me”. Another fea- is the German-HDT with 3 055 010 tokens. The ture of Welsh is the vigesimal number system (at smallest is Tagalog with just 292 words. Average least in the formal registers of the language) and size for all treebanks is 150 827 tokens, median non-contiguous numerals (tri phlentyn ar hugain size is 43 754 tokens. “23 children”, lit. “three child(SG) on twenty”). 4 Corpus 3 Related Work Like every language, Welsh has formal and in- Welsh has been the object of research in compu- formal registers. All of those are written, which tational linguistics, notably for speech recognition makes it difficult to constitute a homogeneous cor- and speech synthesis (Williams, 1999; Williams pus. The differences are not only a question of et al., 2006; Williams and Jones, 2008), as style, but are also of morphological and some- well as spell checking and machine translation. times of syntactic nature. Usually for the writ- An overview can be found in Heinecke (2005), ten language distinction is made between Liter- more detailed information about existing language ary Welsh (cf. grammars by Williams (1980) and technology for Welsh is accessible at http:// Thomas (1996)) and Colloquial Welsh (Uned Iaith techiaith.cymru. Research on Welsh syntax Genedlaethol (1978)) including an attempted new within various frameworks is very rich: Awbery standard, Cymraeg Byw “Living Welsh” (Educa- (1976), Rouveret (1990), Sadler (1998), Sadler tion Department, 1964; Davies, 1988)). Cymraeg (1999), Roberts (2004), Borsley et al. (2007), Byw, however, has fallen out of fashion since. For Tallerman (2009), Borsley (2010) to cite a few. the UD Welsh treebank, we chose sentences of The only available annotated corpus to our Colloquial written Welsh. knowledge is Cronfa Electroneg o Gymraeg (CEG) The sentences of the initial version of the tree- (Ellis et al., 2001), which contains about one mil- bank have been taken from varying sources, like lion tokens, annotated with POS and lemmas. The the Welsh language Wikipedia, mainly from pages CEG corpus contains texts from novels and short on items about Wales, like on the Urdd Gobaith stories, religious texts and non-fictional texts in the Cymru, the Eisteddfodau or Welsh places, since it fields of education, science, business or leisure ac- is much more probable that native Welsh speakers tivities. It also contains texts from newspapers and contributed to these pages. Other sources for indi- magazines and some transcribed spoken Welsh. vidual sentences were the Welsh Assembly corpus Currently work is under way for the National mentioned above, Welsh Grammars (in order to Corpus of Contemporary Welsh2. It is a very large cover syntactic structures less frequently seen) or corpus which contains spoken and written Welsh. 3cf. Canolfan Bedwyr, University of Bangor, currently the existing data is not annotated in syn- https://www.bangor.ac.uk/canolfanbedwyr/ tax (dependency or other). Members of the Cor- ymchwil_TI.php.en and http://corpws.cymru CenCC project also work on WordNet Cymraeg, 4http://cymraeg.org.uk/kynulliad3/ 5http://eurfa.org.uk/ 2http://www.corcencc.org/ 6http://www.unimorph.org/ Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 22 web sites from Welsh institutions (Welsh Universi- tically relevant information, and some subtypes ties, Cymdeithas yr Iaith). Finally some sentences for dependency relations, also frequently used in origin from Welsh language media (Y Golwg, BBC other languages: acl:relcl (relative clauses) and Cymru) and blogs. Even though a few of the sen- flat:name (flat structures for multi-word named tences from Wikipedia may look awkward or in- entities). correct to native speakers, these sentences are the reality of written Welsh and are therefore included UPOS Welsh specific XPOS in the treebank. ADJ pos, cmp, eq, ord, sup The different registers of Welsh mean, that theo- ADP prep, cprep retically “identical” forms may appear in diverging AUX aux, impf, ante, post, verbnoun surface representations: so the very formal yr yd- NOUN noun, verbnoun wyf i “I am” (lit. “(affirmative) am I”) can take PRON contr, dep, indp, intr, pron, refl, rel the following (more or less contracted) forms in PROPN org, person, place, propn, work written Welsh: rydwyf (i), rydwi, rydw (i), dwi. In the treebank, we annotate these forms as multi- Table 1: Welsh specific XPOS token words. For layout reasons, we do not split these forms in all examples in this paper. Where 5 Dependencies it is the case, we mark multi-token words with a box around the corresponding words. The same The POS-tagged corpus was then manually an- applies for the negation particle ni(d) which is of- notated7 and all layers were validated: lemmas, ten contracted with the following form of bod, if UPOS, XPOS (see section 5), and dependency re- the latter has an initial vowel: nid oedd “(he) was lations using the annotation guidelines of Univer- not” > doedd.