Bulgarian-English Parallel Treebank: Word and Semantic Level Alignment
Total Page:16
File Type:pdf, Size:1020Kb
Bulgarian-English Parallel Treebank: Word and Semantic Level Alignment Kiril Simov Petya Osenova LMD, IICT-BAS LMD, IICT-BAS [email protected] [email protected] Laska Laskova Aleksandar Savkov LMD, IICT-BAS LMD, IICT-BAS [email protected] [email protected] Stanislava Kancheva LMD, IICT-BAS [email protected] Abstract ative context for the application of a rule. For more details on the transfer rules consult (Oepen The paper describes the basic strategies behind 2008). This type of rules allows for the ex- the word and semantic level alignment in the tremely flexible transfer of factual and linguistic Bulgarian-English treebank. The word level knowledge between the source and the target lan- alignment has taken into consideration the ex- guages. Thus the treebank has to contain parallel perience within other NLP groups in the con- sentences, their syntactic and semantic analyses text of the Bulgarian language specific fea- and correspondences on the level of MRS. tures. The semantic level alignment builds on the word level alignment and is represented in In the development of such a parallel treebank the framework of the Minimal Recursion Se- we rely on the Bulgarian HPSG resource gram- mantics. mar BURGER, and on a dependency parser (Malt Parser – Nivre et al. 2006), trained on the 1 Introduction BulTreeBank data. Both parsers produce semant- ic representations in terms of MRS. The treebank Manually created aligned bi- or multilingual cor- is a parallel resource aligned first on a sentence pora have proven to be useful resources in vari- level. Then the alignment is done on the level of ety of tasks, e.g. for the development of automat- MRS. This level of abstraction makes possible ic alignment tools, but also for lexicon extrac- the usage of different tools for producing these tion, word sense disambiguation, machine trans- alignments, since MRS is meant to be compatible lation, annotation transfer and others. with various syntactic frameworks. The chosen In this paper we describe the word level align- procedure is as follows: first, the Bulgarian sen- ment of the Bulgarian-English Parallel HPSG tences are parsed with BURGER. If it succeeds, Treebank (BulEngTreebank) and its connection then the produced MRSes are used for the align- to the semantic level alignment. The aim of con- ment. In case BURGER fails, the sentences are structing such a treebank is to use it as a source parsed with Malt Parser, and then MRSes are for learning of statistical transfer rules for Bul- constructed on the base of the dependency ana- garian-English machine translation along the lysis. The latter MRSes are created via a set of lines of (Bond et al. 2011 to appear). The transfer transfer rules (see Simov and Osenova 2011). In rules in this framework are rewriting rules over both cases we keep the syntactic analyses for the MRS (Minimal Recursion Semantics) structures. parallel sentences. The basic format of the transfer rules is: With respect to the MRS alignments, a very [C:] I [!F] → O pragmatic approach has been adopted – namely, where I is the input of the rule, O is the output. the MRS alignments originated from the word C determines the context and F is the filter of the level alignment. This approach is based on the rule. C selects positive context and F selects neg- following observations and requirements: 29 Proceedings of Recent Advances in Natural Language Processing, pages 29–38, Hissar, Bulgaria, 12-14 September 2011. • Both approaches for generation of MRS over tradition established by the guidelines used in the sentences are lexicalized; similar projects, aiming at the creation of golden • Non-experts in linguistics can do the align- standards for different language pairs, such as the ments successfully on word level; Blinker project for English-French alignment • Different rules for generation/testing are pos- (Melamed 1998), the alignment task for the sible. Prague Czech-English Dependency Treebank 1.0 Both parsers (for Bulgarian and English), (Kruijff-Korbayová et al. 2006), the Dutch paral- which we use for the creation of MRSes, are lex- lel Corpus project (Macken 2010), among others. icalized in their nature. Thus, they first assign As Lambert et al. (2006) point out, the align- elementary predicates to the lexical elements in ment decisions presented in the guidelines reflect the sentences, and then, on the base of the syn- different tasks. There are projects such as AR- tactic analysis, these elementary predicates are CADE (Vèronis, 2000) and PLUG (Ahrenberg et composed into MRSes for the corresponding al., 2000), which aim at building a reference cor- phrases, and finally of the whole sentence. pora with word, not sentence pairs, and have a Our belief is that having alignments on word different annotation strategy in contrast to those level, syntactic analyses and the rules for com- that focus on sentence level. Different linguistic position of MRS, we will be able to determine theoretical backgrounds appear to be another correspondences between bigger MRSes than source of divergence that affects the rules of only lexical level MRSes, using the ideas of phrase alignments as well as the specific gram- (Tinsley et al, 2009). They first establish the matical techniques. This holds especially in cor- mapping on word level (automatically), then for respondences between synsemantic words (like candidate phrases they calculate the rank of the prepositions, determiners, particles, auxiliary correspondences on the base of the word level verbs) and synsemantic and/or autosemantic alignment. Thus, our idea is to score the corres- words (Macken 2010). In addition, some tools pondences between two MRSes on the base of for manual word alignment, e.g. HandAlign1, al- involved elementary predicates as well as the low the user to link both phrases and their ele- syntactic structure of the parallel sentences. ments with different kind of links, which might As it was mentioned, the alignment on word be simulated in other tools, which are more re- level allows us to do more reliable alignments strictive. Finally, the use of the so called possible using annotators who are non-experts in linguist- (also ambiguous, fuzzy or weak) links that signal ics. Currently, the inter-annotator agreement is correspondence between semantically and/or 92 %. Also this kind of alignment does not re- structurally nonequivalent words or phrases is quire any initial knowledge of MRS from the an- also a matter of dispute. While some argue that notators. Another advantage is that the result alignment with possible links should be determ- might be used for training tools for automatic ined by unambiguous rules, formulated with con- word alignment, and thus automatic extension of sideration of inter-annotation agreement, others the treebank can be performed. Additionally, the (Lambert et al. 2006) allow for different de- word level alignment might be done before the cisions to be kept, which is true to the role ori- actual analysis of the sentences. This is espe- ginally ascribed to this kind of links: “P (pos- cially useful in case of Bulgarian, where the sible) alignment which is used for alignments BURGER grammar is underdeveloped in com- which might or might not exist” (Och and Ney parison with the English grammar. 2000). The paper is structured as follows: the next sec- tion discusses the related works on word align- 3 Word Level Alignment ment strategies. Section 3 focuses on the basic The word level alignment was performed by principles behind the word alignment between the WordAligner2 – a web-based tool for word Bulgarian and English. Section 4 describes the alignment, built on top of the word alignment in- level of MRS alignments. Section 5 outlines the terface developed by C. Callison-Burch. It allows conclusions. the user to provide parallel input of non-aligned text through the interface or to upload file(s) with 2 Previous Work on Word Level sentence level aligned texts. Editing and/or com- Alignment pletion of alignments is also supported. Each pair The annotation guidelines for Bulgarian-English 1 word alignment, presented here, gained from the Available at http://www.cs.utah.edu/~hal/HandAlign/ 2 http://www.bultreebank.bas.bg/aligner/index.php 30 of sentences is represented as a grid of squares Idioms and free translations present a special (Fig. 1). For convenience English is considered case. If two autosemantic words or phrases refer to be the source and Bulgarian – the target lan- to the same object, but do not share the same guage, but that has no implications for the trans- meaning, they are aligned with a P link, e.g.: lation direction. Correspondence between two (1) this animal tokens is marked by clicking on a square – once това куче [‘this dog’] (black square) or twice (dark grey square). Ori- ginally, the two colours were introduced to allow the annotator to mark his/her degree of certainty about the alignment decision: sure link (S link, black) or possible link (P link, dark grey). It is worth noting that in an alignment there can be only one type of link between two tokens or, more precisely, there is no distinction between The same rule holds when there is a synse- phrase and word levels. mantic – autosemantic correspondence: (2) Ivan 's mother called. Неговата майка се oбади. [‘His mother called.’] Fig 1. Aligner interface. Mapping is done by P link: Ivan 's ~ Неговата clicking on the squares. P link is used when a lexical item is para- Subsequently the colours were used to distin- phrased in the other language: guish between strong and weak alignment (3) these non-Serbs (Kruijff-Korbayová et al. 2006), thus P link (dark тези лица от несръбски произход [‘persons grey) represents either weak alignment, or that from a non-Serbian origin’] the annotator is uncertain about the pairing, or both.