Parsing Aligned Parallel Corpus by Projecting Syntactic Relations from Annotated Source Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Parsing Aligned Parallel Corpus by Projecting Syntactic Relations from Annotated Source Corpus Shailly Goyal Niladri Chatterjee Department of Mathematics Indian Institute of Technology Delhi Hauz Khas, New Delhi - 110 016, India {shailly goyal, niladri iitd}@yahoo.com Abstract approaches either use examples from the same lan- guage, e.g., (Bod et al., 2003; Streiter, 2002), or Example-based parsing has already been they try to imitate the parse of a given sentence proposed in literature. In particular, at- using the parse of the corresponding sentence in tempts are being made to develop tech- some other language (Hwa et al., 2005; Yarowsky niques for language pairs where the source and Ngai, 2001). In particular, Hwa et al. (2005) and target languages are different, e.g. have proposed a scheme called direct projection Direct Projection Algorithm (Hwa et al., algorithm (DPA) which assumes that the relation 2005). This enables one to develop parsed between two words in the source language sen- corpus for target languages having fewer tence is preserved across the corresponding words linguistic tools with the help of a resource- in the parallel target language. This is called Di- rich source language. The DPA algo- rect Correspondence Assumption (DCA). rithm works on the assumption of Di- However, with respect to Indian languages we rect Correspondence which simply means observed that the DCA does not hold good all the that the relation between two words of time. In order to overcome the difficulty, in this the source language sentence can be pro- work, we propose an algorithm based on a vari- jected directly between the correspond- ation of the DCA, which we call pseudo Direct ing words of the parallel target language Correspondence Assumption (pDCA). Through sentence. However, we find that this as- pDCA the syntactic knowledge can be transferred sumption does not hold good all the time. even if not all syntactic relations may be projected This leads to wrong parsed structure of the directly from the source language to the target lan- target language sentence. As a solution guage in toto. Further, the proposed algorithm we propose an algorithm called pseudo projects the relations between phrases instead of DPA (pDPA) that can work even if Direct projecting relations between words. Keeping in Correspondence assumption is not guaran- line with (Hwa et al., 2005), we call this algorithm teed. The proposed algorithm works in a as pseudo Direct Projection Algorithm (pDPA). recursive manner by considering the em- The present work discusses the proposed pars- bedded phrase structures from outermost ing scheme for a new (target) language with the level to the innermost. The present work help of a parser that is already available for a discusses the pDPA algorithm, and illus- language (source) and using word-aligned paral- trates it with respect to English-Hindi lan- lel corpus of the two languages under considera- guage pair. Link Grammar based pars- tion. We propose that the syntactic relationships ing has been considered as the underlying between the chunks of the input sentence T (of parsing scheme for this work. the target language) are given depending upon the 1 Introduction relationships of the corresponding chunks in the translation S of T . Along with the parsed struc- Example-based approaches for developing parsers ture of the input, the system also outputs the con- have already been proposed in literature. These stituent structure (phrases) of the given input sen- 301 Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 301–308, Sydney, July 2006. c 2006 Association for Computational Linguistics tence. ture and constituent representation of the sentence In this work, we first discuss the proposed given below. scheme in a general framework. We illustrate the +--------Ss--------+ scheme with respect to parsing of Hindi sentences | +----Jp---+ | using the Link Grammar (LG) based parser for En- +--Ds-+-Mp-+ +-Dmc-+ +-Pa-+ | | | | | | | glish and the experimental results are discussed. the teacher of the boys is good Before that in the following section we discuss Link Grammar briefly. (S (NP (NP The teacher) (PP of (NP the boys))) (VP is) 2 Link Grammar and Phrases (ADJP good).) Link grammar (LG) is a theory of syntax which It may be noted that in the phrase structure of builds simple relations between pairs of words, the above sentence, verb phrase as obtained from rather than constructing constituents in tree-like the phrase parser has been modified to some ex- hierarchy. For example, in an SVO language like tent. The algorithm discussed in this work as- English, the verb forms a subject link (S-) to some sumes verb phrases as the main verb along with word on its left, and an object link (O+) with some all the auxiliary verbs. word on its right. Nouns make the subject link For ease of presentation and understanding, we (S+) to some word (verb) on its right, or object classify phrase relations as Inter-Phrase and Intra- link (O-) to some word on its left. phrase relations. Since the phrases are often em- The English Link Grammar Parser (Sleator and bedded, different levels of phrase relations are ob- Temperley, 1991) is a syntactic parser of English tained. From the outermost level to the innermost, based on LG. Given a sentence, the system as- we call them as “first level”, “second level” of re- signs to it a syntactic structure, which consists of lations and so on. One should note that an ith level a set of labeled links connecting pairs of words. Intra-phrase relation may become Inter-phrase re- The parser also produces a “constituent” represen- lation at a higher level. tation of a sentence (showing noun phrases, verb As an example, consider the parsing and phrase phrases, etc.). It is a dictionary-based system in structure of the English sentence given above. which each word in the dictionary is associated In the first level the Inter-phrase relations (cor- with a set of links. Most of the links have some responding to the phrases “the teacher of associated suffixes to provide various information the boys”, “is” and “good”) are Ss and Pa (e.g., gender (m/f), number (s/p)), describing and the remaining links are Intra-phrase relations. some properties of the underlying word. The En- In the second level the only Inter-phrase rela- glish link parser lists total of 107 links. Table tionship is Mp (connecting “the teacher” and 1 gives a list of some important links of English “the boys”), and the Intra-phrase relations are LG along with the information about the words on Ds, Jp and Dmc. In third and the last level, Jp is their left/right and some suffixes. the Inter-phrase relationship and Dmc is the Intra- phrase relation (corresponding to “of” and “the Link Word in Left Word in Right Suffixes A Premodifier Noun - boys”). D Determiners Nouns s/m,c/u The algorithm proposed in Section 4 uses J Preposition Object of the prepo- s/p pDCA to first establish the relations of the tar- sition M Noun Post-nominal Modi- p/v/g/a get language corresponding to the first-level Inter- fier phrase relations of the source language sentence. MV Verbs/adjectives Modifying phrase p/a/i/ Then recursively it assigns the relations corre- l/x O Transitive verb Direct or indirect ob- s/p sponding to the inner level relations. ject P Forms of “be” Complement of “be” p/v/g/a 3 DCA vis-a-vis` pDCA PP Forms of “have” Past participle - S Subject Finite verb s/p, i, g Direct Correspondence Assumption (DCA) states that the relation between words in source language Table 1: Some English Links and Their Suffixes sentence can be projected as the relations between corresponding words in the (literal) translation in As an example, consider the syntactic struc- the target language. Direct Projection Algorithm 302 (DPA), which is based on DCA, is a straightfor- from the source language sentence. Thus we pro- ward projection procedure in which the dependen- pose pseudo Direct Correspondence Assumption cies in an English sentence are projected to the (pDCA) where not all relations can be projected sentence’s translation, using the word-level align- directly. The projection algorithm needs to take ments as a bridge. DPA also uses some monolin- care of the following three categories of links: gual knowledge specific to the projected-to lan- guage. This knowledge is applied in the form of Category 1: Relationship between two chunks Post-Projection transformation. in the source language can be projected to the tar- However with respect to many language pairs get language with minor or no changes (for ex- syntactic relationships between the words cannot ample, subject-verb, object-verb relationships in always be imitated to project a parse structure the above illustration). It may be noted that since from source language to target language. For il- except for some suffix differences (due to mor- lustration consider the sentence given in Figure 1. phological variations), the relation is same in the We try to project the links from English to Hindi source and the target language. in Figure 1(a) and Hindi to Bangla in Figure 1(b). Category 2: Relationship between two chunks For Hindi sentence, links are given as discussed by in the source language can be projected to the Goyal and Chatterjee (2005a; 2005b). target language with major changes. For ex- ample, in the English sentence given in Figure 2(a), the relationship between the girl and in the white dress is Mp, i.e. “nominal mod- ifier (preposition phrase)”. In the corresponding phrases ladkii and safed kapde waalii of Hindi, although the relationship is same, i.e., “nominal modifier”, the type of nominal modifier is chang- ing to waalaa/waale/waalii-adjective.