A Syntactic Resource for Thai: CG Treebank
Total Page:16
File Type:pdf, Size:1020Kb
A Syntactic Resource for Thai: CG Treebank Taneth Ruangrajitpakorn Kanokorn Trakultaweekoon Thepchai Supnithi Human Language Technology Laboratory National Electronics and Computer Technology Center 112 Thailand Science Park, Phahonyothin Road, Klong 1, Klong Luang Pathumthani, 12120, Thailand +66-2-564-6900 Ext.2547, Fax.: +66-2-564-6772 {taneth.ruangrajitpakorn, kanokorn.trakultaweekoon, thep- chai.supnithi}@nectec.or.th Abstract tion such as tense, aspect, modal, etc. Therefore, to recognise word order is a key to syntactic ana- This paper presents Thai syntactic re- lysis for Thai. Categorial Grammar (CG) is a source: Thai CG treebank, a categorial formalism which focuses on principle of syntact- approach of language resources. Since ic behaviour. It can be applied to solve word or- there are very few Thai syntactic re- der issues in Thai. To apply CG for machine sources, we designed to create treebank learning and statistical based approach, CG tree- based on CG formalism. Thai corpus was bank, is initially required. parsed with existing CG syntactic dic- CG is a based concept that can be applied to tionary and LALR parser. The correct advance grammar such as Combinatory Cat- parsed trees were collected as prelimin- egorial Grammar (CCG) (Steedman, 2000). ary CG treebank. It consists of 50,346 Moreover, CCG is proved to be superior than trees from 27,239 utterances. Trees can POS for CCG tag consisting of fine grained lex- be split into three grammatical types. ical categories and its accuracy rate (Curran et There are 12,876 sentential trees, 13,728 al., 2006; Clark and Curran, 2007). noun phrasal trees, and 18,342 verb Nowadays, CG and CCG become popular in phrasal trees. There are 17,847 utterances NLP researches. There are several researches us- that obtain one tree, and an average tree ing them as a main theoretical approach in Asia. per an utterance is 1.85. For example, there is a research in China using CG with Type Lifting (Dowty, 1988) to find fea- 1 Introduction tures interpretations of undefined words as syn- tactic-semantic analysis (Jiangsheng, 2000). In Syntactic lexical resources such as POS tagged Japan, researchers also works on Japanese cat- corpus and treebank play one of the important egorial grammar (JCG) which gives a foundation roles in NLP tools for instance machine transla- of semantic parsing of Japanese (Komatsu, tion (MT), automatic POS tagger, and statistical 1999). Moreover, there is a research in Japan to parser. Because of a load burden and lacking lin- improve CG for solving Japanese particle shift- guistic expertise to manually assign syntactic an- ing phenomenon and using CG to focus on Ja- notation to sentence, we are currently limited to a panese particle (Nishiguchi, 2008). few syntactical resources. There are few re- This paper is organised as follows. Section 2 searches (Satayamas and Kawtrakul, 2004) fo- reviews categorial grammar and its function. cused on developing system to build treebank. Section 3 explains resources for building Thai Unfortunately, there is no further report on the CG treebank. Section 4 describes experiment res- existing treebank in Thai so far. Especially for ult. Section 5 discusses issues of Thai CG tree- Thai, Thai belongs to analytic language which bank. Last, Section 6 summarises paper and lists means grammatical information relying in a up future work. word rather than inflection (Richard, 1964). Function words represent grammatical informa- 96 Proceedings of the 7th Workshop on Asian Language Resources, ACL-IJCNLP 2009, pages 96–102, Suntec, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP 2 Categorial Grammar of derivation with interpretation becomes an ad- vantage over others. Categorial grammar (Aka. CG or classical cat- Example of CG derivation of Thai sentence is egorial grammar) (Ajdukiewicz, 1935; Car- illustrated in Figure 1. penter, 1992; Buszkowski, 1998; Steedman, Recently, there are many researches on com- 2000) is a formalism in natural language syntax binatory categorial grammar (CCG) which is an motivated by the principle of constitutionality improved version of CG. With the CG based and organised according to the syntactic ele- concept and notation, it is possible to easily up- ments. The syntactic elements are categorised in grade it to advance formalism. However, Thai terms of their ability to combine with one anoth- syntax still remains unclear since there are sever- er to form larger constituents as functions or ac- al points on Thai grammar that are yet not com- cording to a function-argument relationship. All pletely researched and found absolute solvent syntactic categories in CG are distinguished by a (Ruangrajitpakorn et al., 2007). Therefore, CG is syntactic category identifying them as one of the currently set for Thai to significantly reduce over following two types: generation rate of complex composition or am- biguate usage. 1. Argument: this type is a basic category, such as s (sentence) and np (noun phrase). 2. Functor (or functor category): this cat- egory type is a combination of argument and operator(s) '/' and '\'. Functor is marked to a complex lexicon to assist ar- gument to complete sentence such as s\np (intransitive verb) requires noun phrase from the left side to complete a sentence. CG captures the same information by associat- ing a functional type or category with all gram- matical entities. The notation α/β is a rightward- Figure 1. CG derivation tree of Thai sentence combining functor over a domain of α into a range of β. The notation α\β is a leftward-com- 3 Resources bining functor over β into α. α and β are both ar- gument syntactic categories (Hockenmaier and To collect CG treebank, CG dictionary and pars- Steedman, 2002; Baldridge and Kruijff, 2003). er are essentially required. Firstly, Thai corpus The basic concept is to find the core of the com- was parsed with the parser using CG dictionary bination and replace the grammatical modifier as a syntactic resource. Then, the correct trees of and complement with set of categories based on each sentence were manually determined by lin- the same concept with fractions. For example, in- guists and collected together as treebank. transitive verb is needed to combine with a sub- 3.1 Thai CG Dictionary ject to complete a sentence therefore intransitive verb is written as s\np which means it needs a Recently, we developed Thai CG dictionary to be noun phrase from the left side to complete a sen- a syntactic dictionary for several purposes since tence. If there is a noun phrase exists on the left CG is new to Thai NLP. CG was adopted to our side, the rule of fraction cancellation is applied syntactic dictionary because of its focusing on as np*s\np = s. With CG, each lexicon can be an- lexicon's behaviour and its fine grained lexical- notated with its own syntactic category. ised grammar. CG is proper to nature of Thai However, a lexicon could have more than one language since Thai belongs to analytic language syntactic category if it is able to be used in dif- typology; that is, its syntax and meaning depend ferent appearances. on the use of particles and word orders rather Furthermore, CG does not only construct a than inflection (Boonkwan, and Supnithi, 2008). purely syntactic structure but also delivers a Moreover, pronouns and other grammatical in- compositional interpretation. The identification formation, such as tenses, aspects, numbers, and voices, are expressed by function words such as 97 determiners, auxiliary verbs, adverbs and adject- In addition, there are many multi-sense words ives, which are in fix word order. With CG, it is in Thai. These words have the same surface form possible to well capture Thai grammatical in- but they have different meanings and different formation. Currently we only aim to improve an usages. This issue can be solved with CG formal- accuracy of Thai syntax parsing since it still re- ism. The different usages are separated because mains unresearched ambiguities in Thai syntax. the annotation of syntactic information. For ex- A list of grammatical Thai word orders which are ample, Thai word “เกาะ” /kɔSʔ/ can be used to handled with CG is shown in Table 1. refer to noun as an 'island' and it is marked as np, and this word can also be denoted an action Thai which means 'to clink' or 'to attach' and it is Word-order utilisation marked as s\np/np. Sentence - Subject + Verb + (Object)1 [rigid order] After observation Thai word usage, the list of Compound CG was created according to CG theory ex- - Core noun + Attachment noun plained in Section 2. Thai argument syntactic categories were ini- Adjective 2 modification - Noun + Adjective tially created. For Thai language, six argument Predicate Ad- 3 syntactic categories were determined. Thai CG jective - Noun + Adjective arguments are listed with definition and ex- Determiner - Noun + (Classifier) + Determiner amples in Table 2. Additionally, np, num, and Numeral ex- - Noun + (Modifier) + Number + Classifier + spnum are a Thai CG arguments that can dir- pression (Modifier) ectly tag to a word, but other can not and they Adverb - Sentence + Adverb can only be used as a combination for other argu- modification - Adverb + Sentence ment. Several aux- - Subject + (Aux verbs) + VP + (Aux verbs) With the arguments, other type of word are iliary verbs created as functor by combining the arguments - Subject + Negator + VP together following its behaviour and environ- - Subject + (Aux verb) + Negator + (Aux verb) + Negation VP mental requirements. The first argument in a - Subject + VP + (Aux verb) + Negator + (Aux functor is a result of combination. There are only verb) two main operators in CG which are slash '/' and backslash '\' before an argument. A slash '/' refers Passive - Actee + Passive marker + (Actor) + Verb to argument requirement from the right, and a - Subject + Ditransitive verb + Direct object + In- backslash '\' refers to argument requirement from Ditransitive direct object the left.