Integrating Collocation Features in Chinese Word Sense Disambiguation
Total Page:16
File Type:pdf, Size:1020Kb
Integrating Collocation Features in Chinese Word Sense Disambiguation Wanyin Li Qin Lu Wenjie Li Department of Computing Department of Computing Department of Computing The Hong Kong Polytechnic The Hong Kong Polytechnic The Hong Kong Polytechnic University University University Hong Hom, Kowloon, HK Hong Hom, Kowloon, HK Hong Hom, Kowloon, HK [email protected] [email protected] [email protected] du.hk du.hk u.hk Through Time”, or the Chinese word “ഄᮍ” in Abstract “ഄᮍᬓᑰ”(“local government”) and “Ҫг᳝ᇍ ⱘഄᮍ”(“He is also partly right”). WSD tries to The selection of features is critical in pro- automatically assign an appropriate sense to an viding discriminative information for clas- occurrence of a word in a given context. sifiers in Word Sense Disambiguation Various approaches have been proposed to deal (WSD). Uninformative features will de- with the word sense disambiguation problem grade the performance of classifiers. Based including rule-based approaches, knowledge or on the strong evidence that an ambiguous dictionary based approaches, corpus-based ap- word expresses a unique sense in a given proaches, and hybrid approaches. Among these collocation, this paper reports our experi- approaches, the supervised corpus-based ap- ments on automatic WSD using collocation proach had been applied and discussed by many as local features based on the corpus ex- researches ([2-8]). According to [1], the corpus- tracted from People’s Daily News (PDN) based supervised machine learning methods are as well as the standard SENSEVAL-3 data the most successful approaches to WSD where set. Using the Naïve Bayes classifier as our contextual features have been used mainly to core algorithm, we have implemented a distinguish ambiguous words in these methods. classifier using a feature set combining However, word occurrences in the context are both local collocation features and topical too diverse to capture the right pattern, which features. The average precision on the means that the dimension of contextual words PDN corpus has 3.2% improvement com- will be very large when all words in the training pared to 81.5% of the baseline system samples are used for WSD [14]. Certain where collocation features are not consid- uninformative features will weaken the dis- ered. For the SENSEVAL-3 data, we have criminative power of a classifier resulting in a reached the precision rate of 37.6% by in- lower precision rate. To narrow down the con- tegrating collocation features into text, we propose to use collocations as contex- contextual features, to achieve 37% im- tual information as defined in Section 3.1.2. It is provement over 26.7% of precision in the generally understood that the sense of an am- baseline system. Our experiments have biguous word is unique in a given collocation shown that collocation features can be used [19]. For example, “ࣙ㺅” means “burden” but to reduce the size of human tagged corpus. not “baggage” when it appears in the collocation “ᗱᛇࣙ㺅” (“ burden of thought”). In this paper, we apply a classifier to combine 1 Introduction the local features of collocations which contain the target word with other contextual features to WSD tries to resolve lexical ambiguity which discriminate the ambiguous words. The intuition refers to the fact that a word may have multiple is that when the target context captures a collo- meanings such as the word “walk” in “Walk or cation, the influence of other dimensions of Bike to school” and “BBC Education Walk 87 contextual words can be reduced or even ig- words are used in the same sense, they have nored. For example, in the expression “ᘤᗪߚᄤ similar context and co-occurrence information ⛮ ↕ њ ᅸ ” (“terrorists burned down the [13]. It is also generally true that the nearby con- gene laboratory”), the influence of contextual text words of an ambiguous word give more ef- word “” (“gene”) should be reduced to work fective patterns and features values than those on the target word “ߚᄤ” because “ᘤᗪߚᄤ” is far from it [12]. The existing methods consider a collocation whereas “ߚᄤ” and “” are not features selection for context representation in- collocations even though they do co-occur. Our cluding both local and topic features where local intention is not to generally replace contextual features refer to the information pertained only information by collocation only. Rather, we to the given context and topical features are sta- would like to use collocation as an additional tistically obtained from a training corpus. Most feature in WSD. We still make use of other con- of the recent works for English corpus including textual features because of the following reasons. [7] and [8], which combine both local and topi- Firstly, contextual information is proven to be cal information in order to improve their per- effective for WSD in the previous research formance. An interesting study on feature works. Secondly, collocations may be independ- selection for Chinese [10] has considered topical ent on the training corpus and a sentence in con- features as well as local collocational, syntactic, sideration may not contain any collocation. and semantic features using the maximum en- Thirdly, to fix the tie case such as ̖̌ᘤᗪߚᄤ tropy model. In Dang’s [10] work, collocational ⌟䆩̖̍ (̌terrorists’ gene checking̍), features refer to the local PoS information and bi-gram co-occurrences of words within 2 posi- ̌ߚᄤ̍ means ̌human̍ when presented in tions of the ambiguous word. A useful result the collocation ̌ᘤᗪߚᄤ̍, but ̌particle̍ from this work based on (about one million in the collocation ̌ߚᄤ”. The primary words) the tagged People’s Daily News shows purpose of using collocation in WSD is to im- that adding more features from richer levels of prove precision rate without any sacrifices in linguistic information such as PoS tagging recall rate. We also want to investigate whether yielded no significant improvement (less than the use of collocation as an additional feature 1%) over using only the bi-gram co-occurrences can reduce the size of hand tagged sense corpus. information. Another similar study for Chinese The rest of this paper is organized as follows. [11] is based on the Naive Bayes classifier Section 2 summarizes the existing Word Sense model which has taken into consideration PoS Disambiguation techniques based on annotated with position information and bi-gram templates corpora. Section 3 describes the classifier and in the local context. The system has a reported the features in our proposed WSD approach. 60.40% in both precision and recall based on the Section 4 describes the experiments and the SENSEVAL-3 Chinese training data. Even analysis of our results. Section 5 is the conclu- though in both approaches, statistically signifi- sion. cant bi-gram co-occurrence information is used, they are not necessarily true collocations. For 2 Related Work example, in the express “ᥠᦵⲥ㾚ᴀᎲ ഄᮄ㒇㊍ߚᄤⱘ⌏ࡼᚙމ”, the bi-grams in Automating word sense disambiguation tasks their system are (ᥠᦵ , ⲥ㾚 , ⲥ㾚ᴀ based on annotated corpora have been proposed. Ꮂ , ᴀᎲഄ , ഄᮄ㒇㊍ , ᮄ㒇㊍ⱘ Examples of supervised learning methods for ⱘ⌏ࡼ , ⌏ࡼᚙމ Some bi-grams such as WSD appear in [2-4], [7-8]. The learning algo- rithms applied including: decision tree, decision- ⌏ࡼᚙމ may have higher frequency but list [15], neural networks [7], naïve Bayesian may introduce noise when considering it as fea- learning ([5],[11]) and maximum entropy [10]. tures in disambiguating the sense “human|Ҏ” Among these leaning methods, the most impor- and “symbol|ヺো” like in the example case of tant issue is what features will be used to con- “∈ߚᄤ⌏ࡼᚙމ”. In our system, we do not rely struct the classifier. It is common in WSD to use on co-occurrence information. Instead, we util- contextual information that can be found in the ize true collocation information (ᮄ㒇㊍, ߚᄤ) neighborhood of the ambiguous word in training which fall in the window size of (-5, +5) as fea- data ([6], [16-18]). It is generally true that when 88 tures and the sense of “human|Ҏ” can be de- the contextual window size as 10 in our system. cided clearly using this features. The collocation Each of the Chinese words except the stop information is a pre-prepared collocation list words inside the window range will be consid- obtained from a collocation extraction system ered as one topical feature. Their frequencies are and verified with syntactic and semantic meth- calculated over the entire corpus with respect to ods ([21], [24]). each sense of an ambiguous word w. The sense Yarowsky [9] used the one sense per collocation definitions are obtained from HowNet. property as an essential ingredient for an unsu- pervised Word-Sense Disambiguation algorithm 3.1.2 Local Collocation Features to perform bootstrapping algorithm on a more We chose collocations as the local features. A general high-recall disambiguation. A few re- collocation is a recurrent and conventional fixed cent research works have begun to pay attention expression of words which holds syntactic and to collocation features on WSD. Domminic [19] semantic relations [21]. Collocations can be used three different methods called bilingual classified as fully fixed collocations, fixed col- method, collocation method and UMLS (Unified locations, strong collocations and loose colloca- Medical Language System) relation based tions. Fixed collocations means the appearance method to disambiguate unsupervised English of one word implies the co-occurrence of an- and German medical documents. As expected, other one such as “ग़ࣙ㺅” (“burden of his- the collocation method achieved a good preci- tory”), while strong collocations allows very sion around 79% in English and 82% in German limited substitution of the components, for ex- but a very low recall which is 3% in English and ample, “ഄᮍ䰶᷵” (“local college”), or ” ഄᮍ 1% in German.