<<

Leveraging Morpho- for the Discovery of Relations in Chinese Wordnet

Shu-Kai Hsieh Yu-Yun Chang Graduate Institute of Graduate Institute of Linguistics National Taiwan University National Taiwan University Taipei, Taiwan Taipei, Taiwan [email protected] [email protected]

Abstract data from the web (Cimiano et al., 2005), Semantic relations of different types have but runs the risk of influenced by the web played an important role in , and have genre (Alain, 2010). been widely recognized in various fields. In re- To enrich the relations coverage in Chinese cent years, with the growing interests of con- structing in support of in- Wordnet (CWN), in this paper, we propose telligent systems, automatic semantic relation an in situ approach by exploiting the morph- discovery has become an urgent task. This semantic information. This method, simple paper aims to extract semantic relations re- lying on the in situ morpho-semantic struc- and straightforward as it seems, does not incur ture in Chinese which can dispense of an the difficulties associated with lexical gaps in outside source such as corpus or web data. cross-language mapping that any translation- Manual evaluation of thousands of pairs shows that most relations can be successful based model would encounter; and it is also predicted. We believe that it can serve as a economic and complementary with previous valuable starting point in complementing with approaches in that we can dispense of an out- other approaches, which will hold promise for the robust lexical relations acquisition. side corpus resource. In what follows, Section 2 gives a brief sum- 1 Introduction mary of lexical semantic relations acquisition Semantic relations are at the core of WordNet- from two perspectives. Section 3 explains the alike architecture, and constitute the essential proposed methods for the automatic discovery and integral part of linguistic and conceptual of semantic relations, which are the main focus knowledge formalization. However, the man- of this study. Section 4 shows the experiment ual labeling task of semantic relations is very results and discussion. Finally, we conclude laborious. this paper in Section 5. To minimize the labor, in recent years, 2 Relations in Chinese Wordnet automatic ways of extracting semantic rela- tions from textual data have been proposed. Modelling on English WordNet, CWN has Among these methods, extensive works have been launched by Academia Sinica in 2006 and been done based on the so-called pattern-based continuously broadened its scope (Huang et approaches, which was pioneered by (Hearst, al., 2010).1. The initial version of CWN con- 1992). The patterns predefined or plucked out tains a manually created fine-grained senses of a corpus are often referred to as lexico- repository but sparse relations. However, se- syntactic patterns, which serve as an infor- mantic relation labeling is a time-consuming mation marker for a certain relation between and labor-demanding task. Two main meth- two concepts. Later representative works us- ods were employed to automatic relation ac- ing such approaches include (Cimiano et al., quisition. 2005), and (Pantel and Pennacchiotti, 2006), etc. Pattern-based extraction has shown quite 2.1 Bilingual Bootstrapping Approach reasonable success characterized by a (rela- Though lexical semantic relations (LSRs) tively) high precision rate, but suffers from could be presumed to be more universal than a very low recall resulting from the fact that word senses in human languages, a direct the patterns are rare in corpora. Remedies 1Freely available at http://lope.linguistics. against the problem involve exploiting scaled ntu.edu.tw/cwn copying or simple porting of LSRs from one finding semantic relations by using various wordnet to another could possibly lead to pattern-based algorithm has become the most invalid relations in the target wordnet. A common approach. broader view on the underlying inference logic We (Lo et al., 2008) have tried to define of cross-language LSRs with 26 rules was first some patterns (e.g., a manner of ) to extract proposed by (Huang et al., 2002) and for- troponymy among verbs in Chinese. To avoid mally introduced in (Hsieh, 2009). A series the interference of unnecessary contextual in- of large-scaled bilingual bootstrapping experi- formation which may include modal verbs, ments showed substantial improvements (with hedging, negation that often occur in different 55% precision) over baseline model (47%). corpus genres, we applied the proposed pat- However, it was also reported that among the terns on the gloss of CWN. The results were correctly predicted LSRs, a large portion (c.a. evaluated with the substitution tests. Substi- 60%) belongs to non-lexical relations such as tution test is commonly used in linguistic lit- similar to, pertainym, also see, etc. erature (Tsai et al., 2002); EuroWordnet pro- To look deeper into the issues, second ex- vided linguistic tests for each semantic rela- periment focusing only on the hypernymy- tion to examine the validity. In (Tsai et al., troponymy among the verbs was conducted. 2002), sentence formulae were created follow- The bootstrapping model returned totally ing the frame in EuroWordnet to examine the 12214 verb pairs mapped from WordNet 3.0, validity of certain semantic relations in Chi- which were manually evaluated. The analy- nese. Linguistic semantic tests help researcher sis shows that around 50% verb pairs can be check if two word meanings have a certain kind recognized as fit in CWN, however, two main of semantic relation or not, and further ensure error types are identified: [1] Lexicalization of the quality and consistence of the database. verbs: similar to the problems of lexical gap Therefore, following the previous framework, appeared in the cross-language sense mapping, a set of sentence formulae based on properties a single word in English often has meanings of troponymy was created to verify the cor- that require several in Chinese to ex- rectness of hypernymy-troponymy verb pairs. plain. By analyzing the results, it is found that However, due to data sparseness, the system many verbs could not be described by a sin- can achieve only high precision but low recall. gle in Chinese. [2] Mismatch of synset: other than the above, there are cases when the 3 Morpho-semantic Linkage hypernymy-troponymy relations of the verb pairs are approved, but the synset that CWN Instead of assuming any external context in chooses is not the same with that of PWN. which words to be linked appear, we propose This could be due to the different semantic to exploit the language-internal evidence man- ranges between CWN and PWN hypernymy- ifested at the morpho-syntactic levels in Chi- troponymy pairs, or due to the subtlety of nese, which is assumably guided by underlying sense division when the sense levels are sim- semantic composition of . ilar. 3.1 Morpho-semantics in The bilingual bootstrapping experiments showed that lexical relations turn out to be The idea of exploiting morpho-semantic in- not subject to automatic importing and would formation for the enrichment of WordNet has still require tremendous human efforts of vali- been discussed and implemented in the Word- dations. Net community for a while. (Miller and Fell- baum, 2003) first described the importance of 2.2 Pattern-based Approach adding ”morphosemantic links” to WordNet, There has been a variety of studies on the with later works (Fellbaum et al., 2009) on the automatic acquisition of lexical semantic re- classification of regular polysemous patterns of lations, Hearst (Hearst, 1992) first proposed a morphosemantic V-N pairs related via -er af- lexico-syntactic pattern based method for au- fixation (e.g., build-builder). tomatic acquisition of hyponymy from unre- The notion of morpho-semantic links stricted texts, and since then automatically (MSLs) has been applied to other (morphologically-rich) languages such as formation in Chinese can be used to identify Czech (Pala and Hlavác̆ková, 2007) (in terms these relations based on the position and se- of D-relations), Turkish (Bilgin et al., 2004) mantic role of morphemes in modification. and Bantu languages (Bosch et al., 2008). In the case of Verb-Verb (compound) words, It is worth of mentioning that the proposed where the word is composed of two verbal morpho-semantic relations or derivational morphemes, linguistics have sorted out differ- relations are relations that hold among literals ent types resulting from the interplay of mor- (lemmas) rather than synsets, which leaves phemes within (Li and Thompson, 1981). For some room of discussion about the extra instance, for the type of so-called ‘parallel’ level these relations should be anchored VV compounds, V1 (verb in the first position) because neither paradigmatic nor syntagmatic and V2 (verb in the second position) share relations would fit. the similar meaning (near ), such as It is note here that for morphologically-poor bang-zhu ‘help-assist’ (help), fang-qi ‘loosen- languages like Chinese, the MSLs are quite dif- abandon’ (give up). With a fine-grained sense ferent in that they do not exist between stems analysis, we can label the troponymy between and suffixes, but between word-to-be/word- V1 and V1V2, where V1 is widely recognized as used-to-be morphemes instead. This has the the component that carries heavier semantic practical advantages for the enrichment of ex- load in VV compound (a.k.a. left-headedness). isting paradigmatic relations, as we will intro- In the case of Noun-Noun (compound) duce in the following. words, e.g., noodle-shop ‘mian-dian’ (noodle shop), where the word is composed of two 3.2 Probing Morpho-Semantic nominal morphemes, the N modifier - N head Relations in Chinese 1 2 structure is prevalently observed (a.k.a. right- The vast majority of Chinese characters rep- headedness). The linkage between N1N2 and resent the morphemes. It has been always a N2 can be labeled as hypernymy-hyponymy. controversy over the notion of wordhood in the lexical history of Chinese. In a way any Chi- 4 From MSL to Lexical Semantic nese character can be seen as word-to-be or Relations word-used-to-be morphemes. Given the fact that the relative predominance of the mono- 4.1 Hypothesis syllabic word in ancient Chinese has shifted to As argued in previous section, Morpho- bi-syllabic words in modern Chinese, the huge Semantic Linkage abound in abundant rela- semantic weight carried by the morphemes tional knowledge. In this study, we aim to has made the idea of character-centered lex- enrich the CWN with relations leveraged by icon deeply ingrained in Chinese mind. Or- operationalizing MSL. thographically, the lack of word delimiter The automatic labeling of the lexical seman- (such as space) in texts worsens the achieve- tic relations on word-pairs is quite straightfor- ment of consensus regarding the distinction ward. For N1N2 compounds, ≺ N1N2,N2≻ between words, compounds and phrases, and pairs are labeled with hypernymy-hyponymy, thus makes the segmentation a long-standing and ≺N1N2,N1≻ pairs are labeled with heated topic in Chinese NLP. meronymy-holonymy. We follow the cognitive-functional stance The cases of VV compounds are trickier, the in the respect that and form flow of judgement is shown in algorithm 1. a continuum rather than two strictly sepa- When V1 has synonymy or near- synonymy rated modules. We argue that the Morpho- with V2, then V1V2 are troponyms of both Semantic Relations (MSRs), i.e., the ways V1 and V2. If V2 is on the list of 完住掉開壞 morphemes combine to form composite mean- 成, which is a subclass of the VV compounds ings, can function as the organic linkage in re- that are often called resultative compounds, for vealing the composition mechanism among the there is a causal relation between the event continuum of different lexical units in varied represented by the first compound of such a contexts. In terms of WordNet’s paradigmatic compound and the event/state represented by relations, this means that morpho-semantic in- the second component. are automatically labelled on the related word Data: VV compounds pairs. Result: Labeled relations between V V 1 2 A manual evaluation of the resulting seman- and V /V 1 2 tic relations lists was conducted. We have cre- initialization (POS tagging); ated a wiki-based collaborative platform5 on if V1 is V2 then which registered users can contribute to CWN return troponymy; by adding new entries, editing existing ones else and rating one another’s contribution to en- if V2 is 完住掉開壞成 then sure the quality of collective intelligence (Lee return causality; et al, 2013). Figure 1 shows the snapshot of if V2 is 上下來去進回出落入向往過起 the system. then return directional; end With three linguistic graduate students else judging the correctness, the inter-annotator return pertainymy; agreement measured by Fleiss kappa (Fleiss, 1971) was used, which is defined as: end end P − P k = e Algorithm 1: Pseudo code for relations la- 1 − Pe beling between V1V2 and V1/V 2 where the numerator expresses the degree of agreement actually achieved, and the denom- 4.2 Experiments inator the degree of agreement that is possi- ble above chance. As a result, it’s interest- In this section, we discuss the experiment we ing to see that there is a very poor agree- designed, the evaluation and error analysis. ment between three raters (k = −0.7069972) The first step is to create a list of term on the predicted relations of ≺ W − W W pairs, which a total of 561,703 words covered 1 1 2 ≻, which also gets low precision rate; while in CWN 2, Sinica BOW 3, and Ministry of Ed- agreement achieves a moderate degree (with ucation Online Chinese Dictionary 4. In this k = 0.5835113) on the predicted relations of experiment, we focus only on bi-syllabic words ≺ W − W W ≻, which also gets high perfor- represented by two characters, which consti- 2 1 2 mance in precision.6 Figure 2 shows the en- tute the largest proportion of Chinese vocab- richment of relations through the experiment. ulary repository. In order to filter out a coarse-grained bi- syllabic word list, only both characters of a 4.3 Discussion bi-syllabic word that could be found in the big word list, are preserved. Additionally, four The experiment we carried out gives rise to principles are applied to construct a more fine- some issues for discussion. Table 1 shows grained word list: [1] the part-of-speech tags the performance for each predicted relation. of both characters within a bi-syllabic word When we scrutinize the portion with low pre- should be NN or VV; [2] bi-syllabic words con- cision rate, we found that the problematic taining metaphors are excluded; [3] bi-syllabic cases are mostly from the predicted meronymy- morphemic word (e.g., 齷齪 (sordid)) or ar- holonymy relations between NN compounds, ≺ − ≻ chaic words (e.g.,搗家) are not included; and i.e., N1N2 N1 . It is in fact not sur- [4] proper nouns, (e.g., 成龍(Jackie Chan)) are prising in that the definition of part-whole is not considered. Therefore, a list with 1482 bi- not easily stated, and the judgement criteria in syllabic words are produced. Using the hy- the previous literature are not unproblematic potheses proposed in section 4.1, the relations too. For instance, given the restrictive rules 5See http://lope.linguistics.ntu.edu.tw/ 2See http://lope.linguistics.ntu.edu.tw/cwn/ cwikin/ 3See http://bow.sinica.edu.tw/ 6The results will be accessible at http://140.112. 4See http://dict.revised.moe.edu.tw/ 147.131/ Figure 1: User graphical interface of CWIKIN

Figure 2: Relations added that Cruse (1986) sets on the meronymy rela- 5 Conclusion tion with the co-existence of both the ‘N1N2 Lexical semantic relations offers rich linguis- is part of N1’ and ‘N1 has N1N2’ paraphrases, the raters did not all agree that the relation tic and conceptual knowledge information and hold between 黨部 (party headquarter) and 黨 are the most to fill in for wordnets. Semantic (political party). relations extraction has been one of the most important tasks in many fields. The challenges pertaining to this task are multifaceted. The most active pattern-based approaches provide a reasonable solution, but poses difficulties as well. Another main error sources come from the In this paper, we have presented a linguistic predicted troponymy-hypernymy relations be- alternative to the task in Chinese by resort- ≺ − ≻ tween V1V2 V1 . Recall that we hy- ing to resources of language in itself. Rather pothesize that if V1 and V2 are synonymous, than focusing on the patterns design - relation then V1V2 is automatically labeled as tro- extraction model, a notion of Morpo-semantic ponym of V1. The errors arose here can be links is proposed to support the extraction and mainly ascribed to the lack of consistent Chi- labeling of a wide variety of semantic relations nese thesaurus. In this experiment, the CWN in Chinese. The experiment shows that it is synset (fine-grained determination) possible to discover semantic relations with- and CILIN semantic class (coarse-grained syn- out being influenced by corpus size and gen- onym determination) are integrated for predic- res. This simple strategy can also serve as the tion, both has different criteria regarding the linguistic baseline for related works. sameness or nearness of senses between two Future works include: [1] extending to VN verbs. In addition, no proper rules for the eval- and NV compounds (Song and Qiu, 1981), uation of troponymy among raters constitute and more fined-grained classification of seman- the difficulties as well. tic relations among these word-pairs, and [2] Furthermore, there are two points can be mapping with Japanese Wordnet where an made. [1], the experiment of relation discov- amount of Chinese characters are employed for ery is conducted at the level of word-lemma, advanced cross-linguistic validation. We also not concept(word-sense), in terms of wordnet, hope that the work presented here will shed the generic label ’semantic relations’ are re- new light on the understanding of morpho- garded as the relation occurring between lin- semantic representation of natural languages. guistic units rather than between concepts (i.e., synsets.) Currently, the predicted re- lations are presumably connected with the References first sense of the word lemma in CWN. A Alain Auger and Caroline Barrière (eds). 2010. fine-grained annotation will be left for future Probing Semantic Relations. John Benjamins work. [2], in the evaluation task, when the Publishing, Amsterdam/Philadelphia. raters did not agree with the predicted rela- Orhan Bilgin, Özlem Çetinoglu and Kemal Oflazer. tion type, they also provide proper relation 2004. Morphosemantic Relations In and Across types for the pair, which are not named rela- Wordnets: A Study Based on Turkish. In: Pro- tions explicitly defined in WordNet. For ex- ceedings of the Second Global WordNet Confer- ample, the qualia modification between cer- ence, 60–66. tain N N and N , such as 肉醬(meat sauce) 1 2 2 Sonja Bosch, Christiane Fellbaum and Karel Pala. - 醬(sauce). This is different from patterned- 2008. Enhancing WordNets with Morphological based approaches where a bottom-up method- Relations: A Case Study from Czech, English ology is taken because named and explicitly and Zulu. In: Proceedings of the Fourth Global defined semantic relations of interest are pre- WordNet Conference, 74–90. sumed before lexico-syntactic patterns are ex- Philipp Cimiano, Aleksander Pivk Lars, Schmidt- tracted and utilized to search for instances of Thieme and Steffen Staab. 2005. Learning Tax- the relations onomic Relations from Heterogeneous Sources of Word Pairs Type of Relations Precision Observations Meronymy 33% 956 W1-W1W2 Pertainymy 25% 352 Troponymy 29% 174 Causality 90% 73 Directional 90% 43 W2-W1W2 Hypernymy/Hyponymy 95% 956 Pertainymy 93% 241 Troponymy 92% 169

Table 1: Inter-annotator agreement across all relations

Evidence. In: Buitelaar (eds),Ontology Learn- Chiao-Shan Lo, Yi-Rung Chen, Chih-Yu Lin, and ing from Text: Methods, Evaluation and Appli- Shu-Kai Hsieh. 2008. Automatic Labeling of cations, 55–73. Troponymy for Chinese Verbs. In: Proceedings of the 20th Conference on Computational Lin- Christiane Fellbaum, Anne Osherson, and Peter. guistics and Speech Processing (ROCLING). E. Clark. 2009. Putting Semantics into Word- Net’s ”Morphosemantic” Links. In: Zygmunt George Miller and Christiane Fellbaum. 2003. Vetulani and Hans Uszkoreit (eds). Human Lan- Morphosemantic Links in WordNet. Traitement guage Technology. Challenges of the Information Automatique de Langue, 44(2):69–80. Society, 350-358. Karel Pala and Dana Hlavác̆ková. 2007. Deriva- Joseph. L. Fleiss. 1971. Measuring Nominal Scale tional Relations in Czech WordNet. In: ACL ’07 Agreement among Many Raters. Psychological Proceedings of the Workshop on Balto-Slavonic Bulletin, 76(5): 378–382. Natural Language Processing: Information Ex- traction and Enabling Technologies. Strouds- Marti A. Hearst. 1992. Automatic Acquisition of burg, PA, USA. Hyponyms from Large Text Corpora. In: Pro- ceedings of the Fourth International Conference Patrick Pantel and Marco Pennacchiotti. 2006. on Computational Linguistics (COLING), 539– Espresso: Leveraging Generic Patterns for Au- 545. tomatically Harvesting Semantic Relations. In: Proceedings of Conference on Computational Shu-Kai Hsieh. 2009. Formal Description of Lexi- Linguistics / Association for Computational cal Semantic Relations. Concentric: Studies in Linguistics (COLING/ACL-06). Linguistics, 35(1):87–109. Gerardo Sierra, Rodrigo Alarcón, César Aguilar Chu-Ren Huang, I-Ju Tseng, and Dylan Tsai. and Carme Bach. 2010. Definitional Verbal 2002. Translating Lexical Semantic Relations. Patterns for Semantic Relation Extraction. In: In: SEMANET ’02 Proceedings of the 2002 Alain Auger and Caroline Barrière (eds). Prob- workshop on Building and using Semantic Net- ing Semantic Relations. John Benjamins Pub- works, 1–7. lishing, Amsterdam/Philadelphia.

Chu-Ren Huang,Shu-Kai Hsieh,Jia-Fei Hong,Yun- Zuoyan Song and Qiu Likun. 2013. Qualia Rela- Zhu Chen,I-Li Su,Yong-Xiang Chen and Sheng- tions in Chinese Nominal Compounds Contain- Wei Huang. 2010. Chinese Wordnet: Design, ing Verbal Elements. International Journal of Implementation, and Application of an Infras- Knowledge and Language Processing, 28(1):114– tructure for Cross-Lingual Knowledge Process- 133. ing. Journal of Chinese Information Processing, 24(2):14–23. Dylan B. Tsai, Chu-Ren Huang, Shu-Chuan Tseng, Elanna J.I.Lin, Keh-jiann Chen and Yuan-hsun Chih-Yao Lee, Yu-Yun Chang, Shu-Kai Hsieh, Jia- Chuang. 2002. Chinese Lexical Semantic Fei Hong and Chu-Ren Huang. 2013. CWIKIN: Relations: Definition and Classification Crite- A wiki that Helps Quicken the Development of ria. Journal of Chinese Information Processing, Chinese Wordnet. The 8th International Confer- 2002(4):21–31. ence of the Asian Association for . Bali, Indonesia.

Charles N. Li and Sandra A. Thompson. 1981. Mandarin Chinese: A Functional Reference Grammar. Berkeley: University of California Press.