Measuring Synonyms in by Probabilistic Semantics

Kam Tang LAU 1, Yan SONG 2, Pang Fei KWOK3

1, College of Professional and Continuing Education, The Kong Polytechnic University, Hong Kong China Email: [email protected]

2, Microsoft Search Technology Center Asia, Beijing China Email: [email protected]

3, Department of Chinese and History, City University of Kong Kong, Hong Kong China [email protected]

Abstract

Erya is the earliest Chinese which was edited and circulated before (BC 221). It is used for annotating the Confucian Classics. Therefore, the words collected in Erya are from pre-Qin Confucian Classics. It is recognized as one of the most important work of Chinese Philology. Erya later was listed as one of the Shisan Jing (Thirteen Confucian Classics). This shows that Erya is as important as the other Confucian Classics compiled in the pre-Qin Dynasty.

Erya is like a synonym thesaurus. It puts all synonyms in parallel and explained by the last word and Ye. (“Ye” is using as verb-to-be “is” in Classical Chinese and placed after the explained word.) There are two main explanation patterns presented in Erya, one is “A, B, C, D Ye.” Another one is “E, F Ye, F, G Ye.” There are many Chinese Philology scholars discussed why the editors didn’t just record “E, F, G Ye”. They explained these phenomena by providing many evidences and examples from the Confucian Classics. They believed although E and F are the synonyms of G, F has shorter distance to G than E.

This research is going to prove that if the research outcome of the Chinese Philology scholars is valid by using computation methods. This will be a more scientific and precise method to explain the distance between the words in Erya. We try to use the most prevailing deep neural network based word-to-vector modeling (Mikolov, et al., 2013) to build vectors for every word in Erya, and investigate the distance (probabilistic semantics) among the words in question. Continuous bag-of-word (CBOW) framework is used to train word vectors so that we are able to manipulate them for further computation like vector summation and concatenation for investigating word contexts. In doing so, the entire Erya is preprocessed by word segmentation before our investigation, following the process we did for Huainanzi (Lau et al., 2013). Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

Kam Tang Lau, Yan Song, and Fei Xia, 2013. The Construction of a Segmented and Part-of- speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi (in Chinese), in Proceedings of the 12th China National Conference on Computational (CNCCL 2013), Suzhou, China, Oct, 2013.