International Journal of Pattern Recognition and Arti¯cial Intelligence Vol. 29, No. 2 (2015) 1559003 (26 pages) #.c World Scienti¯c Publishing Company DOI: 10.1142/S021800141559003X
Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering
† Guoyu Tang* and Yunqing Xia Department of Computer Science and Technology TNList, Tsinghua University Beijing 100084, P. R. China *[email protected] † [email protected]
Erik Cambria School of Computer Engineering Nanyang Technological University 50 Nanyang Avenue, Singapore 639798, Singapore [email protected]
Peng Jin School of Computer Science Leshan Normal University Leshan 614000, P. R. China [email protected]
Thomas Fang Zheng Department of Computer Science and Technology TNList, Tsinghua University Beijing 100084, P. R. China [email protected]
Received 23 May 2014 Accepted 16 October 2014 Published 18 February 2015
Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent
†Corresponding author.
1559003-1 G. Tang et al.
Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.
Keywords: Word sense; cross-lingual document representation; cross-lingual document clustering.
1. Introduction Economy globalization and internationalization of businesses urge organizations to handle an increasing number of documents written in di®erent languages. As an important technology for cross-lingual information access, cross-lingual document clustering (CLDC) seeks to automatically organize a large collection of multi-lingual documents into a small number of clusters, each of which contains semantically similar cross-lingual documents. Various document representation (DR) models have been proposed to deal with mono-lingual documents. The classic DR model is vector space model (VSM),32 which typically makes use of words as feature space. However, words are in fact not independent of each other. Two semantic word relations are worth mentioning, i.e. synonymy and polysemy. Synonymy indicates that di®erent words can carry almost all identical or similar meaning, and polysemy implies that a single word can have two or more senses. To address such issues, previous researches attempted to rep- resent documents through either explicit or latent semantic spaces.3,6,14,16,20,46 In the cross-lingual case, DR models present two main issues: language barrier and translation ambiguity. As for the former, a term in one language and its coun- terparts in other languages should be viewed as a unique feature in cross-lingual DR. In some earlier systems, dictionaries were used to map cross-lingual terms.12,25 However, such systems all su®ered from the latter issue, which implies that one term can be possibly translated into di®erent terms in another language, especially when such terms entail common-sense knowledge.7 Two translation ambiguity scenarios are worth noting. In the ¯rst scenario, the term carries di®erent meanings (namely, senses). For example, the word arm has two general meanings: (1) the part of body from shoulder to hand, and (2) a thing that is used for ¯ghting. Accordingly, the word arm should be translated into (shou3 bi4, arm) in a context relating to human body, but it should be translated as (zhuang1 bei4, arm) in a military context. The second scenario applies when we have to select one of the many possible translations to convey a speci¯c meaning. For example, as a part of the human body, the word arm can be also translated into (ge1 bo2, arm), which is not quite the same as (shou3 bi4, arm). This is a common problem in natural language processing (NLP) research even in mono-lingual documents, e.g. when switching between di®erent domains.43 In the context of CLDC, popular approaches consist in exploring word co-occurrence sta- tistics within parallel/comparable corpora.18,23,35,45 Recent works improved clus- tering performance by aligning terms from di®erent languages at topic-level.4,27,29,41 Nonetheless, cross-lingual topic alignment still remains an open challenge.
1559003-2 DR with Statistical Word Senses in CLDC
In this work, we treat translation ambiguity of terms, e.g. (shou3 bi4, arm) and (zhuang1 bei4, arm), as polysemy and translation choices, e.g. (shou3 bi4, arm) and (ge1 bo2, arm), as synonyms. As synonymy and polysemy pro- blems are closely related to word senses, we propose to represent document with cross-lingual statistical senses. Unlike previous approaches, which extract word senses from dictionaries, we propose to induce word senses statistically from corpora. To deal with cross-lingual cases, a novel cross-lingual word sense induction (WSI) model, referred to as CLHDP, is proposed to learn senses for each word (referred to as local word senses) respectively in parallel corpora. Thus, a sense clustering method is adopted to discover global word senses with semantic relatedness between senses of di®erent words. In this work, two cross-lingual DR models are proposed: a sense-based VSM and sense-based latent Dirichlet allocation (LDA) model. Two advantages of the pro- posed models are worth noting. Firstly, synonymy can be naturally addressed when word senses are involved. Words in one language that carry the same meaning can be organized by one word sense. In the cross-lingual case, words in cross languages can also be organized by one cross-lingual word sense. With synonymy addressed, cross- lingual documents can be more accurately represented. As a result, more accurate cross-lingual document similarity can be obtained and, hence, CLDC improves. Secondly, polysemy can also be well addressed as the translation ambiguity of po- lysemous words can be resolved within the cross-lingual contexts. Consequently, cross-lingual document similarity can be calculated more accurately when cross- lingual word disambiguation is achieved. By jointly addressing synonym and poly- semy, the proposed cross-lingual DR models work at a more semantic-level and, thus, are able to outperform bag-of-words models.8 Compared to topic-level DR models, moreover, the proposed models result to be more ¯ne-grained and, hence, more accurate. The structure of the paper is as follows: Section 2 introduces related work in the ¯eld of CLDC. Sections 3 and 4 illustrate in detail the proposed model. Section 5 presents evaluation and discussion. Section 6, ¯nally, gives concluding remarks and future directions.
2. Related Work 2.1. DR models This work is closely related to DR models. In traditional VSM, it is assumed that terms are independent from each other and, thus, any semantic relations between them are ignored. Previous works used concepts or word clusters10,30 as features or used similarities of words,13,42 but they still failed to handle the polysemy problem. To address both of the synonymy and polysemy issues, some DR models are based on lexical ontologies such as WordNet or Wikipedia, to represent documents in a concept space.14,16,17 However, the lexical ontologies are di±cult to construct and are
1559003-3 G. Tang et al. also hardly complete, moreover they tend to over-represent rare word senses, while missing corpus speci¯c senses. Representative extensions of the classic VSM are latent semantic analysis (LSA)20 and LDA.3 LSA seeks to decompose the term-document matrix by applying singular value decomposition, in which each feature is a linear combination of all words. However, LSA cannot solve the polysemy problem. LDA has successfully been used for the task of topic discovery3,21 but, according to Ref. 24, it may not perform well by itself in text mining task, especially in the case of tasks requiring ¯ne granularity discrimination, e.g. document clustering. Most of these semantic models, moreover, are designed for mono-lingual document sets, and cannot be used in cross-lingual scenarios directly.
2.2. Cross-lingual document clustering The main issue of CLDC is dealing with the cross-language barrier. The straight- forward solution is document translation. In TDT3, four systems attempted to use Machine Translation systems.22 Results show that using a machine translation tool leads to around 50% performance loss, compared with mono-lingual topic tracking. This ascribed mainly to the poor accuracy of machine translation systems. Dictionaries and corpora are two popular ways to get cross-language information. Some researches use dictionaries to translate documents.12 Others use dictionaries to translate features or keywords. Mathieu et al. use bi-lingual dictionaries to translate named entities and keywords and modi¯ed the cosine similarity formula to calculate similarity between bi-lingual documents.25 Pouliquen et al. rely on a multi-lingual thesaurus called Eurovoc to create cross-lingual article vectors.31 However, it is hard to select proper translation of ambiguous words in di®erent contexts. To solve such a problem, some researches leverage on word co-occurrence fre- quencies from corpora.12,25 However, they still need a dictionary but the human- de¯ned lexical resources are di±cult to construct and are also hardly complete. Wei et al. use LSA to construct a multi-lingual semantic space onto which words and document in either language can be mapped and dimensions are reduced again according to documents to be clustered.41 Yogatama and Tanaka-Ishii use a prop- agation algorithm to merge multi-lingual spaces from comparable corpus and spec- tral method to cluster documents.45 Li and Shawe-Taylor use Kernel Canonical Correlation Analysis, a method that ¯nds the maximally correlated projections of documents in two languages for cross-language Japanese-English patent retrieval and document classi¯cation.23 Unlike document classi¯cation, document clustering usually lacks training data. Hence, semantic spaces are constructed from parallel/comparable corpora, and dimensions are selected on the basis of their importance in such corpora, which are usually di®erent from the target multi-lingual documents. Mimno et al. introduce a poly-lingual topic model that discovers topics aligned across multiple languages.27 However, topics generated from a parallel corpus may be not aligned well to the
1559003-4 DR with Statistical Word Senses in CLDC topics discovered from the target document. Tang et al. use cross-lingual word similarity, but ignores the translation ambiguity problem.39 In this work, we view language barrier and translation ambiguity as synonymy and polysemy problems and propose to use statistic word senses to represent docu- ments in di®erent languages. Our proposed model can concurrently deal with the problems of synonymy and polysemy and, hence, outperform the state-of-the-art CLDC methods.
2.3. WSI and disambiguation Many approaches have been proposed to address the word sense disambiguation (WSD) task.11,26,28 The use of word senses has been proved to enhance performances on many NLP tasks.38 However, the use of word sense requires manually compiled large lexical resources such as WordNet. In many other cases, word senses are learned from corpora in an unsupervised manner, known as WSI. Many WSI algorithms have been proposed in the literature.9 The Bayesian model proposed in Ref. 5 uses an extended LDA model to induce word senses. It outperforms the state-of-the-art systems in SemEval-2007 evaluation1 by using a hierarchical Dirichlet process (HDP)40 to induce word senses. Unlike LDA, which requires a speci¯ed number of topics, HDP is able to infer such number au- tomatically. Apidianaki uses a bi-lingual corpus and take translation equivalent clusters as word senses.2 It assumes that word instances with the equivalent trans- lation carry the same meaning, which is not always true as instances of the same word with di®erent meanings may be translated as the same word in another lan- guage. WSI algorithms have already been integrated in information retrieval.34,37 However, to the best of our knowledge, the above-mentioned works only consider senses of query words, while in document clustering senses of every word in the documents should be identi¯ed. In this paper, we propose to induce cross-lingual word senses from a parallel corpus by means of a novel Bayesian sense induction model, termed CLHDP, which is hereby also exploited for WSD.
3. CLDC System 3.1. An overview Figure 1 presents the work°ow of our sense-based CLDC system. Firstly, senses of individual words (referred to as local word senses) in each language are induced from the parallel corpus by means of a cross-lingual WSI (CL-WSI) algorithm. As a result, we obtain a set of local word senses, each of which is represented by distribution of cross-language words. Secondly, after grouping cross-language local word senses in one set, a clustering algorithm is used to partition such a set and, hence, to obtain a few word sense subsets, each of which contains some semantically similar
1559003-5 G. Tang et al.
Parallel corpus Cross-lingual documents
WSI WSD
Local word senses Bag of cross-lingual senses
Word sense clustering Sense-based DR
Global word senses Cross-lingual document models
Document clustering
Cross-lingual document clusters
Fig. 1. Work°ow of the CLDC system. cross-language word senses. By using one sense for each subset to represent the di®erent subsets, we obtain a few cross-lingual global word senses. Thirdly, cross- lingual documents are represented through such cross-lingual global word senses. Finally, the clustering algorithm is executed on the cross-lingual documents.
3.2. Summary on novelty Two novel points in the CLDC system are worth noting.
(1) We propose a cross-lingual WSI algorithm by adapting mono-lingual HDP40 to the cross-lingual scenario (CLHDP) and using a clustering method to discover semantic relatedness between senses of di®erent words. (2) Cross-lingual DR models are proposed to represent cross-lingual documents with the cross-lingual word senses, which are learnt by means of the CLHDP algorithm on a parallel corpus.
In the next sections, we show how and why the proposed models are better than existing DR models and explain in detail modules of the systems.
4. Theory and Algorithms 4.1. De¯nitions
De¯nition Local word sense
A local word sense sw of word w is statistically represented by a set of discrete distributions over context words — one for a speci¯c language l, i.e.:
l l sw ¼fc i : pðc ijswÞg; i ¼ 1; ...; N; ð1Þ l where sw denotes a local sense of word w, c i is a context word in language l, and l l pðc ijswÞ the probability distribution of c i under sw.
1559003-6 DR with Statistical Word Senses in CLDC
To obtain word senses, previous work relied on thesauri, which are time — and resourc consuming to construct. In this work, instead, we use context words, as well as their probabilities to re°ect word senses. To use word arm again as an example, the following two local word senses can be learnt from the corpus.
. arm#1=flimb: 0.159, forelimb: 0.069, sleeve: 0.019g . arm#2=fweapon: 0.116, war: 0.039, battle: 0.026g
The example indicates that a local sense of the word arm involves speci¯c context words and their probability values, which are estimated from the corpus through a WSI algorithm. Obviously, local word senses can address the polysemy issue. De¯nitions: Cross-lingual local word sense In the cross-lingual scenario, we extend the local word sense de¯nition so that it involves multi-lingual context words, which are extracted from a parallel corpus, i.e.: 2 3 fc l1 : pðc l1 js Þg; i ¼ 1; ...; N 6 i i w l1 7 sw ¼ 4 ... 5; ð2Þ l1 lL ; ; ; fc j : pðc i jswÞg j ¼ 1 ... NlL
lk l1 where c i is a context word in language lk, and pðc i jswÞ the probability distribution of lk c i under sw within texts in language lk. For the word arm in the English-Chinese scenario, for example, the following two cross-lingual local word senses are illustrative.
. arm#1=flimb: 0.159, forelimb: 0.069, sleeve: 0.019; : 0.137, : 0.079, : 0.017g . arm#2=fweapon: 0.116, war: 0.039, battle: 0.026; : 0.153, : 0.027; : 0.026g
With an English-Chinese parallel corpus, the cross-lingual local word senses can be obtained through the CL-WSI algorithm. As seen in the above example, the cross-lingual local word senses can address the polysemy issue in cross-lingual sce- narios. However, local word senses are induced for every word separately. It is very common that a large number of synonymous word senses exist. Hence, we further propose to learn global word senses, which represent the universally exclusive word senses. De¯nition: Cross-lingual global word sense A global word sense g is a virtual word sense generalized from a group of synonymous local word senses, formalized as follows.
j g ¼fs wg; j ¼ 1; ...; M; ð3Þ
j where s w represents a local word sense. When the local word senses are induced from a cross-lingual scenario, the global word sense becomes cross-lingual naturally. In our CLDC system, the global word senses are discovered through a clustering algorithm
1559003-7 G. Tang et al. that uses context words as features in calculating semantic similarity between local word senses. Again, we use the word arm as an example to illustrate the global word sense:
. g#1=farm#1, #1g=f flimb: 0.159, forelimb: 0.069, sleeve: 0.019; : 0.137, : 0.079, : 0.017g, farm: 0.189, forelimb: 0.058, sleeve: 0.025; : 0.159, : 0.089, : 0.014g g . g#2=farm#2, weapon#1, #1g=f fweapon: 0.116, war: 0.039, battle: 0.026; : 0.153, : 0.027; : 0.026g, farm: 0.12, battle: 0.04, war: 0.016; : 0.133, : 0.035; : 0.028g, farm: 0.14, weapon: 0.12, war: 0.016; : 0.133, : 0.035; : 0.028g g
As shown in the above examples, the senses arm#1 and #1 are organized by the global word sense g#1 because the context distributions of arm#1 and #1 are similar. In this way, synonymous word senses in both languages can be organized with one global word sense. Synonymy is thus successfully addressed. In the following sections, we present how the cross-lingual word senses are learned from the parallel corpus.
4.2. Learning the cross-lingual word senses Two steps are required in learning the cross-lingual word senses:
(1) The local word senses are ¯rst induced from a parallel corpus; (2) The global word senses are generalized from the local word senses.
4.2.1. Local WSI The Bayesian model is adopted in order to achieve the task of local word sense learning. To be more speci¯c, we extend HDP40 to the cross-lingual scenario, referred to as CLHDP. Theory of HDP is brie°y introduced ¯rst. HDP for WSI HDP is proposed to perform text modeling. Yao and Van Durme (2011) employ HDP for WSI.44 HDP should be performed on each word respectively, which means each word has its own HDP model. In this paper, we de¯ne a word on which the WSI algorithm is performed as a target word. We also de¯ne words in the context of a target word as context words of the target word. HDP is a generative model, which can randomly generate observed data. For each context vi of the target word w, the sense sij for each word cij in vi has a nonparametric prior Gi which is sampled from a base distribution Gw. Hw is a Dirichlet distribution with hyperparameter w. The context word distribution sw given a sense sw is generated from Hw: sw Hw. The generative process of a target
1559003-8 DR with Statistical Word Senses in CLDC word w is given as follows:
(1) Choose Gw DPð w; HwÞ. (2) For each context window vi of word w:
(a) choose Gi DP ð w; GwÞ. (b) for each context word cij of target word w:
(i) choose sij Gi. (ii) choose cij Multð sij Þ.
Hyperparameters w and w are the concentration parameters for DP, controlling the variability of the distributions of Gw and Gi, respectively. HDP is illustrated in Fig. 2, where the shaded circle represents the observed variable, context word cij. HDP can be generated by the stick-breaking process and the Chinese restaurant process.40 CLHDP model CLHDP models word senses through cross-lingual context tuples. Each tuple is a set of contexts that are equivalent to each other but written in di®erent languages. Two assumptions are made in CLHDP. Firstly, contexts in a tuple share the same tuple- speci¯c distribution over senses. Secondly, each sense consists of a set of discrete distributions over context words — one for each language l ¼ 1; ...; L. In other words, rather than using a s for each sense s, as in HDP, there are L language- 1 L speci¯c senses-context word distributions s; ...; s , each of which is drawn from a l l language-speci¯c symmetric Dirichlet H w with concentration parameter w. CLHDP is illustrated in Fig. 3. As shown in Fig. 3, the generative process of a target word w is given as follows:
(1) Choose Gw DPð w; HÞ. (2) For each context window vi of w:
(a) choose Gi DP ð w; GwÞ. l (b) for each context word c ij in language l of target word w: l (i) choose s ij Gi. l l (ii) choose c ij Multð sij Þ.
Fig. 2. Illustration of the CLHDP model.
1559003-9 G. Tang et al.
Fig. 3. Illustration of the CLHDP model.
Hyperparameters rw and w are the concentration parameters of the DP, controlling the variability of the distributions Gw and Gvi . Inference for CLHDP model Teh et al. use Collapse Gibbs Sampling to ¯nd latent variables in HDP. Gibbs Sampling initializes all hidden variable randomly.40 For each iteration, hidden variables are sequentially sampled from the distribution conditioned on all other variables. Three sampling schemes can be used in HDP: posterior sampling in Chinese restaurant franchise, posterior sampling with an augmented representation and posterior sampling by direct assignment. For CLHDP, we use the direct assignment scheme because it is easy to implement. There are three steps in sampling scheme: l (1) Given s ¼fs ijg and m ¼fmkjg in Chinese restaurant process, samples fGvg and Gw, where mkj represents the number of tables in restaurant k serving dish j. The process is similar as described in Ref. 40.
The prior distribution Gw for each target word is a Dirichlet Process with con- centration parameter w and base probability Hw. It can be expressed using a stick- breaking representation,
X1 sw G ¼ 1 ; ; L ; ð4Þ w w ð sw ... sw Þ sw¼1 1 ; ; L 1 ; ; L where sw ... sw are generated from H w ... H w respectively and are given in this 1 L sw step, 1 ; ; L is a probability measure concentrated at ; ...; . f w g are sw ... sw sw sw mixtures over senses. They are sampled from a stick-breaking construction. In the sampling process, suppose that we have seen Sw senses for the target word w. The l context word distributions f sw g are generated and assigned to the context words in the corpus after some sampling iterations. Gw can be expressed as
sw u u G ¼ 1 ; ; L þ G ; ð5Þ w sw w ð sw ... sw Þ w w
1559003-10 DR with Statistical Word Senses in CLDC
u where G w is distributed as Dirichlet Process DPð w; HwÞ. Thus, Gw is dependent on sw w ¼f w g and the sampling equation for w is as follows: