Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

International Journal of Pattern Recognition and Arti¯cial Intelligence Vol. 29, No. 2 (2015) 1559003 (26 pages) #.c World Scienti¯c Publishing Company DOI: 10.1142/S021800141559003X

† Guoyu Tang* and Yunqing Xia Department of Computer Science and Technology TNList, Tsinghua University Beijing 100084, P. R. China *[email protected] † [email protected]

Erik Cambria School of Computer Engineering Nanyang Technological University 50 Nanyang Avenue, Singapore 639798, Singapore [email protected]

Peng Jin School of Computer Science Leshan Normal University Leshan 614000, P. R. China [email protected]

Thomas Fang Zheng Department of Computer Science and Technology TNList, Tsinghua University Beijing 100084, P. R. China [email protected]

Received 23 May 2014 Accepted 16 October 2014 Published 18 February 2015

Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent

†Corresponding author.

1559003-1 G. Tang et al.

Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.

Keywords: Word sense; cross-lingual document representation; cross-lingual document clustering.

1. Introduction Economy globalization and internationalization of businesses urge organizations to handle an increasing number of documents written in di®erent languages. As an important technology for cross-lingual information access, cross-lingual document clustering (CLDC) seeks to automatically organize a large collection of multi-lingual documents into a small number of clusters, each of which contains semantically similar cross-lingual documents. Various document representation (DR) models have been proposed to deal with mono-lingual documents. The classic DR model is vector space model (VSM),32 which typically makes use of words as feature space. However, words are in fact not independent of each other. Two semantic word relations are worth mentioning, i.e. synonymy and polysemy. Synonymy indicates that di®erent words can carry almost all identical or similar meaning, and polysemy implies that a single word can have two or more senses. To address such issues, previous researches attempted to represent documents through either explicit or latent semantic spaces.3,6,14,16,20,46 In the cross-lingual case, DR models present two main issues: language barrier and translation ambiguity. As for the former, a term in one language and its coun- terparts in other languages should be viewed as a unique feature in cross-lingual DR. In some earlier systems, dictionaries were used to map cross-lingual terms.12,25 However, such systems all su®ered from the latter issue, which implies that one term can be possibly translated into di®erent terms in another language, especially when such terms entail common-sense knowledge.7 Two translation ambiguity scenarios are worth noting. In the ¯rst scenario, the term carries di®erent meanings (namely, senses). For example, the word arm has two general meanings: (1) the part of body from shoulder to hand, and (2) a thing that is used for ¯ghting. Accordingly, the word arm should be translated into (shou3 bi4, arm) in a context relating to human body, but it should be translated as (zhuang1 bei4, arm) in a military context. The second scenario applies when we have to select one of the many possible translations to convey a speci¯c meaning. For example, as a part of the human body, the word arm can be also translated into (ge1 bo2, arm), which is not quite the same as (shou3 bi4, arm). This is a common problem in natural language processing (NLP) research even in mono-lingual documents, e.g. when switching between di®erent domains.43 In the context of CLDC, popular approaches consist in exploring word co-occurrence statistics within parallel/comparable corpora.18,23,35,45 Recent works improved clustering performance by aligning terms from di®erent languages at topic-level.4,27,29,41 Nonetheless, cross-lingual topic alignment still remains an open challenge.

1559003-2 DR with Statistical Word Senses in CLDC

In this work, we treat translation ambiguity of terms, e.g. (shou3 bi4, arm) and (zhuang1 bei4, arm), as polysemy and translation choices, e.g. (shou3 bi4, arm) and (ge1 bo2, arm), as synonyms. As synonymy and polysemy problems are closely related to word senses, we propose to represent document with cross-lingual statistical senses. Unlike previous approaches, which extract word senses from dictionaries, we propose to induce word senses statistically from corpora. To deal with cross-lingual cases, a novel cross-lingual word sense induction (WSI) model, referred to as CLHDP, is proposed to learn senses for each word (referred to as local word senses) respectively in parallel corpora. Thus, a sense clustering method is adopted to discover global word senses with semantic relatedness between senses of di®erent words. In this work, two cross-lingual DR models are proposed: a sense-based VSM and sense-based latent Dirichlet allocation (LDA) model. Two advantages of the proposed models are worth noting. Firstly, synonymy can be naturally addressed when word senses are involved. Words in one language that carry the same meaning can be organized by one word sense. In the cross-lingual case, words in cross languages can also be organized by one cross-lingual word sense. With synonymy addressed, cross- lingual documents can be more accurately represented. As a result, more accurate cross-lingual document similarity can be obtained and, hence, CLDC improves. Secondly, polysemy can also be well addressed as the translation ambiguity of po- lysemous words can be resolved within the cross-lingual contexts. Consequently, cross-lingual document similarity can be calculated more accurately when cross- lingual word disambiguation is achieved. By jointly addressing synonym and polysemy, the proposed cross-lingual DR models work at a more semantic-level and, thus, are able to outperform bag-of-words models.8 Compared to topic-level DR models, moreover, the proposed models result to be more ¯ne-grained and, hence, more accurate. The structure of the paper is as follows: Section 2 introduces related work in the ¯eld of CLDC. Sections 3 and 4 illustrate in detail the proposed model. Section 5 presents evaluation and discussion. Section 6, ¯nally, gives concluding remarks and future directions.

2. Related Work 2.1. DR models This work is closely related to DR models. In traditional VSM, it is assumed that terms are independent from each other and, thus, any semantic relations between them are ignored. Previous works used concepts or word clusters10,30 as features or used similarities of words,13,42 but they still failed to handle the polysemy problem. To address both of the synonymy and polysemy issues, some DR models are based on lexical ontologies such as WordNet or Wikipedia, to represent documents in a concept space.14,16,17 However, the lexical ontologies are di±cult to construct and are

1559003-3 G. Tang et al. also hardly complete, moreover they tend to over-represent rare word senses, while missing corpus speci¯c senses. Representative extensions of the classic VSM are latent semantic analysis (LSA)20 and LDA.3 LSA seeks to decompose the term-document matrix by applying singular value decomposition, in which each feature is a linear combination of all words. However, LSA cannot solve the polysemy problem. LDA has successfully been used for the task of topic discovery3,21 but, according to Ref. 24, it may not perform well by itself in text mining task, especially in the case of tasks requiring ¯ne granularity discrimination, e.g. document clustering. Most of these semantic models, moreover, are designed for mono-lingual document sets, and cannot be used in cross-lingual scenarios directly.

2.2. Cross-lingual document clustering The main issue of CLDC is dealing with the cross-language barrier. The straight- forward solution is document translation. In TDT3, four systems attempted to use Machine Translation systems.22 Results show that using a machine translation tool leads to around 50% performance loss, compared with mono-lingual topic tracking. This ascribed mainly to the poor accuracy of machine translation systems. Dictionaries and corpora are two popular ways to get cross-language information. Some researches use dictionaries to translate documents.12 Others use dictionaries to translate features or keywords. Mathieu et al. use bi-lingual dictionaries to translate named entities and keywords and modi¯ed the cosine similarity formula to calculate similarity between bi-lingual documents.25 Pouliquen et al. rely on a multi-lingual thesaurus called Eurovoc to create cross-lingual article vectors.31 However, it is hard to select proper translation of ambiguous words in di®erent contexts. To solve such a problem, some researches leverage on word co-occurrence fre- quencies from corpora.12,25 However, they still need a dictionary but the human- de¯ned lexical resources are di±cult to construct and are also hardly complete. Wei et al. use LSA to construct a multi-lingual semantic space onto which words and document in either language can be mapped and dimensions are reduced again according to documents to be clustered.41 Yogatama and Tanaka-Ishii use a propagation algorithm to merge multi-lingual spaces from comparable corpus and spectral method to cluster documents.45 Li and Shawe-Taylor use Kernel Canonical Correlation Analysis, a method that ¯nds the maximally correlated projections of documents in two languages for cross-language Japanese-English patent retrieval and document classi¯cation.23 Unlike document classi¯cation, document clustering usually lacks training data. Hence, semantic spaces are constructed from parallel/comparable corpora, and dimensions are selected on the basis of their importance in such corpora, which are usually di®erent from the target multi-lingual documents. Mimno et al. introduce a poly-lingual topic model that discovers topics aligned across multiple languages.27 However, topics generated from a parallel corpus may be not aligned well to the

1559003-4 DR with Statistical Word Senses in CLDC topics discovered from the target document. Tang et al. use cross-lingual word similarity, but ignores the translation ambiguity problem.39 In this work, we view language barrier and translation ambiguity as synonymy and polysemy problems and propose to use statistic word senses to represent documents in di®erent languages. Our proposed model can concurrently deal with the problems of synonymy and polysemy and, hence, outperform the state-of-the-art CLDC methods.

2.3. WSI and disambiguation Many approaches have been proposed to address the word sense disambiguation (WSD) task.11,26,28 The use of word senses has been proved to enhance performances on many NLP tasks.38 However, the use of word sense requires manually compiled large lexical resources such as WordNet. In many other cases, word senses are learned from corpora in an unsupervised manner, known as WSI. Many WSI algorithms have been proposed in the literature.9 The Bayesian model proposed in Ref. 5 uses an extended LDA model to induce word senses. It outperforms the state-of-the-art systems in SemEval-2007 evaluation1 by using a hierarchical Dirichlet process (HDP)40 to induce word senses. Unlike LDA, which requires a speci¯ed number of topics, HDP is able to infer such number automatically. Apidianaki uses a bi-lingual corpus and take translation equivalent clusters as word senses.2 It assumes that word instances with the equivalent translation carry the same meaning, which is not always true as instances of the same word with di®erent meanings may be translated as the same word in another language. WSI algorithms have already been integrated in information retrieval.34,37 However, to the best of our knowledge, the above-mentioned works only consider senses of query words, while in document clustering senses of every word in the documents should be identi¯ed. In this paper, we propose to induce cross-lingual word senses from a parallel corpus by means of a novel Bayesian sense induction model, termed CLHDP, which is hereby also exploited for WSD.

3. CLDC System 3.1. An overview Figure 1 presents the work°ow of our sense-based CLDC system. Firstly, senses of individual words (referred to as local word senses) in each language are induced from the parallel corpus by means of a cross-lingual WSI (CL-WSI) algorithm. As a result, we obtain a set of local word senses, each of which is represented by distribution of cross-language words. Secondly, after grouping cross-language local word senses in one set, a clustering algorithm is used to partition such a set and, hence, to obtain a few word sense subsets, each of which contains some semantically similar

1559003-5 G. Tang et al.

Parallel corpus Cross-lingual documents

WSI WSD

Local word senses Bag of cross-lingual senses

Word sense clustering Sense-based DR

Global word senses Cross-lingual document models

Document clustering

Cross-lingual document clusters

Fig. 1. Work°ow of the CLDC system. cross-language word senses. By using one sense for each subset to represent the di®erent subsets, we obtain a few cross-lingual global word senses. Thirdly, cross- lingual documents are represented through such cross-lingual global word senses. Finally, the clustering algorithm is executed on the cross-lingual documents.

3.2. Summary on novelty Two novel points in the CLDC system are worth noting.

(1) We propose a cross-lingual WSI algorithm by adapting mono-lingual HDP40 to the cross-lingual scenario (CLHDP) and using a clustering method to discover semantic relatedness between senses of di®erent words. (2) Cross-lingual DR models are proposed to represent cross-lingual documents with the cross-lingual word senses, which are learnt by means of the CLHDP algorithm on a parallel corpus.

In the next sections, we show how and why the proposed models are better than existing DR models and explain in detail modules of the systems.

4. Theory and Algorithms 4.1. De¯nitions

De¯nition Local word sense

A local word sense sw of word w is statistically represented by a set of discrete distributions over context words — one for a speci¯c language l, i.e.:

l l sw ¼fc i : pðc ijswÞg; i ¼ 1; ...; N; ð1Þ l where sw denotes a local sense of word w, c i is a context word in language l, and l l pðc ijswÞ the probability distribution of c i under sw.

1559003-6 DR with Statistical Word Senses in CLDC

To obtain word senses, previous work relied on thesauri, which are time — and resourc consuming to construct. In this work, instead, we use context words, as well as their probabilities to re°ect word senses. To use word arm again as an example, the following two local word senses can be learnt from the corpus.

. arm#1=flimb: 0.159, forelimb: 0.069, sleeve: 0.019g . arm#2=fweapon: 0.116, war: 0.039, battle: 0.026g

The example indicates that a local sense of the word arm involves speci¯c context words and their probability values, which are estimated from the corpus through a WSI algorithm. Obviously, local word senses can address the polysemy issue. De¯nitions: Cross-lingual local word sense In the cross-lingual scenario, we extend the local word sense de¯nition so that it involves multi-lingual context words, which are extracted from a parallel corpus, i.e.: 2 3 fc l1 : pðc l1 js Þg; i ¼ 1; ...; N 6 i i w l1 7 sw ¼ 4 ... 5; ð2Þ l1 lL ; ; ; fc j : pðc i jswÞg j ¼ 1 ... NlL

lk l1 where c i is a context word in language lk, and pðc i jswÞ the probability distribution of lk c i under sw within texts in language lk. For the word arm in the English-Chinese scenario, for example, the following two cross-lingual local word senses are illustrative.

. arm#1=flimb: 0.159, forelimb: 0.069, sleeve: 0.019; : 0.137, : 0.079, : 0.017g . arm#2=fweapon: 0.116, war: 0.039, battle: 0.026; : 0.153, : 0.027; : 0.026g

With an English-Chinese parallel corpus, the cross-lingual local word senses can be obtained through the CL-WSI algorithm. As seen in the above example, the cross-lingual local word senses can address the polysemy issue in cross-lingual scenarios. However, local word senses are induced for every word separately. It is very common that a large number of synonymous word senses exist. Hence, we further propose to learn global word senses, which represent the universally exclusive word senses. De¯nition: Cross-lingual global word sense A global word sense g is a virtual word sense generalized from a group of synonymous local word senses, formalized as follows.

j g ¼fs wg; j ¼ 1; ...; M; ð3Þ

j where s w represents a local word sense. When the local word senses are induced from a cross-lingual scenario, the global word sense becomes cross-lingual naturally. In our CLDC system, the global word senses are discovered through a clustering algorithm

1559003-7 G. Tang et al. that uses context words as features in calculating semantic similarity between local word senses. Again, we use the word arm as an example to illustrate the global word sense:

. g#1=farm#1, #1g=f flimb: 0.159, forelimb: 0.069, sleeve: 0.019; : 0.137, : 0.079, : 0.017g, farm: 0.189, forelimb: 0.058, sleeve: 0.025; : 0.159, : 0.089, : 0.014g g . g#2=farm#2, weapon#1, #1g=f fweapon: 0.116, war: 0.039, battle: 0.026; : 0.153, : 0.027; : 0.026g, farm: 0.12, battle: 0.04, war: 0.016; : 0.133, : 0.035; : 0.028g, farm: 0.14, weapon: 0.12, war: 0.016; : 0.133, : 0.035; : 0.028g g

As shown in the above examples, the senses arm#1 and #1 are organized by the global word sense g#1 because the context distributions of arm#1 and #1 are similar. In this way, synonymous word senses in both languages can be organized with one global word sense. Synonymy is thus successfully addressed. In the following sections, we present how the cross-lingual word senses are learned from the parallel corpus.

4.2. Learning the cross-lingual word senses Two steps are required in learning the cross-lingual word senses:

(1) The local word senses are ¯rst induced from a parallel corpus; (2) The global word senses are generalized from the local word senses.

4.2.1. Local WSI The Bayesian model is adopted in order to achieve the task of local word sense learning. To be more speci¯c, we extend HDP40 to the cross-lingual scenario, referred to as CLHDP. Theory of HDP is brie°y introduced ¯rst. HDP for WSI HDP is proposed to perform text modeling. Yao and Van Durme (2011) employ HDP for WSI.44 HDP should be performed on each word respectively, which means each word has its own HDP model. In this paper, we de¯ne a word on which the WSI algorithm is performed as a target word. We also de¯ne words in the context of a target word as context words of the target word. HDP is a generative model, which can randomly generate observed data. For each context vi of the target word w, the sense sij for each word cij in vi has a nonparametric prior Gi which is sampled from a base distribution Gw. Hw is a Dirichlet distribution with hyperparameter w. The context word distribution sw given a sense sw is generated from Hw: sw Hw. The generative process of a target

1559003-8 DR with Statistical Word Senses in CLDC word w is given as follows:

(1) Choose Gw DPðw; HwÞ. (2) For each context window vi of word w:

(a) choose Gi DP ðw; GwÞ. (b) for each context word cij of target word w:

(i) choose sij Gi. (ii) choose cij Multð sij Þ.

Hyperparameters w and w are the concentration parameters for DP, controlling the variability of the distributions of Gw and Gi, respectively. HDP is illustrated in Fig. 2, where the shaded circle represents the observed variable, context word cij. HDP can be generated by the stick-breaking process and the Chinese restaurant process.40 CLHDP model CLHDP models word senses through cross-lingual context tuples. Each tuple is a set of contexts that are equivalent to each other but written in di®erent languages. Two assumptions are made in CLHDP. Firstly, contexts in a tuple share the same tuple- speci¯c distribution over senses. Secondly, each sense consists of a set of discrete distributions over context words — one for each language l ¼ 1; ...; L. In other words, rather than using a s for each sense s, as in HDP, there are L language- 1 L speci¯c senses-context word distributions s; ...;s , each of which is drawn from a l l language-speci¯c symmetric Dirichlet H w with concentration parameter w. CLHDP is illustrated in Fig. 3. As shown in Fig. 3, the generative process of a target word w is given as follows:

(1) Choose Gw DPðw; HÞ. (2) For each context window vi of w:

(a) choose Gi DP ðw; GwÞ. l (b) for each context word c ij in language l of target word w: l (i) choose s ij Gi. l l (ii) choose c ij Multð sij Þ.

Fig. 2. Illustration of the CLHDP model.

1559003-9 G. Tang et al.

Fig. 3. Illustration of the CLHDP model.

Hyperparameters rw and w are the concentration parameters of the DP, controlling the variability of the distributions Gw and Gvi . Inference for CLHDP model Teh et al. use Collapse Gibbs Sampling to ¯nd latent variables in HDP. Gibbs Sampling initializes all hidden variable randomly.40 For each iteration, hidden variables are sequentially sampled from the distribution conditioned on all other variables. Three sampling schemes can be used in HDP: posterior sampling in Chinese restaurant franchise, posterior sampling with an augmented representation and posterior sampling by direct assignment. For CLHDP, we use the direct assignment scheme because it is easy to implement. There are three steps in sampling scheme: l (1) Given s ¼fs ijg and m ¼fmkjg in Chinese restaurant process, samples fGvg and Gw, where mkj represents the number of tables in restaurant k serving dish j. The process is similar as described in Ref. 40.

The prior distribution Gw for each target word is a Dirichlet Process with concentration parameter w and base probability Hw. It can be expressed using a stick- breaking representation,

X1 sw G ¼ 1 ; ;L ; ð4Þ w w ð sw ... sw Þ sw¼1 1 ; ;L 1 ; ; L where sw ... sw are generated from H w ... H w respectively and are given in this 1 L sw step, 1 ; ;L is a probability measure concentrated at ; ...; . f w g are sw ... sw sw sw mixtures over senses. They are sampled from a stick-breaking construction. In the sampling process, suppose that we have seen Sw senses for the target word w. The l context word distributions f sw g are generated and assigned to the context words in the corpus after some sampling iterations. Gw can be expressed as

sw u u G ¼ 1 ; ;L þ G ; ð5Þ w sw w ð sw ... sw Þ w w

1559003-10 DR with Statistical Word Senses in CLDC

u where G w is distributed as Dirichlet Process DPðw; HwÞ. Thus, Gw is dependent on sw w ¼f w g and the sampling equation for w is as follows:

1 ; ;Sw ;u ; ; ; ; ; ð w ... w wÞjf sw g s Dirðm:1 ... m:Sw wÞ ð6Þ where m:j represents the number of tables in all restaurants serving dish j. l (2) Given fGvg, Gw, sample s ¼fs ijg. The conditional probability for sampling l the sense sij of context word c ij ¼ c in context window vi in language l can be estimated as: ; ; Pðsij ¼ sw jsij ciÞ 8 c l > n ; þ w > vi sw ij sw > ðn ; þ w w Þ if s is previously used; < ij sw l nij;s ;l þ Vl;w w ¼ w ð7Þ > c l > n ; þ :> u ij sw w if s is new; w w l nij;sw;l þ Vl;w w where n c is a count of how many context word c are assigned sense s , ij;sw ¼ w excluding the jth context word in language l and Vl;w is the number of context words in language l. nij;sw;l is the total number of context words in language l that are assigned sense s , excluding the jth context word in language l, n vi is total w ij;sw number of context words in language l in vi that are assigned sense sw excluding the jth context word in language l. l (3) Given Gw, s ¼fs ijg, sample m ¼fmkjg. The conditional probability for sampling m ¼fmkjg can be estimated as:

sw ð w Þ ; ; w ; sw m; pðmkj ¼ mjs mkj wÞ¼ sw sðnk:j mÞð w w Þ ð8Þ ðw w þ nk:jÞ where nk:j represents the number of customers in restaurant k serving dish j, sðnk:j; mÞ are unsigned Stirling numbers of the ¯rst kind. l Thus the context word distribution s can be calculated as c l n ; þ w l ðcÞ¼ sw l ð9Þ sw l nsw;l þ Vw;l w where n c is a count of how many context word c are assigned sense s, in language sw;l ¼ l and Vw;l is the number of context words in language l. nsw;l is the total number of words in language l that are assigned sense sw. In this work, we use sentences as context windows and extract cross-lingual context in a parallel corpus. For example, when a word is found in one sentence, we put the sentence and its corresponding sentence in the parallel corpus in a tuple.

4.2.2. Global word sense generalization We view word sense generalization as a clustering task. The goal is to organize semantically similar word senses with one virtual word sense, which is globally unique.

1559003-11 G. Tang et al.

In this work, probability distribution of context words is considered as a set of features and clustering algorithms are applied to merge equivalent senses. For a cross-lingual word sense, we simply combine context words in all languages and their distributions in one vector. Two methods are adopted to cluster the local word senses:

(1) Bisecting K-Means is an extension of K-means, which is proven better than standard K-Means and hierarchical agglomerative clustering.36 It begins with a large cluster consisting of every element to be clustered and iteratively picks the largest cluster in the set and splits it into two. (2) Graph-based Clustering is a clustering method based on graph-partition. It ¯rst models the objects using a nearest-neighbor graph and then splits the graph into k-clusters using a min-cut graph-partitioning algorithm.

4.3. Sense-based DR 4.3.1. Cross-lingual WSD

In this work, the CLHDP algorithm is also used for WSD. Given D ¼fdj; j ¼ 1; ...; Ng representing a document set containing N documents and M words, the context set for each word is extracted and sense distribution in each context can be estimated by CLHDP model. l Given the word w in language l, the sense-context word distribution sw for word ^ w estimated in parallel corpus, context sets fvî; i ¼ 1; ...; V wg in D, the inference process is similar as Sec. 4.2.1. The only modi¯cation is that in the second step, the l conditional probability for sampling the sense sij of context word c ij ¼ c in context window vî in language l can be estimated as: Pðs ¼ s ; js ; c Þ¼ðn^ vî þ ^^ sw Þl ðc Þ; ð10Þ ij w ij i ij;sw w sw ij vî sw where n^ ij;s , ^w, ^ w represent CLHDP parameters in context sets w ^ fvî; i ¼ 1; ...; V wg. After sampling, the sense distribution v^ for each context window vî in context set ^ fvî; i ¼ 1; ...; V wg for the target word w can be estimated as follows:

v^ sw n^ þ ^ ^ w sw w ; v^ðswÞ¼ P 0 ð11Þ s w v^ 0 n^ þ ^w s w ^ w v^ v^ where n^ sw is a count of how many sense sw in context window v^ and n^ is the total number of words in context window v^.

With v^ðswÞ, we simply take the mode sense in the distribution as the sense of the target word. For example, three sentences are given below.

. S1: That man with one arm lost his other limb in an airplane crash. . S2: The nation must arm its soldiers for battle.

1559003-12 DR with Statistical Word Senses in CLDC

. S3:

After stop word removal and word lemmatization, the three sentences become:

. S 1: man arm lost limb airplane crash . S 2: nation arm soldier battle . S 3:

The probability of word sense arm#1 in sentence S1 is 0.998005. For sentence S2, The probability of word sense arm#2 is 0.944096. In this work, we simply take the sense with the highest probability as the sense of the target word and use the senses to represent document. So the sense of arm in S1 is g#1 because the probability of arm#1 is higher and arm#1 belongs to g#1. Simi- larly, the sense of arm in S2 is g#2. For sentence S3, the sense of (wu3 qi4, arm) is also g#2. In this way, instances of the same word with di®erent meanings are identi¯ed as di®erent senses and di®erent words with same meaning are identi¯ed as the same sense. Accordingly, translation ambiguity and language barrier issues are both addressed. After WSD, we start from the two most popular DR models, i.e. VSM and LDA, and propose sense-based versions of them.

4.3.2. Sense-based VSM The traditional VSM model uses discriminative words to represent a document.

Document di in document set D is represented as di ¼fwij : rijgj¼1;...;M di in VSM, d where rij represents the weight of a feature word wij in di. M i is the number of feature words in di. Di®erently, sense-based VSM (sVSM) uses global senses as features. With WSD, every word ¯rst in the document is assigned a unique global sense. Then, the weight of a global sense is calculated in a similar manner using TF-IDF formula.

Finally, the sense vector is produced for each document. For example, di can be represented as di ¼fgij : r^ijgj¼1;...;M di in sVSM, where r^ij represents the weight of sense gij in di. We use cosine similarity to calculate similarities between two sense vectors.

4.3.3. Sense-based LDA We replace word surfaces with word senses so that the classic LDA model is extended to sense-based LDA (sLDA) model. The WSD algorithm is again used to assign a unique global sense to a speci¯c surface word. Then, sLDA generates a distribution of topics di for each document di in the document set. For a word wj in the document, the sense sij is drawn from the topic and topic-sense distribution containing T multinomial distributions over all possible senses in the corpus drawn from a symmetric Dirichlet distribution DirðÞ.

1559003-13 G. Tang et al.

Fig. 4. Illustration of the sLDA model.

As shown in Fig. 4, the formal procedure of generative process in sLDA is given as follows:

(1) For each topic z: ^ ^ (a) choose z DirðÞ.

(2) For each document di: ^ (a) choose di Dirð ^Þ. (b) for each word wj in document di: ^ (i) choose topic zij Multð di Þ. ^ (ii) choose sense gij Multð zij Þ. In sLDA, Gibbs sampling is used for parameter estimation and inference.15 Compared with LDA, we replace the surface words with the induced word senses. Therefore, the topic inference is similar to the classic LDA, where the condition ; probability Pðzij ¼ zjzij sÞ is evaluated by

n di þ n g þ ; ij;z ij;z : Pðzij ¼ zjzij gÞ/ d ð12Þ i n ; þ G n ij þ Z ij z

di In Eq. (12), n ij;z is the number of words that are assigned topic z in document di; di g n ij is the total number of words in document di; n ij;z is the number of senses with sense g that are assigned topic z; nij;z is the total number of words assigned topic z; G is the number of senses for the dataset. ij in all the above variables refers to excluding the count for the sense of the jth word. Further details are similar to the classic LDA.15

4.4. Cross-lingual document clustering Document clustering becomes naturally feasible when the documents are represented by cross-lingual word senses. As the clustering algorithm is not the focus of this work, we simply adopt Bisecting K-Means to cluster document for sVSM. For sLDA, each topic in the test dataset is considered a cluster. After the parameters are estimated, documents are clustered into topics with the highest probabilities.

1559003-14 DR with Statistical Word Senses in CLDC

5. Evaluation 5.1. Setup

Development dataset We randomly extract 1M parallel sentence pairs from LDC corpora (i.e. LDC2004E12, LDC2004T08, LDC2005T10, LDC2003E14, LDC2002E18 LDC2005T06, LDC2003E07 and LDC2004T07) as our development data to get word senses. Test dataset Four datasets are used in this paper.

(1) TDT4 datasets: Following Kong and Gra®,19 we use two datasets which are extracted from TDT4 evaluation dataset. (2) CLTC datasets: Two datasets are extracted from the CLTC.39

Table 1 presents statistics of the four datasets. In our experiments, we only extract nouns and verbs to induce word senses because words of other types make little contribution in document clustering. We use TreeTagger33 to do lemmatization and POS tagging for English word, and use ICTCLASa to segment Chinese words and assign POS tags to these words. Word information of the four test datasets is presented in Table 2. Evaluation metrics We adopt the evaluation metrics proposed by Steinbach et al.36 The evaluation metrics are de¯ned as the maximum score for each cluster. Let Ai correspond to the set of articles in a human-annotated cluster ci.LetAj correspond to the set of articles

Table 1. Statistics of topic and story in the four datasets. In each cell, number of topics is on the left and number of stories on the right.

Dataset TDT41 (2002) TDT42 (2003) CLTC1 CLTC2

English 38/1270 33/617 20/200 20/600 Chinese 37/657 32/560 20/200 20/600 Common 40/1927 37/1177 20/200 20/1200

Table 2. Word statistics in the four datasets. Dataset TDT41 (2002) TDT42 (2003) CLTC1 CLTC2

English 2414 1887 1651 1862 Chinese 5457 3548 1437 2255 Common 7871 5435 3088 4117

ahttp://www.ictclas.org/ictclas introduction.html.

1559003-15 G. Tang et al. in a system-generated cluster cj. We consider each topic in the dataset as a cluster. The score for each cluster is based on the pairwised evaluation as follows:

jAi \ Ajj pi;j ¼ pi ¼ maxfpi;jg; jAij j

jAi \ Ajj ri;j ¼ ri ¼ maxfri;jg; ð13Þ jAjj j

2 pi;j ri;j fi;j ¼ fi ¼ maxffi;jg; pi;j þ ri;j j where pi;j, ri;j and fi;j represent precision, recall and F-measure for the pair of clusters ci and cj, respectively. The general F-measure of a system is the micro-average of all the F-measures (ffig) for the system-generated clusters. System parameters The proposed approach involves great °exibility in modeling empirical data. This, however, entails that several parameters must be instantiated. More precisely, our model is regulated by the following four kinds of parameters:

(1) WSI parameters: We set Gammað1; 0:1Þ, Gammað0:01; 0:028Þ and ¼ 0:1 for every word and both languages. (2) sVSM parameters: We set number of clusters as number of topics in each dataset. (3) sLDA parameters: We set ¼ 50=#topic, ¼ 0:1 which are usually used in LDA. The topic number is also set as cluster number in each dataset. In all experiments, we let the Gibbs sampler burn in for 2000 iterations and subse- quently take samples 20 iterations apart for another 200 iterations. (4) Number of global senses: We choose to conduct experiments to observe how they in°uence the document clustering performance.

5.2. Experiment 1: Di®erent word sense clustering methods In this experiment, we aim to study how di®erent word sense clustering (WSC) methods in°uence the system performance. We implement two systems of di®erent WSC methods.

(1) Bisecting K-Means with sVSM (BK-sVSM): The system uses Bisecting K-means to cluster local word sense. sVSM is used to represent documents. Cosine similarity measure is used to calculate document similarity and Bisecting K-means is used to cluster documents. (2) Graph-based Clustering with sVSM (GC-sVSM): The system uses Graph-based clustering method to cluster local word sense. Other setups are the same as BK-sVSM.

1559003-16 DR with Statistical Word Senses in CLDC

1 0.9 0.8 0.7

F-Measure 0.6 BK-sVSM GC-sVSM 0.5 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 5. Results of the systems with di®erent sense numbers in CLTC1 test dataset.

1 0.9 0.8 0.7

F-Measure 0.6 BK-sVSM GC-sVSM 0.5 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 6. Results of the systems with di®erent sense numbers in CLTC2 test dataset.

We incrementally increase sense number jGj from 100 to 4000 and evaluate both systems with the four datasets separately. Experimental results are presented in Figs. 5–8. The best F-measure values (f1) at the corresponding global word sense number (i.e. f1@Number #) of the two systems are listed in Table 3. The average e±ciency of the two systems are listed in Table 4.

0.85 0.8 0.75 0.7

F-Measure 0.65 BK-sVSM GC-sVSM 0.6 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 7. Results of the systems with di®erent sense numbers in TDT41 test dataset.

1559003-17 G. Tang et al.

0.85

0.8

0.75

0.7

F-Measure 0.65 BK-sVSM GC-sVSM 0.6 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 8. Results of the systems with di®erent sense numbers in TDT42 test dataset.

Table 3. The highest F-measure values of CLDC with di®erent sense clustering methods.

Dataset CLTC1 CLTC2 TDT41 TDT42 System BK-SVSM 0.926@700 0.904@1300 0.809@1800 0.791@1500 GC-SVSM 0.917@400 0.869@800 0.771@2100 0.752@1000

Table 4. The e±ciency of CLDC with di®erent sense clustering methods.

Dataset CLTC1 CLTC2 TDT41 TDT42 System BK-SVSM 5753s 5732s 9623s 7129s GC-SVSM 94s 144s 366s 222s

Discussion on in°uence of the global sense number We compared the performance of di®erent sense numbers and found that using low and high sense number can cause a drop on F-measure. This is due to the fact that, when the sense number is set to a low number, many local word senses that are not similar are clustered together resulting in low performance. When the sense number is set to a high number, similar local word senses are not clustered together. Thus, words with the same meaning in di®erent languages may not be connected. This will largely a®ect the accuracy of similarity between documents in di®erent languages. In that case, performance is reduced. After comparing di®erent datasets, we can claim that datasets with larger word number have larger optimal global word sense number. For example, in system BK-sVSM, in CLTC1 dataset with 3088 words, the best F-measure achieves when the global word sense number is set as 700 while in TDT41 dataset with 7871 words, the optimal global word sense number is 1800. This is coherent with the fact that dataset with more words usually contains more senses.

1559003-18 DR with Statistical Word Senses in CLDC

Discussions on in°uence of the sense clustering method As we can see from Figs. 5–8, BK-sVSM system outperforms GC-sVSM when the sense number jGj increases from 100 to 3500 in most cases. This happens because the graph-based clustering method produces unbalanced clusters, while bisecting K-Means is more balanced in which it favors global property rather than the nearest neighbor. For this reason, we use Bisecting K-Means in WSC in the later experiments. From Table 4 we can see GC-sVSM is much faster than BK-sVSM. This is because given n as the number of objects to be clustered, the time complexity of Bisecting K-Means is O(NNZ logðkÞ) where NNZ represent the number of nonzeros in the input matrix while the time complexity of Graph-based Clustering is O(n2 þ n NNbrs logðkÞ) where NNbrs represents the number of neighbors in the nearest-neighbor graph. The number of target words is much smaller than the number of context words. So NNZ is much larger than n2 and the time complexity of Bisecting K-Means is larger than the time complexity of Graph-based Clustering.

5.3. Experiment 2: Di®erent sense-based DR models In this experiment, we aim to study how di®erent DR models in°uence system performance. Besides BK-sVSM, we also implement a system using sLDA to represent documents.

(1) BK-sLDA: The system uses sLDA to represent documents. Experimental results on four datasets in two language cases are given in Figs. 9–12. The best F-measure (f1) at the corresponding global word sense number (i.e. f1@topic #) of the two systems are listed in Table 5. Discussion We compared the performance of di®erent DR models based on sense and found that sVSM outperforms sLDA in all datasets. This is because ¯ne granularity discrimination of feature space is important in document clustering task, while topic inferred from LDA may not resolve this issue very well. This is consistent with Ref. 24.

1 0.9 0.8 0.7

F-Measure 0.6 BK-sVSM BK-sLDA 0.5 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 9. Results of the systems with di®erent sense numbers in CLTC1 test dataset.

1559003-19 G. Tang et al.

0.9 0.8 0.7

F-Measure 0.6 BK-sVSM BK-sLDA 0.5 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 10. Results of the systems with di®erent sense numbers in CLTC2 test dataset.

0.85 0.8 0.75 0.7

F-Measure 0.65 BK-sVSM BK-sLDA 0.6 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 11. Results of the systems with di®erent sense numbers in TDT41 test dataset.

0.85

0.8

0.75

0.7

F-Measure 0.65 BK-sVSM BK-sLDA 0.6 0 1000 2000 3000 4000 Number of global senses |G|

Fig. 12. Results of the systems with di®erent sense numbers in TDT42 test dataset.

Table 5. The highest F-measure values of CLDC with di®erent sense-based DR models in four test sets.

Dataset CLTC1 CLTC2 TDT41 TDT42 System BK-SVSM 0.926@700 0.904@1300 0.809@1800 0.791@1500 GC-SVSM 0.774@2000 0.847@2900 0.795@3100 0.780@1900

1559003-20 DR with Statistical Word Senses in CLDC

5.4. Experiment 3: Di®erent document representation models In this experiment, we intend to compare our model with state-of-the-art DR models in CLDC. Besides BK-sVSM, the following two models are implemented.

(1) CL-GVSM: The model proposed by Tang et al. improves the similarity cal- culation by cross-lingual word similarity from a parallel corpus.39 (2) PLTM: The model proposed by Mimno et al. to get cross-lingual topic information.27 In this paper, we apply the model on the parallel corpus to train cross- lingual topic and infer the topic distribution on the test dataset. The topic number is set to 1000. Bisecting K-means is used to cluster documents with the topic distribution as features.

Experiment results on four datasets in two language cases are given in Table 6. Discussion As shown in Table 6, BK-sVSM outperforms GVSM in all datasets. GVSM is a DR model that considers word similarity. However, it only considers relationships between words and ignores di®erences of one word in di®erent contexts, which instead our proposed BK-sVSM considers both. Table 6 also shows that BK-sVSM outperforms PLTM in all datasets. This indicates BK-sVSM outperforms PLTM in document clustering task. We ¯nd that PLTM yields the lowest performance in most cases. The reason for the signi¯cant performance drop is that, when using a parallel corpus to train the PLTM model, the topics may not be well covered in the test dataset and noise redundant topics (produced by the training corpus) may a®ect the performance. Performance issue Inducing word senses from the development data and clustering the word senses require higher computational e®ort. Indeed, the most time-consuming phase of our proposed model is the construction of word senses, which requires one CLHDP model for each word and a clustering method on those topics, referred to as local word senses in this paper. While word senses can be pre-computed or cached, word disambiguation of the test datasets still requires to be computed in real time. However, it can take advantage of parallel computing in which the disambiguation of each word is independent.

Table 6. The highest F-measure values with di®erent DR models.

Dataset CLTC1 CLTC2 TDT41 TDT42 System BK-SVSM 0.926 0.904 0.809 0.791 GVSM 0.900 0.898 0.762 0.748 PLTM 0.768 0.776 0.493 0.482

1559003-21 G. Tang et al.

6. Conclusion Previous researches show the importance of addressing language barrier and translation ambiguity in CLDC. In this paper, these two issues are viewed as general synonymy and polysemy problems and a DR based on cross-lingual statistical word senses is proposed. The proposed method, in particular, aims to address the synonymy and polysemy issues in DR in two ways: (1) words containing the same meaning in di®erent languages can be identi¯ed as the same word senses (in that case, language barrier can be crossed); (2) Instances of the same word with di®erent meanings are identi¯ed as the di®erent word senses (in that case, translation ambiguity can be addressed). Experiments on four datasets of two language cases show that our proposed model outperforms two state-of-the-art models in CLDC. In the future, we plan to evaluate the performance of the proposed method with datasets of smaller samples. As the proposed method represents document in a word sense space, in fact, we can utilize it to handle sparse data problem with datasets of smaller samples, e.g. SMS messages and tweets.

Acknowledgment This research work is supported by National Science Foundation of China (NSFC: 61272233). We thank the anonymous reviewers for their valuable comments.

References 1. E. Agirre and A. Soroa, Semeval-2007 task 02: Evaluating word sense induction and discrimination systems, in Proc. 4th Int. Workshop on Semantic Evaluations (Association for Computational Linguistics, 2007), pp. 7–12. 2. M. Apidianaki, Data-driven semantic analysis for multilingual wsd and lexical selection in translation, in Proc. 12th Conf. European Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2009), pp. 77–85. 3. D. M. Blei, A. Y. Ng and M. I. Jordan, Latent dirichlet allocation, J. Mach. Lear. Res. 3 (2003) 993–1022. 4. J. Boyd-Graber and D. M. Blei, Multilingual topic models for unaligned text, in Proc. Twenty-Fifth Con. Uncertainty in Arti¯cial Intelligence (AUAI Press, 2009), pp. 75–82. 5. S. Brody and M. Lapata, Bayesian word sense induction, in Proc. 12th Conf. European Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2009), pp. 103–111. 6. E. Cambria, P. Gastaldo, F. Bisio and R. Zunino, An ELM-based model for a®ective analogical reasoning, Neurocomputing 149 (2015) 443–455. 7. E. Cambria, A. Hussain, C. Havasi and C. Eckl, Common sense computing: From the society of mind to digital intuition and beyond, in Biometric ID Management and Mul- timodal Communication, eds. J. Fierrez, J. Ortega, A. Esposito, A. Drygajlo and M. Faundez-Zanuy, Lecture Notes in Computer Science, Vol. 5707 (Springer, Berlin Hei- delberg, 2009), pp. 252–259. 8. E. Cambria and B. White, Jumping NLP curves: A review of natural language processing research, IEEE Comput. Intell. Mag. 9(2) (2014) 48–57.

1559003-22 DR with Statistical Word Senses in CLDC

9. M. Denkowski, A survey of techniques for unsupervised word sense induction, Language & Statistics II Literature Review (2009) 1–8. 10. I. S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in Proc. Seventh ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (ACM, 2001), pp. 269–274. 11. J. Duan, R. Lu and X. Li, Multi-engine collaborative boostrapping for word sense dis- ambiguatoin, Int. J. Artif. Intell. Tools 16(3) (2007) 465–482. 12. D. K. Evans and J. L. Klavans, A platform for multilingual news summarization, Technical Report (Department of Computer Science, Columbia University, 2003). 13. A. K. Farahat and M. S. Kamel, Statistical semantics for enhancing document clustering, Knowl. Inf. Syst. 28(2) (2011) 365–393. 14. E. Gabrilovich and S. Markovitch, Computing semantic relatedness using wikipedia- based explicit semantic analysis, in Proc. IJCAI, Vol. 7 (2007), pp. 1606–1611. 15. T. L. Gri±ths and M. Steyvers, Finding scienti¯c topics, Proc. Nat. Acad. Sci. U. S. Am. 101(Suppl. 1) (2004) 5228–5235. 16. A. Hotho, S. Staab and G. Stumme, Ontologies improve text document clustering, in Proc. ICDM 2003 (IEEE, 2003), pp. 541–544. 17. H.-H. Huang and Y.-H. Kuo, Cross-lingual document representation and semantic similarity measure: A fuzzy set and rough set based approach, IEEE Trans. Fuzzy Syst. 18(6) (2010) 1098–1111. 18. K. Kishida, Double-pass clustering technique for multilingual document collections, J. Inf. Sci. 37(3) (2011) 304–321. 19. J. Kong and D. Gra®, Tdt4 multilingual broadcast news speech corpus, Linguistic Data Consortium (2005). 20. T. K. Landauer and S. T. Dumais, A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychol. Rev. 104(2) (1997) 211. 21. R. Y. K. Lau, Y. Xia and Y. Ye, A probabilistic generative model for mining cybercriminal networks from online social media, IEEE Comput. Intell. Mag. 9(1) (2014) 31–43. 22. T. Leek, H. Jin, S. Sista and R. Schwartz, The BBN crosslingual topic detection and tracking system, 1999 TDT evaluation system summary papers (Vienna, USA, 1999), pp. 214–221. 23. Y. Li and J. Shawe-Taylor, Advanced learning algorithms for cross-language patent retrieval and classi¯cation, Inf. Process. Manag. 43(5) (2007) 1183–1199. 24. Y. Lu, Q. Mei and C. Zhai, Investigating task performance of probabilistic topic models: An empirical study of plsa and lda, Inf. Retrieval 14(2) (2011) 178–203. 25. B. Mathieu, R. Besançon and C. Fluhr, Multilingual document clusters discovery, in Proc. RIAO (Citeseer, 2004), pp. 116–125. 26. R. F. Mihalcea and D. I. Moldovan, A highly accurate boostrapping algorithm for word sense diambiguation, Int. J. Artif. Intell. Tools 10 (2001) 5–21. 27. D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith and A. McCallum, Polylingual topic models, in Proc. 2009 Conf. Empirical Methods in Natural Language Processing, Vol. 2 (Association for Computational Linguistics, 2009), pp. 880–889. 28. R. Navigli and G. Crisafulli, Inducing word senses to improve web search result clustering, in Proc. 2010 Conf. Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2010), pp. 116–126. 29. X. Ni, J.-T. Sun, J. Hu and Z. Chen, Mining multilingual topics from wikipedia, in Proc. 18th Int. Conf. World Wide Web (ACM, 2009), pp. 1155–1156. 30. J.-F. Pessiot, Y.-M. Kim, M. R. Amini and P. Gallinari, Improving document clustering in a learned concept space, Inf. Process. Manag. 46(2) (2010) 180–192.

1559003-23 G. Tang et al.

31. B. Pouliquen, R. Steinberger, C. Ignat, E. Kasper and I. Temnikova, Multilingual and cross-lingual news topic tracking, in Proc. 20th Int. Conf. Computational Linguistics (Association for Computational Linguistics, 2004), p. 959. 32. G. Salton, A. Wong and C.-S. Yang, A vector space model for automatic indexing, Commun. ACM 18(11) (1975) 613–620. 33. H. Schmid, Probabilistic part-of-speech tagging using decision trees, in Proc. Int. Conf. New Methods in Language Processing, Manchester, UK, Vol. 12 (1994), pp. 44–49. 34. H. Schtze and J. O. Pedersen, Information retrieval based on word senses, Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), pp. 161–175. 35. L. Shi, R. Mihalcea and M. Tian, Cross language text classi¯cation by model translation and semi-supervised learning, in Proc. 2010 Conf. Empirical Methods in Natural Lan- guage Processing (Association for Computational Linguistics, 2010), pp. 1057–1067. 36. M. Steinbach, G. Karypis, V. Kumar et al., A comparison of document clustering techniques, KDD Workshop on Text Mining, Boston, Vol. 400 (2000), pp. 525–526. 37. R. Steinberger, B. Pouliquen and C. Ignat, Newsexplorer: Multilingual news analysis with cross-lingual linking, in Proc. of the 27th International Conference on Information Technology Interfaces (2005). 38. C. Stokoe, M. P. Oakes and J. Tait, Word sense disambiguation in information retrieval revisited, in Proc. 26th Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval (ACM, 2003), pp. 159–166. 39. G. Tang, Y. Xia, M. Zhang, H. Li and F. Zheng, Clgvsm: Adapting generalized vector space model to cross-lingual document clustering, in Proc. IJCNLP (2011), pp. 580–588. 40. Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei, Hierarchical dirichlet processes, J. Am. Stat. Assoc. 101 (2004) 1566–1581. 41. C.-P. Wei, C. C. Yang and C.-M. Lin, A latent semantic indexing-based approach to multilingual document clustering, Decis. Support Syst. 45(3) (2008) 606–620. 42. S. K. M. Wong, W. Ziarko and P. C. N. Wong, Generalized vector spaces model in information retrieval, in Proc. 8th Annual Int. ACM SIGIR Conf. Research and Devel- opment in Information Retrieval (ACM, 1985), pp. 18–25. 43. R. Xia, C. Zong, X. Hu and E. Cambria, Feature ensemble plus sample selection: A comprehensive approach to domain adaptation for sentiment classi¯cation, IEEE Intell. Syst. 28(3) (2013) 10–18. 44. X. Yao and B. Van Durme, Nonparametric bayesian word sense induction, in Proc. TextGraphs-6: Graph-based Methods for Natural Language Processing (Association for Computational Linguistics, 2011), pp. 10–14. 45. D. Yogatama and K. Tanaka-Ishii, Multilingual spectral clustering using document similarity propagation, in Proc. 2009 Conf. Empirical Methods in Natural Language Processing, Vol. 2 (Association for Computational Linguistics, 2009), pp. 871–879. 46. L. Zhai, Z. Ding, Y. Jia and B. Zhou, A word position-related lda model, Int. J. Pattern Recogn. Artif. Intell. 25(6) (2011) 909–925.

1559003-24 DR with Statistical Word Senses in CLDC

Guoyu Tang received Erik Cambria received her B.S. degree in Com- his B.Eng. and M.Eng. puter Science from the with honours in Electron- Department of Computer ic Engineering from the Science and Technology, University of Genoa in Tsinghua University in 2005 and 2008, respec- 2009. She is currently a tively. In 2012, he was Ph.D. student in the De- awarded his Ph.D. in partment of Computer Computing Science and Science and Technology, Mathematics, following Tsinghua University. Her the completion of a Co- research interests include natural language pro- operative Awards in Science and Engineering cessing, information retrieval and machine (CASE) project born from the collaboration be- learning. tween the University of Stirling, the MIT Media Lab, and Sitekit Solutions Ltd., which included internships at HP Labs India, the Chinese Academy of Sciences, and Microsoft Research Yunqing Xia, IEEE se- Asia. From August 2011 to May 2014, Erik was a nior member, received his research scientist at the National University of Ph.D. in July 2001 from Singapore (Cognitive Science Programme) and the Institute of Comput- an Associate Researcher at the Massachusetts ing Technologies, Chinese Institute of Technology (Synthetic Intelligence Academy of Science. He Project). Today, Erik is an Assistant Professor at worked as a postdoc re- Nanyang Technological University (School of search associate in the Computer Engineering), where he teaches natu- Natural Language Pro- ral language processing and data mining. He is cessing group at Univer- also a research fellow at NTU Temasek Labs sity of She±eld, UK from (TRF awardee), where he focuses on common- January 2003 to October 2004. From December sense reasoning, concept-level sentiment analy- 2004 to September 2006, he worked as a postdoc sis, and cyber-issue detection. Erik is an editorial fellow in the Department of Computer Science at board co-chair of Cognitive Computation, asso- the Chinese University of Hong Kong. In Octo- ciate editor of Knowledge-Based Systems, and ber 2006, he joined the Research Institute of In- guest editor of many other top-tier AI journals, formation Technology, Tsinghua University. His including three issues of IEEE Intelligent Sys- research interests include natural language pro- tems, two issues of IEEE CIM, and one issue of cessing, information retrieval and text mining. In Neural Networks. He is also involved in several the past 10 years, he has published more than 80 international conferences as workshop series or- papers in distinguished journals (e.g. IEEE In- ganizer, e.g. ICDM SENTIRE since 2011, pro- telligent Systems, IEEE Computational Intelli- gram chair, e.g. ELM since 2012, PC member, gence) and conferences (e.g. ACL). He is also the e.g. UAI in 2014, tutorial organizer, e.g. WWW inventor of three Chinese patents. He is currently in 2014, special track chair, e.g. AAAI FLAIRS serving as an editorial board member of Cogni- in 2015, and keynote speaker, e.g. CICLing in tive Computation Journal, advisory board 2015. member of Springer Socio-A®ective Computing book series and associate editor of IJCPOL. He is co-organizer of ACM KDD WISDOM workshop and IEEE ICDM SENTIRE workshop.

1559003-25 G. Tang et al.

Peng Jin received his Thomas Fang Zheng is B.S., M.S. and Ph.D. a Full Research Professor degrees in Computing Sci- and Vice Dean of the Re- ence from the Zhongyuan search Institute of Infor- University of Technology, mation Technology, and Nanjing University of Sci- the Director of the Center ence and Technology, for Speech and Language Peking University, respec- Technologies, Tsinghua tively. From October 2007 University. Since 1988, he to April 2008, he was a has been working on visiting student at the speech and language pro- Department of Informatics, University of Sussex cessing, and has published over 230 journal and (funded by China Scholarship Council); from conference papers and 11 books. He holds 10 in- August 2014 to February 2015, he has been a vention patents. He serves as Vice President — visiting research fellow at the Department of In- Institutional Relations and Education Program formatics, University of Sussex. Currently, he is of APSIPA, and Director of the Speech Infor- an Associate Professor at Leshan Normal Uni- mation Technical Commission of Chinese Infor- versity (School of Computer Science). His re- mation Processing Society of China. He is an search interests include natural language IEEE Senior member, an Associate Editor of processing, information retrieval and machine IEEE Transactions on Audio, Speech, and Lan- learning. guage Processing, an editor of Speech Commu- nication, an Associate Editor of APSIPA Transactions on Signal and Information Proces- sing, an Associate Editor of International Jour- nal of Asian Language Processing, and a series editor of SpringerBriefs in Signal Processing. He also served as co-chair of Program Committee of International Symposium on Chinese Spoken Language Processing (ISCSLP) 2000, General Chair of Oriental COCOSDA 2003, Tutorial Co- Chair of APSIPA ASC 2009, General Co-Chair of APSIPA ASC 2011, APSIPA Distinguished Lecturer (2012–2013), General Co-Chair of IEEE China Summit and International Confer- ence on Signal and Information Processing (ChinaSIP) 2013, Area Chair of Interspeech 2014, and General Co-Chair of ISCSLP 2014.

1559003-26