Semantic-Based Multilingual Document Clustering Via Tensor Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Semantic-Based Multilingual Document Clustering via Tensor Modeling Salvatore Romeo, Andrea Tagarelli Dino Ienco DIMES, University of Calabria IRSTEA, UMR TETIS Arcavacata di Rende, Italy Montpellier, France [email protected] LIRMM [email protected] Montpellier, France [email protected] Abstract through machine translation techniques based on a se- lected anchor language. Conversely, a comparable cor- A major challenge in document clustering re- pus is a collection of multilingual documents written search arises from the growing amount of text over the same set of classes (Ni et al., 2011; Yo- data written in different languages. Previ- gatama and Tanaka-Ishii, 2009) without any restric- ous approaches depend on language-specific tion about translation or perfect correspondence be- solutions (e.g., bilingual dictionaries, sequen- tween documents. To mine this kind of corpus, external tial machine translation) to evaluate document knowledge is employed to map concepts or terms from similarities, and the required transformations a language to another (Kumar et al., 2011c; Kumar may alter the original document semantics. To et al., 2011a), which enables the extraction of cross- cope with this issue we propose a new docu- lingual document correlations. In this case, a major ment clustering approach for multilingual cor- issue lies in the definition of a cross-lingual similarity pora that (i) exploits a large-scale multilingual measure that can fit the extracted cross-lingual correla- knowledge base, (ii) takes advantage of the tions. Also, from a semi-supervised perspective, other multi-topic nature of the text documents, and works attempt to define must-link constraints to de- (iii) employs a tensor-based model to deal with tect cross-lingual clusters (Yogatama and Tanaka-Ishii, high dimensionality and sparseness. Results 2009). This implies that, for each different dataset, the have shown the significance of our approach set of constraints needs to be redefined; in general, the and its better performance w.r.t. classic docu- final results can be negatively affected by the quantity ment clustering approaches, in both a balanced and the quality of involved constraints (Davidson et al., and an unbalanced corpus evaluation. 2006). 1 Introduction To the best of our knowledge, existing clustering ap- proaches for comparable corpora are customized for a Document clustering research was initially focused on small set (two or three) of languages (Montalvo et al., the development of general purpose strategies to group 2007). Most of them are not generalizable to many unstructured text data. Recent studies have started de- languages as they employ bilingual dictionaries and veloping new methodologies and algorithms that take the translation is performed sequentially considering into account both linguistic and topical characteristics, only pairs of languages. Therefore, the order in which where the former include the size of the text and the this process is done can seriously impact the results. type of language, and the latter focus on the commu- Another common drawback concerns the way most nicative function and targets of the documents. of the recent approaches perform their analysis: the A major challenge in document clustering research various languages are analyzed independently of each arises from the growing amount of text data that are other (possibly by exploiting external knowledge like written in different languages, also due to the increased Wikipedia to enrich documents (Kumar et al., 2011c; popularity of a number of tools for collaboratively edit- Kumar et al., 2011a)), and then the language-specific ing through contributors across the world. Multilingual results are merged. This two-step analysis however document clustering (MDC) aims to detect clusters in a may fail in profitably exploiting cross-language infor- collection of texts written in different languages. This mation from the multilingual corpus. can aid a variety of applications in cross-lingual infor- mation retrieval, including statistical machine transla- Contributions. We address the problem of MDC tion and corpora alignment. by proposing a framework that features three key ele- Existing approaches to MDC can be divided in two ments, namely: (1) to model documents over a unified broad categories, depending on whether a parallel cor- conceptual space, with the support of a large-scale mul- pus rather than a comparable corpus is used (Kumar et tilingual knowledge base; (2) to decompose the mul- al., 2011c). A parallel corpus is typically comprised tilingual documents into topically-cohesive segments; of documents with their related translations (Kim et and (3) to describe the multilingual corpus under a al., 2010). These translations are usually obtained multi-dimensional data structure. 600 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 600–609, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics The first key element prevents loss of information relation type (is-a, part-of, etc.), while each node corre- due to the translation of documents from different lan- sponds to a BabelNet synset, i.e., a set of lexicalizations guages to a target one. It enables a conceptual represen- of a concept in different languages. tation of the documents in a language-independent way BabelNet can be accessed and easily integrated into preserving the content semantics. BabelNet (Navigli applications by means of a Java API provided by the and Ponzetto, 2012a) is used as multilingual knowl- toolkit described in (Navigli and Ponzetto, 2012b). edge base. To the extent of our knowledge, this is the The toolkit also provides functionalities for graph- first work in MDC that exploits BabelNet. based WSD in a multilingual context. Given an in- The second key element, document segmentation, put set of words, a semantic graph is built by looking enables us to simplify the document representation for related synset paths and by merging all them in a according to their multi-topic nature. Previous re- unique graph. Once the semantic graph is built, the search has demonstrated that a segment-based ap- graph nodes can be scored with a variety of algorithms. proach can significantly improve document clustering Finally, this graph with scored nodes is used to rank the performance (Tagarelli and Karypis, 2013). More- input word senses by a graph-based approach. over, the conceptual representation of the document segments enables the grouping of linguistically dif- 2.2 Tensor model representation ferent (portions of) documents into topically coherent A tensor is a multi-dimensional array I I I T ∈ clusters. 1× 2×···× M . The number of dimensions M, also < The latter aspect is leveraged by the third key ele- known as ways or modes, is called order of the ten- ment of our proposal, which relies on a tensor-based sor, so that a tensor with order M is also said a M- model (Kolda and Bader, 2009) to effectively handle way or M-order tensor. A higher-order tensor (i.e., a the high dimensionality and sparseness in text. Ten- tensor with order three or higher) is denoted by bold- sors are considered as a multi-linear generalization of face calligraphic letters, e.g., ; a matrix (2-way ten- T matrix factorizations, since all dimensions or modes sor) is denoted by boldface capital letters, e.g., U; are retained thanks to multi-linear structures which can a vector (1-way tensor) is denoted by boldface low- produce meaningful components. The applicability of ercase letters, e.g., v. The generic entry (i1, i2, i3) tensor analysis has recently attracted growing atten- of a third-order tensor is denoted by t , with T i1i2i3 tion in information retrieval and data mining, including i [1..I ], i [1..I ], i [1..I ]. 1 ∈ 1 2 ∈ 2 3 ∈ 3 document clustering (e.g., (Liu et al., 2011; Romeo A one-dimensional fragment of tensor, defined by et al., 2013)) and cross-lingual information retrieval varying one index and keeping the others fixed, is a (e.g., (Chew et al., 2007)). 1-way tensor called fiber. A third-order tensor has The rest of the paper is organized as follows. Sec- column, row and tube fibers. Analogously, a two- tion 2 provides an overview of BabelNet and basic no- dimensional fragment of tensor, defined by varying two tions on tensors. We describe our proposal in Section 3. indices and keeping the rest fixed, is a 2-way tensor Data and experimental settings are described in Sec- called slice. A third-order tensor has horizontal, lateral tion 4, while results are presented in Section 5. We and frontal slices. summarize our main findings in Section 6, finally Sec- The mode-m matricization of a tensor , denoted T tion 7 concludes the paper. by T(m), is obtained by arranging the mode-m fibers as columns of a matrix. A third-order tensor I1 I2 I3 T ∈ 2 Background × × is all-orthogonal if ti i αti i β = < i1i2 1 2 1 2 ti αi ti βi = tαi i tβi i = 0 when- 2.1 BabelNet i1i3 1 3 1 3 i2i3 2 3 2 3 ever α = β. The mode-m productP of a tensor BabelNet (Navigli and Ponzetto, 2012a) is a multilin- PI I 6 I P J I T ∈ 1× 2×···× M with a matrix U × m , denoted by gual semantic network obtained by linking Wikipedia < ∈ < m U, is a tensor of dimension I1 ...Im 1 with WordNet, that is, the largest multilingual Web en- T × × − × J Im+1 IM and can be expressed in terms cyclopedia and the most popular computational lex- × × · · · × of matrix product as = m U, whose mode-m icon. The linking of the two knowledge bases was Y T × matricization is Y( ) = UT( ). performed through an automatic mapping of WordNet m m synsets and Wikipages, harvesting multilingual lexi- 3 Our Proposal calization of the available concepts through human- generated translations provided by the Wikipedia inter- language links or through machine translation tech- 3.1 Multilingual Document Clustering niques.