A Knowledge-Based Representation for Cross-Language Document Retrieval and Categorization

A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization Marc Franco-Salvador1;2, Paolo Rosso2 and Roberto Navigli1 1 Department of Computer Science Sapienza Universita` di Roma, Italy ffrancosalvador,[email protected] 2 Natural Language Engineering Lab - PRHLT Research Center Universitat Politecnica` de Valencia,` Spain fmfranco,[email protected] Abstract 1994), Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and variants thereof, which project these Current approaches to cross-language doc- vectors into a lower-dimensional vector space. In ument retrieval and categorization are order to enable multilinguality, the vectors of com- based on discriminative methods which parable documents written in different languages represent documents in a low-dimensional are concatenated, making up the document ma- vector space. In this paper we pro- trix which is then reduced using linear projection pose a shift from the supervised to the (Platt et al., 2010; Yih et al., 2011). However, to knowledge-based paradigm and provide a do so, comparable documents are needed as train- document similarity measure which draws ing. Additionally, the lower dimensional represen- on BabelNet, a large multilingual knowl- tations are not of easy interpretation. edge resource. Our experiments show The availability of wide-coverage lexical state-of-the-art results in cross-lingual knowledge resources extracted automatically document retrieval and categorization. from Wikipedia, such as DBPedia (Bizer et al., 1 Introduction 2009), YAGO (Hoffart et al., 2013) and BabelNet (Navigli and Ponzetto, 2012a), has considerably The huge amount of text that is available on- boosted research in several areas, especially where line is becoming ever increasingly multilingual, multilinguality is a concern (Hovy et al., 2013). providing an additional wealth of useful informa- Among these latter are cross-language plagiarism tion. Most of this information, however, is not eas- detection (Potthast et al., 2011; Franco-Salvador ily accessible to the majority of users because of et al., 2013), multilingual semantic relatedness language barriers which hamper the cross-lingual (Navigli and Ponzetto, 2012b; Nastase and search and retrieval of knowledge. Strube, 2013) and semantic alignment (Navigli Today’s search engines would benefit greatly and Ponzetto, 2012a; Matuschek and Gurevych, from effective techniques for the cross-lingual re- 2013). One main advantage of knowledge-based trieval of valuable information that can satisfy methods is that they provide a human-readable, a user’s needs by not only providing (Landauer semantically interconnected, representation of and Littman, 1994) and translating (Munteanu and the textual item at hand (be it a sentence or a Marcu, 2005) relevant results into different lan- document). guages, but also by reranking the results in a lan- Following this trend, in this paper we provide guage of interest on the basis of the importance of a knowledge-based representation of documents search results in other languages. which goes beyond the lexical surface of text, Vector-based models are typically used in the while at the same time avoiding the need for train- literature for representing documents both in ing in a cross-language setting. To achieve this monolingual and cross-lingual settings (Manning we leverage a multilingual semantic network, i.e., et al., 2008). However, because of the large size BabelNet, to obtain language-independent repre- of the vocabulary, having each term as a compo- sentations, which contain concepts together with nent of the vector makes the document represen- semantic relations between them, and also include tation very sparse. To address this issue several semantic knowledge which is just implied by the approaches to dimensionality reduction have been input text. The integration of our multilingual proposed, such as Principal Component Analysis graph model with a vector representation enables (Jolliffe, 1986), Latent Semantic Indexing (Hull, us to obtain state-of-the-art results in comparable document retrieval and cross-language text cate- 2010; Yih et al., 2011). Cross-language Latent Se- gorization. mantic Indexing (CL-LSI) (Dumais et al., 1997) was the first linear projection approach used in 2 Related Work cross-lingual tasks. CL-LSI provides a cross- lingual representation for documents by reducing The mainstream representation of documents the dimensionality of a matrix D whose rows are for monolingual and cross-lingual document re- obtained by concatenating comparable documents trieval is vector-based. A document vector, whose from different languages. Similarly, PCA and components quantify the relevance of each term in OPCA can be adapted to a multilingual setting. the document, is usually highly dimensional, be- LDA was also adapted to perform in a multilingual cause of the variety of terms used in a document scenario with models such as Polylingual Topic collection. As a consequence, the resulting docu- Models (Mimno et al., 2009), Joint Probabilistic ment matrices are very sparse. To address the data LSA and Coupled Probabilistic LSA (Platt et al., sparsity issue, several approaches to the reduc- 2010), which, however, are constrained to using tion of dimensionality of document vectors have word counts, instead of better weighting strate- been proposed in the literature. A popular class of gies, such as log(tf)-idf, known to perform bet- methods is based on linear projection, which pro- ter with large vocabularies (Salton and McGill, vides a low-dimensional mapping from a high di- 1986). Another variant, named Canonical Cor- mensional vector space. A historical approach to relation Analysis (CCA) (Thompson, 2005), uses linear projection is Principal Component Analysis a cross-covariance matrix of the low-dimensional (PCA) (Jolliffe, 1986), which performs a singular vectors to find the projections. Cross-language value decomposition (SVD) on a document matrix Explicit Semantic Analysis (CL-ESA) (Potthast et D of size n × m, where each row in D is the term al., 2008; Cimiano et al., 2009; Potthast et al., vector representation of a document. PCA uses 2011), instead, adapts ESA to be used at cross- an orthogonal transformation to convert a set of language level by exploiting the comparable docu- observations of possibly correlated variables into ments across languages from Wikipedia. CL-ESA a set of values of linearly uncorrelated variables represents each document written in a language called principal components, which make up the L by its similarities with a document collection low-dimensional vector. Latent Semantic Analy- in language L. Using a multilingual document sis (LSA) (Deerwester et al., 1990) is very simi- collection with comparable documents across lan- lar to PCA but performs the SVD using the cor- guages, the resulting vectors from different lan- relation matrix instead of the covariance matrix, guages can be compared directly. which implies a lower computational cost. LSA An alternative unsupervised approach, Cross- preserves the amount of variance in an eigenvector language Character n-Grams (CL-CNG) (Mc- ~vT C~v ~v by maximizing its Rayleigh ratio: ~vT ~v , where namee and Mayfield, 2004), does not draw upon C = DT D is the correlation matrix of D. linear projections and represents documents as A generalization of PCA, called Oriented Prin- vectors of character n-grams. It has proven to ob- cipal Component Analysis (OPCA) (Diamantaras tain good results in cross-language document re- and Kung, 1996), is based on a noise covari- trieval (Potthast et al., 2011) between languages ance matrix to project the similar components of with lexical and syntactic similarities. D closely. Other projection models such as La- Recently, a novel supervised linear projec- tent Dirichlet Allocation (LDA) (Blei et al., 2003) tion model based on Siamese Neural Networks are based on the extraction of generative models (S2Net) (Yih et al., 2011) achieved state-of-the- from documents. Another approach, named Ex- art performance in comparable document retrieval. plicit Semantic Analysis (ESA) (Gabrilovich and S2Net performs a linear combination of the terms Markovitch, 2007), represents each document by of a document vector d~ to obtain a reduced vector its similarities to a document collection. Using a ~r, which is the output layer of a neural network. low domain specificity document collection such Each element in ~r has a weight which is a linear as Wikipedia, the model has proven to obtain com- combination of the original weights of d~, and cap- petitive results. tures relationships between the original terms. Not only have these methods proven to be suc- However, linear projection approaches need a cessful in a monolingual scenario (Deerwester high number of training documents to achieve et al., 1990; Hull, 1994), but they have also state-of-the-art performance (Platt et al., 2010; been adapted to perform well in tasks at a cross- Yih et al., 2011). Moreover, although they are language level (Potthast et al., 2008; Platt et al., good at identifying a few principal components, the representations produced are opaque, in that icographic and encyclopedic knowledge, and its they cannot explicitly model the semantic content free availability.1 In our work we used BabelNet of documents with a human-interpretable repre- 1.0, which encodes knowledge for six languages, sentation, thereby making the data analysis diffi- namely: Catalan, English, French, German, Italian cult.

Load more