A Knowledge-based Representation for Cross- Document Retrieval and Categorization

Marc Franco-Salvador1,2, Paolo Rosso2 and Roberto Navigli1 1 Department of Computer Science Sapienza Universita` di Roma, Italy {francosalvador,navigli}@di.uniroma1.it 2 Natural Language Engineering Lab - PRHLT Research Center Universitat Politecnica` de Valencia,` Spain {mfranco,prosso}@dsic.upv.es

Abstract 1994), Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and variants thereof, which project these Current approaches to cross-language doc- vectors into a lower-dimensional vector space. In ument retrieval and categorization are order to enable multilinguality, the vectors of com- based on discriminative methods which parable documents written in different represent documents in a low-dimensional are concatenated, making up the document ma- vector space. In this paper we pro- trix which is then reduced using linear projection pose a shift from the supervised to the (Platt et al., 2010; Yih et al., 2011). However, to knowledge-based paradigm and provide a do so, comparable documents are needed as train- document similarity measure which draws ing. Additionally, the lower dimensional represen- on BabelNet, a large multilingual knowl- tations are not of easy interpretation. edge resource. Our experiments show The availability of wide-coverage lexical state-of-the-art results in cross-lingual knowledge resources extracted automatically document retrieval and categorization. from Wikipedia, such as DBPedia (Bizer et al., 1 Introduction 2009), YAGO (Hoffart et al., 2013) and BabelNet (Navigli and Ponzetto, 2012a), has considerably The huge amount of text that is available on- boosted research in several areas, especially where line is becoming ever increasingly multilingual, multilinguality is a concern (Hovy et al., 2013). providing an additional wealth of useful informa- Among these latter are cross-language plagiarism tion. Most of this information, however, is not eas- detection (Potthast et al., 2011; Franco-Salvador ily accessible to the majority of users because of et al., 2013), multilingual semantic relatedness language barriers which hamper the cross-lingual (Navigli and Ponzetto, 2012b; Nastase and search and retrieval of knowledge. Strube, 2013) and semantic alignment (Navigli Today’s search engines would benefit greatly and Ponzetto, 2012a; Matuschek and Gurevych, from effective techniques for the cross-lingual re- 2013). One main advantage of knowledge-based trieval of valuable information that can satisfy methods is that they provide a human-readable, a user’s needs by not only providing (Landauer semantically interconnected, representation of and Littman, 1994) and translating (Munteanu and the textual item at hand (be it a sentence or a Marcu, 2005) relevant results into different lan- document). guages, but also by reranking the results in a lan- Following this trend, in this paper we provide guage of interest on the basis of the importance of a knowledge-based representation of documents search results in other languages. which goes beyond the lexical surface of text, Vector-based models are typically used in the while at the same time avoiding the need for train- literature for representing documents both in ing in a cross-language setting. To achieve this monolingual and cross-lingual settings (Manning we leverage a multilingual , i.e., et al., 2008). However, because of the large size BabelNet, to obtain language-independent repre- of the vocabulary, having each term as a compo- sentations, which contain concepts together with nent of the vector makes the document represen- semantic relations between them, and also include tation very sparse. To address this issue several semantic knowledge which is just implied by the approaches to dimensionality reduction have been input text. The integration of our multilingual proposed, such as Principal Component Analysis graph model with a vector representation enables (Jolliffe, 1986), Latent Semantic Indexing (Hull, us to obtain state-of-the-art results in comparable document retrieval and cross-language text cate- 2010; Yih et al., 2011). Cross-language Latent Se- gorization. mantic Indexing (CL-LSI) (Dumais et al., 1997) was the first linear projection approach used in 2 Related Work cross-lingual tasks. CL-LSI provides a cross- lingual representation for documents by reducing The mainstream representation of documents the dimensionality of a matrix D whose rows are for monolingual and cross-lingual document re- obtained by concatenating comparable documents trieval is vector-based. A document vector, whose from different languages. Similarly, PCA and components quantify the relevance of each term in OPCA can be adapted to a multilingual setting. the document, is usually highly dimensional, be- LDA was also adapted to perform in a multilingual cause of the variety of terms used in a document scenario with models such as Polylingual Topic collection. As a consequence, the resulting docu- Models (Mimno et al., 2009), Joint Probabilistic ment matrices are very sparse. To address the data LSA and Coupled Probabilistic LSA (Platt et al., sparsity issue, several approaches to the reduc- 2010), which, however, are constrained to using tion of dimensionality of document vectors have counts, instead of better weighting strate- been proposed in the literature. A popular class of gies, such as log(tf)-idf, known to perform bet- methods is based on linear projection, which pro- ter with large vocabularies (Salton and McGill, vides a low-dimensional mapping from a high di- 1986). Another variant, named Canonical Cor- mensional vector space. A historical approach to relation Analysis (CCA) (Thompson, 2005), uses linear projection is Principal Component Analysis a cross-covariance matrix of the low-dimensional (PCA) (Jolliffe, 1986), which performs a singular vectors to find the projections. Cross-language value decomposition (SVD) on a document matrix Explicit Semantic Analysis (CL-ESA) (Potthast et D of size n × m, where each row in D is the term al., 2008; Cimiano et al., 2009; Potthast et al., vector representation of a document. PCA uses 2011), instead, adapts ESA to be used at cross- an orthogonal transformation to convert a set of language level by exploiting the comparable docu- observations of possibly correlated variables into ments across languages from Wikipedia. CL-ESA a set of values of linearly uncorrelated variables represents each document written in a language called principal components, which make up the L by its similarities with a document collection low-dimensional vector. Latent Semantic Analy- in language L. Using a multilingual document sis (LSA) (Deerwester et al., 1990) is very simi- collection with comparable documents across lan- lar to PCA but performs the SVD using the cor- guages, the resulting vectors from different lan- relation matrix instead of the covariance matrix, guages can be compared directly. which implies a lower computational cost. LSA An alternative unsupervised approach, Cross- preserves the amount of variance in an eigenvector language Character n-Grams (CL-CNG) (Mc- ~vT C~v ~v by maximizing its Rayleigh ratio: ~vT ~v , where namee and Mayfield, 2004), does not draw upon C = DT D is the correlation matrix of D. linear projections and represents documents as A generalization of PCA, called Oriented Prin- vectors of character n-grams. It has proven to ob- cipal Component Analysis (OPCA) (Diamantaras tain good results in cross-language document re- and Kung, 1996), is based on a noise covari- trieval (Potthast et al., 2011) between languages ance matrix to project the similar components of with lexical and syntactic similarities. D closely. Other projection models such as La- Recently, a novel supervised linear projec- tent Dirichlet Allocation (LDA) (Blei et al., 2003) tion model based on Siamese Neural Networks are based on the extraction of generative models (S2Net) (Yih et al., 2011) achieved state-of-the- from documents. Another approach, named Ex- art performance in comparable document retrieval. plicit Semantic Analysis (ESA) (Gabrilovich and S2Net performs a linear combination of the terms Markovitch, 2007), represents each document by of a document vector d~ to obtain a reduced vector its similarities to a document collection. Using a ~r, which is the output layer of a neural network. low domain specificity document collection such Each element in ~r has a weight which is a linear as Wikipedia, the model has proven to obtain com- combination of the original weights of d~, and cap- petitive results. tures relationships between the original terms. Not only have these methods proven to be suc- However, linear projection approaches need a cessful in a monolingual scenario (Deerwester high number of training documents to achieve et al., 1990; Hull, 1994), but they have also state-of-the-art performance (Platt et al., 2010; been adapted to perform well in tasks at a cross- Yih et al., 2011). Moreover, although they are language level (Potthast et al., 2008; Platt et al., good at identifying a few principal components, the representations produced are opaque, in that icographic and encyclopedic knowledge, and its they cannot explicitly model the semantic content free availability.1 In our work we used BabelNet of documents with a human-interpretable repre- 1.0, which encodes knowledge for six languages, sentation, thereby making the data analysis diffi- namely: Catalan, English, French, German, Italian cult. In this paper, instead, we propose a language- and Spanish. independent knowledge graph representation for 3.2 From Document to Knowledge Graph documents which is obtained from a large multi- lingual semantic network, without using any train- We now introduce our five-step method for repre- ing information. Our knowledge graph represen- senting a given document d from a collection D of tation explicitly models the semantics of the docu- documents written in language L as a language- ment in terms of the concepts and relations evoked independent knowledge graph. by its co-occurring terms. Building a Basic Vector Representation Ini- tially we transform a document d into a traditional 3 A Knowledge-based Document vector representation. To do this, we score each Representation term ti ∈ d with a weight wi. This weight is usu- ally a function of term and document frequency. We propose a knowledge-based document rep- Following the literature, one method that works resentation aimed at expanding the terms in a doc- well is the log tf-idf weighting (Salton et al., 1983; ument’s bag of by means of a knowledge Salton and McGill, 1986): graph which provides concepts and semantic rela- tions between them. Key to our approach is the wi = log2(fi + 1)log2(n/ni). (1) use of a graph representation which does not de- pend on any given language, but, indeed, is multi- where fi is the number of times term i occurs in lingual. To build knowledge graphs of this kind we document d, n is the total number of documents in utilize BabelNet, a multilingual semantic network the collection and ni is the number of documents that we present in Section 3.1. Then, in Section that contain ti. We then create a weighted term 3.2, we describe the five steps needed to obtain our vector ~v = (w1, ..., wn), where wi is the weight graph-based multilingual representation of docu- corresponding to term ti. We exclude stopwords ments. Finally, we introduce our knowledge graph from the vector. similarity measure in Section 3.3. Selecting the Relevant Document Terms We 2 3.1 BabelNet then create the set T of base forms, i.e., lemmas , of the terms in the document d. In order to keep BabelNet (Navigli and Ponzetto, 2012a) is a only the most relevant terms, we sort the terms T multilingual semantic network whose concepts according to their weight in vector ~v and retain a and relations are obtained from the largest avail- maximum number of K terms, obtaining a set of able semantic lexicon of English, WordNet (Fell- 3 terms TK . The value of K is calculated as a func- baum, 1998), and the largest wide-coverage tion of the vector size, as follows: collaboratively-edited encyclopedia, Wikipedia, 2 by means of an automatic mapping algorithm. Ba- K = (log2(1 + |~v|)) , (2) belNet is therefore a multilingual “encyclopedic ” that combines lexicographic informa- The rationale is that K must be high enough to tion with wide-coverage encyclopedic knowledge. ensure a good conceptual representation but not Concepts in BabelNet are represented similarly to too high, so as to avoid as much noise as possi- WordNet, i.e., by grouping sets of in ble in the set TK . the different languages into multilingual synsets. Populating the Graph with Initial Concepts Multilingual synsets contain lexicalizations from Next, we create an initially-empty knowledge WordNet synsets, the corresponding Wikipedia graph G = (V,E), i.e., such that V = E = ∅. pages and additional translations output by a sta- We populate the vertex set V with the set SK of tistical system. The relations all the synsets in BabelNet which contain any term between synsets are collected from WordNet and in TK in the document language L, that is: from Wikipedia’s hyperlinks between pages. 1http://babelnet.org We note that, in principle, we could use any 2Following the setup of (Platt et al., 2010), our initial data multilingual network providing a similar kind of is represented using term vectors. For this reason we lemma- information, e.g., EuroWordNet (Vossen, 2004). tize in this step. 3Since the vector ~v provides weights for all the word However, in our work we chose BabelNet be- forms, and not only lemmas, occurring in d, we take the best cause of its larger size, its coverage of both lex- weight among those word forms of the considered lemma. Figure 1: (a) initial graph from TK = {“European”, “apple”, “tree”, “Malus”, “species”, “America”}; (b) knowledge graph obtained by retrieving all paths from BabelNet. Gray nodes are the original concepts.

Knowledge Graph Weighting The final step [ SK = SynsetsL(t), (3) consists of weighting all the concepts and se- t∈TK mantic relations of the knowledge graph G. For weighting relations we use the original weights where SynsetsL(t) is the set of synsets in Ba- from BabelNet, which provide the degree of re- belNet which contain a term t in the language latedness between the synset end points of each of interest L. For example, in Figure 1(a) we edge (Navigli and Ponzetto, 2012a). As for con- show the initial graph obtained from the set TK = cepts, we weight them on the basis of the origi- {“European”, “apple”, “tree”, “Malus”, “species”, nal weights of the terms in the vector ~v. In or- “America”}. Note, however, that each retrieved der to score each concept in our knowledge graph synset is multilingual, i.e., it contains lexicaliza- G, we applied the topic-sensitive PageRank al- tions for the same concept in other languages too. gorithm (Haveliwala et al., 2003) to G. While Therefore, the nodes of our knowledge graph pro- the well-known PageRank algorithm (Page et al., vide a language-independent representation of the 1998) calculates the global importance of vertices document’s content. in a graph, topic-sensitive PageRank is a variant Creating the Knowledge Graph Similarly to in which the importance of vertices is biased us- Navigli and Lapata (2010), we create the knowl- ing a set of representative “topics”. Formally, the edge graph by searching BabelNet for paths con- topic-sensitive PageRank vector ~p is calculated by necting pairs of synsets in V . Formally, for each means of an iterative process until convergence as pair v, v0 ∈ V such that v and v0 do not share any follows: ~p = cM~p+(1−c)~u, where c is the damp- 4 lexicalization in TK , for each path in BabelNet ing factor (conventionally set to 0.85), 1 − c repre- 0 v → v1 → ... → vn → v , we set: V := V ∪ sents the probability of a surfer randomly jumping 0 {v1, . . . , vn} and E := E∪{(v, v1),..., (vn, v )}, to any node in the graph, M is the transition proba- −1 that is, we add all the path vertices and edges to bility matrix of graph G, with Mji = degree(i) G. After prototyping, the path length is limited if an edge from i to j exists, 0 otherwise, ~u is to maximum length 3, so as to avoid an excessive the random-jumping transition probability vector, semantic drift. where each ui represents the probability of jump- As a result of populating the graph with inter- ing randomly to the node i, and ~p is the resulting mediate edges and vertices, we obtain a knowl- PageRank vector which scores the nodes of G. In edge graph which models the semantic context of contrast to vanilla PageRank, the “topic-sensitive” document d. We point out that our knowledge variant gives more probability mass to some nodes graph might have different isolated components. in G and less to others. In our case we perturbate We view each component as a different interpreta- ~u by concentrating the probability mass to the ver- tion of document d. To select the main interpre- tices in SK , which are the synsets corresponding tation, we keep only the largest component, i.e., to the document terms TK (cf. Formula 3). the one with the highest number of vertices, which we consider as the most likely semantic represen- 3.3 Similarity between Knowledge Graphs tation of the document content. We can now determine the similarity between two Figure 1(b) shows the knowledge graph ob- documents d, d0 ∈ D in terms of the similarity of tained for our example term set. Note that our their knowledge graph representations G and G0. approach retains, and therefore weights, only the Following the literature (Montes y Gomez´ et subgraph focused on the “apple fruit” meaning. al., 2001) we calculate the similarity between the 4This prevents different senses of the same term from be- vertex sets in the two graphs using Dice’s coeffi- ing connected via a path in the resulting knowledge graph. cient (Jackson et al., 1989): Figure 2: Knowledge graph examples from two comparable documents in different languages. X 2 · w(c) Algorithm 1 Dictionary-based term-vector translation. 0 0 c∈V (G)∩V (G ) Input: a weighted document vector ~vL = (w1, . . . , wn), a Sc(G, G ) = X X , (4) 0 w(c) + w(c) source language L and a target language L Output: a translated vector ~vL0 c∈V (G) c∈V (G0) 1: ~vL0 ← (0,..., 0) of length n 2: for i = 1 to n where w(c) is the weight of a concept c (see Sec- 3: if wi = 0 continue tion 3.2). Likewise, we calculate the similarity be- 4: // let ti be the term corresponding to wi in ~vL 5: SL ← SynsetsL(ti) tween the two edge sets as: 6: for each synset s ∈ SL 0 X 7: T ← getTranslations(s, L ) 2 · w(r) 8: if T 6= ∅ then 0 9: for each tr ∈ T 0 r∈E(G)∩E(G ) Sr(G, G ) = X X , (5) 10: wnew = wi · confidence(tr, ti) w(r) + w(r) 11: // let index(tr) be the index of tr in ~vL r∈E(G) r∈E(G0) 12: if ∃ index(tr) then 13: vL0 (index(tr)) = wnew where w(r) is the weight of a semantic relation 14: return ~vL0 edge r. We combine the two above measures of concep- tual (Sc) and relational (Sr) similarity to obtain an using BabelNet as our multilingual dictionary. We 0 integrated measure Sg(G, G ) between knowledge detail the document-vector translation process in graphs: Algorithm 1. 0 0 0 Sc(G, G ) + Sr(G, G ) The translated vector ~vL0 is obtained as follows: Sg(G, G ) = . (6) 2 for each term ti with non-zero weight in vL we Notably, since we are working with a language- obtain all the possible meanings of ti in BabelNet independent representation of documents, this (see line 5) and, for each of these, we retrieve all similarity measure can be applied to the knowl- the translations (line 7), i.e., lexicalizations of the concept, in language L0 available in the synset. We edge graphs built from documents written in any 5 language. In Figure 2 we show two knowledge set a non-zero value in the translation vector ~vL0 , graphs for comparable documents written in dif- in correspondence with each such translation tr, ferent languages (for clarity, labels are in English proportional to the weight of ti in the original vec- tor and the confidence of the translation (line 10), in both graphs). As expected, the graphs share sev- 6 eral key concepts and relations. as provided by the BabelNet semantic network. In order to increase the amount of information 4 A Multilingual Vector Representation available in the vector and counterbalance possible wrong translations, we avoid translating all vec- 4.1 From Document to Multilingual Vector tors to one language. Instead, in the present work Since our knowledge graphs will only cover the we create a multilingual vector representation of a most central concepts of a document, we comple- 5 ment this core representation with a more tradi- To make the translation possible, while at the same time keeping the same number of dimensions in our vector repre- tional vector-based representation. However, as sentation, we use a shared vocabulary which covers both lan- we are interested in the cross-language compari- guages. See Section 6 for details on the experimental setup. son of documents, we translate our monolingual 6Non-English lexicalizations in BabelNet have confi- dence 1 if originating from Wikipedia inter-language links vector ~vL of a document d written in language L and ≤ 1 if obtained by means of statistical machine transla- 0 into its corresponding vector ~vL0 in language L tion (Navigli and Ponzetto, 2012a). document d written in language L by concatenat- and evaluated KBSim on the Wikipedia dataset ing the corresponding vector ~vL with the translated made available by the authors of those papers. 0 vector ~vL0 of d for language L . As a result, we The dataset is composed of Wikipedia compara- obtain a multilingual vector ~vLL0 , which contains ble encyclopedic entries in English and Spanish. lexicalizations in both languages. For each document in English there exists a “real” 4.2 Similarity between Multilingual Vectors pair in Spanish which was defined as a compara- ble entry by the Wikipedia user community. The Following common practice for document similar- dataset of each language was split into three parts: ity in the literature (Manning et al., 2008), we use 43,380 training, 8,675 development and 8,675 test the cosine similarity as the similarity measure be- documents. The documents were tokenized, with- tween multilingual vectors: out , and represented as vectors using a 0 0 ~vLL0 · ~vLL0 0 log(tf)-idf weighting (Salton and Buckley, 1988). Sv(~vLL , ~vLL0 ) = 0 . (7) ||~v 0 || ||~v || LL LL0 The vocabulary of the corpus was restricted to 20,000 terms, which were the most frequent terms 5 Knowledge-based Document Similarity in the two languages after removing the top 50 Given a source document d and a target docu- terms. ment d0, we calculate the similarities between the Methodology To evaluate the models we com- respective knowledge-graph and multilingual vec- pared each English document against the Spanish tor representations, and combine them to obtain a dataset and vice versa. Following the original set- knowledge-based similarity as follows: ting, the results are given as the average perfor- mance between these two experiments. For eval- 0 0 0 KBSim(d, d ) = c(G)Sg (G, G ) + (1 − c(G))Sv (~vLL0 , ~vLL0 ), (8) uation we employed the averaged top-1 accuracy where c(G) is an interpolation factor calculated as and Mean Reciprocal Rank (MMR) at finding the the edge density of knowledge graph G: real comparable document in the other language. We compared KBSim against the state-of-the-art |E(G)| c(G) = . (9) supervised models S2Net, OPCA, CCA, and CL- |V (G)|(|V (G)| − 1) LSI (cf. Section 2). In contrast to these models, Note that, using the factor c(G) to interpolate KBSim does not need a training step, so we ap- the two similarities in Eq. 8, we determine the rel- plied it directly to the testing partition. evance for the knowledge graphs and the multi- In addition we also included the results of 7 8 lingual vectors in a dynamic way. Indeed, c(G) CL-ESA , CL-C3G and two simple vector-based makes the contribution of graph similarity depend models which translate all documents into English on the richness of the knowledge graph. on a word-by-word basis and compared them us- ing cosine similarity: the first model (CosSimE) 6 Evaluation uses a statistical dictionary trained with Europarl using Wavelet-Domain Hidden Markov Models In this section we compare our knowledge- (He, 2007), a model similar to IBM Model 4; based document similarity measure, KBSim, the second model (CosSimBN ) instead uses Algo- against state-of-the-art models on two different rithm 1 to translate the vectors with BabelNet. tasks: comparable document retrieval and cross- lingual text categorization. 6.1.2 Results 9 6.1 Comparable Document Retrieval As we can see from Table 1, the CosSimBN model, which uses BabelNet to translate the docu- In our first experiment we determine the effective- ment vectors, achieves better results than CCA and ness of our knowledge-based approach in a com- CL-LSI. We hypothesize that this is due to these parable document retrieval task. Given a docu- linear projection models losing information during ment d written in language L and a collection D 0 L the projection. CosSim yields results similar to of documents written in another language L0, the E CosSim , showing that BabelNet is a good al- task of comparable document retrieval consists of BN ternative statistical dictionary. In contrast to CCA finding the document in DL0 which is most simi- lar to d, under the assumption that there exists one 7Document collections with sizes higher than 105 provide 0 document d ∈ D 0 which is comparable with d. high performance (Potthast et al., 2008). Here we used 15k L documents from the training set to index the test documents. 6.1.1 Corpus and Task Setting 8CL-C3G is CL-CNG using character 3-grams, which has proven to be the best length (Mcnamee and Mayfield, 2004). Dataset We followed the experimental setting 9In this work, statistically significant results according to described in (Platt et al., 2010; Yih et al., 2011) a χ2 test are highlighted in bold. Model Dimension Accuracy MMR Model Dim. EN News ES News S2Net 2000 0.7447 0.7973 Accuracy Accuracy KBSim N/A 0.7342 0.7750 KBSim N/A 0.8189 0.6997 OPCA 2000 0.7255 0.7734 Full MT 50 0.8483 0.6484 CosSimE N/A 0.7033 0.7467 CosSimBN N/A 0.8023 0.6737 CosSimBN N/A 0.7029 0.7550 OPCA 100 0.8412 0.5954 CCA 1500 0.6894 0.7378 CCA 150 0.8388 0.5323 CL-LSI 5000 0.5302 0.6130 CL-LSI 5000 0.8401 0.5105 CL-ESA 15000 0.2660 0.3305 CosSimE N/A 0.8046 0.4481 CL-C3G N/A 0.2511 0.3025 Table 2: Test results for cross-language text cat- Table 1: Test results for comparable document re- egorization. Full MT, OPCA, CCA, CL-LSI and trieval in Wikipedia. S2Net, OPCA, CosSimE, CosSimE are from (Platt et al., 2010). CCA and CL-LSI are from (Yih et al., 2011). and CL-LSI, OPCA performs better thanks to its ble categories. In addition, each dataset of news improved projection method using a noise covari- is translated into the other four languages using ance matrix, which enables it to obtain the main the Portage translation system (Sadat et al., 2005). components in a low-dimensional space. As a result, we have five different multilingual CL-C3G and CL-ESA obtain the lowest results. datasets, each containing source news documents Considering that English and Spanish do not have in one language and four sets of translated doc- many lexical similarities, the low performance of uments in the other languages. Each of the lan- CL-C3G is justified because these languages do guages has an independent vocabulary. Document not share many character n-grams. The reason be- vectors in the collection are created using TFIDF- hind the low results of CL-ESA can be explained based weighting. by the low number of intersecting concepts be- tween Spanish and English in Wikipedia, as con- Methodology To evaluate our approach we used firmed by Potthast et al. (2008). Despite both us- the English and Spanish news datasets. From ing Wikipedia in some way, KBSim obtains much the English news dataset we randomly selected higher performance than CL-ESA thanks to the 13,131 news as training and 1,875 as test docu- use of our multilingual knowledge graph repre- ments. From the Spanish news dataset we selected sentation of documents, which makes it possible all 12,342 news as test documents. To classify to expand and semantically relate its original con- both test sets we used the English news training cepts. As a result, in contrast to CL-ESA, KB- set. We performed the experiment at cross-lingual Sim can integrate conceptual and relational simi- level using Spanish and English languages avail- larity functions which provide more accurate per- able for both Spanish and English news datasets, formance. Interestingly, KBSim also outperforms therefore we classified each test set selecting the OPCA which, in contrast to our system, is super- documents in English and using the Spanish doc- vised, and in terms of accuracy is only 1 point be- uments in the training dataset, and vice versa. We low S2Net, the supervised state-of-the-art model followed Platt et al. (2010) and averaged the val- using neural networks. ues obtained from the two comparisons for each 6.2 Cross-language Text Categorization test set to obtain the final result. To categorize the documents we applied k-NN to the ranked The second task in which we tested the differ- list of documents according to the similarity mea- ent models was cross-language text categorization. sure employed for each model. We evaluated each The task is defined as follows: given a document 0 model by estimating its accuracy in the classifica- d in a language L and a corpus D 0 with docu- L L tion of the English and Spanish test sets. ments in a different language L0, and C possible categories, a system has to classify dL into one of We compared our approach against the state- 0 the categories C using the labeled collection DL0 . of-the-art supervised models in this task: OPCA, CCA and CL-LSI (Platt et al., 2010). In addi- 6.2.1 Corpus and Task Setting tion, we include the results of the CosSimBN and Dataset To perform this task we used the Mul- CosSimE models that we introduced in Section tilingual Reuters Collection (Amini et al., 2009), 6.1.1, as well as the results of a full statistical ma- which is composed of five datasets of news from chine translation system trained with Europarl and five different languages (English, French, German, post-processed by LSA (Full MT), as reported by Spanish and Italian) and classified into six possi- Platt et al. (2010). 6.2.2 Results Testing Training datasets datasets DE EN ES FR IT Table 2 shows the cross-language text categoriza- DE 0.8053 0.6872 0.5373 0.6417 0.5920 EN 0.5827 0.8463 0.5540 0.6530 0.5820 tion accuracy. CosSimE obtained the lowest re- ES 0.5883 0.6153 0.8707 0.6237 0.7010 sults. This is because there is a significant number FR 0.6867 0.7103 0.6667 0.8227 0.6887 of untranslated terms in the translation process that IT 0.5973 0.5487 0.6263 0.5973 0.8317 the statistical dictionary cannot cover. This is not Table 3: KBSim accuracy in a multilingual setup. the case in the CosSimBN model which achieves higher results using BabelNet as a statistical dic- 7 Conclusions tionary, especially on the Spanish news corpus. On the other hand, however, the linear projec- In this paper we introduced a knowledge-based tion methods as well as Full MT obtained the high- approach to represent and compare documents est results on the English corpus. The differences written in different languages. The two main between the linear projection methods are evident contributions of this work are: i) a new graph- when looking at the Spanish corpus results; OPCA based model for the language-independent rep- performed best with a considerable improvement, resentation of documents based on the Babel- which indicates again that it is one of the most ef- Net multilingual semantic network; ii) KBSim, a fective linear projection methods. Finally, our ap- knowledge-based cross-language similarity mea- proach, KBSim, obtained competitive results on sure between documents, which integrates our the English corpus, performing best among the un- multilingual graph-based model with a traditional supervised systems, and the highest results on the vector representation. Spanish news, surpassing all alternatives. In two different cross-lingual tasks, i.e., compa- Since KBSim does not need any training for rable document retrieval and cross-language text document comparison, and because it based, categorization, KBSim has proven to perform on moreover, on a multilingual , we a par or better than the supervised state-of-the-art performed an additional experiment to demon- models which make use of linear projections to strate its ability to carry out the same text cate- obtain the main components of the term vectors. gorization task in many languages. To do this, we We remark that, in contrast to the best systems in used the Multilingual Reuters Collection to cre- the literature, KBSim does not need any parameter ate a 3,000 document test dataset and 9,000 train- tuning phase nor does it use any training informa- ing dataset10 for five languages: English, German, tion. Moreover, when scaling to many languages, Spanish, French and Italian. Then we calculated supervised systems need to be trained on each pair, the classification accuracy on each test set using which can be very costly. each training set. Results are shown in Table 3. The gist of our approach is in the knowl- The best results for each language were ob- edge graph representation of documents, which re- tained when working at the monolingual level, lates the original terms using expanded concepts which suggests that KBSim might be a good and relations from BabelNet. The knowledge untrained alternative in monolingual tasks, too. graphs also have the nice feature of being human- In general, cross-language comparisons produced interpretable, a feature that we want to exploit in similar results, demonstrating the general applica- future work. We will also explore the integration bility of KBSim to arbitrary language pairs in mul- of linear projection models, such as OPCA and tilingual text categorization. However, we note S2Net, into our multilingual vector-based similar- that German, Italian and Spanish training parti- ity measure. Also, to ensure a level playing field, tions produced low results compared to the oth- following the competing models, in this work we ers. After analyzing the length of the documents did not use multi-word expressions as vector com- in the different datasets we discovered that they ponents. We will study their impact on KBSim in have different average lengths in words: 79 (EN), future work. 76 (FR), 75 (DE), 60 (ES) and 55 (IT). German, Acknowledgments Spanish and especially Italian documents have the lowest average length, which makes it more diffi- cult to build a representative knowledge graph of The authors gratefully acknowledge the support the content of each document when it is perform- of the ERC Starting Grant MultiJEDI No. 259234, ing at cross-language level. EC WIQ-EI IRSES (Grant No. 269180) and MICINN DIANA-Applications (TIN2012-38603- 10Note that training is needed for the k-NN classifier, but C02-01). Thanks go to Yih et al. for their support not for document comparison. and Jim McManus for his comments. References Workshop on Statistical Machine Translation, pages 80–87. Association for Computational Linguistics. Massih-Reza Amini, Nicolas Usunier, and Cyril Goutte. 2009. Learning from multiple partially ob- Johannes Hoffart, Fabian M. Suchanek, Klaus served views - an application to multilingual text Berberich, and Gerhard Weikum. 2013. Yago2: A categorization. In Advances in Neural Information spatially and temporally enhanced knowledge base Processing Systems 22 (NIPS 2009), pages 28–36. from wikipedia. Artificial Intelligence, 194:28–61. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Eduard H. Hovy, Roberto Navigli, and Simone Paolo Soren¨ Auer, Christian Becker, Richard Cyganiak, Ponzetto. 2013. Collaboratively built semi- and Sebastian Hellmann. 2009. Dbpedia - a crys- structured content and Artificial Intelligence: The tallization point for the web of data. J. Web Sem., story so far. Artificial Intelligence, 194:2–27. 7(3):154–165. David Hull. 1994. Improving text retrieval for the David M. Blei, Andrew Y. Ng, and Michael I. Jordan. routing problem using latent semantic indexing. In 2003. Latent dirichlet allocation. the Journal of ma- Proceedings of the 17th Annual International ACM chine Learning research, 3:993–1022. SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 282–291. Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Springer. Sorg, and Steffen Staab. 2009. Explicit versus la- tent concept models for cross-language information Donald A. Jackson, Keith M. Somers, and Harold H. retrieval. In Proceedings of the International Joint Harvey. 1989. Similarity coefficients: measures of Conference on Artificial Intelligence (IJCAI), vol- co-occurrence and association or simply measures of ume 9, pages 1513–1518. occurrence? American Naturalist, pages 436–453. Scott Deerwester, Susan T. Dumais, George W. Fur- Ian T. Jolliffe. 1986. Principal component analysis, nas, Thomas K. Landauer, and Richard Harshman. volume 487. Springer-Verlag New York. 1990. Indexing by . Jour- nal of the American society for information science, Thomas K. Landauer and Michael L. Littman. 1994. 41(6):391–407. Computerized cross-language document retrieval using latent semantic indexing, April 5. US Patent Konstantinos I. Diamantaras and Sun Y. Kung. 1996. 5,301,109. Principal component neural networks. Wiley New York. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ 2008. Introduction to Information Susan T. Dumais, Todd A. Letsche, Michael L. Retrieval. Cambridge University Press, New York, Littman, and Thomas K. Landauer. 1997. Auto- NY, USA. matic cross-language retrieval using latent seman- tic indexing. In Proc. of AAAI Spring Symposium Michael Matuschek and Iryna Gurevych. 2013. on Cross-language Text and Speech Retrieval, pages Dijkstra-WSA: A graph-based approach to word 18–24. sense alignment. Transactions of the Association for Computational Linguistics (TACL), 1:151–164. Christiane Fellbaum. 1998. WordNet: An electronic lexical database. Bradford Books. Paul Mcnamee and James Mayfield. 2004. Charac- ter n-gram tokenization for european language text Marc Franco-Salvador, Parth Gupta, and Paolo Rosso. retrieval. Information Retrieval, 7(1-2):73–97. 2013. Cross-language plagiarism detection using a multilingual semantic network. In Proc. of the David Mimno, Hanna M. Wallach, Jason Naradowsky, 35th European Conference on Information Retrieval David A. Smith, and Andrew McCallum. 2009. (ECIR’13), volume LNCS(7814), pages 710–713. Polylingual topic models. In Proceedings of the Springer-Verlag. 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages Evgeniy Gabrilovich and Shaul Markovitch. 2007. 880–889. Association for Computational Linguis- Computing semantic relatedness using wikipedia- tics. based explicit semantic analysis. In Proc. of the 20th International Joint Conference on Artifical Intelli- Manuel Montes y Gomez,´ Alexander F. Gelbukh, Au- gence (IJCAI), pages 1606–1611. relio Lopez-L´ opez,´ and Ricardo A. Baeza-Yates. 2001. Flexible comparison of conceptual graphs. Taher Haveliwala, Sepandar Kamvar, and Glen Jeh. In Proc. of the 12th International Conference on 2003. An analytical comparison of approaches to Database and Expert Systems Applications (DEXA), personalizing pagerank. Technical Report 2003-35, pages 102–111. Stanford InfoLab, June. Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- Xiaodong He. 2007. Using word dependent transition proving machine translation performance by exploit- models in hmm based word alignment for statistical ing non-parallel corpora. Computational Linguis- machine translation. In Proceedings of the Second tics, 31(4):477–504. Vivi Nastase and Michael Strube. 2013. Transform- Piek Vossen. 2004. EuroWordNet: A multilin- ing wikipedia into a large scale multilingual concept gual database of autonomous and language-specific network. Artificial Intelligence, 194:62–85. connected via an inter-lingual index. In- ternational Journal of Lexicography, 17(2):161– Roberto Navigli and Mirella Lapata. 2010. An ex- 173. perimental study of graph connectivity for unsuper- vised word sense disambiguation. IEEE Transac- Wen-tau Yih, Kristina Toutanova, John C. Platt, and tions on Pattern Analysis and Machine Intelligence, Christopher Meek. 2011. Learning discriminative 32(4):678–692. projections for text similarity measures. In Proceed- ings of the Fifteenth Conference on Computational Roberto Navigli and Simone Paolo Ponzetto. 2012a. Natural Language Learning, pages 247–256. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual se- mantic network. Artificial Intelligence, 193:217– 250.

Roberto Navigli and Simone Paolo Ponzetto. 2012b. BabelRelate! a joint multilingual approach to com- puting semantic relatedness. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelli- gence (AAAI-12), pages 108–114, Toronto, Canada.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. The PageRank Citation Ranking: Bringing Order to the Web. Technical re- port, Stanford Digital Library Technologies Project.

John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from discriminative projections. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 251–261.

Martin Potthast, Benno Stein, and Maik Anderka. 2008. A wikipedia-based multilingual retrieval model. In Advances in Information Retrieval, pages 522–530. Springer.

Martin Potthast, Alberto Barron-Cede´ no,˜ Benno Stein, and Paolo Rosso. 2011. Cross-language plagia- rism detection. Language Resources and Evalua- tion, 45(1):45–62.

Fatiha Sadat, Howard Johnson, Akakpo Agbago, George Foster, Joel Martin, and Aaron Tikuisis. 2005. Portage: A phrase-based machine translation system. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, Ann Arbor, USA.

Gerard Salton and Christopher Buckley. 1988. Term- weighting approaches in automatic text retrieval. In- formation processing & management, 24(5):513– 523.

Gerard Salton and Michael J. McGill. 1986. Intro- duction to Modern Information Retrieval. McGraw- Hill, Inc., New York, NY, USA.

Gerard Salton, Edward A. Fox, and Harry Wu. 1983. Extended boolean information retrieval. Communi- cations of the ACM, 26(11):1022–1036.

Bruce Thompson. 2005. Canonical correlation analy- sis. Encyclopedia of statistics in behavioral science.