Web Search Result Clustering with Babelnet
Total Page:16
File Type:pdf, Size:1020Kb
Web Search Result Clustering with BabelNet Marek Kozlowski Maciej Kazula OPI-PIB OPI-PIB [email protected] [email protected] Abstract 2 Related Work In this paper we present well-known 2.1 Search Result Clustering search result clustering method enriched with BabelNet information. The goal is The goal of text clustering in information retrieval to verify how Babelnet/Babelfy can im- is to discover groups of semantically related docu- prove the clustering quality in the web ments. Contextual descriptions (snippets) of docu- search results domain. During the evalua- ments returned by a search engine are short, often tion we tested three algorithms (Bisecting incomplete, and highly biased toward the query, so K-Means, STC, Lingo). At the first stage, establishing a notion of proximity between docu- we performed experiments only with tex- ments is a challenging task that is called Search tual features coming from snippets. Next, Result Clustering (SRC). Search Results Cluster- we introduced new semantic features from ing (SRC) is a specific area of documents cluster- BabelNet (as disambiguated synsets, cate- ing. gories and glosses describing synsets, or Approaches to search result clustering can be semantic edges) in order to verify how classified as data-centric or description-centric they influence on the clustering quality (Carpineto, 2009). of the search result clustering. The al- The data-centric approach (as Bisecting K- gorithms were evaluated on AMBIENT Means) focuses more on the problem of data clus- dataset in terms of the clustering quality. tering, rather than presenting the results to the user. Other data-centric methods use hierarchical 1 Introduction agglomerative clustering (Maarek, 2000) that re- In the previous years, Web clustering engines places single terms with lexical affinities (2-grams (Carpineto, 2009) have been proposed as a solu- of words) as features, or exploit link information tion to the issue of lexical ambiguity in Informa- (Zhang, 2008). tion Retrieval. These systems group search results, Description-centric approaches (as Lingo, STC) by providing a cluster for each specific topic of the are more focused on the description that is pro- input query. Users navigate through the clusters duced for each cluster of search results. Ac- in order to retrieve the pertinent results. Most of curate and concise cluster descriptions (labels) clustering engines group search results on the ba- let the user search through the collection’s con- sis of their lexical similarity, and therefore suffer tent faster and are essential for various brows- from semantic lackness e.g. polysemy (different ing interfaces. The task of creating descrip- user needs expressed with the same words). tive, sensible cluster labels is difficult - typical In this paper we present well-known search re- text clustering algorithms rely on samples of key- sult clustering method enriched with BabelNet words for describing discovered clusters. Among information. The goal is to verify how Babel- the most popular and successful approaches are net/Babelfy can improve the clustering quality in phrase-based, which form clusters based on re- the web search results domain. Our approach is curring phrases instead of numerical frequencies evaluated on the dataset AMBIENT using four of isolated terms. STC algorithm employs fre- distinct measures, namely: Rand Index (RI), Ad- quently recurring phrases as both document sim- justed Rand Index (ARI), Jaccard Index (JI) and ilarity feature and final cluster description (Za- F1 measure. mir, 1998). Clustering in STC is treated as find- ing groups of documents sharing a high ratio of Babelfy2(Navigli, 2014; Flati, 2014) is a uni- frequent phrases. The Lingo algorithm combines fied graph-based approach that leverages Babel- common phrase discovery and latent semantic in- Net to jointly perform word sense disambigua- dexing techniques to separate search results into tion and entity linking in arbitrary languages. Ba- meaningful groups. Lingo uses singular value de- belfy is based on a loose identification of candi- composition of the term-document matrix to select date meanings coupled with a densest subgraph good cluster labels among candidates extracted heuristic, which selects high-coherence semantic from the text (frequent phrases). The algorithm interpretations. Babelfy WSD performance eval- was designed to cluster results from Web search uations outperform the state-of-the-art supervised engines (short snippets and fragmented descrip- systems. tions of original documents) and proved to provide diverse meaningful cluster labels (Osinski, 2004). 3 Approach The goal is to verify how Babelnet/Babelfy can 2.2 Babel eco-system improve the clustering quality in the web search The creation of very large knowledge bases results domain. The evaluation was performed in has been made possible by the availability of three steps. First, we tested three algorithms (Bi- collaboratively-edited online resources such as secting K-Means, STC, Lingo). At this stage, we Wikipedia and Wiktionary. Although these re- performed experiments only with textual features sources are only partially structured, they provide coming from snippets. Those methods do not ex- a great deal of valuable knowledge which can be ploit any external corpora or knowledge resource harvested and transformed into structured form. in order to overcome lack of data. This stage is fulfilled in order to choose the representative al- BabelNet1(Navigli, 2014; Flati, 2014) is a mul- gorithm for the next phases. Next, we introduced tilingual encyclopedic dictionary and semantic new semantic features from BabelNet/Babelfy (as network, which currently covers more than 270 disambiguated synsets, categories/glosses describ- languages and provides both lexicographic and ing synsets, or semantic edges) in order to ver- encyclopedic knowledge thanks to the seamless ify how they influence on the clustering quality of integration of WordNet, Wikipedia, Wiktionary, the search result clustering algorithm. Finally, we OmegaWiki, Wikidata and the Open Multilingual verified the idea of clustering snippets without the WordNet. BabelNet encodes knowledge in a la- specialized algorithm, namely only with the use of beled directed graph G=(V,E), where V is the set BabelNet/Babelfy systems. of nodes (concepts) and E is the set of edges con- necting pairs of concepts. Each edge is labeled 4 Experiments with a semantic relation. Each node contains set of lexicalizations of the concepts for different lan- 4.1 Experimental Setup guages. The multilingually lexicalized concepts Test sets. We conducted our experiments on the are Babel synsets. At its core, concepts and re- AMBIENT data set. AMBIENT (AMBIguous lations in BabelNet are harvested from the largest ENTries3) consists of 44 topics, each with a set available semantic lexicon of English, WordNet, of subtopics and a list of 100 ranked documents and the biggest open encyclopedia Wikipedia. Ba- (Carpineto, 2008). belNet preserves the organizational structure of Reference algorithms. We compared such WordNet, i.e., it encodes concepts and named en- search result clustering methods: (1) Lingo (Os- tities as sets of synonyms (synsets), but also the inski, 2004), (2) Suffix Tree Clustering (Za- information typical for WordNet is complemented mir, 1998) and (3) Bisecting K-means (Steinbach, with wide Wikipedia encyclopedic coverage, re- 2000). sulting in an intertwined network of concepts and BabelNet modules. We used the HTTP API pro- named entities. vided by BabelNet and Babelfy. Babelfy was used BabelNet is available online as (1) a public web in order to disambiguate the given text 4. Such ex- user interface, (2) a public SPARQL endpoint or tracted synsets were processed by BabelNet API (3) HTTP Rest API. 2http://babelfy.org 3http://credo.fub.it/ambient/ 1http://babelnet.org 4https://babelfy.io/v1/disambiguate in order to get more information about them 5 (as • synsets+ - the snippet’s text is disambiguated categories, glosses), or to get some graph relations using Babelfy, and the retrieved synset ids are as hypernyms6. added as additional tokens to the snippet tex- tual data 4.2 Scoring Following (Di Marco, 2009), the methods were • categories+ - the snippet’s text is disam- evaluated in terms of the clustering quality. Clus- biguated using Babelfy, and the retrieved tering evaluation is a difficult issue. Many eval- synset ids are processed with BabelNet in or- uation measures have been proposed in the litera- der to get theirs categories, such retrieved cat- ture so, in order to get exhaustive results we calcu- egories are added as additional tokens to the lated four distinct measures, namely: Rand Index snippet textual data (RI), Adjusted Rand Index (ARI), Jaccard Index • categories+1 - the snippet’s text is disam- (JI) and F1 measure. The above mentioned mea- biguated using Babelfy, and the retrieved sures are described in detail in (Di Marco, 2009). synset ids are processed with BabelNet in or- 4.3 Results der to get theirs categories, the categories oc- curring more than once are added as addi- We conducted three level experiments on the AM- tional tokens to the snippet textual data BIENT data set. First, we compared three search result clustering algorithms (i.e. Lingo, STC, and • categories+2 - the snippet’s text is disam- 7 Bisecting K-means ). Our goal was to estimate biguated using Babelfy, and the retrieved theirs quality parameters (the details in the Table synset ids are processed with BabelNet in or- 1). Lingo and STC outperform