Result Clustering for Keyword Search on Graphs

Result Clustering for Keyword Search on Graphs Madhulika Mohanty, Maya Ramanath December 1, 2016 Abstract Graph structured data on the web is now massive as well as diverse, ranging from social networks, web graphs to knowledge-bases. Effectively querying this graph structured data is non-trivial and has led to research in a variety of directions { structured queries, keyword and natural language queries, automatic translation of these queries to structured queries, etc. In this project, we are concerned with a class of queries called relationship queries, which are usually expressed as a set of keywords (each keyword denoting a named entity). The results returned are a set of ranked trees, each of which denotes relationships among the various keywords. The result list could consist of hundreds of answers. The problem of keyword search on graphs has been explored for over a decade now, but an important aspect that is not as extensively studied is that of user experience. In this project, we focus on the problem of presenting results of a keyword search to users. Our approach is to present a clustering of results rather than single results. The novelty of our approach lies in the fact that we use language models to represent result trees and our clustering algorithm makes use of the JS-divergence as the distance measure. We compare our approach with the existing approaches based on isomorphism between trees and tree-edit distance. The LM based clusters formed are highly cohesive and clearly separated from each other. Chapter 1 Introduction Many current, state-of-the-art information systems typically deal with large graphs. These graphs could be entity-relationship graphs extracted from tex- tual sources, relational databases modelled as graphs using foreign-key relationships among tuples, biological networks, social networks, or a combination of different kinds of graphs. Typically, these graphs are both node-labeled as well as edge-labeled and provide semantic information and edge weights denote the strength of the relationship between nodes. Since these graphs are often massive, querying and analysing them efficiently is non-trivial. An example of such a massive and fast-growing graph is the Linked Open Data1 (LOD) graph [10]. As of September 2011, the LOD graph is estimated to contain 31 billion RDF triples2 [41]. While these graphs can be analysed and queried in a variety of ways, we are interested in a specific class of queries called relationship queries. That is, given a set of two or more entities, what are the best relationships that connect all these entities. For example, given entities \Angela Merkel", \Indira Gandhi", \Julia Gillard" and \Sheikh Hasina", apart from the obvious interconnection that they are all women, the other (perhaps more interesting) interconnection is that they are/were all heads of their respective governments. Similar queries could range from fun and entertaining on graphs of movies and actors, for example, to more serious discoveries on graphs of politicians and corporations. In all these cases, we expect the system to return multiple, ranked results [9, 27, 22, 20, 4, 25, 18, 23, 21]. The results are in the form of trees which express the relationships among the queried entities. Here it is keyword search since the user specifies her query using multiple keywords that denote entities in the graph. A key problem of keyword search on graphs is that it could potentially return a huge number of results. As a simple example, consider the query consisting of just two keywords \Rekha" and \Bachchan" on the IMDB dataset. Example results are shown in Figure 1.1. However, we find that the total number of results returned by this query is 105. While it is certainly possible to restrict the number of results returned to just k (where k is typically 10 or 20), this is potentially restrictive for the following reasons: 1http://linkeddata.org 2An RDF triple consists of (subject, predicate, object) where the predicate describes the relationship between the subject and object. 1 (a) (b) (c) (d) (e) Figure 1.1: Results of query \Rekha",\Bachchan" over IMDB. 2 • If the query is ambiguous, there are different flavours of answers, which means that the user may never find an answer which is interesting to her. For example, for a search query \Bachchan" over IMDB data, result node could be an actor \Amitabh/Abhishek/Aishwarya Rai Bachchan" or a movie name that has \Bachchan" as its substring e.g. \Bol Bachchan". Hence, one term could mean two different entities altogether and both the entities could be equally important, and restricting the number of results could result in either missing out one of them or intermingling of results of both types and leading to confusion. • Displaying only few answers without any categorization makes the results less informative. This could also prevent the user from discovering any interesting patterns that might exist in the results. For example, for the query \Rekha",\Bachchan", many results were of the form where the two actors have acted in the same movie, i.e., \Rekha" and \Bachchan" have been co-actors in many movies. On the other hand, clustering results helps us avoid exactly the problems highlighted above. Different flavours of answers are clustered together and the number of answers on a single page can encompass all of them. The latter helps users get a bird's eye view of all the different kinds of answers possible and analyse interesting patterns that are now visible. In this paper, we propose a novel, language-model-based [37] clustering of result trees. Each result tree is represented using a language model (LM) and the distance between two result trees is measured using the Jensen-Shannon divergence (JS divergence). The LM for a result tree is estimated as follows. Since a result tree is nothing more than a set of entities (corresponding to nodes) and the relationships among these entities (corresponding to edges), we first esti- mate LMs for each entity and relationship in the result tree separately. The LM for the result tree is then estimated by combining these individual LMs through linear interpolation. We then use a standard hierarchical clustering algorithm to cluster the results, using LMs as our objects of interest and JS-divergence as a measure of similarity between two LMs. The clusters so generated are then ranked using various simple heuristics based on using the underlying keyword search algorithm's ranking of the result trees. The main advantage to using LMs as opposed to standard tree similarity measures such as tree edit distance and tree isomorphism is that our LMs can take into account the neighbourhood of entities in the result tree. In the example of \Rekha" and \Bachchan", even if the given node in the result tree contains just the word \Bachchan", we can still differentiate between the \Bachchan"s by looking at the neighbourhood of that node. For example, \Amitabh Bachchan" would more likely contain neighbours which are movies he acted in, while \Bol Bachchan" would contain neighbours corresponding to the directors, producers or actors associated with the movie. Note that our work is orthogonal to any specific keyword search algorithm and therefore can be used as a post-processing step on any list of results returned by a keyword search algorithm. In summary, in this paper we propose a novel LM-based method for repre- senting result trees returned from keyword search on graphs. We show through user evaluations that clustering results based on this representation provides users with better results as opposed to clustering based on tree-edit distance or graph isomorphism. 3 The rest of the paper is organized as follows: section 2 discusses related work, section 3 describes our technique to cluster keyword search results, section 4 outlines the evaluation of our system against the two major techniques and finally, section 5 concludes the paper with a summary of our contributions and future work directions. 4 Chapter 2 Related Work We discuss the related work under the following categories: Graph Summarization and Minimization There has been a lot of work on graph compression and summarization techniques which aim at reducing the graph into a smaller size with minimum loss of information. [40] proposes an algorithm, SNAP, to approximately describe the structure and content of graph data by grouping nodes based on user-selected node attributes and relationships. This summarization technique exploits the similarities in the non-numerical attributes only. [44] describes an algorithm, CANAL, which extends SNAP to summarize and categorize numerical attributes into various buckets with an op- tion to drill-down or roll-up the resolution of the dataset. [12] describes an algorithm to generate a summary graph using a random walk model and [36] compresses graphs using Minimum Description Length(MDL) principle from information theory. All these summary techniques aim at finding a representative graph for the huge data graph. These techniques cannot be used in a situation where we have multiple result graphs and we are looking for grouping similar results together. Other summarization techniques include those based on bisim- ulation equivalence relation [16] and simulation based minimization [11] which summarize graphs by reducing/merging similar nodes and edges. [42] uses dom- inance relation defined in [19] to create a summary graph by merging a number of nodes. Graph Clustering Graph clustering is the process of grouping many graphs into different clusters such that each cluster contains similar graphs. [2] is an extensive survey on many algorithms based on tree-edit distance for clustering graphs and XML data. All these algorithms are based on tree-edit distance and we have shown a comparison of our technique with tree-edit distance based technique.

Load more