Hypergraph-Of-Entity

Open Comput. Sci. 2019; 9:103–127 Research Article Open Access José Devezas* and Sérgio Nunes Hypergraph-of-entity A unified representation model for the retrieval of text and knowledge https://doi.org/10.1515/comp-2019-0006 and structured data, in the form of knowledge. We have Received December 5, 2018; accepted April 17, 2019 frequently used data structures like graphs as a representation that promotes the integration of heterogeneous data. Abstract: Modern search is heavily powered by knowledge Hypergraphs take it even further, by providing a more ex- bases, but users still query using keywords or natural lan- pressive data structure that can, at the same time, cap- guage. As search becomes increasingly dependent on the ture both the relations and the intersections of nodes. We integration of text and knowledge, novel approaches for propose that hypergraphs should be used as an alterna- a unified representation of combined data present the op- tive data structure for indexing, not only because of their portunity to unlock new ranking strategies. We have pre- expressiveness — a document might be modeled as a hy- viously proposed the graph-of-entity as a purely graph- peredge with terms and entities, potentially subsuming based representation and retrieval model, however this other relations between entities —, but also because they model would scale poorly. We tackle the scalability issue have the potential to scale better than a graph-based ap- by adapting the model so that it can be represented as proach — in particular, relations like synonymy or co- a hypergraph. This enables a significant reduction of the occurrence can be modeled with only one hyperedge as number of (hyper)edges, in regard to the number of nodes, opposed to creating a complete subgraph for all synonyms while nearly capturing the same amount of information. or co-occurring nodes. It is also clearer, from a semantics Moreover, such a higher-order data structure, presents the perspective, to model synonymy or co-occurrence as a sin- ability to capture richer types of relations, including n- gle hyperedge. Even visually, a hypergraph can, through ary connections such as synonymy, or subsumption. We transparency, provide further insights regarding intersec- present the hypergraph-of-entity as the next step in the tions and subsumptions [1, Figures 2 and 5]. graph-of-entity model, where we explore a ranking ap- Let us for instance assume a labeled hyperedge, re- proach based on biased random walks. We evaluate the lated_to, which connects four entities mentioned in a doc- approaches using a subset of the INEX 2009 Wikipedia Col- ument entitled “Cat”: Carnivora, Mammal, Felidae and Pet. lection. While performance is still below the state of the In subsumption theory, this hyperedge would represent an art, we were, in part, able to achieve a MAP score similar extension of Carnivora, Mammal and Felidae, a set of enti- to TF-IDF and greatly improve indexing efficiency over the ties that could be a part of a hyperedge present for instance graph-of-entity. in a document entitled “Lion”. While a lion is not a pet, it Keywords: semantic search, hypergraph-based models, is still related to cat through the remaining three entities, collection-based representation, text and knowledge uni- so this information is useful for retrieval. While such exam- fication ple illustrates the potential of a hypergraph-based model, it only just scratches the surface. Using a hypergraph we can represent n-ary relations — linking more than two 1 Introduction nodes (undirected hyperedge) or two sets of nodes (directed hyperedge) —, but also hierarchical relations (hyperedges contained within other hyperedges) and any par- Entity-oriented and semantic search are centered around tial combination of the two. This means we can, for in- the integration of unstructured data, in the form of text, stance, model synonyms as an undirected hyperedge (e.g, fresult, consequence, eect, outcomeg) and even intro- duce hypernyms/hyponyms as directed hyperedges (e.g., *Corresponding Author: José Devezas: INESC TEC and Faculty of fcat, liong ! ffelineg). Engineering, University of Porto, Portugal; E-mail: [email protected] To sustain our argument, let us consider an alterna- Sérgio Nunes: INESC TEC and Faculty of Engineering, University of tive approach based on a tree, which is a type of directed Porto, Portugal; E-mail: [email protected] Open Access. © 2019 José Devezas and Sérgio Nunes, published by De Gruyter. This work is licensed under the Creative Commons Attribution alone 4.0 License. 104 Ë José Devezas and Sérgio Nunes graph, to represent hierarchical relations. With a basic tree If we instead ranked entity nodes using the same strategy, we lose the ability to simultaneously represent hierarchi- we would have generalized the problem to ad hoc entity cal and n-ary relations. We argue that a bipartite graph retrieval. or an edge-labeled mixed graph would allow for the rep- In Section 2, we clearly present the challenges and resentation of both n-ary and hierarchical relations, but, opportunities leading to this work, in particular distin- while conceptually it would contain the same information guishing between considerations that impact the future as the hypergraph, it would also be harder to read and use. of the model and the actual work presented in this pa- We wouldn’t be able, for instance, to naturally identify in- per. In Section 3, we present relevant work in entity- tersections or subsumptions. Independently of whether or oriented and semantic search, as well as graph-based not we translate the hypergraph to an equivalent graph, and hypergraph-based approaches. In Section 4, we de- at the very least the theoretical modeling power of the hy- scribe the hypergraph-of-entity representation and re- pergraph is clear. Nonetheless, in practice there are also trieval model, introducing the base model, as well as three some advantages to using hypergraphs over graphs, for in- optional index features that can be combined as desired stance the fact that a single hyperedge can store all syn- to extend the base model: synonyms, context and weights. onyms at once, requiring a single step to retrieve the syn- We also describe the ranking approach, based on seed onyms for a single term — O(jVj) for term nodes V. Con- node selection and random walks. In Section 5, we present versely, the same operation on a graph would require as the test collection used for evaluating the ad hoc docu- many steps as the number of synonyms for a single term ment retrieval performance, following with a study of rank — O(jVj + jEj) = O(bd), assuming a breadth-first search ap- stability, given our nondeterministic ranking approach, proach for term nodes V and synonym edges E or, equiva- and the performance assessment for six models combin- lently, for outdegree b (the branching factor of the graph) ing different index features of the hypergraph-of-entity. Fi- and distance d (where d would be the same as the graph nally, in Section 6, we conclude with some final remarks diameter). and directions for future work. Another advantage of hypergraphs includes the attempt to more closely model the human cognitive process. When we think, we inherently relate, generalize, particu- 2 Problem statement larize or overlap concepts. Most of us also translate natural language (sequences of terms) into concepts (entities), When answering a user’s information need, entity- supporting the thought process on language. What we pro- oriented search reconciles results from unstructured, pose to do with the hypergraph-of-entity is to attempt to de- semi-structured and structured data. This problem is velop a kind of cognitive search engine (or at least the foun- frequently approached by establishing separate tasks, dation for one). The way the (hyper)graph is traversed, in- where the information need is solved as a combination cluding the selection of the point of origin, determines the of different subsystems. While each subsystem can use kind of process over the “brain” of the engine. As a result, information from the other subsystems, they usually have generalization becomes possible. With only slight adjust- their own central representation and retrieval model. ments to the search process, we can add support for multi- For example, the inverted index is one of the main rep- ple tasks from entity-oriented search. This includes ad hoc resentation models in ad hoc document retrieval. And document and entity retrieval — the point of origin might while structured information can be integrated into the be term nodes from a keyword query —, as well as related inverted index to improve retrieval effectiveness, the rich entity finding and entity list completion — the point ofori- and complex relations from knowledge bases are seldom gin might be one or several example entity nodes; both transposed to the inverted index in an effective manner. tasks can also be considered a type of recommendation [2]. For instance, related entities can be represented as text In order to avoid a combinatorial explosion while still tak- through a description or a profile. This way they can then ing advantage of structural features, we propose that we be indexed in a field of the inverted index and contribute model each process using random walks over the hyper- to the ranking function as any other field would. Another graph — each step is based on the random selection of an approach is to separately query the inverted index and incident hyperedge and the subsequent random selection the knowledge base and somehow combine document of one of its nodes. In this work, we focus on the task of ad and entity weights. So, the approach is to either combine hoc document retrieval, a process that we implement by the output of two models or to translate one type of data modeling documents as hyperedges of terms and entities, to fit a chosen model and always work in that domain.

Hypergraph-Of-Entity

Introduction to Information Retrieval

Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Eugen Popovici

On the Collaboration Support in Information Retrieval

Developing a Personalized Article Retrieval System for Pubmed

Indexing the World Wide Web: the Journey So Far Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA

An Intelligent Meta Search Engine for Efficient Web Document Retrieval

UNIT I Introduction

The Weaknesses of Full-Text Searching

Document Retrieval Using a Probabilistic Knowledge Model

Information Retrieval on the Internet

Social Networks and Information Retrieval, How Are They Converging? a Survey, a Taxonomy and an Analysis of Social Information Retrieval Approaches and Platforms

An Effective Framework for Cloud Based Search Engine Dr