On the Representation and Embedding of Knowledge Bases Beyond Binary Relations
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) On the Representation and Embedding of Knowledge Bases Beyond Binary Relations 1,2 1,2 3 1 1,2 Jianfeng Wen , Jianxin Li , Yongyi Mao , Shini Chen , and Richong Zhang⇤ 1 State Key Laboratory of Software Development Environment, Beihang University 2 School of Computer Science and Engineering, Beihang University 3 School of Electrical Engineering and Computer Science, University of Ottawa Abstract to link prediction [Bordes et al., 2013] or question answering [Bordes et al., 2014a] in knowledge bases, in which the prob- The models developed to date for knowledge base lems are usually of a combinatorial nature in their original embedding are all based on the assumption that discrete settings. the relations contained in knowledge bases are bi- Despite their promising successes, existing embedding nary. For the training and testing of these embed- techniques are all developed based on the assumption that ding models, multi-fold (or n-ary) relational data knowledge data are instances of binary relations, namely in- are converted to triples (e.g., in FB15K dataset) stances each involving two entities (such as “Beijing is the and interpreted as instances of binary relations. capital of China”). In reality, however, a large portion of the This paper presents a canonical representation of knowledge data are from non-binary relations (such as “Bene- knowledge bases containing multi-fold relations. dict Cumberbatch played Alan Turing in the movie The Imita- We show that the existing embedding models on tion Game”). For example, we observe that in Freebase[Bol- the popular FB15K datasets correspond to a sub- lacker et al., 2008], more than 1/3 of the entities participate in optimal modelling framework, resulting in a loss of non-binary relations. This calls for a careful investigation of structural information. We advocate a novel mod- embedding techniques for knowledge bases containing non- elling framework, which models multi-fold rela- binary relations. tions directly using this canonical representation. Using this framework, the existing TransH model is In this paper, we first present a clean mathematical defini- generalized to a new model, m-TransH. We demon- tion of multi-fold relations, also known as n-ary relation in [ ] strate experimentally that m-TransH outperforms the literature Codd, 1970 . Using this notion, we propose a TransH by a large margin, thereby establishing a canonical representation for multi-fold (binary or non-binary) new state of the art. relational data, which we call instance representation. Ex- isting knowledge bases usually organize their data using the W3C Resource Description Framework (RDF) [Nickel et al., 1 Introduction 2015], in which relational data are represented as a collection of (subject, predicate, object) triples. Although such a triple The emerging of knowledge bases such as YAGO[Suchanek representation, like the instance representation, is capable of et al., 2007], DBpedia[Auer et al., 2007] and Freebase[Bol- capturing the structures of multi-fold relations [Codd, 1970; lacker et al., 2008] has inspired intense research interest in Nguyen et al., 2014; Rouces et al., 2015; Grewe, 2010; this area, from completing and improving knowledge bases Krieger and Willms, 2015; W3C, 2016], we show that manip- (e.g., [Baader et al., 2007; Angeli and Manning, 2013]) to ulating multi-fold relational data into triples (as in Freebase) developing applications that retrieve information from the results in an heterogeneity of the predicates, unfavourable for knowledge data (e.g., [Marin et al., 2014; Xiong and Callan, embedding. As such, we advocate that the starting point of 2015a; 2015b]). Recently knowledge base embedding[Bor- embedding multi-fold relations should be recovering the re- des et al., 2013; Wang et al., 2014; Lin et al., 2015b; lational data in its instance representation. Bordes et al., 2014b; 2011; Socher et al., 2013; Lin et al., 2015a] has stood out as an appealing and generic methodol- We then formulate the embedding problem on the instance ogy to access various research problems in this area. Briefly, representation and suggest that the heart of the problem is this methodology sets out to represent entities in a knowledge modelling each cost function that defines a constraint in the base as points in some Euclidean space while preserving the embedding space. We examine the popular FB15K[Bordes structures of the relational data. This approach turns the dis- et al., 2013] datasets, used in all existing embedding mod- crete topology of the relations into a continuous one, enabling els, and point out that the triple-based data format of FB15K the design of efficient algorithms and potentially benefitting results from applying a particular “star-to-clique” (S2C) con- many applications. For example, embedding can be applied version procedure to the filtered Freebase data. This proce- dure can be verified to be irreversible, which causes a loss of ⇤Corresponding author: [email protected] structural information in the multi-fold relations. 1300 Interestingly, we discover that all existing embedding mod- (R) given below is then the instance of R stating els on such S2C-converted datasets can be unified under a “BenedictM ! CumberbatchN played Alan Turing in the movie The “decomposition” modelling framework. In this framework, Imitation Game”: J the cost function associated with a -fold relation is modelled t(ACTOR) = BenedictCumberbatch, as the sum of J bi-variate functions. Suggesting that the 2 t(CHARACTER) = AlanTuring, decomposition framework is fundamentally limited, we pro- t(MOVIE) = TheImitationGame. pose a “direct modelling” framework for embedding multi- fold relations. As an example in this framework, we general- 2.2 Instance Representation ize TransH[Wang et al., 2014] to a new model for multi-fold Let index a set of distinct multi-fold relations on . More relations. Although TransH is known to perform comparably R N to the best performing models but with lower time complex- precisely put, for each r , there is a relation Rr on with roles (R ). One may2R identify the index r of R withN ity, our experiments demonstrate that this new model, desig- M r r nated m-TransH, has even lower complexity, and outperforms the type of the relation Rr, and in this view, is a collection of relation types. Let be the set of instancesR of relation TransH by an astonishing margin. Tr In summary, this paper takes a fundamental look at the Rr that are included in the KB, then the KB can be speci- fied as ( , , ,r ). We call such specification an knowledge base embedding problem when non-binary rela- N R {Tr 2R} tions exist. We advocate the instance representation and the instance representation. Since a KB is usually incomplete, each set is expected to be strictly contained in R . As a direct modelling framework on such representation. Con- Tr r strained by the length requirement, we are unable to elaborate consequence, relation Rr is in fact unknown, and all infor- at places and certain details are omitted. mation about Rr is revealed via the set r, sampled from Rr. Clearly, the instance representation containsT all information in the KB pertaining to the structures of the relations and how 2 Multi-Fold Relations and Knowledge Base they interact. Therefore, instance representations are legit- Representations imately canonical, at least from the embedding perspective, where only such information matters. 2.1 Multi-Fold Relations A well-known algebraic concept, a binary relation [Hunger- 2.3 Fact Representation ford, 2003] on a set is defined as a subset of the cartesian For various implementation considerations, practical KBs product , or N2. This understanding allows an imme- N⇥N N such as Freebase [Fre, 2016] organize relational data in a dif- diate generalization of binary relation to multi-fold relation ferent format, the core of which is a notion related to but dif- (also known as n-ary relation, see, e.g. [Codd, 1970]) where 2 J ferent from the multi-fold relation we define. We call this is replaced with the J-fold cartesian product for an notion meta-relation and define it next. arbitraryN integer J 2. From a knowledge base (KB)N point ≥ Given a set of entities and a set of roles, a (multi- of view, we argue however that such algebraic definitions are fold) meta-relationN Q on with rolesM is a subset of incomplete, in the sense that the role of each coordinate in the N M M cartesian product is not specified. For example, let R be the 2N , where 2N is the power set of . That is, each ele- ment in the meta-relation Q, which willN be called a fact of Q, binary relation relating a country with its capital city. Then ambiguity exists in whether the instance “Paris is the capital is a function mapping to 2N . Similar to relations, we call Q a J-fold meta-relationM if = J and often write (Q) of France” should be written as (Paris, France) R or as |M| M (France, Paris) R. Although this issue is usually2 resolved in place of . 2 M by suitable data structures, we now formulate a clean mathe- Example 2 Let Q be a 3-fold meta-relation about “who matical notion of multi-fold relation which also specifies the played what instruments in the recording of what music roles of involved entities. piece”. The roles (Q) consists of RECORDING (i.e., the Throughout the paper, the following notations will be used. music piece), INSTRUMENTM ROLE (i.e., the music instru- − For any set and , we denote by A the set of all functions ment), and CONTRIBUTOR (i.e., the person). A B B mapping to , as is standard in mathematics [Hungerford, Let u and u be two facts of Q, which respectively state A B 1 2 2003].