Knowledge Graph Embedding with Shared Gate Structure
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) TransGate: Knowledge Graph Embedding with Shared Gate Structure Jun Yuan,1,2,3 Neng Gao,1,2 Ji Xiang1,2∗ 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Information Security, Chinese Academy of Sciences, Beijing, China 3School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China fyuanjun,gaoneng,[email protected] Abstract knowledge graph completion (KGC) has emerged an im- portant open research problem (Nickel, Tresp, and Kriegel Embedding knowledge graphs (KGs) into continuous vec- 2011). For example, it is shown by (Dong et al. 2014) that tor space is an essential problem in knowledge extraction. more than 66% of the person entities are missing a birthplace Current models continue to improve embedding by focus- in Freebase and DBLP. The aim of KGC task is to predict re- ing on discriminating relation-specific information from enti- ties with increasingly complex feature engineering. We noted lations and determine specific-relation type between entities that they ignored the inherent relevance between relations based on existing triplets. Due to the extremely large data and tried to learn unique discriminate parameter set for each size, an effective and scalable solution is crucial for KGC relation. Thus, these models potentially suffer from high task (Liu, Wu, and Yang 2017). time complexity and large parameters, preventing them from Various KGC methods have been developed in recent efficiently applying on real-world KGs. In this paper, we years, and the most successful methods are representa- follow the thought of parameter sharing to simultaneously tion learning based methods which encode KGs into low- learn more expressive features, reduce parameters and avoid dimensional vector spaces. Among these methods, TransE complex feature engineering. Based on gate structure from (Bordes et al. 2013) is a simple and effective model and LSTM, we propose a novel model TransGate and develop shared discriminate mechanism, resulting in almost same regards every relation as translation between the heads and space complexity as indiscriminate models. Furthermore, to tails. We denote embedding vector with the same letters in develop a more effective and scalable model, we reconstruct boldface. When (h; r; t) holds, the embedding h is close to the gate with weight vectors making our method has com- the embedding t by adding the embedding r, that is h+r ≈ t. parative time complexity against indiscriminate model. We Indiscriminate models, like TransE, can scale to real- conduct extensive experiments on link prediction and triplets world KGs (Xu et al. 2017a), due to the limited num- classification. Experiments show that TransGate not only out- ber of parameters and low computational costs. However, performs state-of-art baselines, but also reduces parameters indiscriminate models often learn less expressive features greatly. For example, TransGate outperforms ConvE and R- (Dettmers et al. 2018), leading to either low parameter effi- GCN with 6x and 17x fewer parameters, respectively. These results indicate that parameter sharing is a superior way to ciency or low accuracy. In fact, an entity may have multiple further optimize embedding and TransGate finds a better aspects which may be related to different relations (Lin et trade-off between complexity and expressivity. al. 2015). Therefore, relation-specific information discrimi- nation of entities is the key to improving embeddings. Many discriminate models (Lin et al. 2015; Ji et al. 2015; Introduction 2016) have been proposed recently and usually outperform indiscriminate models on public data sets. We note that cur- Knowledge graphs (KGs) such as WordNet (Miller 1995) rent methods always assumed the independence between re- and Freebase (Bollacker et al. 2008) have been widely lations and tried to learn unique discriminate parameter set adopted in various applications, like web search, Q&A, etc. for each relation. Although there are always thousands of KGs are large-scale multi-relational structures and usually relations in a KG, but they are not independent from each organized in the form of triplets. Each triplet is denoted as other (Bordes et al. 2013). For example, among WordNet, (h; r; t), where h and t are head and tail entities, respectively, the most frequently encoded relation is the hierarchical re- and r is the relation between h and r. For instance, (Beijing, lation and each hierarchical relation has its inverse relation, CapitalOf, China) denotes the fact that the capital of China e:g: hyponym/hypernym. is Beijing. Although existing KGs already have consisted of billions As a result, current discriminate models potentially suf- of triplets, but they are far from completeness. Thus, the fer from following three problems, which all prevent current models from efficiently applying on real-world KGs. ∗Corresponding author First, current discriminate models potentially suffer from Copyright c 2019, Association for the Advancement of Artificial large parameters. For instance, when applying TransD (Ji et Intelligence (www.aaai.org). All rights reserved. al. 2015) on Freebase with an embedding size of 100, the pa- 3100 rameter size will be more than 33GB and half parameters are same space Rm. Then it regards each relation as transla- discriminate parameters. Second, current discriminate mod- tion between the heads and tails and wants h + r ≈ t els explore increasingly complex feature engineering to im- when (h, r, t) holds. Hence, TransE assumes score function prove embedding, resulting in high time complexity. In re- fr(h; t) = jjh + r − tjjL1=L2 is low if (h,r,t) holds, and high ality, one nature of correlated relations is semantic sharing otherwise. (Trouillon et al. 2016), which can be used to improve the DistMult. DistMult (Yang et al. 2015) has the same time discrimination. Third, due to the large parameter size and and space complexity as TransE. DistMult uses weighted complex design, current models have to introduce more hy- element-wise dot product to define the score function P per parameters and adopt pre-training to prevent overfitting. fr(h; t) = hkrktk. Although DistMult has better overall Please refer to Table 1 and “Related Work” for details. performance than TransE, but it is unable to model asym- To optimize embedding and reduce parameters simultane- metric relations. ously, we follow the thought of parameter sharing and pro- ComplEx. ComplEx (Trouillon et al. 2016) makes use pose a novel method TransGate. Unlike previous methods of complex valued embeddings and Hermitian dot product tried to learn relation-specific discriminate parameter set, to address the antisymmetric problem in DistMult. As men- TransGate discriminates relation-specific information with tioned in its paper, though ComplEx doubles the parameter only two shared gates for all relations. Hence, TransGate has size of TransE, but TransE and DistMult perform better on almost same space complexity as TransE on real-world KGs. symmetry relations. Furthermore, shared gates share statistical strength across CombinE. CombinE (Tan, Zhao, and Wang 2017) consid- different relations to optimize embedding without complex ers triplets features from two aspects: relation rp ≈ hp + tp feature engineering. and entity rm ≈ hm − tm. The score function is fr(h; t) = We conduct extensive experiments on link prediction and jjh + t − r jj2 + jjh − t − r jj2 . CombinE p p p L1=L2 m m m L1=L2 triplets classification on large-scale public KGs, namely doubles the parameter size of TransE, but does not yield sig- Freebase and WordNet. TransGate achieves better perfor- nificant boost in performance on link prediction. mance than ConvE (Dettmers et al. 2018) and R-GCN (Michael Sejr et al. 2018) with 6x and 17x fewer parameters, Discriminate Models respectively. Unlike many of the related models that require Discriminate models focus more on precision. They usually pre-training, TransGate is a self-contained model and does contain two stages: relation-specific information discrimina- not need any extra hyper parameter to prevent overfitting. tion and score computation. Our specific contributions are as follows: TransH. TransH (Wang et al. 2014) maps entity em- • We identify the significance of inherent relevance be- bedding into relation hyperplanes to discriminate relation- tween relations, which is largely overlooked by existing specific information. The score function is defined as T T literature. We follow the thought of parameter sharing to fr(h; t) = jjh − wr hwr + r − (t − wr twr)jjL1=L2, where learn more expressive features and reduce parameters at wr is the normal vector of r’s relation hyperplane. the same time. TransR/CTransR. TransR (Lin et al. 2015) learns a map- • Based on gate structure from LSTM (Hochreiter and ping matrix Mr for each relation. It maps entity embedding Schmidhuber 1997), we propose TransGate and develop into relation space to realize the discrimination. The score shared discriminate mechanism to avoid large discrimi- function is fr(h; t) = jjMrh + r − MrtjjL1=L2 . CTransR nate parameters. is an extension of TransR. It clusters diverse head-tail entity pairs into groups and sets a relation vector for each group. • To develop a more effective and scalable model, we re- TranSparse. TranSparse (Ji et al. 2016) replaces transfer construct standard gate with weight vectors. In this way, matrix in TransR by two sparse matrices for each relation. we reduce calculation brought by matrix-vector multipli- h t And it uses two sparse degree hyper parameters, θr and θr, cation operations and decrease the time complexity to the to reduce parameters of mapping matrices. same order as TransE. TransD. TransD (Ji et al. 2015) constructs two mapping • Experiments show that TransGate not only delivers sig- matrices dynamically for each triplet by setting projection nificant improvements compared to state-of-art baselines, > vector for each entity and relation, that is Mrh = hp rp + but also reduces parameters greatly. These results indicate Im×n, M = t>r +Im×n.