Joint Embeddings of Hierarchical Categories and Entities
Total Page:16
File Type:pdf, Size:1020Kb
Joint Embeddings of Hierarchical Categories and Entities Yuezhang Li Ronghuo Zheng Tian Tian Zhiting Hu Rahul Iyer Katia Sycara Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213, USA [email protected] Abstract the relatedness measure between entities and cat- egories, although they successfully learn entity Due to the lack of structured knowledge representations and relatedness measure between applied in learning distributed represen- entities (Hu et al., 2015). Knowledge graph em- tation of categories, existing work can- bedding methods (Wang et al., 2014; Lin et al., not incorporate category hierarchies into 2015) give embeddings of entities and relations entity information. We propose a frame- but lack category representations and relatedness work that embeds entities and categories measure. Though taxonomy embedding methods into a semantic space by integrating struc- (Weinberger and Chapelle, 2009; Hwang and Si- tured knowledge and taxonomy hierarchy gal, 2014) learn category embeddings, they pri- from large knowledge bases. The frame- marily target documents classification not entity work allows to compute meaningful se- representations. mantic relatedness between entities and In this paper, we propose two models to si- categories. Compared with the previous multaneously learn entity and category vectors state of the art, our framework can handle from large-scale knowledge bases (KBs). They both single-word concepts and multiple- are Category Embedding model and Hierarchi- word concepts with superior performance cal Category Embedding model. The Category in concept categorization and semantic re- Embedding model (CE model) extends the entity latedness. embedding method of (Hu et al., 2015) by using category information with entities to learn entity 1 Introduction and category embeddings. The Hierarchical Cat- Hierarchies, most commonly represented as Tree egory Embedding model (HCE model) extends or DAG structures, provide a natural way to cate- CE model to integrate the hierarchical structure of gorize and locate knowledge in large knowledge categories. It considers all ancestor categories of bases (KBs). For example, WordNet, Freebase one entity. The final learned entity and category and Wikipedia use hierarchical taxonomy to or- vectors can capture meaningful semantic related- ganize entities into category hierarchies. These ness between entities and categories. hierarchical categories could benefit applications We train the category and entity vectors on arXiv:1605.03924v2 [cs.CL] 13 May 2016 such as concept categorization (Rothenhausler¨ and Wikipedia, and then evaluate our methods from Schutze,¨ 2009), document categorization (Gopal two applications: concept categorization (Ba- and Yang, 2013), object categorization (Verma et roni and Lenci, 2010) and semantic relatedness al., 2012), and link prediction in knowlegde graphs (Finkelstein et al., 2001). (Lin et al., 2015). In all of these applications, The organization of the research elements that it is essential to have a good representation of comprise this paper, summarizing the above dis- categories and entities as well as a good semantic cussion, is shown in Figure 1. relatedness measure. The main contributions of our paper are sum- Existing work does not use structured knowl- marized as follows. First, we incorporate category edge of KBs to embed representations of enti- information into entity embeddings with the pro- ties and categories into a semantic space. Cur- posed CE model to get entity and category embed- rent entity embedding methods cannot provide dings simultaneously. Second, we add category Figure 1: The organization of research elements comprising this paper. hierarchies to CE model and develop HCE model Wikipedia, category hierarchies are usually given to enhance the embeddings’ quality. Third, we as DAG or tree structures, and entities are catego- propose a concept categorization method based rized into one or more categories as leaves. Thus, on nearest neighbor classification that avoids the in KBs, each entity et is labeled with one or more issues arising from disparity in the granularity of categories (c1; c2; :::; ck); k ≥ 1 and described categories that plague traditional clustering meth- by an article containing other context entities (see ods. Fourth, our model can handle multiple-word Data in Figure 1). concepts/entities1 such as “hot dog” and distin- To learn embeddings of entities and categories guish it from “dog”. Overall, our model out- simultaneously, we adopt a method that incorpo- performs state-of-the-art methods on both concept rates the labeled categories into the entities when categorization and semantic relatedness. predicting the context entities, similar to TWE- 1 model (Liu et al., 2015). For example, if et 2 Hierarchical Category Embedding is the target entity in the document, its labeled categories (c ; c ; :::; c ) would be combined with In order to find representations for categories and 1 2 k the entity e to predict the context entities like entities that can capture their semantic relatedness, t e and e (see CE Model in Figure 1). For each we use existing hierarchical categories and entities c1 c2 target-context entity pair (e ; e ), the probability labeled with these categories, and explore two t c of e being context of e is defined as the following methods: 1) Category Embedding model (CE c t softmax: Model): it replaces the entities in the context with their directly labeled categories to build cat- exp (et · ec) egories’ context; 2) Hierarchical Category Em- P (ecjet) = P ; (1) e2E exp (et · e) bedding (HCE Model): it further incorporates all ancestor categories of the context entities to utilize where E denotes the set of all entity vectors, and the hierarchical information. exp is the exponential function. For convenience, here we abuse the notation of et and ec to denote 2.1 Category Embedding (CE) Model a target entity vector and a context entity vector Our category embedding (CE) model is based on respectively. the Skip-gram word embedding model(Mikolov et Similar to TWE-1 model, We learn the target al., 2013). The skip-gram model aims at generat- and context vectors by maximizing the average log ing word representations that are good at predict- probability: ing context words surrounding a target word in a sliding window. Previous work (Hu et al., 2015) 1 X h L = log P (ecjet) extends the entity’s context to the whole article jDj (ec;et)2D that describes the entity and acquires a set of entity X i + log P (e jc ) ; (2) pairs D = f(et; ec)g, where et denotes the target c i c 2C(e ) entity and ec denotes the context entity. i t Our CE model extends those approaches by in- where D is the set of all entity pairs and we abuse corporating category information. In KBs such as the notation of ci to denote a category vector, and 1 In this paper, concepts and entities denote same thing. C(et) denotes the categories of entity et. Figure 2: Category and entity embedding visualization of the DOTA-all data set (see Section 4.1.1). We use t-SNE (Van der Maaten and Hinton, 2008) algorithms to map vectors into a 2-dimensional space. Labels with the same color are entities belonging to the same category. Labels surrounded by a box are categories vectors. 2.2 Hierarchical Category Embedding(HCE) where A(et) represents the set of ancestor cate- Model gories of entity et, and wi is the weight of each category in predicting the context entity. To imple- From the design above, we can get the embed- ment the intuition that a category is more relevant dings of all categories and entities in KBs without to its closer ancestor, for example, “NBC Mystery capturing the semantics of hierarchical structure of Movies” is more relevant to “Mystery Movies” categories. In a category hierarchy, the categories 1 than “Entertainment”, we set wi / where at lower layers will cover fewer but more specific l(cc;ci) l(c ; c ) concepts than categories at upper layers. To c i denotes the average number of steps going c c capture this feature, we extend the CE model to down from category i to category c, and it is P w = 1 further incorporate the ancestor categories of the constrained with i i . target entity when predicting the context entities Figure 2 presents the results of our HCE model (see HCE Model in Figure 1). If a category is for DOTA-all data set (see Section 4.1.1). The near an entity, its ancestor categories would also visualization shows that our embedding method be close to that entity. On the other hand, an is able to clearly separate entities into distinct increasing distance of the category from the target categories. entity would decrease the power of that category in predicting context entities. Therefore, given 2.3 Learning the target entity et and the context entity ec, the Learning CE and HCE models follows the opti- objective is to maximize the following weighted mization scheme of skip-gram model (Mikolov et average log probability: al., 2013). We use negative sampling to refor- mulate the objective function, which is then opti- 1 X h mized through stochastic gradient descent (SGD). L = log P (ecjet) jDj Specifically, the likelihood of each context en- (ec;et)2D X i tity of a target entity is defined with the softmax + wi log P (ecjci) ; (3) function in Eq. 1, which iterates over all entities. ci2A(et) Thus, it is computationally intractable. We ap- ply the standard negative sampling technique to 3.1.2 Nearest Neighbor (NN) Classification transform the objective function in equation (3) to Although the previous methods described above equation (4) below and then optimize it through can generate vectors carefully designed for captur- SGD: ing relations and attributes, the number of vector X h X i dimensions can be very large for some methods: L = log σ(ec · et) + wi log σ(ec · ci) from 10,000+ to 1,000,000+, e.g., (Almuhareb (ec;et)2D ci2A(et) and Poesio, 2005; Rothenhausler¨ and Schutze,¨ X h 0 + log σ(−ec · et) 2009).