A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 1 A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications Hongyun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang Abstract—Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in different graph embedding problem settings and how the existing work address these challenges in their solutions. Finally, we summarize the applications that graph embedding enables and suggest four promising future research directions in terms of computation efficiency, problem settings, techniques and application scenarios. Index Terms—Graph embedding, graph analytics, graph embedding survey, network embedding F 1 INTRODUCTION RAPHS naturally exist in a wide diversity of real- can then be computed efficiently. There are different types G world scenarios, e.g., social graph/diffusion graph in of graphs (e.g., homogeneous graph, heterogeneous graph, social media networks, citation graph in research areas, attribute graph, etc), so the input of graph embedding user interest graph in electronic commerce area, knowl- varies in different scenarios. The output of graph embed- edge graph etc. Analysing these graphs provides insights ding is a low-dimensional vector representing a part of into how to make good use of the information hidden the graph (or a whole graph). Fig. 1 shows a toy exam- in graphs, and thus has received significant attention in ple of embedding a graph into a 2D space in different the last few decades. Effective graph analytics can ben- granularities. I.e., according to different needs, we may efit a lot of applications, such as node classification [1], represent a node/edge/substructure/whole-graph as a low- node clustering [2], node retrieval/recommendation [3], link dimensional vector. More details about different types of prediction [4], etc. For example, by analysing the graph graph embedding input and output are provided in Sec. 3. constructed based on user interactions in a social network In the early 2000s, graph embedding algorithms were (e.g., retweet/comment/follow in Twitter), we can classify mainly designed to reduce the high dimensionality of the users, detect communities, recommend friends, and predict non-relational data by assuming the data lie in a low whether an interaction will happen between two users. dimensional manifold. Given a set of non-relational high- Although graph analytics is practical and essential, most dimensional data features, a similarity graph is constructed existing graph analytics methods suffer the high compu- based on the pairwise feature similarity. Then, each node tation and space cost. A lot of research efforts have been in the graph is embedded into a low-dimensional space devoted to conducting the expensive graph analytics effi- where connected nodes are closer to each other. Examples arXiv:1709.07604v3 [cs.AI] 2 Feb 2018 ciently. Examples include the distributed graph data pro- of this line of researches are introduced in Sec. 4.1. Since cessing framework (e.g., GraphX [5], GraphLab [6]), new 2010, with the proliferation of graph in various fields, space-efficient graph storage which accelerate the I/O and research in graph embedding started to take a graph as computation cost [7], and so on. the input and leverage the auxiliary information (if any) In addition to the above strategies, graph embedding to facilitate the embedding. On the one hand, some of provides an effective yet efficient way to solve the graph them focus on representing a part of the graph (e.g., node, analytics problem. Specifically, graph embedding converts edge, substructure) (Figs. 1(b)-1(d)) as one vector. To obtain a graph into a low dimensional space in which the graph such embedding, they either adopt the state-of-the-art deep information is preserved. By representing a graph as a learning techniques (Sec. 4.2) or design an objective function (or a set of) low dimensional vector(s), graph algorithms to optimize the edge reconstruction probability (Sec. 4.3). On the other hand, there is also some work concentrating on embedding the whole graph as one vector (Fig. 1(e)) for • H. Cai is with Advanced Digital Sciences Center, Singapore. E-mail: graph level applications. Graph kernels (Sec. 4.4) are usually [email protected]. • V. Zheng is with Advanced Digital Sciences Center, Singapore. Email: designed to meet this need. [email protected]. The problem of graph embedding is related to two • K. Chang is with University of Illinois at Urbana-Champaign, USA. traditional research problems, i.e., graph analytics [8] and Email: [email protected]. representation learning [9]. Particularly, graph embedding 3 3 3 1 1 1 2 2 2 1.51.5 1.51.2 1.2 1.2 1 1 1 3 3 3 2 2 2 0.80.8 0.8 0 0 03 3 3 6 6 0.30.3 0.3 6 4 4 4 7 7 7 4 1 41 1 9 9 4 6 60.2 0.26 0.21 1 91 9 9 7 7 5 5 5 8 8 8 9 1.51.5 1.50.6 0.6 0.6 7 5 5 5 1.51.5 8 1.5 8 8 -3 -3 -3 0.00.0 0.0 1.5 1.5 1.5 3.0 3.0 3.0 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 2 3 3 3 3 3 3 3 3 3 3 3 2 e ee23e e23 1 1 2 12e 12 1223 1.5 1.2 1.5 1.2 1 1 GG G e13e e13 {1,2,3}{1,2,3}{1,2,3} 13 e e 3 0.8 2 3 2 0 0 0 e e e 56e 56 56 0 0 0 G G 0 0 0 0.8 3 6 0 3 0 46 46 46 G{4,5,6}{4,5,6}{4,5,6} GGG 0.3 6 e e 1 1 1 0.3 e e e45e 45 45 4 7 4 78 78 78 1 7 e e e 4 6 0.2 11 9 34 34 34 G G 4 9 5 8 9 e79e e79 G{7,8,9}{7,8,9}{7,8,9} 1.5 0.6 7 6 0.2 1 79 9 5 8 e67e e67 5 1.5 8 0.6 7 67 1.5 8 -3 -3 -3 -3 -3 -3 -3 -3- 3 -3 5 0.0 1.5 0.03.00.0 0.0 1.5 1.5 1.5 3.0 3.0 3.0 0.00.0 0.0 1.5 1.5 1.5 3.0 3.0 3.0 0.00.0 0.0 1.5 1.5 1.5 3.0 3.0 3.0 G -3 (a) Input Graph 1 (b) Node Embedding0.0 1.5 (c) Edge Embedding 3.0 (d) Substructure Embedding (e) Whole-Graph Embedding Fig. 1. A toy example of embedding a graph into 2D space with different granularities. Gf1;2;3g denotes the substructure containing node v1, v2, v3. 3 3 3 3 3 3 e e23 12 represent graph lowG dimensional e13 aims to a as {1,2,3} vectors while similarity is preserved. e e56 0 0 0 the46 graph structurese are preserved.G{4,5,6} On the one hand,G graph In observations of the challenges faced in different prob- e e12 23 1 e78 45 analyticse aims to mine useful information fromG{1,2,3} graph data. lem settings, we propose two taxonomies of graph em- 13e 34 e G e79 0 On the othere46 hand,56 representation{7,8,9} 0 learningG obtains data0 bedding work, by categorizing graph embedding literature e {4,5,6} G1 representationse 67 e45 that make it easier to extract useful informa- based on the problem settings and the embedding tech- -3 78 -3 -3 0.0 1.5tion when 3.0 building 0.0e 34 classifiers 1.5 or other 3.0 predictors 0.0 [9]. Graph 1.5 niques. 3.0 These two taxonomies correspond to what chal- e G embedding79 lies in the overlap of the{7,8,9} two problems and lenges exist in graph embedding and how existing studies e67 -3focuses on learning the low-dimensional-3 representations.-3 address these challenges. In particular, we first introduce 0.0Note that 1.5 we distinguish 3.0 graph0.0 representation 1.5 learning3.0 0.0 different 1.5 settings 3.0 of graph embedding problem as well as and graph embedding in this survey. Graph representation the challenges faced in each setting. Then we describe how learning does not require the learned representations to be existing studies address these challenges in their work, low dimensional. For example, [10] represents each node as including their insights and their technical solutions.