Motif Aware Node Representation Learning for Heterogeneous Networks
Total Page:16
File Type:pdf, Size:1020Kb
motif2vec: Motif Aware Node Representation Learning for Heterogeneous Networks Manoj Reddy Dareddy* Mahashweta Das Hao Yang University of California, Los Angeles Visa Research Visa Research Los Angeles, CA, USA Palo Alto, CA, USA Palo Alto, CA, USA [email protected] [email protected] [email protected] Abstract—Recent years have witnessed a surge of interest in is originally sequential in nature [35], product review graph machine learning on graphs and networks with applications constructed from reviews written by users for stores [34], ranging from vehicular network design to IoT traffic manage- credit card fraud network constructed from fraudulent and non- ment to social network recommendations. Supervised machine learning tasks in networks such as node classification and link fraudulent transaction activity data [33], etc. prediction require us to perform feature engineering that is Supervised machine learning tasks over nodes and links known and agreed to be the key to success in applied machine in networks1 such as node classification and link prediction learning. Research efforts dedicated to representation learning, require us to perform feature engineering that is known and especially representation learning using deep learning, has shown agreed to be the key to success in applied machine learning. us ways to automatically learn relevant features from vast amounts of potentially noisy, raw data. However, most of the However, feature engineering is challenging and tedious since methods are not adequate to handle heterogeneous information the traditional process relies on domain knowledge, intuition, networks which pretty much represents most real world data data manipulation, and manual intervention. Research efforts today. The methods cannot preserve the structure and semantic dedicated to representation learning, i.e., learning representa- of multiple types of nodes and links well enough, capture higher- tions of the data that make it easier to extract useful informa- order heterogeneous connectivity patterns, and ensure coverage of nodes for which representations are generated. In this paper, tion when training classifiers or other predictors, has shown we propose a novel efficient algorithm, motif2vec that learns us ways to automatically learn relevant features from vast node representations or embeddings for heterogeneous networks. amounts of potentially noisy, raw data. Of particular interest Specifically, we leverage higher-order, recurring, and statistically to the academic and industry research community has been significant network connectivity patterns in the form of motifs to representation learning using deep learning [2] that are formed transform the original graph to motif graph(s), conduct biased random walk to efficiently explore higher order neighborhoods, by the composition of multiple non-linear transformations and then employ heterogeneous skip-gram model to generate the with the goal of yielding more useful representations. There embeddings. Unlike previous efforts that uses different graph has been a series of work over the past demi decade that meta-structures to guide the random walk, we use graph motifs focuses on graph node representation or graph embedding to transform the original network and preserve the heterogeneity. algorithms [7]. The common goal of these works is to obtain We evaluate the proposed algorithm on multiple real-world networks from diverse domains and against existing state-of-the- a low-dimensional feature representation of each node of the art methods on multi-class node classification and link prediction graph such that the method is scalable and the vector rep- tasks, and demonstrate its consistent superiority over prior work. resentation preserves some structure and connectivity pattern between individual nodes in the graph. The graph embedding Author Terms − heterogeneous information networks, methods are broadly classified into three categories namely network embedding, network representation learning, fea- factorization based, random walk based, and deep learning arXiv:1908.08227v1 [cs.SI] 22 Aug 2019 ture learning, motifs based with applications in network compression, visualization, clustering, link prediction, and node classification [7]. Among I. INTRODUCTION the three categories, random walk based graph embedding Recent years have witnessed a surge of interest in machine techniques have emerged to be the most popular since they learning on graphs and networks with applications ranging help approximate many network properties, are useful when from vehicular network design to IoT traffic management to network is too large to measure in its entirety, and can drug discovery to social network recommendations. Graph- work with partially observable network. The popular random- based data representation enables us to understand objects with walk based methods include DeepWalk [18], node2vec [8], respect to the neighboring world instead of just observing them LINE [29], HARP [5], etc. in isolation. Thus, there is an increasing trend of representing However, most of these methods are designed for homoge- data, that is not naturally connected, as graphs. Examples in- neous networks and are inadequate to handle heterogeneous clude item graph constructed from users’ behavior history that 1We use the term network (nodes, links) and graph (vertices, edges) *Work done as an intern at Visa Research. interchangeably throughout the paper. information networks, i.e., networks with multiple types of In this paper, we propose a novel efficient algorithm mo- nodes and links, which pretty much represents most real tif2vec that learns node representations or embeddings for world data today. Contemporary information networks like heterogeneous information networks. Specifically, we leverage Facebook, DBLP, Yelp, Flickr, etc. contain multi-type inter- higher-order, recurring, and statistically significant network acting components. For example, social network Facebook has connectivity patterns in the form of motifs to learn higher different types of objects (nodes) such as users, posts, photos quality embeddings. Motifs are one of the most common as well as different kinds of associations (links) such as user- higher-order data structures for understanding complex net- user friendship, person-photo tagging relationship, post-post works and have been popularly recognized as fundamental replying relationship, etc. Researchers today acknowledge that units of network [3]. It has been successfully used in many heterogeneous networks fuse more information and support network mining tasks such as clustering [32] [36], anomaly richer semantic representation of the real world [22] [24]. They detection [31], and convolution [21]. However, no prior work also emphasize that data mining approaches designed for ho- has investigated the scope and impact of motifs in learning mogeneous graphs are not well-suited to handle heterogeneous node embeddings for heterogeneous networks. Rossi et al. graphs. For example, classification in homogeneous networks introduced the problem of higher-order network representa- is traditionally done on objects of the same entity type, makes tion learning using motifs for homogeneous networks [20]. strong assumptions on the network structure, and assumes But the method cannot be extended to handle heterogeneous that data is independently and identically distributed (i.i.d.). networks. HONE [20] does not combine the best of both Contrarily, classification in heterogeneous networks need to worlds − random walk based method that accounts for local simultaneously classify multiple types of objects which may neighborhood structure and motif-aware method that accounts be organized arbitrarily and may violate the i.i.d assumption. for higher-order global network connectivity patterns, as we Thus, there is an innate need to develop graph embedding do. In addition, HONE (as well other existing methods) do not methods for heterogeneous networks. include the original network in the learning process, as we do. Dong et al. formally introduced the problem and proposed The latter ensures higher coverage of connected nodes. a novel algorithmic solution metapath2vec [6] that leverages Our algorithm motif2vec transforms the original graph to metapath, the most popular graph meta-structure for hetero- motif graph(s), conduct biased random walk to efficiently geneous network mining [24]. A more recent work proposed explore higher order neighborhoods, and then employ hetero- metagraph2vec [37] that leverages metagraph in order to cap- geneous skip-gram model to generate the embeddings. Related ture richer structural contexts and semantics between distant efforts in heterogeneous network node embedding, namely, nodes. Other heterogeneous network embedding methods in- metapath2vec [6] and metagraph2vec [37] are limited to only clude PTE [28] that is a semi-supervised representation learn- exploring neighborhoods, nodes, and links participating in ing method for text data; HNE [4] that learns representation for the meta-structure of interest. motif2vec leverages motifs to each modality of the network separately and then unifies them transform the original graph to a motif representation and into a common space using linear transformations; LANE [11] conduct regular random walks on the entire transformed graph. that generates embeddings for attributed networks; and AS- We evaluate our algorithm on multiple real-world networks PEM [23] that captures the incompatibility in heterogeneous from diverse domains and