Graph Convolutional Networks with Markov Random Field Reasoning for Social Spammer Detection
Total Page:16
File Type:pdf, Size:1020Kb
Graph Convolutional Networks with Markov Random Field Reasoning for Social Spammer Detection Yongji Wu,1 Defu Lian,1∗ Yiheng Xu,2 Le Wu,3 Enhong Chen1 1University of Science and Technology of China 2Harbin Institute of Technology, 3Hefei University of Technology fwuyongji317, dove.ustc, hi.ranpox, [email protected], [email protected] Abstract estimated that between 9% and 15% of active Twitter ac- counts are fake. The malicious behavior of social spammers The recent growth of social networking platforms also led poses a severe threat to the quality of user experience, hence to the emergence of social spammers, who overwhelm legiti- mate users with unwanted content. The existing social spam- effectively identifying these spammers is of great real-world mer detection methods can be characterized into two cate- importance in the development of OSNs. gories: features based ones and propagation-based ones. Fea- A number of social spammer detection methods have been tures based methods mainly rely on matrix factorization us- proposed, following spam detection in traditional environ- ing tweet text features, and regularization using social graphs ments like email (Blanzieri and Bryl 2008) and Web pages is incorporated. However, these methods are fully supervised (Gyongyi and Garcia-Molina 2005). Most existing mod- and can only utilize labeled part of social graphs, which fail to work in a real-world semi-supervised setting. The els can be divided into two categories: features based ap- propagation-based methods primarily employ Markov Ran- proaches and propagation-based ones. Features based meth- dom Fields (MRFs) to capture human intuitions in user fol- ods (Zhu et al. 2012; Hu et al. 2013; 2014; Hu, Tang, and Liu lowing relations, which cannot take advantages of rich text 2014; Shen et al. 2017) generally exploit text features mined features. In this paper, we propose a novel social spammer from tweets posted by users. Matrix factorization is per- detection model based on Graph Convolutional Networks formed on these features, and social graphs are used in regu- (GCNs) that operate on directed social graphs by explicitly larization. However, these matrix factorization based meth- considering three types of neighbors. Furthermore, inspired ods are fully supervised; they can only utilize labeled part of by the propagation-based methods, we propose a MRF layer the social graph, hence need a large number of labeled sam- with refining effects to encapsulate these human insights in ples to work successfully. Recently, Li et al. (2018) proposed social relations, which can be formulated as a RNN through mean-field approximate inference, and stack on top of GCN a semi-supervised model using an autoencoder framework. layers to enable end-to-end training. We evaluate our pro- This model uses node2vec and doc2vec embeddings as in- posed method on two real-world social network datasets, and put features for text view and social graph view. However, the results demonstrate that our method outperforms the state- it fails to capture the interactions between users in social of-the-art approaches. graphs explicitly. The propagation-based methods (Wang, Zhang, and Gong 2017; Wang, Gong, and Fu 2017), which are also called guilt-by-association methods, assume some Introduction sort of correlations between a pair of users and model these Online Social Networks (OSNs) such as Facebook and Twit- intuitions using a Markov Random Field (MRF). However, ter have gained increasing popularity in recent years for these methods simply perform the computation of posterior users to interact and communicate. Nowadays, they have be- distributions of MRFs using a pre-defined pairwise influence come a universal platform for users to discuss events and weight, and cannot benefit from features in tweet text. share personal experience. However, with this growing pop- To this end, we propose a novel model for social spam- ularity, a new kind of malicious users known as social spam- mer detection to take advantage of both features based mers surface (Webb, Caverlee, and Pu 2008). These spam- and propagation-based methods. Since Graph Convolutional mers launch various attacks on social networks with fake Networks (GCNs) which are developed in recent years (Def- accounts. For instance, spreading advertisement to promote ferrard, Bresson, and Vandergheynst 2016; Kipf and Welling sales, posting tweets containing links to pornographic sites 2016) can combine both graph structures and node features (Singh, Bansal, and Sofat 2016), or hijacking trend topics for semi-supervised learning, we use them as the building (VanDam and Tan 2016). A recent study (Varol et al. 2017) block of our model. Besides, the ability of GCNs to propa- ∗Corresponding author gate information layer-wisely allows them to learn localized Copyright c 2020, Association for the Advancement of Artificial patterns at different scales. In our scenario, we consider that Intelligence (www.aaai.org). All rights reserved. different directions in users’ following relations entail differ- ent underlying behaviors of users. Hence we assign an inde- GCN layers and can be jointly trained with GCN; while they pendent weight matrix for each different type of neighbor in simply compute the posterior distribution using pre-defined the message-passing process of GCNs. We further propose weights through loopy belief propagation, given a few la- to stack a MRF layer with refining effects on top of GCN. beled nodes. The MRF layer captures human insights of neighbors’ in- fluences on a user’s identity (for instance, spammers tend to Graph convolutional networks. In recent years, consid- follow a large number of users). It is able to fix incorrect erable efforts have been devoted to extending traditional predictions made by GCN. We use the mean-field approx- convolutional neural networks (CNNs) which operate on Eu- imation to compute posterior distributions of the MRF and clidean structures to arbitrary graphs. Bruna et al. (2014) formulate it as a Recurrent Neural Network (RNN) which first proposed the convolution operation on graphs based performs multi-step inference to ensure convergence. on spectral graph theory, which is extended in (Defferrard, The main contributions of this paper are listed as follows: Bresson, and Vandergheynst 2016) through using Cheby- 1. We propose a novel end-to-end deep learning model for shev polynomials to approximate filters. Kipf and Welling social spammer detection based on GCNs that operate on (2016) further applied the first-order approximation to de- directed social graphs, and a MRF layer that captures hu- velop a layer-wise linear model for fast and scalable semi- man insights in user following relations to refine predic- supervised node classification. GCNs have since been uti- tions made by GCN. To the best of our knowledge, this is lized in many fields. (Wu et al. 2019) proposed GCNs on the first semi-supervised social spammer detection model User Mobility Heterogeneous Graphs to infer social rela- that seamlessly integrates both features based methods tions from trajectory data. (Wang, Lian, and Ge 2019) dis- and propagation-based ones. tilled the ranking information derived from GCN into bina- 2. We formulate the computation of the posterior distribu- rized collaborative filtering to improve the efficiency of on- tions of MRF as a RNN which computes the result of each line recommendation. Jin et al. (2019) integrated MRF and time step based on outputs from the previous time step us- GCN for semi-supervised community detection. However, ing the same weight matrix, and stack it on top of GCN our method is quite different from theirs. While they use layers. We empirically investigate the indispensability of a fully-connected MRF for community detection, we pro- multi-step inference through RNN in the MRF layer. pose a novel sparse MRF for spammer detection by model- ing intuitions about different types of neighbors’ influences 3. We conduct extensive experiments on two real-world on a user’s label, which can be effectively implemented us- Twitter datasets and achieve superior performance. We il- ing sparse-dense matrix multiplication with linear time com- lustrate the refining effects of the MRF layer by giving plexity. Furthermore, (Jin et al. 2019) performs only one- concrete examples. We also demonstrate the vital role of step inference when computing posterior distributions of explicitly considering three kinds of neighbors in GCN, as MRF, in which case convergence cannot be guaranteed and well as the importance of jointly training GCN and MRF. would result in poor performance. We fix this problem by formulating the MRF layer as a RNN and conduct multi-step Related Work inference. Social spammer detection. There are many studies on so- cial spammer detection. Zhu et al. (2012) proposed one of Proposed Method the earliest social spammer detection approach based on ma- trix factorization, where undirected graph Laplacian is used Social spammer detection is essentially a two-class classi- to incorporate the topology information and multi-label in- fication problem. We aim to build a classifier to accurately formed latent semantic indexing is used to model the context assign identity labels for users in the test set, given a training information. Hu, Tang, and Liu (2014) extended this method user set, the social network, and/or features of each user. Our to exploit the direction information of user following rela- proposed social spammer detection model is built on the ba- tions. Sentiment information is considered to assist matrix sis of GCN and MRF. First, graph convolution is performed factorization in (Hu et al. 2014). Fu et al. (2017) investi- on directed social graphs by explicitly considering differ- gated the carefulness of users in social networks and how ent types of neighbors. Then we present three intuitions of the robustness of the detection algorithms can be improved neighbors’ influences on a user’s label. These intuitions are with the aid of user carefulness.