Collective Classification of Spam Campaigners on Twitter: a Hierarchical Meta-Path Based Approach
Total Page:16
File Type:pdf, Size:1020Kb
Collective Classification of Spam Campaigners on Twitter: A Hierarchical Meta-Path Based Approach Srishti Gupta Abhinav Khattar Arpit Gogia IIIT-Delhi IIIT-Delhi DTU [email protected] [email protected] [email protected] Ponnurangam Kumaraguru Tanmoy Chakraborty IIIT-Delhi IIIT-Delhi [email protected] [email protected] ABSTRACT KEYWORDS Cybercriminals have leveraged the popularity of a large user base Spam Campaign, phone number, heterogeneous network, meta- available on Online Social Networks (OSNs) to spread spam cam- path, Twitter, Online Social Networks paigns by propagating phishing URLs, attaching malicious contents, ACM Reference Format: etc. However, another kind of spam attacks using phone numbers Srishti Gupta, Abhinav Khattar, Arpit Gogia, Ponnurangam Kumaraguru, has recently become prevalent on OSNs, where spammers adver- and Tanmoy Chakraborty. 2018. Collective Classification of Spam Campaign- tise phone numbers to attract users’ attention and convince them ers on Twitter: A Hierarchical Meta-Path Based Approach. In WWW 2018: to make a call to these phone numbers. The dynamics of phone The 2018 Web Conference, April 23–27, 2018, Lyon, France. ACM, New York, number based spam is different from URL-based spam due toan NY, USA, 10 pages. https://doi.org/https://doi.org/10.1145/3178876.3186119 inherent trust associated with a phone number. While previous work has proposed strategies to mitigate URL-based spam attacks, 1 INTRODUCTION phone number based spam attacks have received less attention. Online Social Networks (OSNs) are becoming more and more popu- In this paper, we aim to detect spammers that use phone num- lar in the recent years, used by millions of users. As a result, OSNs bers to promote campaigns on Twitter. To this end, we collected are being abused by spam campaigners to carry out phishing and information (tweets, user meta-data, etc.) about 3; 370 campaigns spam attacks [18]. While attacks carried using URLs [8, 16, 18, 42, spread by 670; 251 users. We model the Twitter dataset as a hetero- 47] has been extensively explored in the literature, attacks via a new geneous network by leveraging various interconnections between action token, i.e., a phone number is mostly unexplored. Tradition- different types of nodes present in the dataset. In particular, we ally, spammers have been exploiting telephony system in carrying make the following contributions – (i) We propose a simple yet out social engineering attacks either by calling victims or sending effective metric, called Hierarchical Meta-Path Score (HMPS) to mea- SMS [45]. Recently, spammers have started abusing OSNs where sure the proximity of an unknown user to the other known pool of they float phone numbers controlled by them. Besides exploiting spammers. (ii) We design a feedback-based active learning strategy trust associated with a phone number, spammers save efforts in and show that it significantly outperforms three state-of-the-art reaching out their victims themselves. baselines for the task of spam detection. Our method achieves 6.9% Present Work: Problem definition. In this paper, we aim to and 67.3% higher F1-score and AUC, respectively compared to the detect spam campaigners (aka, spammers) spreading spam cam- best baseline method. (iii) To overcome the problem of less train- paigns using phone numbers on Twitter. We here define spammers ing instances for supervised learning, we show that our proposed as user accounts that use phone numbers to aggressively promote feedback strategy achieves 25.6% and 46% higher F1-score and AUC products, disseminate pornography, entice victims for lotteries and respectively than other oversampling strategies. Finally, we per- discounts, or simply mislead victims by making false promises. form a case study to show how our method is capable of detecting Discovering the correspondence between the spammer accounts arXiv:1802.04168v1 [cs.SI] 12 Feb 2018 those users as spammers who have not been suspended by Twitter and the resources (such as URL or phone number) used for spam (and other baselines) yet. activities is a crucial task. As the phone numbers are being prop- agated by spammers, and their monetization revenue starts once CCS CONCEPTS people call them, it is fair to assume that these phone numbers • Information systems; • Security and privacy; • Applied com- would be under their control. As an added advantage of this, if puting; we can identify the spammer accounts in Twitter and bring them down, the entire campaign would get disintegrated. To identify spammers, we model the Twitter dataset as a heterogeneous graph This paper is published under the Creative Commons Attribution 4.0 International where there are different connections between heterogeneous type (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. of entities: users, campaigns, and the action tokens (phone number WWW 2018, April 23–27, 2018, Lyon, France or URL) as shown in Figure 1. Heterogeneous networks have been © 2018 IW3C2 (International World Wide Web Conference Committee), published proposed for data representation in a variety of datasets like path under Creative Commons CC BY 4.0 License. ACM ISBN 978-1-4503-5639-8. similarity in scholar data [39], link prediction in social network https://doi.org/https://doi.org/10.1145/3178876.3186119 data [22], etc. Objects of different types and links carry different User We collected tweets and other meta-data information of users propagating propagated-by from April-October, 2016, and identified 3; 370 campaigns, contain- Phno shared-by ing 670; 251 users (Section 2). Each tweet carries a phone number. sharing using We consider user accounts suspended by Twitter as ground-truth used-by Camp spammers. However, due to the lack of enough training samples per using using campaign, we introduce a novel feedback-based active learning mech- used-by used-by anism that uses a SVM-based one-class classifier for each campaign. Phno User URL Over multiple iterations, it keeps accumulating evidences from dif- sharing ferent campaigns to enrich the training set for each campaign. This, shared-by in turn, enhances the prediction performance of individual classi- fiers. The process terminates when there is no chance of finding Figure 1: Twitter modeled as a heterogeneous network. the label of the unknown users across iterations. Summary of the evaluation. We compare our model with three state-of-the-art baselines used for spam detection (Section 5.3). semantics. For instance, a phone number being a more stable re- We design various experimental setup to perform a thorough eval- source would help in connecting user accounts over an extended uation of our proposed method. We observe that our model out- period. Physical identity verification is required to purchase phone performs the best baseline method by achieving 44.8%, 16.7%, 6.9%, numbers, while only e-mail verification is sufficient to purchase 67.3% higher performance in terms of accuracy, precision, F1-score domains. Studying similarity between a pair of nodes keeping the and AUC (Section 5.3). We further demonstrate how / why one-class heterogeneous nature of the network helps in distinguishing the classifier (Section 5.4), active learning (Section 5.5) and feedback- semantics of different types of paths connecting these two nodes. based learning (Section 5.6) are better than 2-class classifier, general To distinguish the semantics among paths connecting two nodes, learning and other oversampling method, respectively. Moreover, we introduce a meta-path based similarity framework for nodes we conduct a case study and present an intuitive justification why of the same type in the heterogeneous network. A meta-path is our method is superior to the other methods (Section 5.3). a sequence of relations between node types, which defines a new composite relation between its starting type and ending type. It provides a powerful mechanism to classify objects sharing similar semantics appropriately. 2 DATASET Present Work: Motivation of the work. The problem of iden- We collected tweets containing phone numbers from Twitter based tifying spammers on Twitter that use phone numbers is useful in on an exhaustive list of 400 keywords via Twitter streaming API. many aspects. Attacks using phone numbers and URLs are different We chose Twitter due to easy availability of data. The data was in some aspects: in URL based spam, the campaign propagates and collected from April - October, 2016. Since we intended to detect spreads on the same medium, i.e., OSNs, while in case of phone campaigns around phone numbers, the keywords we chose were number based spam, the attacking medium is a telephone and the specific to phone number such as ‘call’, ‘SMS’, ‘WA’, ‘ring’ etc.We propagating medium is OSNs. As a result, it is challenging for OSN accumulated ∼ 22 million tweets, each of which containing at service providers to track down the accounts spreading these spam least one phone number. The reason behind collecting only tweets campaigns. In addition, there is no meta-data available for phone containing phone numbers is that they are found to be a stable numbers, unlike URLs where landing page information, length of resource, i.e., spammers use them for a long period due to attached URLs, obfuscation, etc. can be checked. Perhaps due to the chal- cost. Moreover, the phone numbers are known to help in forming lenges associated with finding spam phone numbers, there have better user communities [9], which is the basis of the approach been several attacks and financial losses caused by the phone-based adopted in this work. attacks [31]. Using the collective classification approach proposed Campaign identification: We define a campaign as a group of in the paper, Twitter will be able to find potential spammers and sus- similar posts shared by a set of users propagating multiple phone pend the accounts, thereby restricting phone number based spam numbers.