Facebook Inspector (Fbi): Towards Automatic Real Time Detection of Malicious Content on Facebook

Social Network Analysis and Mining Volume 7, Issue 1 (Author's version) Facebook Inspector (FbI): Towards Automatic Real Time Detection of Malicious Content on Facebook Prateek Dewan · Ponnurangam Kumaraguru Accepted: April 12, 2017 Abstract Online Social Networks (OSNs) witness a identifying malicious Facebook posts in real time. Face- rise in user activity whenever a major event makes book Inspector uses class probabilities obtained from news. Cyber criminals exploit this spur in user engage- two independent supervised learning models based on ment levels to spread malicious content that compro- a Random Forest classifier to identify malicious posts in mises system reputation, causes financial losses and de- real time. These supervised learning models are based grades user experience. In this paper, we collect and on a feature set comprising of 44 features, and achieve characterize a dataset of 4.4 million public posts gener- an accuracy of over 80% each, using only publicly avail- ated on Facebook during 17 news-making events (nat- able features. During the first nine months of its pub- ural calamities, sports, terror attacks, etc.) over a 16- lic deployment (August 2015 - May 2016), Facebook month time period. From this dataset, we filter out two Inspector processed 0.97 million posts at an average sets of malicious posts, one using URL blacklists and response time of 2.6 seconds per post, and was down- another using human annotations. Our observations re- loaded over 2,500 times. We also evaluate Facebook In- veal some characteristic differences between malicious spector in terms of performance and usability to iden- posts obtained from the two methodologies, thus de- tify further scope for improvement. manding a two-fold filtering process for a more complete and robust filtering system. We empirically confirm the Keywords Facebook · Malicious Content · Machine need for this two-fold filtering approach by cross val- Learning · Realtime System idating supervised learning models obtained from the two sets of malicious posts. These supervised learning models include Naive Bayesian, Decision Trees, Ran- 1 Introduction dom Forest, and Support Vector Machine (SVM) based models. Based on this learning, we implement Facebook Social network activity rises considerably during events Inspector (FbI), a REST API based browser plug-in for that make news, like sports, natural calamities, etc. [Szell et al., 2014]. For example, the 2014 FIFA World Cup final inspired more than 618,000 tweets per minute, a P. Dewan Precog new record for Twitter. Facebook also saw 350 mil- Indraprastha Institute of Information Technology - Delhi (II- lion users generating over 3 billion posts, comments ITD), India and likes during the 32 days of the world cup. 1 This Tel.: +91-11-26907479 enormous magnitude of activity during sports and other Fax: +91-11-26907405 E-mail: [email protected] news events makes OSNs lucrative venues for malicious entities to compromise system reputation and seek mon- P. Kumaraguru Precog etary gains. Facebook, being the most preferred OSN Indraprastha Institute of Information Technology - Delhi (II- for users to get news [Holcomb et al., 2013], is poten- ITD), India tially the most attractive platform for malicious enti- Tel.: +91-11-26907468 Fax: +91-11-26907405 1 http://edition.cnn.com/2014/07/14/tech/social- E-mail: [email protected] media/world-cup-social-media/ 2 P. Dewan, P. Kumaraguru ties to launch cyber-attacks. These attacks have be- lists and human-annotation. Comparing the two datasets come more sophisticated over the years, and are no revealed various differences (and some similarities) amongst longer limited to unsolicited bulk messages (spam and the malicious posts obtained using the two methodolo- promotional campaigns), drive-by malware downloads, gies. Thus, we propose a two-fold scheme to identify etc. Recently, cyber criminals exploited the context of malicious posts using two separate supervised learning various news events to spread hoaxes and misinforma- models. We propose an extensive feature set consisting tion, luring victims into scams, and phishing attacks on of 44 publicly available features to automatically distin- Facebook [Marca.com, 2014, Zech, 2014]. It has been guish malicious content from legitimate content in real claimed that Facebook spammers make $200 million time. Unlike prior work [Gao et al., 2010, Gao et al., just by posting links [TheGuardian, 2013]. Such activ- 2012, Rahman et al., 2012], our technique does not rely ities not only degrade user experience but also violate on message similarity features which have been heavily Facebook's terms of service. In rare cases, hoaxes have used to detect spam campaigns in the past. In addi- reached the extent of claiming human lives. 2 Facebook tion, we do not rely on the engagement level achieved has acknowledged spam and hoaxes as a serious issue, by posts (likes, comments, etc.), since these attributes and taken steps to reduce malicious content in users' build up over time, and are unavailable at zero-hour. newsfeed [Owens and Turitzin, 2014, Owens and Weins- Our experiments show that prior clustering based berg, 2015]. spam detection techniques are able to detect less than Researchers have been studying malicious content half the number of malicious posts as compared to our in the form of spam and phishing for over two decades. supervised learning model. We use our models to de- However, the definition and scope of what should be la- ploy Facebook Inspector (FbI), a REST API 3 based beled as \malicious content" on the Internet has been browser plug-in, that can be used to identify malicious constantly evolving since the birth of the Internet. With content on Facebook in real time. FbI is freely available respect to Online Social Networks, state-of-the-art tech- for both Google Chrome and Mozilla Firefox browsers, niques have become efficient in automatically detecting and has been downloaded over 2,500 times in the first spam campaigns [Gao et al., 2010, Zhang et al., 2012], nine months of its deployment. During this period, FbI and phishing [Aggarwal et al., 2012] without human in- received over 2.7 million requests and has evaluated volvement. Meanwhile, new classes of malicious content slightly over 0.97 million unique public Facebook posts. pertaining to appropriateness, authenticity, trustwor- Using this data, we evaluated FbI in terms of response thiness, and credibility of content have emerged in the time, and found that the response time for approxi- recent past. Some researchers have studied these classes mately 80% of all public posts analyzed by FbI was of malicious content on OSNs and shown their impli- under 3 seconds. Our contributions are as follows: cations in the real world [Castillo et al., 2011, Gupta { Characterization of malicious content generated on and Kumaraguru, 2012, Gupta et al., 2012, Mendoza Facebook during news-making events. Our dataset et al., 2010]. All of these studies, however, resorted to of 4.4 million public posts is one of the biggest human expertise to identify untrustworthy and inappro- datasets of public Facebook posts in literature. priate content and establish ground truth, due to the { Extensive feature set for identifying malicious con- absence of efficient automated techniques to identify tent in real time, excluding features like likes, com- such content. We focus on a similar class of malicious ments, shares, etc. which are absent at post creation content pertaining to trustworthiness and appropriate- time. ness in this work (in addition to traditional spam, and { Two-fold filtering approach using models trained on phishing), which currently requires human expertise to separate ground truth datasets obtained through identify. human-annotation and URL blacklists. In this paper, we address the problem of automatic { Publicly available end-user solution (Facebook In- real-time detection of malicious content generated dur- spector) in the form of a REST API and a browser ing news-making events, that is currently evading Face- plug-in to identify malicious posts in real time. We book's detection techniques [Stein et al., 2011]. To this also evaluated FbI with real users and found it fast end, we collected 4.4 million public posts generated and useful in most cases. by 3.3 million unique entities (users / pages) during 17 news-making events that took place between April The rest of the paper is organized as follows. We 2013 and July 2014. We constructed two ground truth discuss the related work in Section 2. The methodology datasets for malicious Facebook posts, using URL black- adopted for data collection and labeled dataset creation 2 http://news.discovery.com/human/psychology/social- 3 http://multiosn.iiitd.edu.in/fbapi/endpoint/?version=2. media-ebola-hoax-causes-deaths-14100.htm 0&fid=hpost idi Facebook Inspector (FbI) 3 is described in Section 3. Section 4 discusses the char- malicious posts if the system had not seen a similar acterization and analysis of our datasets. The results post before. We overcome this drawback by eliminating of our automatic detection techniques are described in dependency on post similarity and using classification Section 5. Section 6 describes the implementation de- instead of clustering. tails and evaluation of Facebook Inspector. Section 7 In an attempt to protect Facebook users from ma- discusses the challenges we faced in working with Face- licious posts, Rahman et al. designed a social malware book. Section 8 explains the limitations of our research, detection method which took advantage of the social and we conclude our work in Section 9. context of posts [Rahman et al., 2012]. The authors were able to achieve a maximum true positive rate of 97%, using a SVM based classifier trained on 6 fea- 2 Related work tures. The classifier required 46ms to classify a post. This model was then used to develop MyPageKeeper 4, Facebook has its own immune system to safeguard its a Facebook app to protect users from malicious posts.

Load more