Detecting Spam Classification on Twitter Using URL Analysis, Natural Language Processing and Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 Detecting Spam Classification on Twitter Using URL Analysis, Natural Language Processing and Machine Learning Ms Priyanka Chorey Ms. Sonika A. Chorey P.R.M.I.T.&R. Badnera, Amravati India P.R.M.I.T.& R. Badnera, Amravati, India [email protected] [email protected] Ms.Prof R.N.Sawade Ms.P.V.Mamanka P.R.M.I.T.&R. Badnera, Amravati India P.R.M.I.T.& R. Badnera, Amravati, India [email protected] [email protected] Abstract- In the present day world, people are so much downloading of malwares. Malware’s are malicious habituated to Social Networks. Because of this, it is very easy to software spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which 2) Phishers: These are account users who spread uses an integrated approach to the spam classification in Twitter. malicious URLs through their tweets. Legitimate users The integrated approach comprises the use of URL analysis, who click on these links end up in wrong websites. This natural language processing and supervised machine learning leads to the stealing of passwords, credit card information techniques. In short, this is a three step process. In this paper we and so on. consider the problem of detecting spammers on twitter. We construct a large labeled collection of users, manually classified into 3) Adult Content Propagators: These spammers spammers and non-spammers. We then identify a number of characteristics related to tweet content and user social behavior, broadcast links containing adult contents. On clicking which could potentially be used to detect spammers. those links, the user will be redirected to malicious sites.. Keywords—natural language processing; tweets; machine learning, URLs. 4) Marketers: These are spammers who concentrate on spreading advertisements.They try to trend different I. INTRODUCTION products. Marketers are normally harmless because, the only thing they do is, popularizing their products. But Online Social Networks (OSNs) are becoming very popular sometimes these users can mislead the legitimate users. these days. Some of the popular OSNs are Twitter, Facebook, MySpace, LinkedIn etc. With the increasing popularity of these In this paper we are proposing an application which can sites, the attacks on them are also increasing. It is a platform classify a Twitter user into spam or legitimate. To achieve through which people can share their ideas and thoughts. These this, an integrated approach, which contains URL analysis, sites have millions of users and not all users are legitimate. Each Natural Language Processing and Machine Learning of these OSNs has lots of illegitimate (or spam) accounts with techniques are used. These techniques are applied in the same them. order as given above. Over the last few years, social networking sites have A tweet may or may not contain a URL. Since Twitter become one of the main ways for users to keep track and com municate with their friends online. Sites such as Facebook, supports only 140 characters in a tweet, a long URL is MySpace, and Twitter are consistently among the top 20 most- normally shortened. So many URL shortening services are viewed web sites of the Internet. Moreover, statistics show that, available these days. For example, Google URL shortened, on average, users spend more time on popular social networking Bitly, Twitter URL shortener etc. These shorteners generate sites than on any other site [1]. Most social networks provide short URLs which ends with .gl, bit.ly, t.co etc. Moreover, mobile platforms that allow users to. Twitter has different features like @ mentions, # tags and RT. @ mentions are actually used to address a user. # Tags are In this paper, we concentrate more on spammers in Twitter. used for trending a topic. RT shows that the tweet is Twitter is a microblogging site, which allows only a maximum retweeted. of 140 characters in each tweet (message). The four major types of spammers on Twitter that we have considered in this paper The rest of the paper is organized as follows. Section II are: talks about the related works done in this field. Section III 1) Malware Propagators: Malware propagators are users describes the system design and methodology. Section IV who tweet malicious links, which on clicking leads to the explains the experiments done. Section V gives the 141 International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 experiment results. Finally the paper is concluded in the section request to B, who has to acknowledge that she knows A. VI. When B confirms the request, a friendship connection with A is established. However, the users’ perception of Facebook friendship is different from their perception of a relationship II. RELATED WORK and Background in real life. Most of the time, Facebook users accept Social networks offer a way for users to keep track of their friendship requests from persons they barely know, while in friends and communicate with them. This network of trust real life, the person asking to be friend would undergo more typically regulates which personal information is visible to scrutiny. whom. In our work, we looked at the different ways in which A lot of research has been done in this field. The authors have social networks manage the network of trust and the visibility of used the concept of social honeypots in [1], along with information between users. This is important because the nature machine learning for spam detection in OSNs. Social of the network of trust provides spammers with different options for sending spam messages, learning information about their honeypots are fake profiles or accounts which are created victims, or befriending someone (to appear trustworthy and deliberately to gain the attention of a spammer. make it more difficult to be detected as a spammer). The method of detecting pharmaceutical spam in Twitter is discussed in [2]. This is done by applying text mining 2.1 The MySpace Social Network techniques and data mining tools. This paper is addressing MySpace was the first social network to gain significant popularity among Internet users. The basic idea of this network how to classify a new incoming spam as pharmaceutical spam is to provide each user with a web page, which the user can then or not. The authors used decision tree (J48) algorithm and personalize with information about herself and her interests. Naïve-Bayes algorithm. Finally they compared the output Even though MySpace has also the concept of “friendship,” like obtained by both these classifiers. A set of 65 words (which Facebook, MySpace pages are public by default. Therefore, it is were related to pharmaceuticals) were used as the training set. easier for a malicious user to obtain sensitive information about If at least, one word out of these 65 were present in the tweets a user on MySpace than on Facebook. in the test set, then they will be classified as spam. 2.2 The Twitter Social Network Online Spam Filtering is presented in [3]. This is a real Twitter is a much simpler social network than Facebook time system. This can inspect a message (tweet in the case of and MySpace. It is designed as a microblogging platform,where Twitter and post in the case of Facebook) and drop it if it is users send short text messages (i.e., tweets) that appear on their found to be a spam. The spam messages are dropped even friends’ pages. Unlike Facebook and MySpace, no personal before the intended recipient gets it. Everything happens in information is shown on Twitter pages by default. Users are real time. Such messages are not stored in the database. The identified only by a username and, op-tionally, by a real name. paper uses machine learning techniques. Millions of tweets To profile a user, it is possible to analyze the tweets she sends, and posts are collected from both Twitter and Facebook for and the feeds to which she is subscribed. However, this is datasets. In this paper the authors have used two supervised significantly more difficult than on the other social networks. machine learning algorithms namely Support Vector Machine A Twitter user can start “following” another user. As a (SVM) and Decision Tree. consequence, she receives the user’s tweets on her own page. The user who is “followed” can, if she wants, follow the other one back. Tweets can be grouped by hash tags, which are Evaluation of the context-aware spam that could result from popular words, beginning with a “#” character. This allows users information that is shared on the social networks is dealt in to efficiently search who is posting topics of interest at a certain [4]. The mitigation techniques are also discussed here. The time. When a user likes someone’s tweet, he can decide to authors have done analysis on Facebook. The authors retreat it. As a result, that message is shown to concluded that context-aware e-mail attacks have a high rate all her followers. By default, profiles on Twitter are public, of success. The paper also mentions the defence strategies but a user can decide to protect her profile. By doing that, taken by other social networks like LinkedIn and MySpace. anyone wanting to follow the user needs her permission. Ac- cording to the same statistics, Twitter is the social network Harvested Twitter dataset and links are examined in [6]. that has the fastest growing rate on the Internet. During Here the authors have found features using which content the last year, it reported a 660% increase in visits [2] polluters can be easily identified.