International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 Detecting Spam Classification on Using URL Analysis, Natural Language Processing and Machine Learning

Ms. Sonika A. Chorey Ms Priyanka Chorey P.R.M.I.T.&R. Badnera, Amravati India P.R.M.I.T.& R. Badnera, Amravati, India [email protected] [email protected]

Ms.Prof R.N.Sawade Ms.P.V.Mamanka P.R.M.I.T.&R. Badnera, Amravati India P.R.M.I.T.& R. Badnera, Amravati, India [email protected] [email protected]

Abstract- In the present day world, people are so much downloading of malwares. Malware’s are malicious habituated to Social Networks. Because of this, it is very easy to software spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which 2) Phishers: These are account users who spread uses an integrated approach to the spam classification in Twitter. malicious through their tweets. Legitimate users The integrated approach comprises the use of URL analysis, who click on these links end up in wrong websites. This natural language processing and supervised machine learning leads to the stealing of passwords, credit card information techniques. In short, this is a three step process. In this paper we and so on. consider the problem of detecting spammers on twitter. We construct a large labeled collection of users, manually classified into 3) Adult Content Propagators: These spammers spammers and non-spammers. We then identify a number of characteristics related to tweet content and user social behavior, broadcast links containing adult contents. On clicking which could potentially be used to detect spammers. those links, the user will be redirected to malicious sites..

Keywords—natural language processing; tweets; machine learning, URLs. 4) Marketers: These are spammers who concentrate on spreading advertisements.They try to trend different I. INTRODUCTION products. Marketers are normally harmless because, the only thing they do is, popularizing their products. But Online Social Networks (OSNs) are becoming very popular sometimes these users can mislead the legitimate users. these days. Some of the popular OSNs are Twitter, Facebook, MySpace, LinkedIn etc. With the increasing popularity of these In this paper we are proposing an application which can sites, the attacks on them are also increasing. It is a platform classify a Twitter user into spam or legitimate. To achieve through which people can share their ideas and thoughts. These this, an integrated approach, which contains URL analysis, sites have millions of users and not all users are legitimate. Each Natural Language Processing and Machine Learning of these OSNs has lots of illegitimate (or spam) accounts with techniques are used. These techniques are applied in the same them. order as given above. Over the last few years, social networking sites have A tweet may or may not contain a URL. Since Twitter become one of the main ways for users to keep track and com municate with their friends online. Sites such as Facebook, supports only 140 characters in a tweet, a long URL is MySpace, and Twitter are consistently among the top 20 most- normally shortened. So many URL shortening services are viewed web sites of the Internet. Moreover, statistics show that, available these days. For example, URL shortened, on average, users spend more time on popular social networking Bitly, Twitter URL shortener etc. These shorteners generate sites than on any other site [1]. Most social networks provide short URLs which ends with .gl, bit.ly, t.co etc. Moreover, mobile platforms that allow users to. Twitter has different features like @ mentions, # tags and RT. @ mentions are actually used to address a user. # Tags are In this paper, we concentrate more on spammers in Twitter. used for trending a topic. RT shows that the tweet is Twitter is a microblogging site, which allows only a maximum retweeted. of 140 characters in each tweet (message). The four major types of spammers on Twitter that we have considered in this paper The rest of the paper is organized as follows. Section II are: talks about the related works done in this field. Section III 1) Malware Propagators: Malware propagators are users describes the system design and methodology. Section IV who tweet malicious links, which on clicking leads to the explains the experiments done. Section V gives the 141

International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 experiment results. Finally the paper is concluded in the section request to B, who has to acknowledge that she knows A. VI. When B confirms the request, a friendship connection with A is established. However, the users’ perception of Facebook friendship is different from their perception of a relationship II. RELATED WORK and Background in real life. Most of the time, Facebook users accept Social networks offer a way for users to keep track of their friendship requests from persons they barely know, while in friends and communicate with them. This network of trust real life, the person asking to be friend would undergo more scrutiny. typically regulates which personal information is visible to whom. In our work, we looked at the different ways in which A lot of research has been done in this field. The authors have social networks manage the network of trust and the visibility of used the concept of social honeypots in [1], along with information between users. This is important because the nature machine learning for spam detection in OSNs. Social of the network of trust provides spammers with different options for sending spam messages, learning information about their honeypots are fake profiles or accounts which are created victims, or befriending someone (to appear trustworthy and deliberately to gain the attention of a spammer. make it more difficult to be detected as a spammer). The method of detecting pharmaceutical spam in Twitter is

discussed in [2]. This is done by applying text mining 2.1 The MySpace Social Network techniques and data mining tools. This paper is addressing MySpace was the first social network to gain significant popularity among Internet users. The basic idea of this network how to classify a new incoming spam as pharmaceutical spam is to provide each user with a web page, which the user can then or not. The authors used decision tree (J48) algorithm and personalize with information about herself and her interests. Naïve-Bayes algorithm. Finally they compared the output Even though MySpace has also the concept of “friendship,” like obtained by both these classifiers. A set of 65 words (which Facebook, MySpace pages are public by default. Therefore, it is were related to pharmaceuticals) were used as the training set. easier for a malicious user to obtain sensitive information about If at least, one word out of these 65 were present in the tweets a user on MySpace than on Facebook. in the test set, then they will be classified as spam.

2.2 The Twitter Social Network Online Spam Filtering is presented in [3]. This is a real Twitter is a much simpler social network than Facebook time system. This can inspect a message (tweet in the case of and MySpace. It is designed as a microblogging platform,where Twitter and post in the case of Facebook) and drop it if it is users send short text messages (i.e., tweets) that appear on their found to be a spam. The spam messages are dropped even friends’ pages. Unlike Facebook and MySpace, no personal before the intended recipient gets it. Everything happens in information is shown on Twitter pages by default. Users are real time. Such messages are not stored in the database. The identified only by a username and, op-tionally, by a real name. paper uses machine learning techniques. Millions of tweets To profile a user, it is possible to analyze the tweets she sends, and posts are collected from both Twitter and Facebook for and the feeds to which she is subscribed. However, this is datasets. In this paper the authors have used two supervised significantly more difficult than on the other social networks. machine learning algorithms namely Support Vector Machine A Twitter user can start “following” another user. As a (SVM) and Decision Tree. consequence, she receives the user’s tweets on her own page. The user who is “followed” can, if she wants, follow the other one back. Tweets can be grouped by hash tags, which are Evaluation of the context-aware spam that could result from popular words, beginning with a “#” character. This allows users information that is shared on the social networks is dealt in to efficiently search who is posting topics of interest at a certain [4]. The mitigation techniques are also discussed here. The time. When a user likes someone’s tweet, he can decide to authors have done analysis on Facebook. The authors retreat it. As a result, that message is shown to concluded that context-aware e-mail attacks have a high rate all her followers. By default, profiles on Twitter are public, of success. The paper also mentions the defence strategies but a user can decide to protect her profile. By doing that, taken by other social networks like LinkedIn and MySpace. anyone wanting to follow the user needs her permission. Ac- cording to the same statistics, Twitter is the social network Harvested Twitter dataset and links are examined in [6]. that has the fastest growing rate on the Internet. During Here the authors have found features using which content the last year, it reported a 660% increase in visits [2] polluters can be easily identified. The authors proposed a long term study of protecting social networks using honeypots. 2.3 The Facebook Social Network Almost 60 honeypots were deployed for seven months which Facebook is currently the largest social network on the Internet. resulted in the harvesting of more than 30000 spam data. The On their website, the Facebook administrators claim to have spam classification was done using machine learning more than 400 million active users all over the world, with over algorithms. 2 billion media items (videos and pictures) shared every week [3]. Detect spam bots in Online Social Networks especially in Usually, user profiles are not public, and the right to view a Twitter is the purpose of [7]. Spammers use Twitter to post user’s page is granted only after having established a multiple duplicate updates. In this paper, suspicious relationship of trust (paraphrasing the Facebook terminology, behaviour of spam bots are studied. They used classification becoming friends) with the user. When a user A wants to methods like decision tree, neural networks, Support Vector become friend with another user B, the platform first sends a Machine (SVM), k-nearest neighbour and Naive-Bayes 142

International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 algorithm. The authors manually labelled 500 accounts as spam vagina , xxx. and non-spam for the training set. All the algorithms used were compared with each other and thus Naive-Bayes was found to be The presence of even one of these expressions can the best. conclude that the URL is spam. If the URL is spam, the user is classified as spam. If the user is not classified as spam in Our application is a combination of the tasks discussed here so this step, or if he hasn’t tweeted any URLs, then the next far. technique of natural language processing is applied.

B. Natural Language Processing

Natural Language Processing (NLP) is a technique which

enables a machine to process a natural language (like

English) and do all the things that a human can do. In short,

NLP helps in automating things. A similar approach is used III. SYSTEM DESIGN AND METHODOLOGY in [8], [9] and [10]. Extracting information from unstructured We have come up with an application which can classify a data using NLP is discussed in [8]. Malicious tweets are identified in [9]. Here also NLP is used. In [10], NLP is used tweetsusernamecheckedapplicationTwitter of user into aand userhas intothe output tois interfacespam enterused is orforeitherthe legitimate.provided. theusername spam whole or So, Theprocess.of legitimate. basicallythe user account of this theThe to input last be 10 is the The entire work uses three techniques: in sentiment analysis of a subject. Before going into deep concepts of NLP, a set of A. URL Analysis incomplete sentences which normally appears in a tweet are identified. After researching on Twitter, 11 common The first step in this application is URL analysis. URL sentences in spam tweets were found. They are ‘add me at’, analysis has been done in [11] and [12]. For this, the URLs ‘take me on a date’, ‘you'll laugh when you see this pic of are extracted from the tweets. The extracted URLs are you’, ‘You look different in this photo’, ‘my friend sent me normally the shortened ones. These URLs are converted to this pic with you in it’, ‘my friend showed me this pic of their long form. For doing this we use HttpURLConnection you’, ‘follow me back’, ‘discount drugs’, ‘I found you in class. This helps in finding the page to which a particular this video’, ‘Is that you in this picture’, ‘buy now’. If these URL is redirected to. When a URL is redirected to another, expressions are found in the tweet, then the user is classified the response code will be 301. So if the header contains 301, as spam. we’ll take that location as the long URL. URL analysis involves 2 steps: In this paper, two concepts of NLP have been used: 1) Comparison with a set of blacklisted URLs removal of stop words and stemming. For processing English there is no need of stop words like I, about, above A set of Blacklisted URLs were downloaded from etc. So all these words are removed and only the keywords http://urlblacklist.com. This data set consists of lakhs of URLs are extracted. The next step is to find the root word or stem from different categories. We chose 4 categories: URLs of the keyword. For this, stemming techniques are used. related to advertisements, malwares, adult contents and Examples of stemming: phishing. This set contains almost 15,000 URLs.

Complexity ------> Complex The URLs extracted from the tweets are compared with the blacklisted URLs. If n URLs are extracted from the tweets and Possessive ------> Possess even if one among the n URLs is present in the , the A simple stemming algorithm has been used in this user is regarded as spam. This is because a legitimate user will paper. A set of spam words that can appear in a tweet is never tweet a URL which is blacklisted. If the user is identified, like ‘porn’, ‘Viagra’ etc. The stemmed classified as spam, then the whole process stops here. Else, the keywords are compared with the set of identified spam process continues. words. If the words match, then the user is regarded as 2) Comparison with a set of already identified expressions spam. At this stage, if the user is not found as spam, then the third technique of Machine Learning is used. The next task is to identify a set of expressions or words in C. Machine Learning Techniques a URL which can prove that the URL is a spam. Some of the expressions were obtained from www.urlblacklist.com. The Using machine learning techniques, a machine can learn rest were identified after thorough research. A total of 33 on its own. So no human intervention is required. These words were identified. algorithms use training set. Training set are labelled The set of 33 words are: /ads, /realmedia/ads/, examples obtained after analysing data manually. The /pics/banner/, adultos, adultsight , adultsite, adultsonly, training set is of the form (a1, a 2 . . . an, L) where a1. . an adultweb, blowjob, bondage, centerfold, cumshot, cyberlust, are attributes and L is the label. The test set contains a set cybercore, hardcore, incest, masturbate, obscene, pedophil, of n attributes of the form {a1, a2 . . .an}. pedofil, playmate, pornstar, sexdream, showgirl, softcore, In this paper we have used Naïve-Bayes which is a striptease, adultsight, adultsite, adultsonly, adultweb, penis, supervised machine learning algorithm. The dataset used 143

International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 here were first tested using two algorithms: Naïve-Bayes and IV. EXPERIMENT SVM. Out of this Naive-Bayes was found to be more accurate. That is the reason why we chose the same. Our experiment consists of dataset crawled from Twitter.

A confusion matrix is drawn as follows: A. Training Set

The training set was obtained from 10 recent Tweets of Spam Legitimate 100 users. Six features were used for classification. The

features are: number of @ mentions, number of unique @ Spam a b mentions, number of # tags, number of unique # tags, Legitimate c d number of URLs and number of unique URLs.

The training set has 100 instances with 6 features and a number of spam classified as legitimate, number of legitimate label, i.e. the set contains 100 rows and 7 columns. This set classified as spam and number of legitimate classified as is read as a CSV file into the program. legitimate respectively. B. Test Set The accuracy, true positive and false positive is calculated as The test set also contains 6 features without label. The aim is follows: to find the label. It is read as a text file. Accuracy = (a+d)/(a+b+c+d) Description of Features Used True Positive = (a)/(a+b) Naïve-Bayes algorithm is applied first. Since this is a machine learning algorithm and solely depends on the False Positive= (c)/(c+d) accuracy of the training set, we can expect an error rate of 2- 10%. So it is not wise to get errors at the early stage itself. The results obtained after using SVM and Naïve-Bayes is given But comparing with URL blacklist and predicting as spam is below: more accurate one. Error expected is very less. NLP is also a strong method in accurate spam classification. So the order: URL analysis, Natural Language processing and then Accuracy False Positive True Machine Learning technique is significant in this application.

Positive

VI. CONCLUSION Naïve-Bayes 94% 0.9 0.03 In this paper we have proposed an integrated approach for

the classification of a Twitter user into spam or legitimate. The combined approach, which includes URL analysis, SVM 92% 0.875 0.05 Natural Language processing and Machine Learning

techniques, could successfully do the classification. The combined approach gives more accuracy than each of these Naïve-Bayes methods being applied alone. Also, here we have identified different set of expressions, tweets, words and other features Naive-Bayes is a probabilistic classifier which uses the Bayes which can show that a user is a spam or legitimate. The Theorem. Each feature is independent of each other. Consider a integrated approach is found to be more accurate than Test Set T with attributes (features) a1, a2. . . an. machine learning used alone.

T = {a1, a2, . . . an} and a set of labels L = {Spam, As the future work, we plan to make this application to Legitimate}. work in real time. We also plan to increase the dataset used for training as well as testing. This helps in checking whether Then, we are getting more accuracy. Crowd sourcing can be integrated into this work. In this paper, we showed that spam P(L | a1, a2 . . . an) = p (L) (ai| L) where i = 1 to n on social networks is a problem. For our study, we created a Whichever label has the higher probability is the label of that population of 900 honey-profiles on three major social particular test set T. networks and observed the traffic they received. We then developed techniques to identify single spam bots, as well as Requirements for the implementation are a training set and a large-scale campaigns. We also showed how our techniques test set. The most important thing for developing an efficient help to detect spam profiles even when they do not contact a classifier is to construct a good training set. The success of the honey-profile. We believe that these techniques can help classifier lies in the efficiency of the training set. Inefficient social networks to improve their security and detect malicious training set will lead to a classifier with low accuracy. users. Where a, b, c and d are the number of spam classified as spam, 144

International Journal of Innovative and Emerging Research in Engineering Volume 3, Special Issue 1, ICSTSD 2016 1. REFERENCES [2] Uncovering social spammers: social honeypots +machine learning: Kyumin Lee. 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval. .

[3] The Impact of Natural Language Processing Based Textual Analysis of Social Media Interaction on Decision Making. Larson, Keri, Watson, Richard T, Proceedings of the 21st European Conference on Information Systems.

[4] Detecting malicious tweets in trending topics using a statistical analysis of laguage, Juan Martinez-Romo,, Lourdes Araujo, Expert Systems with Applications: An International Journal

[5] Sentiment Analyzer: Extracting Sentiments about a Given Topic Using Natural Language Processing Techniques. Jeonghee Yi ; IBM Almaden Res. Center, San Jose, CA, USA ; Nasukawa, T. ; Bunescu, R. ; Niblack, W. Data Mining, 2003. ICDM 2003. Third IEEE International Conferenc

[6] Information Assurance: Detection of Web Spam Attacks in Social Media Pang-Ning Tan, Feilong Chen, and Anil K Jain. Proceedings of the 27th Army Science Conference, Orlando, Florida(2010).

[7] Design and Evaluation of a Real-Time URL Spam Filtering Service. Kurt Thomas, Chris Grier, Justin Ma, Vern Paxon, and Dawn Song. Proceedings of the IEEE Symposium on Security and Privacy.

[8] Detecting Spammers on Twitter”: Fabr ́ cio Benevenuto, Gabriel Magno, Tiago, Rodrigues, and Virglio Almeida. In Anti-Abuse and Spam Conference (CEAS) (July 2010).

[9] Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter: Kyumin Lee and Brian David Eoff and James Caverlee. In Fifth International AAAI Conference on Weblogs and Social Media July 2011.

[10] Detecting Spammers on Social Networks: Gianluca Stringhini, Christopher Kruegel, Giovanni Vigna. Annual Computer Security Application Conference 2010.

[11] Design and Evaluation of a Real-Time URL Spam Filtering Service : Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song

[12] Towards Online Spam Filtering: Hongyu Gao et al. NDSS Symposium 2012

145