International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-11, Nov.-2017 http://iraj.in SPAM DETECTION APPROACHES AND STRATEGIES: A PHENOMENON

1BALOGUN ABIODUN KAMORU, 2ALI HASSAN MATLAB, 3AZMI JAAFAR

Faculty of Computer Science and Information Technology Department of Information System and Software Engineering Universiti Putra Malaysia, Serdang 43300, Selangor Malaysia. Faculty of Computer System & Software Engineering University Malaysia Pahang, 26300 Gambang, Kuantan, Pahang Darul Makmur, Malaysia E-mail: [email protected], [email protected], [email protected]

Abstract- The massive increase of spam is posing a very dangerous and serious threat to our email and social networks. It is pertinent and imperative to step up the spam detection approach and strategies in email and social networks. In recent years online spam has become a major problem for the sustainability of the internet globally, Excessive amounts of spam are not only reducing the quality of information available on the email and social networks but also creating concern among the users of email and various social networks. This paper aims to analyze existing research works in spam detection strategies and approaches, state of art, the phenomenon of spam detection, to explore the rudiment of spam detection, to proposed detection scheme and potential online mitigation schemes. The paper will surveys various anti spam strategies for email and social networking. In the literature we have studied that many anti spam strategies have been discovered and work on but they are still open challenges to these different approaches and techniques, while some of them are highlighted in this articles. It is very important to work on spam detection and reposition it for the better of the world. General Terms Spam detection, Security, Protection

Keywords- Spam; Anti-spam strategies; social networks, state of art, Phenomenon, Unsolicited Commercial email (UCE), Unsolicited Bulk email (UBE)

I. INTRODUCTION networking spam; (6) Blogging and live stream spam[9]. Spam is Unsolicited Commercial Email (UCE), is not Various legal means of anti-spam attempts have been a new problem causing complaints from many work on by previous research[10] [11]. Legislation internet users globally. is the act of specifically targeted at as well as sending unsolicited commercial email, involves the unwanted messages in general have been introduced sending of nearly identical emails to thousand or even in some countries, such as the United State of millions of recipients without the recipients’ prior America. Before targeted legislations are introduced, consent or even violates recipients explicit refusal some existing laws are sought for fighting spam. [1] [2] [3]. Internet is used on a daily basis to search Possible approaches and techniques are based on laws for information and acquire knowledge [4]. Spam is and statutes that combat fraud antiracketeering and increasingly being used to distribute virus, , anti harassment. These approaches are considered links to websites, etc. The problem of spam ineffective as they require considerable costs and is not only threat but also annoyance that has become efforts for the prosecutor to prove the relevance a dangerous phenomenon to our existence. between the spam messages and the law. Another Unsolicited Bulk email (UBE) is another category of challenging problem to the legal approach is the emails that can be considered spam.. As suggested in limited jurisdiction of the law concerned. Also, many recent reports by Spamhaus and Symantec [7][8]. legislator are forced to leave loopholes in the There is an increase trend for both UCE and UBE. legislations to avoid infringing the freedom of speech For instance , Symantec has detected 44% increase in [10]. These often allow spammers to slip through and phishing attempts from the first half of 2016 to the the restriction merely becomes a burden to legitimate second half. Statistics from the Distributed Check senders. sum Clearinghouse (DCC) Project [6] shows that To reduce or mitigate spams, various anti spam 54% of the email messages checked by the DCC methods have been proposed in state-of-the-art network in 2016 are likely to be from bulk email. research[2] Heymann et al, classified anti-spam According to MX logic [5] shows that an average strategies into three categories: (i) Prevention Based; 80.78% of the email messages delivered to their (ii) Detection Based; (iii) Demotion Based. There are clients during week of March 24-30,2016 are various anti-spam strategies as content based, link considered to be spam email, with peaks more than based, graph analysis, clocking scheme but in spite of 90%. having various anti-spam strategies there are various There are six main forms of spam, and they have open challenges to these anti-spam strategies and different effects on End users globally like: (1) E- approaches, phenomenon which need to be addressed. mail spam; (2) Comment spam; (3) Instant Messenger To prevent users from being overwhelmed by spam, Spam;(4) Unsolicited text messages; (5) social many internet service providers (ISP) and

Spam Detection Approaches and Strategies: A Phenomenon

40 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-11, Nov.-2017 http://iraj.in organizations deploy spam filters at the email server email mainly focuses on the following : (1) Anomaly level. The family of Naive Bayes (NB) classifiers detection;(2) Fault detection; (3) [13] is probably one of the most commonly detection;(4) Intrusion detection. If a considerable implemented, which is also embedded in many effort is not made to find a technological solution to popular email and social networks clients. They the menace of spam. The internet email and social extract keywords and other indicators from email email is in danger as an important medium of messages and determine whether the messages are communication. spam using some statistical or heuristics scheme. Ahmed and Abulaish,2013 [15] present that However, spam senders (spammers) nowadays are spammers are trying to a new approach to gain access using increasingly sophisticated techniques and through social media and email. While most of the approaches to trick content based filters by clever previous work on social spam and email spam has manipulation of the spam content [14]. for analysis. focused on spam prevention on a single email or Also , words with scrambled character order can social spam. Social spam and email spam is relatively render vocabulary-based detection techniques new research area and the literature is still sparse ineffective, yet human can still understand the [4][5][7]. A large number of social spam and email scrambled words. As a consequence, content-based spam classifiers have been used in spam detection but filters are becoming less effective and hence other choosing the right classifier and the most efficient approaches are being explored to complement them, combination of them is still problem with previous One popular approach is based on blacklists and scholar work. Although there are still limited studies whitelists. A blacklist is a list of senders whose on spam detection. emails are blocked from getting through to the Taylor[16] discussed the domain reputation system recipients. A whitelist is just the exact opposite. deployed in Google’s Gmail System, the reputation While a blacklist specifies who is to be kept out maintains the reputation for each domain that sends to allowing all others to pass, a whitelist only allows email to Gmail. This reputations are calculated based those who are already on the list to get through . on previous results from statistical filters and user Since Spammers almost always spoof the “ From” feedback. If the reputation of a domain is good, the field of spam messages, blacklists usually keep IP domain will be white listed and the reverse will be addresses rather than email addresses. For incoming blacklisted. The emails from senders in neither lists messages from senders not on the lists, content-based are further processed with statistical anti-spam filters filters may be applied so that the approaches can for making the final decision. Email classification complement each other. results are logged as auto spam or auto non-spam In this paper, we propose to analyze existing research events. Users can send feedback to the system by works in spam detection strategies and approaches, clicking on a button in the webmail interface for state of art, the phenomenon of spam detection, to reporting misclassification. These events are also explore the rudiment of spam detection. The paper logged and used during the next update reputations. will surveys various anti spam strategies for email Taylor also discussed the problem of spoofed source and social networking. In the literature we have addresses which can affect sender-based detection studied that many anti spam strategies have been systems. The sender policy framework (SPF) and discovered and work on but they are still open Domain-based (Domain Keys) challenges to these different approaches and mechanisms are used to authenticate whether an techniques, to look into the online mitigation methods email is really sent from the domain that it claims to . be from. Besides reputation systems, heuristics-based The rest of this paper is organized as follows; section approaches have also been explored. 2 review various form of related work on spam Harris proposed a heuristic method called Greylisting detection method in email and social networks, [13] to avoid receiving spam at the recipient’s email section 3 describes the anti -spam strategies and need transfer agent (MTA). when a recipient MTA that for spam detection addressed by this paper. Section 4 uses Grey listing receives a delivery attempt, the details the spam detection techniques and section 5 MTA will respond with an SMTP temporary error explores the possible challenges and spam mitigation messages. The recipient MTA will record the identity strategies and to proposed detection scheme. Section of the recent attempts of delivery so that the next 6 present the conclusion and final part of this paper attempt will be accepted. Legitimate senders that is section 7 which is our reference. conform to the standard will have their message delivered. Whereas spammers, who concern more II. RELATED WORK ON SPAM DETECTION about coding simplicity and speed of the spamming IN EMAIL AND SOCIAL NETWORKS engine, ignore any error message and move on to the next recipient in the list instead of retrying. Thus, 2.1 RELATED WORK Spam can be avoided. A Spam detection method tries to determine whether Structural features in email social networks may also a sender is a spammer or legitimate sender. Balogun be exploited for sender-based spam detection. Gomes et al,2017[14] Spam detection on social networks and et al presented a graphic-theoretic analysis of email

Spam Detection Approaches and Strategies: A Phenomenon

41 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-11, Nov.-2017 http://iraj.in traffic and proposed several features that can serve to through interfaces (Such as CAPTCHA which stands distinguish between spam and legitimate email. for “completely automated public turing test to tell Although they did not present any spam detection computers and human parts”) or through usage limits study in this paper, the features proposed can be used (such as tagging quot e.g Flickr introduced a limit of for spam detection. In particular. 75 tags per photo. Boykin and Roychowdhury [3] proposed an  Detection Based: Detection based automated anti-spam tool that exploits the properties approaches identify likely spams either manually or of a , the nodes in the network are automatically by making use of machine learning clustered to form spam and non-spam component (such as text classification) or statistical analysis ( automatically. such as link analysis) and then deleting the spam content or visibly marking hidden to the user. For III. ANTI-SPAM STRATEGIES AND NEED these methods, we can treat the corpus as set of FOR SPAM DETECTION objects with associated attributes. In e-mail spam, the messages ar objects and the headers are attributes. In 3.1 ANTI-SPAM STRATEGIES web spam, the web pages are objects and attributes Anti-spam strategies are part of the strategies to might be inlinks, outlinks, page content and various mitigate spam in email and social spam, the following meta data. are anti-spam:(i) Prevention Based (ii) Detection  Demotion Based: The approach reduces Based (iii) Demotion Based. the prominence of content likely to be spam. For  Prevention Based: This approach aims at instance rank based methods produce ordering of a making it difficult for spam content to contribute to system’s , tags or users based on trust score. social tagging system by restricting certain access

Figure 1: Anti-Spam Strategies

3.2 NEED FOR SPAM DETECTION 1. Content Based Spam is a threat to the users of internet globally. Due 2. Link Based to the challenges for the service providers because of 3. Algorithms that exploits click stream the following negative effects are [2]: 4. Semantic based spam detection  Spam deteriorates the quality of search 4.1.1. Content Based: results and deprive legitimate websites of revenue. Spam detection techniques which analyze content  Spam have economic impact since highest features such as word count or language models and ranking provides large free advertising and so an content duplication. Fetterly et al proposed that web increase in web traffic volume. spam pages exhibit some anomalous properties as :  User trust is weaken due to the search engine (1) URL of spam pages have exceptional number of provider which is especially tangible issue to zero dots, dashes, digits and length, (2) Most spam pages cost of switching from one search provider to that resides on the same host have very low word another. count variance,(3) Content of spam pages changes  Spam websites are means of malware and very rapidly.[5] Features based on HTML page adult content dissemination and phishing attack. structure to detect script generated spam pages. In  Identification of the most appropriate tags this preprocessing is made by removing all the for the given content and to eliminate the spam tag. content and considering only layout of the page. They applied finger printing technique with subsequent IV. SPAM DETECTION TECHNIQUES AND clustering to find groups of structurally near spam APPROACHES pages [5]. Mishne et al proposed a line of work on language modelling for spam detection. They 4.1 SPAM DETECTION TECHNIQUES proposed an approach of spam detection in blogs by The spam detection techniques can be categorized comparing the language models for blog comments into 4: and page linked with these comments. They use KL

Spam Detection Approaches and Strategies: A Phenomenon

42 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-11, Nov.-2017 http://iraj.in divergence as a measure of discrepancy [7]. in other In literature review we have studied that many anti- work by Sydow lingusitic features were analyzed for spam strategies have been discovered but still there web spam detection by considering lexical validity are some possible challenges to these techniques. and content diversity, syntactical diversity and Some of them highlighted below: entropy, usage of active and passive voices and  It is observed that interaction across social various other NLP feature [8]. network become popular. For e.g. users can use their account or email account to log in some 4.1.2. Link Based: other social network services. Thus future challenges The link based approach analyze link based is to investigate how trust model across domains can information such as neighbor graph connectivity. be effectively connected and shared. Based on identification of suspicious nodes and links  However social network services are used by and their subsequent down weighting. Extracting link people from various countries , so various languages based features for each node and use various machine simultaneously appears in tags and comment. In such learning algorithm to detect spam. Graph cases some text information may be regarded as regularization technique for spam detection. In this wrong or considered as spam due to language spam. link information is used to compute global Therefore incorporating multilingual in trust importance scores for all pages pi to page pj. modelling would solve this problem. Algorithm follows repeated improvement principle i.e  Most of the existing approaches based on the true score is computed as convergence point of an text information assuming monolingual environment. iterative updating process [5]. Algorithms belonging  Trust modelling most of the current to this category represent pages as feature vectors techniques for noise and spam reduction focus only and perform standard classification or clustering on textual tag processing and user profile analysis analysis. Studies link-based feature to perform while audio and visual content features of website categorization based on their functionality, multimedia content can also provide useful their assumption is that sites sharing similar structural information about the relevance of the content tag pattern, such as average page level or number of out relation. links per leaf page, share similar roles on web. For  In trust modelling system user’s trust tends e.g web directories mostly consists of pages while to vary over time according to the users’ experience spam site have specific topology aimed to optimize and involvement of social networks. Only a few page Rank boost and demonstrate high content approaches deals with the dynamics of trust by duplication. Overall, eacg website is represented as a distinguishing between recent and old tags. Future vector of 16 connectivity and a clustering is work considering dynamics of trust would lead to performed using cosine as a similarity measure [5]. better modelling in real world application. 4.1.3. Algorithms that exploits click stream: Data and user behaviour of data , query popularity 5.2. SPAM MITIGATION STRATEGIES data or information and HTTP session information, SCHEME since click spam aims to push “ Malicious noise” into A detection scheme needs a mitigation strategy to a query log with the intention to corrupt data, used for react to spam in e-mail and social networks. There are the ranking function construction, most of the counter more than one way to use the legitimacy avenue methods study the ways to make learning algorithm provided by the social network based detection robust to this noise. Other anti-click- fraud methods strategy scheme to mitigate spam. One of the more are driven by the analysis of the economic factors straight forward ways is to apply a threshold to the underlying the spammers ecosystem. Interesting idea score which email from the sender will be filtered. to prevent click spam is proposed.. The author While this approach is simple, we observe that it is suggests using personalized ranking functions, as unlikely that social network based detection alone is being more robust, to prevent click fraud accurate enough for the purpose. Also, existing manipulation[3]. content-based schemes and rule-based schemes are still performing reasonably well. We prefer to use the 4.1.4. Semantic Based Spam detection: social network based detection strategy scheme to To overcome the drawback of content based spam complement rather than replace content-based detection, semantic based detection method is used filtering techniques. where instead of content semantic of the web site is Different ways of combining filters have been analyzed. explored in the literature. Segal et al.[9] proposed to form a pipeline of anti-spam filter components. An e- V. POSSIBLE CHALLENGES, SPAM mail passes through each component in the pipeline MITIGATION STRATEGIES AND PROPOSED one by one. Each component assigns a score to the SPAM DETECTION SCHEME email. An e-mail can be directly classified by an intermediate classifier and skip all subsequent 5.1. POSSIBLE CHALLENGES IN SPAM components if the classifier determined the DETECTION TECHNIQUES classification of the email with high confidence.

Spam Detection Approaches and Strategies: A Phenomenon

43 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-11, Nov.-2017 http://iraj.in Lynam and Cormack [14] explored different ways of content-based filter. Senders with a score lower than combining anti-spam filters. Specifically, the voting Ts will be considered spammers. Their emails can be of binary classifications from spam classifiers, the rejected at this stage or flagged as spam directly. log-dds averaging of spam cores, the use of SVM on Email from senders with a score in between the two spam scores from different spam filters and the use of thresholds, i.e., Ts< FiTs on this score will be defined. while others are slowed down. One of the way to use Emails from senders with the legitimate score above an exponential decay function on the legitimate score Ti will be accepted directly to the inbox, skipping the to determine the extent of damping. The delay

Spam Detection Approaches and Strategies: A Phenomenon

44 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-11, Nov.-2017 http://iraj.in imposed by the server grows exponentially with the issues which has to be addressed and left as open decrease of the legitimate score. challenge for research. For future research it could be combine multi media content analysis with 5.3. PROPOSED SPAM DETECTION SCHEME conventional tag processing and user profile analysis. figure below is an overview of the proposed solution for detecting spam senders. Email and Social REFERENCES networks are first constructed from email transaction logs. A social network can be represented by a direct [1] [1] A. Beygelzimer, S.Kakade, and J. Langford; " Cover graph where senders are represented as nodes and trees for nearest neighbor" In ICML '16. Proceeding of the 23rd International Conference on Machine Learning. ACM email transactions are represented as edges. After the press.2016 feature extraction and preprocessing stages, a [2] [2] Paul Heyman; Georgia Koutrika., " Fighting Spam on machine learning method, such as k-Nearest Social Websites". 2007 IEEE, INTERNET Neighbor (k-NN) classifier, can be used for the COMPUTING,ISSN: 1089-7801/07 [3] [3] P.O.Boykin and V.P. Roychowdhury;" Leveraging Social classification task. Some postprocessing on the Networks to fight spam" Computer,38:61-68, April,2005. classifier output may yield results that are more [4] [4] Nikita Spirin; Jiawei Han;. " Survey on Web Spam versatile. The proposed spam detection scheme will Detection: Principles and Algorithms". Proceeding of ACM give a permanent solution to the possible challenges SIGKDD Exploration newsletter, Vol.13 Issue 2. December 2011. pose by the spam on email and social network. [5] [5] Chen, Y.S., Hung. Y.P., Yen, T.F., and Fuh.S.H., “ Fast and versatile algorithm for nearest neighbor search based on a lower bound tree.” Pattern Recogn., 40 (2): 360-375, 2007.

[6] [6] Krishnan, Vijay., Rashmi, Raj., “Web spam Detection with Anti-Trust Rank”. Airweb.cse.lehigh.edu/2006/proceeding.pdf.

[7] [7] Goel, W.B., & Davison, B.D.,” Topical Trust Rank Using topically to combat web spam.” WWW.2006. [8] [8] P. A. Chirita, J. Diederich, and W. Nejdl. Mailrank: usingranking for spam detection. In CIKM ’05: Proceedings of the14th ACM international conference on Information andknowledge management, pages 373–380, New York, NY,USA, 2005. ACM Press. [9] [9] R. Segal, J. Crawford, J. Kephart, and B. Leiba. Spamguru:An enterprise anti-spam filtering system. In First Conferenceon Email and Anti-Spam CEAS 2004, Jul. 2004 [10] [10] J. Leyden. The economics of spam, Nov. 2003. Retrieved:Jun. 2006 [11] [11] Federal Trade Commission. Email address harvesting: How spammers reap what you sow, Nov. 2002. Retrieved: Mar.2006http://www.ftc.gov/bcp/conline/pubs/alerts/spamalrt .pdf. [12] [12] B. Taylor. Sender reputation in a large webmail service. CONCLUSION InThird Conference on Email and Anti-Spam CEAS 2006, Jul.2006. http://www.ceas.cc/2006/19.pdf. In this paper, we have explored the new direction for [13] [13] E. Harris. The next step in the spam control war: a proposed scheme of spam detection, we have been Greylisting, Aug. 2003. Retrieved: Aug. 2006 http://projects.puremagic.com/greylisting/whitepaper.html. able to explore the possible challenges on spam [14] [14] T. R. Lynam, G. V. Cormack, and D. R. Cheriton. On- detection, spam detection approaches and spam line spam filter fusion. In SIGIR ’06: Proceedings of the detection techniques. Fron the above study we have 29thannual international ACM SIGIR conference on studied various spam detection approaches and Research and development in information retrieval, pages techniques and explored the open challenges and 123–130, New York, NY, USA, 2006. ACM Press.



Spam Detection Approaches and Strategies: A Phenomenon

45