An Improved Framework for Content-Based Spamdexing Detection
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 1, 2020 An Improved Framework for Content-based Spamdexing Detection Asim Shahzad1, Hairulnizam Mahdin*2, Nazri Mohd Nawi3 Faculty of Computer Science and Information Technology Universiti Tun Hussein Onn Malaysia Abstract—To the modern Search Engines (SEs), one of the deceive the SE in such a way that it delivers those search biggest threats to be considered is spamdexing. Nowadays results which are irrelevant and not beneficial to the web user. spammers are using a wide range of techniques for content The ultimate aim of web spammers is to increase their web generation, they are using content spam to fill the Search Engine page's rank in SERPs. Besides that, spamdexing also has an Result Pages (SERPs) with low-quality web pages. Generally, economic impact because web pages with higher rank can get spam web pages are insufficient, irrelevant and improper results huge free advertisements and huge web traffic volume. During for users. Many researchers from academia and industry are the past few years, researchers are working hard to develop working on spamdexing to identify the spam web pages. the new advanced techniques for identification of fraudulent However, so far not even a single universally efficient method is web pages but, spam techniques are evolving also, and web developed for identification of all spam web pages. We believe spammers are coming up with new spamming methods every that for tackling the content spam there must be improved day [4]. Research in the area of web spam detection and methods. This article is an attempt in that direction, where a prevention has become an arms race to fight an opponent who framework has been proposed for spam web pages identification. The framework uses Stop words, Keywords Density, Spam consistently practices more advanced techniques [4]. If one Keywords Database, Part of Speech (POS) ratio, and Copied can recognize and eliminate all spam web pages, then it is Content algorithms. For conducting the experiments and possible to build an efficient and robust Information Retrieval obtaining threshold values WEBSPAM-UK2006 and System (IRS). Efficient SEs are needed which can produce WEBSPAM-UK2007 datasets have been used. An excellent and promising and high-quality results according to the user search promising F-measure of 77.38% illustrates the effectiveness and query. The next task is to arrange retrieved pages by the applicability of proposed method. content or semantic similarity between retrieved web pages and the search query entered by the user. Finally, the arranged Keywords—Information retrieval; Web spam detection; content pages are presented to the user. There are many adverse spam; pos ratio; search spam; Keywords stuffing; machine effects of spamdexing on both search engine and end user [5]. generated content detection Spam web pages not only waste the time but also waste the storage space. As SE needs to store and index a huge number I. INTRODUCTION of web pages, so more storage space is required. Furthermore, Spamdexing or web spam is described as " an intentional when SE requires to search web pages on the bases of the act that is intended to trigger illegally favorable importance or user's query, it will search in the huge corpus and therefore relevance for some page, considering the web page's true more time is required. Due to this, it diminishes the significance" [1]. Studies by different researchers in the area effectiveness of the SE and reduces the trust of the end user on show that on Web at least twenty percent of hosts are spam SE [6]. [2]. Spamdexing is widely recognized as one of the most significant challenges to web SEs [3]. One of the important To get over the web spam attacks the improvement in anti- current problems of SEs is spamdexing because spamdexing web-spam methods is essential. All the techniques which can heavily reduces the quality of the search engine's results. be used to get an undeservedly high rank are kind of Many users get annoyed when they search for certain content spamdexing or web spam. Generally, there are three types of and ended up with irrelevant content because of web spam. spamdexing, Content Spam, Link Spam, and Cloaking. Due to the unprecedented growth of information on the World Cloaking is a spamdexing method in which the content offered Wide Web (WWW), the available size of textual data has to the end user is different from that presented to the SE spider become very huge to any end user. According to the most [7]. However, the most common types of spamdexing are recent survey by WorldWideWebSize, the web is consisting of content and link spam. Content spam is the one studied in this 5.39 billion1 pages. To the web corpus, thousands of pages are research work. Davison [8] defined the link spam as ― the being added every day and out of all these web pages several connections among various web pages that are present for a are either spam or duplicate web pages [3]. Web spammers are purpose other than merit. In link spam, web spammers are taking the benefits from internet users by dragging them to creating the link structure for taking the benefits from link- their web pages using several smart and creative spamming based ranking algorithms, for instance, PageRank, it will methods. The purpose of building a spam web page is to assign the higher rank to a web page if other highly ranked web pages are pointing to the web page with backlinks. 1 https://www.worldwidewebsize.com/ Content spam is consisting of all those methods in which web *Corresponding Author spammers are changing the logical view that an SE has over 409 | Page www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 1, 2020 the web page contents [1], it is the most common type of unlabeled samples, they used the semi-supervised learning, spamdexing [9]. This spamdexing technique is popular among and for creating the new features and reducing the TF-IDF the web spammers because of several SEs are using the content-based features they used the combinatorial feature information retrieval models (IRM) for instance, statistical fusion method. Empirical results prove the effectiveness of language model [10], vector space model [11], BM25 [12] and their method. probabilistic model which are applied to the web page's content for ranking the web page. Spammers try to utilize the The most fundamental work on content-based spamdexing vulnerabilities of these models for manipulating the content of detection algorithms is done by Fetterly et al [3], [21]–[23]. In target web page [13]. For instance, using important keywords [23] they suggest that using statistical analysis the spam web several times on a target web page and increasing the pages can be classified easily. Because generally, web keywords frequencies, copying the content from various good spammers are generating the spam web pages automatically web pages, using the machine-generated content on target web by adopting the phrase stitching and weaving methods [1], page, and putting all dictionary words on the page and then these spam pages are designed for SE spiders instead of real changing the color of text similar to the background color so human visitors, so these web pages show the abnormal users can not see the dictionary words on target page and only behavior. Researchers identified that there are a huge number visible to SEs spiders are some methods which web spammers of dashes, dots, and digits in the URL of spam web pages and are using for getting higher page rank in the SERPs [14]. the length of URL is also exceptional. During their research Generally, content spam can be divided into five subtypes work, they identify that out of 100 longest observed host- based on the structure of the page. These subtypes are Body names 80 were pointing to the adult contents, and 11 of them spam, Anchor text spam, Meta-tag spam, URL spam, and Title were referring to the financial credit related web pages. They spam. There are a number of spamming methods which are also observe that web pages themselves have the duplicating targeting the different algorithms used in SEs [15]. nature - spam web pages hosted by the same host almost contains the same content with very little variance in word The focus of this article is content spam detection, in this count. Another very fascinating finding of this research was work, a framework has been proposed to detect spam web they identified that spammers are changing the content on pages by using content-based techniques. In the proposed these spam pages very quickly. Specifically, they kept a close approach, stop words, keywords density, Keywords database, eye on content changing feature and observed the changes on part of speech (POS) ratio test, and copied content algorithms these spam pages for a specific host on weekly bases. They are used to detect the spam web pages. For the experimental come up with the results that 97.2% of the most active spam purpose, WEBSPAM- UK2007 and WEBSPAM-UK2006 hosts can be identified on the bases of this one feature only. datasets are used. The experimental results with an They proposed several other features in this research work all encouraging F-measure demonstrate the applicability and other features can be seen in the research article [23]. effectiveness of the proposed improved framework as compared to other already existing techniques. In another study conducted by them [21], [22], they worked on content duplication and identified that bigger II. LITERATURE REVIEW clusters with the identical content are actually spam.