NLSDF for BOOSTING the RECITAL of WEB SPAMDEXING CLASSIFICATION DOI: 10.21917/Ijsc.2016.0183

NLSDF for BOOSTING the RECITAL of WEB SPAMDEXING CLASSIFICATION DOI: 10.21917/Ijsc.2016.0183

S K JAYANTHI AND S SASIKALA: NLSDF FOR BOOSTING THE RECITAL OF WEB SPAMDEXING CLASSIFICATION DOI: 10.21917/ijsc.2016.0183 NLSDF FOR BOOSTING THE RECITAL OF WEB SPAMDEXING CLASSIFICATION S.K. Jayanthi1 and S. Sasikala2 1Department of Computer Science, Vellalar College for Women, India E-mail: [email protected] 2Department of Computer Science, Hindusthan College of Arts and Science, India E-mail: [email protected] Abstract in trending, what type of users seeks which type of information Spamdexing is the art of black hat SEO. Features which are more and class of targeted audience for specific search engines. influential for high rank and visibility are manipulated for the SEO SEO is a task of manipulating content and hyperlinks of a task. The motivation behind the work is utilizing the state of the art Website. SEO techniques can be classified into two broad Website optimization features to enhance the performance of categories: white hat SEO, and black hat SEO. If a Website is spamdexing detection. Features which play a focal role in current SEO strategies show a significant deviation for spam and non-spam samples. subject to white hat SEO it results in genuine rank and visibility This paper proposes 44 features named as NLSDF (New Link in SERP. In contrast, if it is subject to black hat SEO it gives Spamdexing Detection Features). Social media creates an impact in higher than deserved rank and visibility in SERP. The search search engine results ranking. Features pertaining to the social media engines attempt to minimize the effect of the latter called were incorporated with the NLSDF features to boost the recital of the spamdexing. Gyongyi and Garcia-Molina (2004) [2] discusses spamdexing classification. The NLSDF features with 44 attributes about the SEO in the web spam arena. The activity of some SEOs along with 5 social media features boost the classification performance benefits the whole web community, as they help authors create of the WEBSPAM-UK 2007 dataset. The one tailed paired t-test with well-structured, high-quality pages. However, most SEOs engage 95% confidence, performed on the AUC values of the learning models in practices that are called spamming. shows significance of the NLSDF. For instance, there are optimizers who define spamming exclusively as increasing relevance for queries not related to the Keywords: topics of the page. These SEOs endorse and practice techniques Web Spam, Search Engine, SVM, Decision Table, HMM that have an impact on importance scores, to achieve what they call ethical web page positioning or optimization. According to 1. INTRODUCTION them, all types of actions intended to boost ranking (either relevance, or importance, or both), without improving the true The inevitable role of the search engines are proved day by value of a page, are considered spamming. day whenever every user seeks for a piece of the information from WEBSPAM-UK 2007 dataset is the largest compiled the Web. Digital content flourishes at a lightning speed every collection with class labels. It is the most prominent dataset second. The challenge of every content provider in Web is to get available in public to study the characteristics of the spamdexing, higher visibility in SERP for their domain. A survey report [1] when this research has been started on 2010. Since the strategies indicates the importance of the results in first page. 91.5% of the in spamdexing is getting changed day by day, the pace of traffic in Google is flooded to the first SERP. When moving from spamming changes more frequently. It is decided to collect page one to two, the traffic dropped by 95%, and by 78% and 58% features pertaining to SEO for selected samples in the dataset. for the subsequent pages. Samples could be used to elucidate the structural traits and They also reveal a study for traffic share among the Websites temporal amends of the spamdexing phenomenally. based on their results position using a dataset drawn from May st th Selecting the influential SEO features is the next task which 21 to May 27 2013. A website with the first position in the involves an analysis in four renowned SEO sources [3]: SEOmoz, search results contributed to 33% of the traffic, compared to 18% Searchengineland, Search engine journal and SEObook. Page- for the second position. The data also shows that the first position specific and domain-specific link based metrics play a vital role on any page of results contributed to more traffic than the second in optimizing a Website. position in each respective page (i.e. traffic from users dropped by 27%, 11.3%, and 5.4% from the first position to second The page specific metrics considered as important in SEO are: position in page two, three, and four). The traffic dropped by keyword-focused anchor text from external links, external link 140% going from 10th to 11thposition and 86% going from 20th to popularity, diversity of link sources (links from many unique root 21st position. The drop in cumulative traffic moving from one domains), page-specific TrustRank, link popularity, focus of page to another was even more significant. external link sources, keyword-focused anchor text from internal links, location in information architecture of the site, internal link Getting higher visibility would lead to many potential popularity. benefits. SEO task gains attraction at this point. SEO is the process of improving the visibility of a website or a web page in The site specific metrics which are of high importance in SEO a search engine result page (SERP). SEO analyses the working are: trustworthiness of the domain based on link distance from method of ranking algorithm in search engines, factors trusted domains (TrustRank, Domain mozTrust), global link concentrated while ranking and influential features of Website popularity of the domain (PageRank on the domain graph, which improves rank for specific ranking algorithm. Along with Domain mozRank), link diversity of the domain (based on that they also focus on the user profiling; which content user seeks number of unique root domains linking to pages on this domain), 1324 ISSN: 2229-6959 (ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2016, VOLUME: 07, ISSUE: 01 links from hubs/authorities in a given topic-specific dataset. Newly included features improve the performance of the neighborhood, temporal growth/shrinkage of links to the domain dataset by 3% accuracy. (the quantity/quality of links earned over time and the temporal Chung et al. [8] proposed new set of features including white distribution), links from domains with restricted access extensions score, spam score, relative trust; outgoing and incoming link (e.g. .edu, .gov, .mil, .ac.uk, etc.). related features, PageRank and hijacked score. It is clear that Domain Trust/Authority acts as the dominant Kariampor et al. [9] performs classification of web spam using factor in the success of rankings at Google. A link strategy that imperialist competitive algorithm and genetic algorithm. positions a site as an authority, or a hub, in a web community is a Imperialist competitive algorithm is a novel optimization powerful way to get attention of search engines [4]. According to algorithm that is inspired by socio-political process of a correlation study done by SEOmoz, a famous SEO concern, link imperialism in the real world. Experiments are carried out on metrics which has significance in visibility and visitability to a WEBSPAM-UK2007 data set, which show feature selection Website are: Domain authority, Page Authority, mozRank, improves classification accuracy, and imperialist competitive mozTrust and link counts (external and internal) [3]. algorithm outperforms GA. MozRank and PageRank scores are being used to show value Geng et al. [10] used re-extracted features based on the host on sites for sale in recent times. Page/Domain Authority level link graph and the predicted spamicity, clustering, represent SEOmoz’s best ranking predictions and propagation and neighbor details and used WEBSPAM-UK 2006 uses MozRank and MozTrust in their calculations. MozRank is a dataset as a base. They use bagging, a famous meta-learning part of Page/Domain Authority mix which gives high granularity algorithm with c4.5. in spamdexing detection. Shen et al. [11] propose method for web spam detection, using Cervino and Malae [5] conducted cyber metric analysis to genetic programming, from existing link-based features and use analyze the correlation between applied metric indicators and them as the inputs to support vector machine and genetic accessibility problems in academic websites. They used the programming classifiers. According to the authors, the classifiers mozRank and domain authority as their focal indicators for that use the new features achieve better results compared with the determining the visibility for a Website. After analyzing the SEO features provided in the original database. sources available in Web, a set of features were demarcated. For Jayanthi and Sasikala used genetic algorithm for Web spam naming convention, the additional features are collected, classification [12]. Later utilized the Reptree based classifier for computed and compiled in the name NLSDF (New Link Spam the same [13]. Naive bayes classifier is also proposed by the Detection Features). authors for the problem [14]. They also applied the Artificial This paper is organized as follows: Section 2 discusses the Immune Recognition System based classifiers for the web spam related work in this problem. Section 3 describes 27 new features problem. They proved that AIRS1 and AIRS2 Parallel are two used for this work along with 17 computed features. Section 4 methods which give best results when compared with pioneered gives a brief about the suite of the classifiers used in this paper. literature [15]. Parameter settings of the classifiers are also briefed. Section 5 briefs the evaluation metrics and presents the results.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us