Web Spam, Propaganda and Trust
Total Page:16
File Type:pdf, Size:1020Kb
Web Spam, Propaganda and Trust Panagiotis T. Metaxas Joseph DeStefano Wellesley College College of the Holy Cross Wellesley, MA 02481, USA Worcester, MA 01610, USA [email protected] [email protected] ABSTRACT Retrieval]: Miscellaneous Web spamming, the practice of introducing artificial text and links into web pages to affect the results of searches, has General Terms been recognized as a major problem for search engines. It is also a serious problem for users because they are not aware Algorithms, Experimentation, Social Networks, Propaganda, of it and they tend to confuse trusting the search engine with Trust trusting the results of a search [16]. The parallels between web spamming on the internet and propaganda in the real Keywords world suggest that we can use anti-propaganda techniques to educate users and develop tools to help them evaluate the search, Web graph, link structure, PageRank, HITS, Web reliability of the information they find online. spam In this paper, we first analyze the effects that web spam has on the evolution of the search engines and their relation- 1. INTRODUCTION ship to propagandistic techniques in society. Then, we ex- The web has changed the way we inform and get informed. amine the neighborhoods of untrustworthy sites, finding that Every organization has a web site and people are increas- a dense biconnected component (BCCs) containing the site ingly comfortable accessing it for information for any ques- provide a reasonable trust neighborhood that has parallels in tion they may have. The exploding size of the web necessi- social network theory. The fact that spammers employ pro- tated the development of search engines and web directories. pagandistic techniques enables us to design a heuristic that Most people with online access use a search engine to get in- follows anti-propagandistic practices in order to recognize a formed and make decisions that may have medical, financial, spamming network. In society, recognition of an untrustwor- cultural, political, security or other important implications thy message (in the opinion of a particular person or other [10, 37, 23, 29]. Moreover, 85% of the time, people do not social entity) is a reason for questioning the entities that rec- look past the first ten results returned by the search en- ommend the message. Entities that are found to strongly gine [35]. Given this, it is not surprising that anyone with support more untrustworthy messages become untrustwor- a web presence struggles for a place in the top ten posi- thy themselves. So, social distrust is propagated backwards tions of relevant web search results. The importance of the for a number of steps. Our heuristic simulates this behavior top-10 placement has given birth to a new industry, which on the trust neighborhood of a spammer. claims to sell know-how for prominent placement in search In our experiments, we examined trust neighborhoods of results and includes companies, publications, and even con- web sites, both trustworthy and not. Our findings suggest ferences. Some of them are willing to bend the truth in order that spamming networks can be reliably recognized from to fool the search engines and their customers, by creating their relationship to a single untrustworthy starting point by web pages containing web spam. backward propagation of distrust. Further, nodes involved Web spamming is the practice of manipulating web pages in a spamming network can be divided into two groups: in order to cause search engines to rank some web pages those that have content similar to the starting site (aka “link higher than they would without any manipulation.1 The farms”), and those that have dissimilar content (aka “mu- motive is usually commercial, but can also be political, or tual admiration societies”). Our tool explores thousands of religious. nodes within minutes and could be deployed at the browser- The creators of web spam are often specialized companies level, making it possible to resolve the moral question of who selling their expertise as a service, but can also be the web should be making the decision of weeding out spammers in masters of the companies and organizations that would be favor of the end user. their customers. Spammers attack search engines through text and link manipulations [22, 18]: Categories and Subject Descriptors 1We should mention here that there is not a complete agree- H.3.3 [Information Storage and Retrieval]: Information ment on the definition of web spam among authors, which Search and Retrieval; H.3.m [Information Storage and leads to some confusion. Moreover, to people unfamiliar Copyright is held by the author/owner(s). with web spam, the term is mistaken for email spam. A WWW2005, May 10–14, 2005, Chiba, Japan. more descriptive name for it would be “search engine rank- . ing manipulation” or “adversarial information retrieval”. • Text spam: This includes excessively repeating text indicate that at least 8% of all pages indexed is spam [12] and/or adding irrelevant text on the page that will while experts consider web spamming the single most diffi- cause incorrect calculation of page relevance; adding cult challenge web searching is facing today [22]. misleading meta-keywords or irrelevant “anchor text” In this paper we first examine the reasons web spamming that will cause incorrect application of rank heuristics. has been so successful and its relationship to social propa- ganda. Then, we develop heuristics that are able to recog- • Link spam: This technique aims to change the per- nize web neighborhoods, especially untrustworthy ones. We ceived structure of the webgraph in order to cause in- present experimental results that show considerable success correct calculation of page reputation. Such examples in recognizing spamming neighborhoods. Finally, we dis- are the so-called “link-farms”, “mutual admiration so- cuss what we believe should be a frame for the long-term cieties”, page “awards”, domain flooding (plethora of approach to web spam. domains that re-direct to a target site), etc. Both kinds of spam aim to boost the ranking of spammed 2. THE WEBGRAPH AS A SOCIAL NET- web pages. Sometimes cloaking is included as a third spam- WORK ming technique [22, 19]. Cloaking aims to serve different The web is typically represented by a directed graph [8]. pages to search engine robots and to web browsers (users). The nodes in the webgraph are the pages (or sites) that These pages could be created statically or dynamically. Static reside on servers on the internet. Arcs correspond to hy- pages, for example, may employ hidden links and/or hidden perlinks that appear on web pages (or sites). The theory text with colors or small font sizes noticeable by a crawler of social networks [38] also uses directed graphs to repre- but not by a human. Dynamic pages might change content sent relationships between social entities. The nodes (called on the fly depending on the visitor, fake the clickstream or “actors”) correspond to social entities (e.g., people, institu- query stream, submit millions of pages to “add-URL” forms tions, ideas). Arcs (called “ties”) correspond to social rela- of search engines, etc. We consider the false links and text tions between the entities they connect (e.g., has influence themselves to be the spam, while, strictly speaking, cloaking on, knows, trusts). is a tool that helps spammers hide their attacks. This connection is more than a similarity in descriptions. Since anyone can be an author on the web, these prac- The web itself is a social creation, and both PageRank and tices have naturally created a question of information reli- HITS are socially inspired algorithms [6, 26]. Socially in- ability. An audience used to trusting the written word of spired systems are subject to socially inspired attacks, how- newspapers and books is unable, unprepared or unwilling ever. Not surprisingly then, the theory of propaganda [28] to think critically about the information obtained from the can provide intuition into the dynamics of the web. web. In a recent study [16] we found that while college For example, PageRank is based on the assumption that students regard the web as a primary source of informa- the reputation of an entity (a web site in this case) can be tion, many do not check more than a single source, and measured as a function of both the number and reputation have trouble recognizing trustworthy sources online. In par- of other entities recommending it. A link to a web page ticular, two out of three students are consistently unable is counted as a “vote of confidence” to this web site, and to differentiate between facts and advertising claims, even in turn, the reputation of a page is divided among those “infomercials.” At the same time, they have considerable it is recommending [6]. Since HTML does not provide for confidence in their abilities to distinguish trustworthy sites “positive” and “negative” links, all links are taken as posi- from non-trustworthy ones, especially when they feel tech- tive. This is not always true, but is considered a reasonable nically competent. We have no reason to believe that the assumption. general public will perform any better than well-educated More importantly, there is also the implicit assumption is students. In fact, a recent analysis of internet related fraud that hyperlink “voting” is taking place independently, with- by a major Wall Street law firm [10] put the blame squarely out prior agreement or central control. Spammers, like social on the investors for the success of stock fraud cases. propagandists, are groups of sites that are able to gather a One of the reasons behind the users’ difficulty to distin- large number of such “votes of confidence” by design, thus guish trustworthy from untrustworthy information comes breaking the assumption of independence in a hyperlink.