Applied Text Analytics for Blogs

Total Page:16

File Type:pdf, Size:1020Kb

Applied Text Analytics for Blogs UvA-DARE (Digital Academic Repository) Applied text analytics for blogs Mishne, G.A. Publication date 2007 Document Version Final published version Link to publication Citation for published version (APA): Mishne, G. A. (2007). Applied text analytics for blogs. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:06 Oct 2021 Applied Text Analytics for Blogs Gilad Mishne Applied Text Analytics for Blogs Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. J.W. Zwemmer ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Aula der Universiteit op vrijdag 27 april 2007, te 10.00 uur door Gilad Avraham Mishne geboren te Haifa, Isra¨el. Promotor: Prof.dr. Maarten de Rijke Committee: Prof.dr. Ricardo Baeza-Yates Dr. Natalie Glance Prof.dr. Simon Jones Dr. Maarten Marx Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit van Amsterdam SIKS Dissertation Series No. 2007-06 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. The investigations were supported by the Netherlands Organization for Scientific Research (NWO) under project number 220-80-001. Copyright c 2007 by Gilad Mishne Cover image copyright c 2006 bLaugh.com, courtesy of Brad Fitzpatrick and Chris Pirillo Printed and bound by PrintPartners Ipskamp, Enschede ISBN-13: 978–90–5776–163–8 Contents Acknowledgments ix 1 Introduction 1 1.1 Research Questions .......................... 3 1.2 Organization of the Thesis ...................... 5 1.3 Foundations .............................. 6 2 Background: Blogs and the Blogspace 7 2.1 Blogs .................................. 7 2.1.1 What is a Blog? ........................ 7 2.1.2 Terminology .......................... 10 2.1.3 A History of Blogging .................... 16 2.1.4 Blog Genres .......................... 19 2.2 The Blogspace ............................. 21 2.2.1 Size and Growth ....................... 22 2.2.2 Structure of the Blogspace . 24 2.2.3 The Language of Blogs .................... 33 2.2.4 Demographics ......................... 38 2.3 Computational Access to Blogs ................... 40 2.3.1 Search and Exploration .................... 40 2.3.2 Sentiment Analysis in Blogs . 44 2.3.3 Blog Spam ........................... 45 2.3.4 Blogs as Consumer Generated Media . 46 2.4 Scope of this Work .......................... 48 I Analytics for Single Blogs 49 3 Language-based Blog Profiles 53 iii 3.1 Statistical Language Modeling .................... 53 3.2 Blogger Profiles ............................ 56 3.2.1 Language Model Comparison and Keyword Extraction . 56 3.2.2 Profiling Blogs ........................ 57 3.3 Matching Bloggers and Products . 58 3.3.1 Evaluation ........................... 62 3.3.2 Conclusions .......................... 63 3.4 Extended Blogger Models ...................... 64 3.4.1 Language Models of Blog Posts . 64 3.4.2 Model Mixtures ........................ 66 3.4.3 Example ............................ 67 3.5 Profile-based Contextual Advertising . 68 3.5.1 Contextual Advertising .................... 69 3.5.2 Matching Advertisements to Blog Posts . 70 3.5.3 Evaluation ........................... 71 3.5.4 Experiments .......................... 73 3.5.5 Results ............................. 75 3.5.6 Deployment .......................... 76 3.5.7 Conclusions .......................... 77 4 Text Classification in Blogs 79 4.1 Mood Classification in Blog Posts . 79 4.1.1 Related Work ......................... 80 4.1.2 A Blog Corpus ........................ 81 4.1.3 Feature Set .......................... 83 4.1.4 Experimental Setting ..................... 87 4.1.5 Results ............................. 88 4.1.6 Discussion ........................... 91 4.1.7 Conclusions .......................... 92 4.2 Automated Tagging .......................... 92 4.2.1 Tagging and Folksonomies in Blogs . 93 4.2.2 Tag Assignment ........................ 93 4.2.3 Experimental Setting ..................... 97 4.2.4 Conclusions . 101 5 Comment Spam in Blogs 103 5.1 Comment and Link Spam ...................... 103 5.2 Related Work ............................. 104 5.2.1 Combating Comment Spam . 104 5.2.2 Content Filtering and Spam . 105 5.2.3 Identifying Spam Sites .................... 105 5.3 Comment Spam and Language Models . 106 5.3.1 Language Models for Text Comparison . 106 iv 5.3.2 Spam Classification ...................... 107 5.3.3 Model Expansion ....................... 109 5.3.4 Limitations and Solutions .................. 109 5.4 Evaluation ............................... 110 5.4.1 Results ............................. 110 5.4.2 Discussion . 113 5.4.3 Model Expansions ...................... 113 5.5 Conclusions .............................. 114 II Analytics for Collections of Blogs 117 6 Aggregate Sentiment in the Blogspace 121 6.1 Blog Sentiment for Business Intelligence . 122 6.1.1 Related Work ......................... 122 6.1.2 Data and Experiments .................... 123 6.1.3 Conclusions . 128 6.2 Tracking Moods through Blogs ................... 130 6.2.1 Mood Cycles ......................... 130 6.2.2 Events and Moods ...................... 132 6.3 Predicting Mood Changes ...................... 133 6.3.1 Related Work ......................... 134 6.3.2 Capturing Global Moods from Text . 134 6.3.3 Evaluation . 136 6.3.4 Case Studies . 141 6.3.5 Conclusions . 145 6.4 Explaining Irregular Mood Patterns . 145 6.4.1 Detecting Irregularities .................... 147 6.4.2 Explaining Irregularities ................... 148 6.4.3 Case Studies . 149 6.4.4 Conclusions . 150 7 Blog Comments 153 7.1 Related Work ............................. 155 7.2 Dataset ................................ 156 7.2.1 Comment Extraction ..................... 156 7.2.2 A Comment Corpus ..................... 159 7.2.3 Links in Comments ...................... 163 7.3 Comments as Missing Content .................... 163 7.3.1 Coverage ............................ 165 7.3.2 Precision ............................ 166 7.4 Comments and Popularity ...................... 167 7.4.1 Outliers ............................ 169 v 7.5 Discussions in Comments ....................... 169 7.5.1 Detecting Disputes in Comments . 171 7.5.2 Evaluation . 173 7.6 Conclusions .............................. 174 III Searching Blogs 179 8 Search Behavior in the Blogspace 183 8.1 Data .................................. 184 8.2 Types of Information Needs ..................... 185 8.3 Popular Queries and Query Categories . 188 8.4 Query Categories . 190 8.5 Session Analysis ............................ 192 8.5.1 Recovering Sessions and Subscription Sets . 194 8.5.2 Properties of Sessions ..................... 194 8.6 Conclusions .............................. 196 9 Opinion Retrieval in Blogs 197 9.1 Opinion Retrieval at TREC ..................... 198 9.1.1 Collection . 198 9.1.2 Task Description ....................... 198 9.1.3 Assessment . 199 9.2 A Multiple-component Strategy ................... 200 9.2.1 Topical Relevance ....................... 202 9.2.2 Opinion Expression ...................... 205 9.2.3 Post Quality . 210 9.2.4 Model Combination ...................... 212 9.3 Evaluation ............................... 213 9.3.1 Base Ranking Model ..................... 214 9.3.2 Full-content vs. Feed-only .................. 215 9.3.3 Query Source ......................... 219 9.3.4 Query Expansion ....................... 220 9.3.5 Term Dependence ....................... 221 9.3.6 Recency Scoring ........................ 224 9.3.7 Content-based Opinion Scores . 224 9.3.8 Structure-based Opinion Scores . 226 9.3.9 Spam Filtering ........................ 226 9.3.10 Link-based Authority ..................... 228 9.3.11 Component Combination ................... 230 9.4 Related Work ............................. 230 9.5 Conclusions .............................. 234 vi 10 Conclusions 239 10.1 Answers to Research Questions ................... 239 10.2 Main Contributions . 242 10.3 Future Directions . 243 A Crawling Blogs 249 B MoodViews: Tools for Blog Mood Analysis 253 Samenvatting 257 Bibliography 259 vii Acknowledgments I am indebted to my advisor, Professor Maarten de Rijke, who opened the door of scientific research for me and guided my walk through a path that was at times unclear. Despite an incredibly busy schedule, Maarten always found time to discuss any issue, or to suggest final improvements to a paper; he helped me in many ways, professional and other, and gave me a much-appreciated free hand to explore new areas my way. Many thanks also
Recommended publications
  • Copyright 2009 Cengage Learning. All Rights Reserved. May Not Be Copied, Scanned, Or Duplicated, in Whole Or in Part
    Index Note: Page numbers referencing fi gures are italicized and followed by an “f ”. Page numbers referencing tables are italicized and followed by a “t”. A Ajax, 353 bankruptcy, 4, 9f About.com, 350 Alexa.com, 42, 78 banner advertising, 7f, 316, 368 AboutUs.org, 186, 190–192 Alta Vista, 7 Barack Obama’s online store, 328f Access application, 349 Amazon.com, 7f, 14f, 48, 247, BaseballNooz.com, 98–100 account managers, 37–38 248f–249f, 319–320, 322 BBC News, 3 ActionScript, 353–356 anonymity, 16 Bebo, 89t Adobe Flash AOL, 8f, 14f, 77, 79f, 416 behavioral changes, 16 application, 340–341 Apple iTunes, 13f–14f benign disinhibition, 16 fi le format. See .fl v fi le format Apple site, 11f, 284f Best Dates Now blog, 123–125 player, 150, 153, 156 applets, Java, 352, 356 billboard advertising, 369 Adobe GoLive, 343 applications, see names of specifi c bitmaps, 290, 292, 340, 357 Adobe Photoshop, 339–340 applications BJ’s site, 318 Advanced Research Projects ARPA (Advanced Research Black Friday, 48 Agency (ARPA), 2 Projects Agency), 2 blog communities, 8f advertising artistic fonts, 237 blog editors, 120, 142 dating sites, 106 ASCO Power University, 168–170 blog search engines, 126 defi ned, 397 .asf fi le format, 154t–155t Blogger, 344–347 e-commerce, 316 AskPatty.com, 206–209 blogging, 7f, 77–78, 86, 122–129, Facebook, 94–96 AuctionWeb, 7f 133–141, 190, 415 family and lifestyle sites, 109 audience, capturing and retaining, blogosphere, 122, 142 media, 373–376 61–62, 166, 263, 405–407, blogrolls, 121, 142 message, 371–372 410–422, 432 Blue Nile site,
    [Show full text]
  • The Web 2.0 Way of Learning with Technologies Herwig Rollett
    Int. J. Learning Technology, Vol. X, No. Y, xxxx 1 The Web 2.0 way of learning with technologies Herwig Rollett* Know-Center, Inffeldgasse 21a A-8010 Graz, Austria E-mail: [email protected] *Corresponding author Mathias Lux Department for Information Technology University of Klagenfurt Universitätsstraße 65–67 A-9020 Klagenfurt, Austria E-mail: [email protected] Markus Strohmaier Department of Computer Science University of Toronto 40 St. George Street Toronto, Ontario M5S 2E4, Canada Know-Center, Inffeldgasse 21a A-8010 Graz, Austria E-mail: [email protected] Gisela Dösinger Know-Center, Inffeldgasse 21a A-8010 Graz, Austria E-mail: [email protected] Klaus Tochtermann Know-Center Institute of Knowledge Management Graz University of Technology Inffeldgasse 21a, A-8010 Graz, Austria E-mail: [email protected] Copyright © 200x Inderscience Enterprises Ltd. 2 H. Rollett et al. Abstract: While there is a lot of hype around various concepts associated with the term Web 2.0 in industry, little academic research has so far been conducted on the implications of this new approach for the domain of education. Much of what goes by the name of Web 2.0 can, in fact, be regarded as a new kind of learning technologies, and can be utilised as such. This paper explains the background of Web 2.0, investigates the implications for knowledge transfer in general, and then discusses its particular use in eLearning contexts with the help of short scenarios. The main challenge in the future will be to maintain essential Web 2.0 attributes, such as trust, openness, voluntariness and self-organisation, when applying Web 2.0 tools in institutional contexts.
    [Show full text]
  • Liferay Portal 6 Enterprise Intranets
    Liferay Portal 6 Enterprise Intranets Build and maintain impressive corporate intranets with Liferay Jonas X. Yuan BIRMINGHAM - MUMBAI Liferay Portal 6 Enterprise Intranets Copyright © 2010 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: April 2010 Production Reference: 1230410 Published by Packt Publishing Ltd. 32 Lincoln Road Olton Birmingham, B27 6PA, UK. ISBN 978-1-849510-38-7 www.packtpub.com Cover Image by Karl Swedberg ([email protected]) Credits Author Editorial Team Leader Jonas X. Yuan Aanchal Kumar Reviewer Project Team Leader Amine Bousta Lata Basantani Acquisition Editor Project Coordinator Dilip Venkatesh Shubhanjan Chatterjee Development Editor Proofreaders Mehul Shetty Aaron Nash Lesley Harrison Technical Editors Aditya Belpathak Graphics Alfred John Geetanjali Sawant Charumathi Sankaran Nilesh Mohite Copy Editors Production Coordinators Leonard D'Silva Avinish Kumar Sanchari Mukherjee Aparna Bhagat Nilesh Mohite Indexers Hemangini Bari Cover Work Rekha Nair Aparna Bhagat About the Author Dr.
    [Show full text]
  • Address Munging: the Practice of Disguising, Or Munging, an E-Mail Address to Prevent It Being Automatically Collected and Used
    Address Munging: the practice of disguising, or munging, an e-mail address to prevent it being automatically collected and used as a target for people and organizations that send unsolicited bulk e-mail address. Adware: or advertising-supported software is any software package which automatically plays, displays, or downloads advertising material to a computer after the software is installed on it or while the application is being used. Some types of adware are also spyware and can be classified as privacy-invasive software. Adware is software designed to force pre-chosen ads to display on your system. Some adware is designed to be malicious and will pop up ads with such speed and frequency that they seem to be taking over everything, slowing down your system and tying up all of your system resources. When adware is coupled with spyware, it can be a frustrating ride, to say the least. Backdoor: in a computer system (or cryptosystem or algorithm) is a method of bypassing normal authentication, securing remote access to a computer, obtaining access to plaintext, and so on, while attempting to remain undetected. The backdoor may take the form of an installed program (e.g., Back Orifice), or could be a modification to an existing program or hardware device. A back door is a point of entry that circumvents normal security and can be used by a cracker to access a network or computer system. Usually back doors are created by system developers as shortcuts to speed access through security during the development stage and then are overlooked and never properly removed during final implementation.
    [Show full text]
  • Adversarial Web Search by Carlos Castillo and Brian D
    Foundations and TrendsR in Information Retrieval Vol. 4, No. 5 (2010) 377–486 c 2011 C. Castillo and B. D. Davison DOI: 10.1561/1500000021 Adversarial Web Search By Carlos Castillo and Brian D. Davison Contents 1 Introduction 379 1.1 Search Engine Spam 380 1.2 Activists, Marketers, Optimizers, and Spammers 381 1.3 The Battleground for Search Engine Rankings 383 1.4 Previous Surveys and Taxonomies 384 1.5 This Survey 385 2 Overview of Search Engine Spam Detection 387 2.1 Editorial Assessment of Spam 387 2.2 Feature Extraction 390 2.3 Learning Schemes 394 2.4 Evaluation 397 2.5 Conclusions 400 3 Dealing with Content Spam and Plagiarized Content 401 3.1 Background 402 3.2 Types of Content Spamming 405 3.3 Content Spam Detection Methods 405 3.4 Malicious Mirroring and Near-Duplicates 408 3.5 Cloaking and Redirection 409 3.6 E-mail Spam Detection 413 3.7 Conclusions 413 4 Curbing Nepotistic Linking 415 4.1 Link-Based Ranking 416 4.2 Link Bombs 418 4.3 Link Farms 419 4.4 Link Farm Detection 421 4.5 Beyond Detection 424 4.6 Combining Links and Text 426 4.7 Conclusions 429 5 Propagating Trust and Distrust 430 5.1 Trust as a Directed Graph 430 5.2 Positive and Negative Trust 432 5.3 Propagating Trust: TrustRank and Variants 433 5.4 Propagating Distrust: BadRank and Variants 434 5.5 Considering In-Links as well as Out-Links 436 5.6 Considering Authorship as well as Contents 436 5.7 Propagating Trust in Other Settings 437 5.8 Utilizing Trust 438 5.9 Conclusions 438 6 Detecting Spam in Usage Data 439 6.1 Usage Analysis for Ranking 440 6.2 Spamming Usage Signals 441 6.3 Usage Analysis to Detect Spam 444 6.4 Conclusions 446 7 Fighting Spam in User-Generated Content 447 7.1 User-Generated Content Platforms 448 7.2 Splogs 449 7.3 Publicly-Writable Pages 451 7.4 Social Networks and Social Media Sites 455 7.5 Conclusions 459 8 Discussion 460 8.1 The (Ongoing) Struggle Between Search Engines and Spammers 460 8.2 Outlook 463 8.3 Research Resources 464 8.4 Conclusions 467 Acknowledgments 468 References 469 Foundations and TrendsR in Information Retrieval Vol.
    [Show full text]
  • Uncovering Social Network Sybils in the Wild
    Uncovering Social Network Sybils in the Wild ZHI YANG, Peking University 2 CHRISTO WILSON, University of California, Santa Barbara XIAO WANG, Peking University TINGTING GAO,RenrenInc. BEN Y. ZHAO, University of California, Santa Barbara YAFEI DAI, Peking University Sybil accounts are fake identities created to unfairly increase the power or resources of a single malicious user. Researchers have long known about the existence of Sybil accounts in online communities such as file-sharing systems, but they have not been able to perform large-scale measurements to detect them or measure their activities. In this article, we describe our efforts to detect, characterize, and understand Sybil account activity in the Renren Online Social Network (OSN). We use ground truth provided by Renren Inc. to build measurement-based Sybil detectors and deploy them on Renren to detect more than 100,000 Sybil accounts. Using our full dataset of 650,000 Sybils, we examine several aspects of Sybil behavior. First, we study their link creation behavior and find that contrary to prior conjecture, Sybils in OSNs do not form tight-knit communities. Next, we examine the fine-grained behaviors of Sybils on Renren using clickstream data. Third, we investigate behind-the-scenes collusion between large groups of Sybils. Our results reveal that Sybils with no explicit social ties still act in concert to launch attacks. Finally, we investigate enhanced techniques to identify stealthy Sybils. In summary, our study advances the understanding of Sybil behavior on OSNs and shows that Sybils can effectively avoid existing community-based Sybil detectors. We hope that our results will foster new research on Sybil detection that is based on novel types of Sybil features.
    [Show full text]
  • Svms for the Blogosphere: Blog Identification and Splog Detection
    SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari∗, Tim Finin and Anupam Joshi University of Maryland, Baltimore County Baltimore MD {kolari1, finin, joshi}@umbc.edu Abstract Most blog search engines identify blogs and index con- tent based on update pings received from ping servers1 or Weblogs, or blogs have become an important new way directly from blogs, or through crawling blog directories to publish information, engage in discussions and form communities. The increasing popularity of blogs has and blog hosting services. To increase their coverage, blog given rise to search and analysis engines focusing on the search engines continue to crawl the Web to discover, iden- “blogosphere”. A key requirement of such systems is to tify and index blogs. This enables staying ahead of compe- identify blogs as they crawl the Web. While this ensures tition in a domain where “size does matter”. Even if a web that only blogs are indexed, blog search engines are also crawl is inessential for blog search engines, it is still pos- often overwhelmed by spam blogs (splogs). Splogs not sible that processed update pings are from non-blogs. This only incur computational overheads but also reduce user requires that the source of the pings need to be verified as a satisfaction. In this paper we first describe experimental blog prior to indexing content.2 results of blog identification using Support Vector Ma- In the first part of this paper we address blog identifica- chines (SVM). We compare results of using different feature sets and introduce new features for blog iden- tion by experimenting with different feature sets.
    [Show full text]
  • By Nilesh Bansal a Thesis Submitted in Conformity with the Requirements
    ONLINE ANALYSIS OF HIGH-VOLUME SOCIAL TEXT STEAMS by Nilesh Bansal A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto ⃝c Copyright 2013 by Nilesh Bansal Abstract Online Analysis of High-Volume Social Text Steams Nilesh Bansal Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2013 Social media is one of the most disruptive developments of the past decade. The impact of this information revolution has been fundamental on our society. Information dissemination has never been cheaper and users are increasingly connected with each other. The line between content producers and consumers is blurred, leaving us with abundance of data produced in real-time by users around the world on multitude of topics. In this thesis we study techniques to aid an analyst in uncovering insights from this new media form which is modeled as a high volume social text stream. The aim is to develop practical algorithms with focus on the ability to scale, amenability to reliable operation, usability, and ease of implementation. Our work lies at the intersection of building large scale real world systems and developing theoretical foundation to support the same. We identify three key predicates to enable online methods for analysis of social data, namely : • Persistent Chatter Discovery to explore topics discussed over a period of time, • Cross-referencing Media Sources to initiate analysis using a document as the query, and • Contributor Understanding to create aggregate expertise and topic summaries of authors contributing online. The thesis defines each of the predicates in detail and covers proposed techniques, their practical applicability, and detailed experimental results to establish accuracy and scalability for each of the three predicates.
    [Show full text]
  • Blogosphere: Research Issues, Tools, and Applications
    Blogosphere: Research Issues, Tools, and Applications Nitin Agarwal Huan Liu Computer Science and Engineering Department Arizona State University Tempe, AZ 85287 fNitin.Agarwal.2, [email protected] ABSTRACT ging. Acknowledging this fact, Times has named \You" as the person of the year 2006. This has created a consider- Weblogs, or Blogs, have facilitated people to express their able shift in the way information is assimilated by the indi- thoughts, voice their opinions, and share their experiences viduals. This paradigm shift can be attributed to the low and ideas. Individuals experience a sense of community, a barrier to publication and open standards of content genera- feeling of belonging, a bonding that members matter to one tion services like blogs, wikis, collaborative annotation, etc. another and their niche needs will be met through online These services have allowed the mass to contribute and edit interactions. Its open standards and low barrier to publi- articles publicly. Giving access to the mass to contribute cation have transformed information consumers to produc- or edit has also increased collaboration among the people ers. This has created a plethora of open-source intelligence, unlike previously where there was no collaboration as the or \collective wisdom" that acts as the storehouse of over- access to the content was limited to a chosen few. Increased whelming amounts of knowledge about the members, their collaboration has developed collective wisdom on the Inter- environment and the symbiosis between them. Nonetheless, net. \We the media" [21], is a phenomenon named by Dan vast amounts of this knowledge still remain to be discovered Gillmor: a world in which \the former audience", not a few and exploited in its suitable way.
    [Show full text]
  • Web Spam Taxonomy
    Web Spam Taxonomy Zolt´an Gy¨ongyi Hector Garcia-Molina Computer Science Department Computer Science Department Stanford University Stanford University [email protected] [email protected] Abstract techniques, but as far as we know, they still lack a fully effective set of tools for combating it. We believe Web spamming refers to actions intended to mislead that the first step in combating spam is understanding search engines into ranking some pages higher than it, that is, analyzing the techniques the spammers use they deserve. Recently, the amount of web spam has in- to mislead search engines. A proper understanding of creased dramatically, leading to a degradation of search spamming can then guide the development of appro- results. This paper presents a comprehensive taxon- priate countermeasures. omy of current spamming techniques, which we believe To that end, in this paper we organize web spam- can help in developing appropriate countermeasures. ming techniques into a taxonomy that can provide a framework for combating spam. We also provide an overview of published statistics about web spam to un- 1 Introduction derline the magnitude of the problem. There have been brief discussions of spam in the sci- As more and more people rely on the wealth of informa- entific literature [3, 6, 12]. One can also find details for tion available online, increased exposure on the World several specific techniques on the Web itself (e.g., [11]). Wide Web may yield significant financial gains for in- Nevertheless, we believe that this paper offers the first dividuals or organizations. Most frequently, search en- comprehensive taxonomy of all important spamming gines are the entryways to the Web; that is why some techniques known to date.
    [Show full text]
  • Search Engine Optimization with PHP
    00929ffirs.qxd:00929ffirs 3/13/07 10:36 AM Page iii Professional Search Engine Optimization with PHP A Developer’s Guide to SEO Jaimie Sirovich Cristian Darie 00929ffirs.qxd:00929ffirs 3/13/07 10:36 AM Page iv Professional Search Engine Optimization with PHP: A Developer’s Guide to SEO Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright © 2007 by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-0-470-10092-9 Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging-in-Publication Data: Sirovich, Jaimie, 1981- Professional search engine optimization with PHP : a developer's guide to SEO / Jaimie Sirovich, Cristian Darie. p. cm. Includes index. ISBN 978-0-470-10092-9 (pbk.) 1. PHP (Computer program language) 2. Web sites--Design. 3. Search engines. I. Darie, Cristian. II. Title. QA76.73.P224S525 2007 005.13'3--dc22 2007003317 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permis- sion should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions.
    [Show full text]
  • Spam in Blogs and Social Media
    ȱȱȱȱ ȱ Pranam Kolari, Tim Finin Akshay Java, Anupam Joshi March 25, 2007 ȱ • Spam on the Internet –Variants – Social Media Spam • Reason behind Spam in Blogs • Detecting Spam Blogs • Trends and Issues • How can you help? • Conclusions Pranam Kolari is a UMBC PhD Tim Finin is a UMBC Professor student. His dissertation is on with over 30 years of experience spam blog detection, with tools in the applying AI to information developed in use both by academia and systems, intelligent interfaces and industry. He has active research interest robotics. Current interests include social in internal corporate blogs, the Semantic media, the Semantic Web and multi- Web and blog analytics. agent systems. Akshay Java is a UMBC PhD student. Anupam Joshi is a UMBC Pro- His dissertation is on identify- fessor with research interests in ing influence and opinions in the broad area of networked social media. His research interests computing and intelligent systems. He include blog analytics, information currently serves on the editorial board of retrieval, natural language processing the International Journal of the Semantic and the Semantic Web. Web and Information. Ƿ Ȭȱ • Early form seen around 1992 with MAKE MONEY FAST • 80-85% of all e-mail traffic is spam • In numbers 2005 - (June) 30 billion per day 2006 - (June) 55 billion per day 2006 - (December) 85 billion per day 2007 - (February) 90 billion per day Sources: IronPort, Wikipedia http://www.ironport.com/company/ironport_pr_2006-06-28.html ȱȱǵ • “Unsolicited usually commercial e-mail sent to a large
    [Show full text]