Applied Text Analytics for Blogs

UvA-DARE (Digital Academic Repository) Applied text analytics for blogs Mishne, G.A. Publication date 2007 Document Version Final published version Link to publication Citation for published version (APA): Mishne, G. A. (2007). Applied text analytics for blogs. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:06 Oct 2021 Applied Text Analytics for Blogs Gilad Mishne Applied Text Analytics for Blogs Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. J.W. Zwemmer ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Aula der Universiteit op vrijdag 27 april 2007, te 10.00 uur door Gilad Avraham Mishne geboren te Haifa, Israël. Promotor: Prof.dr. Maarten de Rijke Committee: Prof.dr. Ricardo Baeza-Yates Dr. Natalie Glance Prof.dr. Simon Jones Dr. Maarten Marx Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit van Amsterdam SIKS Dissertation Series No. 2007-06 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. The investigations were supported by the Netherlands Organization for Scientific Research (NWO) under project number 220-80-001. Copyright c 2007 by Gilad Mishne Cover image copyright c 2006 bLaugh.com, courtesy of Brad Fitzpatrick and Chris Pirillo Printed and bound by PrintPartners Ipskamp, Enschede ISBN-13: 978–90–5776–163–8 Contents Acknowledgments ix 1 Introduction 1 1.1 Research Questions .......................... 3 1.2 Organization of the Thesis ...................... 5 1.3 Foundations .............................. 6 2 Background: Blogs and the Blogspace 7 2.1 Blogs .................................. 7 2.1.1 What is a Blog? ........................ 7 2.1.2 Terminology .......................... 10 2.1.3 A History of Blogging .................... 16 2.1.4 Blog Genres .......................... 19 2.2 The Blogspace ............................. 21 2.2.1 Size and Growth ....................... 22 2.2.2 Structure of the Blogspace . 24 2.2.3 The Language of Blogs .................... 33 2.2.4 Demographics ......................... 38 2.3 Computational Access to Blogs ................... 40 2.3.1 Search and Exploration .................... 40 2.3.2 Sentiment Analysis in Blogs . 44 2.3.3 Blog Spam ........................... 45 2.3.4 Blogs as Consumer Generated Media . 46 2.4 Scope of this Work .......................... 48 I Analytics for Single Blogs 49 3 Language-based Blog Profiles 53 iii 3.1 Statistical Language Modeling .................... 53 3.2 Blogger Profiles ............................ 56 3.2.1 Language Model Comparison and Keyword Extraction . 56 3.2.2 Profiling Blogs ........................ 57 3.3 Matching Bloggers and Products . 58 3.3.1 Evaluation ........................... 62 3.3.2 Conclusions .......................... 63 3.4 Extended Blogger Models ...................... 64 3.4.1 Language Models of Blog Posts . 64 3.4.2 Model Mixtures ........................ 66 3.4.3 Example ............................ 67 3.5 Profile-based Contextual Advertising . 68 3.5.1 Contextual Advertising .................... 69 3.5.2 Matching Advertisements to Blog Posts . 70 3.5.3 Evaluation ........................... 71 3.5.4 Experiments .......................... 73 3.5.5 Results ............................. 75 3.5.6 Deployment .......................... 76 3.5.7 Conclusions .......................... 77 4 Text Classification in Blogs 79 4.1 Mood Classification in Blog Posts . 79 4.1.1 Related Work ......................... 80 4.1.2 A Blog Corpus ........................ 81 4.1.3 Feature Set .......................... 83 4.1.4 Experimental Setting ..................... 87 4.1.5 Results ............................. 88 4.1.6 Discussion ........................... 91 4.1.7 Conclusions .......................... 92 4.2 Automated Tagging .......................... 92 4.2.1 Tagging and Folksonomies in Blogs . 93 4.2.2 Tag Assignment ........................ 93 4.2.3 Experimental Setting ..................... 97 4.2.4 Conclusions . 101 5 Comment Spam in Blogs 103 5.1 Comment and Link Spam ...................... 103 5.2 Related Work ............................. 104 5.2.1 Combating Comment Spam . 104 5.2.2 Content Filtering and Spam . 105 5.2.3 Identifying Spam Sites .................... 105 5.3 Comment Spam and Language Models . 106 5.3.1 Language Models for Text Comparison . 106 iv 5.3.2 Spam Classification ...................... 107 5.3.3 Model Expansion ....................... 109 5.3.4 Limitations and Solutions .................. 109 5.4 Evaluation ............................... 110 5.4.1 Results ............................. 110 5.4.2 Discussion . 113 5.4.3 Model Expansions ...................... 113 5.5 Conclusions .............................. 114 II Analytics for Collections of Blogs 117 6 Aggregate Sentiment in the Blogspace 121 6.1 Blog Sentiment for Business Intelligence . 122 6.1.1 Related Work ......................... 122 6.1.2 Data and Experiments .................... 123 6.1.3 Conclusions . 128 6.2 Tracking Moods through Blogs ................... 130 6.2.1 Mood Cycles ......................... 130 6.2.2 Events and Moods ...................... 132 6.3 Predicting Mood Changes ...................... 133 6.3.1 Related Work ......................... 134 6.3.2 Capturing Global Moods from Text . 134 6.3.3 Evaluation . 136 6.3.4 Case Studies . 141 6.3.5 Conclusions . 145 6.4 Explaining Irregular Mood Patterns . 145 6.4.1 Detecting Irregularities .................... 147 6.4.2 Explaining Irregularities ................... 148 6.4.3 Case Studies . 149 6.4.4 Conclusions . 150 7 Blog Comments 153 7.1 Related Work ............................. 155 7.2 Dataset ................................ 156 7.2.1 Comment Extraction ..................... 156 7.2.2 A Comment Corpus ..................... 159 7.2.3 Links in Comments ...................... 163 7.3 Comments as Missing Content .................... 163 7.3.1 Coverage ............................ 165 7.3.2 Precision ............................ 166 7.4 Comments and Popularity ...................... 167 7.4.1 Outliers ............................ 169 v 7.5 Discussions in Comments ....................... 169 7.5.1 Detecting Disputes in Comments . 171 7.5.2 Evaluation . 173 7.6 Conclusions .............................. 174 III Searching Blogs 179 8 Search Behavior in the Blogspace 183 8.1 Data .................................. 184 8.2 Types of Information Needs ..................... 185 8.3 Popular Queries and Query Categories . 188 8.4 Query Categories . 190 8.5 Session Analysis ............................ 192 8.5.1 Recovering Sessions and Subscription Sets . 194 8.5.2 Properties of Sessions ..................... 194 8.6 Conclusions .............................. 196 9 Opinion Retrieval in Blogs 197 9.1 Opinion Retrieval at TREC ..................... 198 9.1.1 Collection . 198 9.1.2 Task Description ....................... 198 9.1.3 Assessment . 199 9.2 A Multiple-component Strategy ................... 200 9.2.1 Topical Relevance ....................... 202 9.2.2 Opinion Expression ...................... 205 9.2.3 Post Quality . 210 9.2.4 Model Combination ...................... 212 9.3 Evaluation ............................... 213 9.3.1 Base Ranking Model ..................... 214 9.3.2 Full-content vs. Feed-only .................. 215 9.3.3 Query Source ......................... 219 9.3.4 Query Expansion ....................... 220 9.3.5 Term Dependence ....................... 221 9.3.6 Recency Scoring ........................ 224 9.3.7 Content-based Opinion Scores . 224 9.3.8 Structure-based Opinion Scores . 226 9.3.9 Spam Filtering ........................ 226 9.3.10 Link-based Authority ..................... 228 9.3.11 Component Combination ................... 230 9.4 Related Work ............................. 230 9.5 Conclusions .............................. 234 vi 10 Conclusions 239 10.1 Answers to Research Questions ................... 239 10.2 Main Contributions . 242 10.3 Future Directions . 243 A Crawling Blogs 249 B MoodViews: Tools for Blog Mood Analysis 253 Samenvatting 257 Bibliography 259 vii Acknowledgments I am indebted to my advisor, Professor Maarten de Rijke, who opened the door of scientific research for me and guided my walk through a path that was at times unclear. Despite an incredibly busy schedule, Maarten always found time to discuss any issue, or to suggest final improvements to a paper; he helped me in many ways, professional and other, and gave me a much-appreciated free hand to explore new areas my way. Many thanks also

Applied Text Analytics for Blogs

Copyright 2009 Cengage Learning. All Rights Reserved. May Not Be Copied, Scanned, Or Duplicated, in Whole Or in Part

The Web 2.0 Way of Learning with Technologies Herwig Rollett

Liferay Portal 6 Enterprise Intranets

Address Munging: the Practice of Disguising, Or Munging, an E-Mail Address to Prevent It Being Automatically Collected and Used

Adversarial Web Search by Carlos Castillo and Brian D

Uncovering Social Network Sybils in the Wild

Svms for the Blogosphere: Blog Identification and Splog Detection

By Nilesh Bansal a Thesis Submitted in Conformity with the Requirements

Blogosphere: Research Issues, Tools, and Applications

Web Spam Taxonomy

Search Engine Optimization with PHP

Spam in Blogs and Social Media