Tracking Content and Predicting Behavior
Total Page:16
File Type:pdf, Size:1020Kb
UvA-DARE (Digital Academic Repository) Mining social media: tracking content and predicting behavior Tsagkias, M. Publication date 2012 Document Version Final published version Link to publication Citation for published version (APA): Tsagkias, M. (2012). Mining social media: tracking content and predicting behavior. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:02 Oct 2021 Mining Social Media: Tracking Content and Predicting Behavior Manos Tsagkias Mining Social Media: Tracking Content and Predicting Behavior ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op woensdag 5 december 2012, te 14:00 uur door Manos Tsagkias geboren te Athene, Griekenland Promotiecommissie Promotor: Prof. dr. M. de Rijke Overige leden: Dr. J. Gonzalo Prof.dr. C.T.A.M. de Laat Dr. C. Monz Prof.dr. A.P. de Vries Faculteit der Natuurwetenschappen, Wiskunde en Informatica SIKS Dissertation Series No. 2012-47 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. This research was supported by the Netherlands Organisation for Scientific Re- search (NWO) under project number 612.061.815. Copyright © 2012 Manos Tsagkias, Amsterdam, The Netherlands Cover by Mark Assies Printed by Offpage, Amsterdam ISBN: 978-94-6182-198-0 To the unseen heroes, including you. You hold the result of five years of work. Writing this book has been a non-trivial, yet exciting, process. Although I claim sole authorship, I cannot forget the many other people who have contributed to this book in different ways, levels and stages. Without them the journey would not have been as easy. I want to say Thank you! Αντὼνη, kai KaterÈna gia thn pÈsth sac se εμὲνα. Δημὴτρη, Tz., kai Θανὰση gia to mÈto thc Αριὰδνης. Maarten for being my Teacher. Wouter, Simon, and Edgar for keeping the mojo flowing. Katja, Jiyin, and Marc for belaying me while I was climbing my way to a Ph.D. My co-authors for the stories and dialogues. All former and current ILPSers for transforming daily routine into an inspiring process. Contents 1 Introduction1 1.1 Research outline and questions....................3 1.2 Main contributions...........................6 1.3 Thesis overview.............................7 1.4 Origins..................................8 2 Background9 2.1 Social media analysis..........................9 2.2 Content representation......................... 12 2.3 Ranking................................. 16 2.4 Link generation............................. 19 2.5 Prediction................................ 21 I. Tracking Content 25 3 Linking Online News and Social Media 29 3.1 Introduction............................... 29 3.2 Approach................................ 33 3.3 Methods................................. 34 3.3.1 Retrieval model......................... 34 3.3.2 Query modeling........................ 36 3.3.3 Late fusion........................... 40 3.4 Experimental setup........................... 41 3.4.1 Experiments........................... 41 3.4.2 Data set and data gathering.................. 42 3.4.3 Evaluation............................ 43 3.4.4 Weight optimization for late fusion.............. 44 3.5 Results and analysis.......................... 45 3.5.1 Query modeling........................ 45 3.5.2 Late fusion........................... 48 3.6 Conclusions and outlook........................ 51 4 Hypergeometric Language Models 55 4.1 Introduction............................... 56 4.2 Hypergeometric distributions..................... 59 v CONTENTS vi 4.2.1 The central hypergeometric distribution........... 59 4.2.2 The non-central hypergeometric distribution......... 60 4.2.3 An example........................... 60 4.3 Retrieval models............................ 61 4.3.1 A log-odds retrieval model................... 62 4.3.2 A linear interpolation retrieval model............. 64 4.3.3 A Bayesian retrieval model.................. 66 4.4 Experimental setup........................... 68 4.4.1 Experiments........................... 68 4.4.2 Dataset............................. 70 4.4.3 Ground truth and metrics................... 70 4.5 Results and analysis.......................... 71 4.6 Discussion................................ 75 4.6.1 Hypergeometric vs. multinomial............... 75 4.6.2 Mixture ratios, term weighting, and smoothing parameters........................... 76 4.6.3 Ad hoc retrieval......................... 78 4.7 Conclusions and outlook........................ 79 II. Predicting Behavior 83 5 Podcast Preference 87 5.1 Introduction............................... 88 5.2 The PodCred framework........................ 90 5.2.1 Motivation for the framework................. 91 5.2.2 Presentation of the framework................ 93 5.2.3 Derivation of the framework.................. 96 5.3 Validating the PodCred framework.................. 103 5.3.1 Feature engineering for predicting podcast preference... 103 5.3.2 Experimental setup....................... 108 5.3.3 Results on predicting podcast preference........... 110 5.4 Real-world application of the PodCred framework.......... 120 5.5 Conclusions and outlook........................ 122 6 Commenting Behavior on Online News 125 6.1 Introduction............................... 126 6.2 Exploring news comments....................... 128 6.2.1 News comments patterns................... 128 6.2.2 Temporal variation....................... 130 6.3 Modeling news comments....................... 133 6.4 Predicting comment volume prior to publication........... 136 6.4.1 Feature engineering...................... 136 vii CONTENTS 6.4.2 Results and discussion..................... 138 6.5 Predicting comment volume after publication............ 146 6.5.1 Correlation of comment volume at early and late time... 147 6.5.2 Prediction model........................ 147 6.5.3 Results and discussion..................... 149 6.6 Conclusions............................... 151 7 Context Discovery in Online News 153 7.1 Introduction............................... 153 7.2 Problem definition........................... 157 7.3 Approach................................ 158 7.4 Modeling................................ 160 7.4.1 Article models......................... 160 7.4.2 Article intent models...................... 162 7.4.3 Query models.......................... 163 7.4.4 Weighting schemes....................... 165 7.5 Retrieval................................. 166 7.6 Experimental setup........................... 168 7.6.1 Dataset............................. 168 7.6.2 Evaluation............................ 169 7.7 Results and analysis.......................... 170 7.8 Discussion................................ 172 7.9 Conclusions and outlook........................ 174 8 Conclusions 179 8.1 Main findings.............................. 179 8.2 Future directions............................ 184 Bibliography 187 Samenvatting 205 1 Introduction Since the advent of blogs in the early 2000s, social media has been gaining tremendous momentum. Blogs have rapidly evolved into a plethora of social media platforms that enabled people to connect, share and discuss anything and everything, from personal experiences, to ideas, facts, events, music, videos, movies, and the list goes on forevermore. The technological advances in portable devices facilitated this process even further. People no longer need to sit behind a computer to access the online world, but they can be connected from their mobile phones from anywhere, anytime. This new phenomenon has started transforming the way we communicate with our peers. We have started digitizing our real- world lives by recording our daily activities in social media. From a research perspective, social media can be seen as a microscope to study and understand human online behavior. In this thesis, we will use social media as our microscope for understanding the content that is published therein, and for understanding and predicting human behavior. Social media has facilitated easy communication and broadcasting of infor- mation, most typically, via commenting, sharing and publishing (Pew Research Project for Excellence in Journalism, 2012). Social media has rapidly become a strong channel for news propagation (Kwak et al., 2010). Effortless publishing in social media has pushed people beyond “just” sharing and commenting the news they read to instantly report what is happening around them; think of local events that break out,