Study of Comment Dynamics at News Outlet
Total Page:16
File Type:pdf, Size:1020Kb
STUDY OF COMMENT DYNAMICS AT NEWS OUTLET A Dissertation Submitted to the Temple University Graduate Board in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY by Lihong He May, 2021 Examining Committee Members: Eduard Dragut, Advisory Chair, Dept. of Computer and Information Sci- ences Zoran Obradovic, Dept. of Computer and Information Sciences Slobodan Vucetic, Dept. of Computer and Information Sciences Weiyi Meng, External Member, Dept. of Computer Science, Binghamton University © Copyright 2021 by Lihong He All Rights Reserved ii ABSTRACT Many news outlets allow users to contribute comments on topics about daily world events. News articles are the seeds that spring users' interest to contribute content, i.e., comments. An article may attract an apathetic user engagement (several tens of comments) or a spontaneous fervent user engagement (thousands of comments). This environment creates a social dynamic that is little studied. The social dynamics around articles have the potential to reveal interesting facets of the user population at a news outlet. We report some salient finding about these social media based on data collected from 17 news outlets. Analysis of the data reveals interesting insights such as there is an uneven relationship between news outlets and their user populations across outlets. Such observations and others have not been revealed, to our knowledge. Besides, we also study the problem of predicting the total number of user comments a news article will receive. Our main insight is that the early dynamics of user comments contribute the most to an accurate prediction, while news article specific factors have surprisingly little influence. We show that the early arrival rate of comments is the best indicator of the eventual number of comments. We conduct an in-depth analysis of this feature across several dimensions, such as news outlets and news article categories. Online comments submitted by readers of news articles can provide valuable feed- back and critique, personal views and perspectives, and opportunities for discussion. Previously, we manually collect user comments from 17 news outlets for data analysis. However, this kind of manual solution is very limited and cannot work for variety of websites. Therefore, we need to have an automatic solution for thousands of news outlets. We find that most new websites employ third party commenting systems to create and manage their comment sections. One can crawl the comments by sending URL requests to those commenting system servers. We propose an approach of crawl- iii ing user comments by instantiating URL templates supported by these web servers. We propose a hybrid framework that combines the advantages of labeling functions, in the form of regular expressions, and deep learning. We conduct extensive experi- ments with thousands of web pages to show that we can crawl comments from many websites with our approach. iv ACKNOWLEDGMENTS During these years of Ph.D. study, I would like to thank a lot of people that support and help me for the research, study and life. I give my deepest thanks to my advisor Dr. Eduard Dragut. Dr. Dragut has much insight in many fields. He guides me for how to do research and how to discover new ideas. He always encourage me to think more about possible improvement and extensions on existing research, which bring me a new way of thinking. I would like to thank my committee members, Dr. Zoran Obradovic, Dr. Slobodan Vucetic and Dr. Weiyi Meng for their valuable time and advice on my dissertation. Dr. Slobodan Vucetic always encourages me and gives me advice if I encounter bottleneck on my research. I believe I have the best committee members in the university and would like to thank them so much. I would like to thank Dr. Yuhong Guo, Dr. Bo Ji, Dr. Kai Zhang and Dr. Jim Korsh for their great courses and suggestion on research. I would like to thank my lab-mates and many other students for their advice and help these years. Finally, thanks very much for my husband, my son and my parents, for their love and support all the time. v TABLE OF CONTENTS ABSTRACT ::::::::::::::::::::::::::::::::::: iii ACKNOWLEDGMENTS ::::::::::::::::::::::::::: v LIST OF FIGURES ::::::::::::::::::::::::::::::: ix LIST OF TABLES :::::::::::::::::::::::::::::::: xii CHAPTER 1. INTRODUCTION :::::::::::::::::::::::::::::: 1 1.1 Comparison between Online News Outlets and Large Social Media1 1.2 Commenting Activity at Online News Outlets . .3 1.3 Prediction of Comment Volume . .4 1.4 Comments Collection at Scale . .5 2. ANALYSIS OF USER ENGAGEMENT AT ONLINE NEWS OUT- LETS :::::::::::::::::::::::::::::::::::::::: 9 2.1 Introduction . .9 2.2 Related Work . 13 2.3 Data . 15 2.3.1 Data Collection . 15 2.3.2 Ecosystem . 16 2.3.3 Data Summary . 17 2.3.4 A Bird's-Eye View of Users Activity . 17 2.4 Dynamics of the Ecosystem . 19 2.4.1 A Case Study . 19 2.4.2 Interplay between Outlets and their Users . 21 2.5 Dynamics of Ecosystem via User Commenting Activity . 23 2.5.1 User Reaction . 24 2.5.2 Duration of User Comments . 25 2.5.3 Posthumous News Story Engagement . 27 vi 2.5.4 User Activity by Time . 28 2.5.5 User Stickiness . 30 2.5.6 News Stories Dual Popularity . 32 2.6 Dynamics of Ecosystem via User Interest . 34 2.6.1 Breadth of User Interest . 34 2.6.2 Diversity of User Interest at News Outlets . 35 2.7 Conclusions . 37 3. PREDICTION OF COMMENT VOLUME FOR A NEWS ARTI- CLE ::::::::::::::::::::::::::::::::::::::::: 38 3.1 Introduction . 39 3.2 Related Work . 42 3.3 Methodology . 44 3.4 Data . 47 3.5 Predicting Comment Volume . 48 3.5.1 Features . 48 3.5.2 Experimental Setup . 54 3.5.3 Experimental Results . 56 3.5.4 Dominant Feature Discovery . 57 3.6 Rate Analysis . 58 3.6.1 Study of Rate across Outlets . 59 3.6.2 Study of Rate across Categories . 63 3.6.3 Interplay between Outlets and Categories . 66 3.7 Conclusion . 67 4. RETRIEVING USER COMMENTS FROM THE WEB :::::: 69 4.1 Introduction . 70 4.2 Challenges in Comment Retrieval . 75 4.3 Related Work . 78 4.4 Proposed Framework . 80 4.4.1 Solution Overview . 80 4.4.2 Deep Model in Pre-training . 82 4.4.3 Fine-tuning . 85 vii 4.5 Data Preparation for Training . 85 4.5.1 Data Preprocessing . 86 4.5.2 Weakly Labeled Data Generation . 88 4.6 Experimental Design . 90 4.6.1 Datasets . 90 4.6.2 Evaluation Metrics . 91 4.6.3 Models in Comparison . 93 4.6.4 Hyperparameter Settings for Deep Model . 93 4.7 Experimental Results . 94 4.7.1 Extraction of Parameter Value . 95 4.7.2 Study of Generated URLs . 96 4.8 Comment Retrieval at Scale . 98 4.8.1 Data . 98 4.8.2 Experiment Results of Comment Retrieval . 98 4.8.3 Going Back in Time in Comment Retrieval . 99 4.9 Conclusions . 101 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 101 viii LIST OF FIGURES Figure 1.1 Discourse in comment section. .3 1.2 URL templates and examples. The parameter values in URL are from HTML documents. .7 2.1 The hierarchical diagram of the social media ecosystem at a news outlet. 17 2.2 Comment volume from user population. Users are ranked based on the number of comments they post. 19 2.3 Proportion of each Outlet for Top 10 Distinct Stories Based on Total Article Volume . 20 2.4 The modeling of interplay between outlets and users. 21 2.5 Interplay between news outlets and users. 22 2.6 Distribution of user reaction at Fox News. User reaction is defined as the time difference between the first comment and the article pub- lication time. 24 2.7 Distribution of the duration of user commenting activity at each out- let. The duration is defined as the time difference between the last and first comment in an article. 26 2.8 Users Engagement . 27 2.9 The volume of user comments by time at each outlet. WSP: Wash- ington Post, DM: Daily Mail, WSJ: Wall Street Journal, Gd: The Guardian, NYT: New York Times, MW: Market Watch . 28 ix 2.10 Comment volume per hour at Daily Mail, focusing on the users from United Kingdom. The time of comment is displayed by the users time zone. 30 2.11 User Stickiness . 30 2.12 Cumulative Comment Volume Over Time . 32 2.13 User Engagement vs Story Popularity . 33 2.14 Cumulative Comment Growth Over Time For Story . 33 2.15 Cumulative plot of averaged breadths of user interest (# distinct stories). Users are ranked by their breadths of interest from large to small. 34 2.16 Quantifying breadth and heterogeneity of user populations at news outlets in news stories. 36 3.1 Methodology illustration. 44 3.2 The distribution of comment volume and logarithmic volume per news outlet. In the first column of graphs, the articles with more than 1,000/2,000 comments are discarded to make the graphs visible. For each outlet, article frequency on the y-axis of the first two graphs is the number of articles, the third graph provide the Q-Q plot of the logarithmic volume. 46 3.3 Comparison of regression lines. 59 3.4 The plot of the prediction model. If we plot the points based on the value of rate and logarithmic volume, points are too dense around the origin. Therefore, we draw the graph in the log scale for rate and volume. 59 4.1 Example of comment section which pops up after clicking the com- ment button on the web page of article. 72 x 4.2 URL templates and examples. The parameter values in URL are from HTML documents.