Sampling, Engagement, and Network Effects
Total Page:16
File Type:pdf, Size:1020Kb
Measuring Collective Attention in Online Content: Sampling, Engagement, and Network Effects Siqi Wu A thesis submitted for the degree of Doctor of Philosophy at The Australian National University March 2021 c 2021 by Siqi Wu All Rights Reserved Except where otherwise indicated, this thesis is my own original work. Siqi Wu 9 March 2021 Acknowledgments I would like to express my sincere gratitude to the people who have supported me in this Ph.D grind: • Prof. Lexing Xie. I am extremely honored to have Lexing to be my advisor. Her extensive knowledge in various fields and strong dedication to research motivate me to become a good researcher. • Dr. Marian-Andrei Rizoiu. Andrei is one of the smartest people that I know. He provided many insightful ideas when I was stuck. His research also sparked my interests in modeling online popularity. • Dr. Cheng Soon Ong. Cheng is my “think tank”. He never turned me down when I felt discouraged or desperately looked for advices, both in research and in life. • Members of the ANU Computational Media Lab – Swapnil Mishra, Quyu Kong, Alexander Mathews, Dawei Chen, Minjeong Shin, Dongwoo Kim, Jooyoung Lee, Rui Zhang, Alasdair Tran, Umanga Bista, Yuli Liu, Qiongkai Xu, and many others. I am grateful that I can work with a group of supportive talents. Much of my work is benefited from the discussions with them. • External collaborators – Yu-Ru Lin, Ali Mert Ertugrul, and Xian Teng from Univer- sity of Pittsburgh, Paul Resnick and James Park from Univeristy of Michigan, and Lu Cheng from Arizona State University. It has been a great pleasure to work with them. Our collaborations also lead to fruitful research outcomes. • Data61, CSIRO. I would like to thank Data61 for providing my scholarship. It has also provided me a strong network in both academia and industry. • ANU Research School of Computer Science. I would like to thank the admin and HDR team in CECS for creating a friendly working environment for us. • National eResearch Collaboration Tools and Resources (Nectar). I would like to thank Nectar for providing computational resources that facilitate my research. This research is supported in part by Asian Office of Aerospace Research and De- velopment Grant 19IOA078, Air Force Research Laboratory Grant FA2386-15-1-4018, Australian Research Council Project DP180101985, and a Google PhD Fellowship. 2 Finally, and most importantly, I would like to express deep thanks to my wife Rongqin Tan, my parents Suying Deng and Sanji Wu, and my extended family, who have all given their unreserved supports to me in so many years. May my grandpar- ents rest in peace. I love you and always will. Abstract The production and consumption of online content have been increasing rapidly, whereas human attention is a scarce resource. Understanding how the content cap- tures collective attention has become a challenge of growing importance. In this thesis, we tackle this challenge from three fronts – quantifying sampling effects of social media data; measuring engagement behaviors towards online content; and estimating network effects induced by the recommender systems. Data sampling is a fundamental problem. To obtain a list of items, one common method is sampling based on the item prevalence in social media streams. However, social data is often noisy and incomplete, which may affect the subsequent observa- tions. For each item, user behaviors can be conceptualized as two steps – the first step is relevant to the content appeal, measured by the number of clicks; the second step is relevant to the content quality, measured by the post-clicking metrics, e.g., dwell time, likes, or comments. We categorize online attention (behaviors) into two classes: popularity (clicking) and engagement (watching, liking, or commenting). Moreover, modern platforms use recommender systems to present the users with a tailoring content display for maximizing satisfaction. The recommendation alters the appeal of an item by changing its ranking, and consequently impacts its popularity. Our research is enabled by the data available from the largest video hosting site YouTube. We use YouTube URLs shared on Twitter as a sampling protocol to obtain a collection of videos, and we track their prevalence from 2015 to 2019. This method creates a longitudinal dataset consisting of more than 5 billion tweets. Albeit the volume is substantial, we find Twitter still subsamples the data. Our dataset covers about 80% of all tweets with YouTube URLs. We present a comprehensive measure- ment study of the Twitter sampling effects across different timescales and different subjects. We find that the volume of missing tweets can be estimated by Twitter rate limit messages, true entity ranking can be inferred based on sampled observations, and sampling compromises the quality of network and diffusion models. Next, we present the first large-scale measurement study of how users collec- tively engage with YouTube videos. We study the time and percentage of each video being watched. We propose a duration-calibrated metric, called relative engagement, which is correlated with recognized notion of content quality, stable over time, and predictable even before a video’s upload. 4 5 Lastly, we examine the network effects induced by the YouTube recommender system. We construct the recommendation network for 60,740 music videos from 4,435 professional artists. An edge indicates that the target video is recommended on the webpage of source video. We discover the popularity bias – videos are dis- proportionately recommended towards more popular videos. We use the bow-tie structure to characterize the network and find that the largest strongly connected component consists of 23.1% of videos while occupying 82.6% of attention. We also build models to estimate the latent influence between videos and artists. By taking into account the network structure, we can predict video popularity 9.7% better than other baselines. Altogether, we explore the collective consuming patterns of human attention to- wards online content. Methods and findings from this thesis can be used by content producers, hosting sites, and online users alike to improve content production, adver- tising strategies, and recommender systems. We expect our new metrics, methods, and observations can generalize to other multimedia platforms such as the music streaming service Spotify. Publications, Software, and Data The majority of the thesis has been published in peer-reviewed conference proceed- ings. Software and data developed as a part of this thesis are provided for reproduc- ing experiment results and facilitating future work. Publications • Minjeong Shin*, Alasdair Tran*, Siqi Wu*, Alexander Mathews, Rong Wang, Geor- giana Lyall, and Lexing Xie. “AttentionFlow: Visualising Dynamic Influence in Ego Networks.” ACM International Conference on Web Search and Data Mining (WSDM), 2021. (Demo j Chapter 5) • Lu Cheng, Kai Shu, Siqi Wu, Yasin N. Silva, Deborah L. Hall, and Huan Liu. “Un- supervised Cyberbullying Detection via Time-Informed Gaussian Mixture Model.” ACM International Conference on Information and Knowledge Management (CIKM), 2020. (Full paper, acceptance rate: 21%) • Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. “Variation across Scales: Mea- surement Fidelity under Twitter Data Sampling.” AAAI International Conference on Weblogs and Social Media (ICWSM), 2020. (Full paper, acceptance rate: 17% j Chapter 3) • Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. “Estimating Attention Flow in Online Video Networks.” ACM International Conference on Computer-Supported Co- operative Work and Social Computing (CSCW), 2019. (Best paper honorable mention award, acceptance rate: 31% j Chapter 5) • Siqi Wu. “How is Attention Allocated? Data-driven Studies of Popularity and Engagement in Online Videos.” ACM International Conference on Web Search and Data Mining (WSDM), 2019. (Doctoral consortium) • Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. “Beyond Views: Measuring and Predicting Engagement in Online Videos.” AAAI International Conference on Weblogs and Social Media (ICWSM), 2018. (Full paper, acceptance rate: 16% j Chapter 4) 6 7 • Quyu Kong, Marian-Andrei Rizoiu, Siqi Wu, and Lexing Xie. “Will This Video Go Viral? Explaining and Predicting the Popularity of YouTube Videos.” International Conference on World Wide Web Companion (WWW), 2018. (Demo j Chapter 4) Preprints • Ali Mert Ertugrul*, Siqi Wu*, Jooyoung Lee*, Lexing Xie, and Yu-Ru Lin. “Linking Collective Attention Across Platforms: Do More Tweets Beget More Video Views?” Under revision. • Siqi Wu and Paul Resnick. “Cross-Partisan Discussions on YouTube: Conserva- tives Talk to Liberals but Liberals Don’t Talk to Conservatives.” Under review. Software • Twitter-intact-stream (Chapter 3): a Python package to reconstruct the complete Twitter filtered stream. https://github.com/avalanchesiqi/twitter-intact-stream • YouTube-insight (Chapter 4): a Python package to collect metadata and historical data for YouTube videos. https://github.com/avalanchesiqi/youtube-insight • HIPie (Chapter 4): a web interface to explain and predict the popularity of YouTube videos. http://www.hipie.ml/ • AttentionFlow (Chapter 5): a web interface to visualize a collection of time series and the dynamic network influence among them. http://www.attentionflow.ml/ Data • Complete/Sampled Retweet Cascades Datasets (Chapter 3) include 2 sets of complete/sampled retweet cascades on the topics of cyberbullying (sampling rate: 52.72%, 3M complete cascades, 1.17M sampled cascades) and YouTube