Characterising the Social Media Temporal Response to External Events
Total Page:16
File Type:pdf, Size:1020Kb
Characterising the Social Media Temporal Response to External Events Peter Mathews A thesis submitted for the degree of Doctor of Philosophy School of Mathematical Sciences The University of Adelaide April 2019 ii Abstract In recent years social media has become a crucial component of online information propagation. It is one of the fastest responding mediums to offline events, signifi- cantly faster than traditional news services. Popular social media posts can spread rapidly through the internet, potentially spreading misinformation and affecting hu- man beliefs and behaviour. The nature of how social media responds allows inference about events themselves and provides insight into human behavioural characteristics. However, despite its importance, researchers don’t have a strong understanding of the temporal dynamics of this information flow. This thesis aims to improve understanding of the temporal relationship between events, news and associated social media activity. We do this by examining the tem- poral Twitter response to stimuli for various case studies, primarily based around politics and sporting events. The first part of the thesis focuses on the relationships between Twitter and news media. Using Granger causality, we provide evidence that the social media reaction to events is faster than the traditional news reaction. We also consider how accurately tweet and news volumes can be predicted, given other variables. The second part of the thesis examines information cascades. We show that the decay of retweet rates is well-modelled as a power law with exponential cutoff, providing a better model than the widely used power law. This finding, explained using human prioritisation of tasks, then allows the development of a method to es- timate the size of a retweet cascade. The third major part of the thesis concerns tweet clustering methods in response to events. We examine how the likelihood that two tweets are related varies, given the time difference between them, and use this finding to create a clustering method using both textual and temporal information. We also develop a method to estimate the time of the event that caused the corresponding social media reaction. iii iv Declaration I certify that this work contains no material that has been accepted for the award of any other degree or diploma in my name, in any university or other tertiary insti- tution and, to the best of my knowledge and belief, contains no material previously published or written by another person, except where due reference has been made in the text. In addition, I certify that no part of this work will, in the future, be used in a submission in my name, for any other degree or diploma in any university or other tertiary institution without the prior approval of the University of Adelaide and where applicable, any partner institution responsible for the joint-award of this degree. I give consent to this copy of my thesis, when deposited in the University Library, being made available for loan and photocopying, subject to the provisions of the Copyright Act 1968. I also give permission for the digital version of my thesis to be made available on the web, via the University’s digital research repository, the Library Search and also through web search engines, unless permission has been granted by the University to restrict access for a period of time. I acknowledge the support I have received for my research through the provision of an Australian Government Research Training Program Scholarship. v vi Acknowledgements Thanks to everyone who participated in my PhD journey. First and foremost, thanks to my supervisors Professor Nigel Bean, Dr Lewis Mitchell and Dr Giang Nguyen for their guidance throughout the PhD. Thanks also to my colleagues for the productive academic discussions and enjoyable times throughout my PhD, particularly Yao Li, Dong Gong, Bohan Zhuang, Mingkui Tan, Qinfeng Shi, Jing Liu, Peng Wang, Lingqiao Liu, Xiusen Wei, Shuang Li, Adrian Johnston, Lachlan Birdsey, Dustin Craggs, Luke Keating-Hughes, Brett Chenoweth, Max Glonek, James Walker, Maha Mansor, Caitlin Gray, Angus Lewis and Dennis Liu. Thanks to the Data to Decisions Co-operative Research Centre (D2DCRC) and the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS) for providing financial support during my PhD. Thanks to all my maths and computer science teachers along the path to my PhD, particularly my high school maths teacher Anthony Harradine. Finally, thanks to my family for their support. vii viii Publications The following peer-reviewed conference publications contain preliminary reports of the findings in this thesis: Peter Mathews, Lewis Mitchell, Giang Nguyen, and Nigel Bean. The nature and origin of heavy tails in retweet activity. In The 26th International Conference on World Wide Web Companion, pages 1493-1498, 2017. Peter Mathews, Caitlin Gray, Lewis Mitchell, Giang Nguyen, and Nigel Bean. SMERC: Social media event response clustering using textual and temporal information. In The 2018 IEEE International Conference on Big Data, pages 3695-3700, 2018. ix x Contents 1 Introduction 1 1.1 Research goals, scope and limitations . 2 1.2 Twitter as a data source . 2 1.3 Literature gap . 3 1.4 Nature of social media analysis . 4 1.4.1 Specific challenges . 4 1.4.2 Limitations to conclusions from social media research . 5 1.5 Ethical considerations . 6 1.6 Research overview and thesis structure . 6 1.7 Key contributions to new knowledge . 9 2 Literature Review and Background 11 2.1 The relationship between tweets and news . 11 2.2 The distribution of retweet times . 13 2.2.1 Information diffusion on social media . 13 2.2.2 Relaxation response of a social system . 14 2.2.3 Statistical test for power law . 16 2.3 Simulating retweet activity and cascades . 17 2.3.1 Causes of power laws in complex systems . 17 2.3.2 Decay of user interest in topics . 19 2.3.3 User influence . 20 2.3.4 Retweet cascade size estimation . 21 2.4 Automated microblog summarisation and event detection . 23 2.4.1 Social sensing and microblog summarisation . 23 2.4.2 Event detection . 24 2.4.3 Similarity and distance measures . 26 2.4.4 Clustering methods . 27 2.4.5 Natural language processing . 28 2.4.6 Event time estimation . 30 xi xii Contents 2.4.6.1 Causes of log-normal distribution . 30 2.4.6.2 Causes of Weibull distribution . 31 2.5 Diurnal cycles and adjustment . 31 2.6 Definitions, tools and techniques . 32 2.6.1 Decay functions . 32 2.6.2 Methods to measure error . 33 2.6.3 One-hot encoded variables . 34 2.6.4 Granger causality . 35 2.6.5 Linear Regression . 36 2.6.6 Neural networks . 36 2.6.7 k-fold cross-validation . 38 2.6.8 Savitzky-Golay filter . 39 2.6.9 Testing goodness of fit . 39 2.6.10 Model selection criteria . 40 2.6.11 Bootstrapping . 40 3 The Temporal Relationship Between Tweets and News 43 3.1 Introduction . 43 3.2 Data collection methodology . 45 3.2.1 US Republican nomination data collection . 46 3.2.2 Australian election data collection . 46 3.3 Daily tweet activity analysis . 47 3.3.1 Diurnal cycle . 49 3.3.2 Correlation between tweets and news . 56 3.3.3 Automated event detection through diurnal adjustment . 57 3.4 Granger causality . 58 3.5 Modelling and prediction of tweet and news activity . 59 3.6 Discussion and conclusions . 62 4 The Temporal Distribution of Retweets 65 4.1 Introduction . 65 4.2 Analysis of example seed tweets . 67 4.2.1 Data collection and processing methodology . 67 4.2.2 First three hours after initial tweet . 68 4.2.3 First 24 hours . 72 4.2.4 Longer time durations . 74 Contents xiii 4.2.5 Diurnal effects and adjustment . 78 4.2.6 Additional retweet datasets . 78 4.3 Large scale data analysis . 81 4.3.1 Large scale data collection . 81 4.3.2 Fitting parameters to a power law by maximum likelihood es- timation . 82 4.3.3 Clauset’s test for power law distribution . 84 4.3.4 Improvement of fit for power law with exponential cutoff . 86 4.3.5 Power law parameter by topic . 88 4.4 Discussion and Conclusions . 90 5 Simulating Retweet Activity and Cascade Size Estimation 91 5.1 Introduction . 91 5.2 A model for the distribution of retweet times . 93 5.3 Simulation of retweet activity . 94 5.3.1 Priority-based tasking . 94 5.3.2 Generative model for retweet times . 95 5.4 Estimation of retweet cascade size . 98 5.4.1 Nature of retweet activity for news stories . 98 5.4.2 Boundedness . 99 5.4.3 Cascade size estimation method . 100 5.4.4 Simulating retweet rates from a single example tweet . 101 5.4.5 Experimental results on larger dataset . 102 5.4.6 Testing on public cascade datasets . 105 5.5 Discussion and conclusions . 107 6 Event Detection and Time Estimation from Twitter 109 6.1 Introduction . 109 6.2 Data collection methodology . 111 6.3 Clustering using textual and temporal information . 112 6.3.1 Relative probability of tweets being related over time interval . 112 6.3.2 Social Media Event Response Clustering (SMERC) . 116 6.3.3 Experiments and results . 120 6.3.3.1 Performance metrics . 121 6.3.3.2 Example clusters . 121 6.4 Event time estimation . 123 xiv Contents 6.4.1 Data collection and manual labelling . 123 6.4.2 Example tweets . 124 6.4.3 Event time response distribution . 126 6.4.4 Estimating event times on a larger dataset . 129 6.4.5 Bootstrapping to estimate error in measurement .