Inferring Interestingness in Online Social Networks

Inferring Interestingness in Online Social Networks A thesis submitted in partial fulfilment of the requirement for the degree of Doctor of Philosophy William M. Webberley 2014 Cardiff University School of Computer Science & Informatics i Declaration This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed . (candidate) Date . Statement 1 This thesis is being submitted in partial fulfilment of the requirements for the degree of PhD. Signed . (candidate) Date . Statement 2 This thesis is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by explicit references. Signed . (candidate) Date . Statement 3 I hereby give consent for my thesis, if accepted, to be available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organisations. Signed . (candidate) Date . Dedication ii To my family and friends; To George, Pete, Jack, Max, Ella, Tom, and Charlie. This is for their support. iii Acknowledgements This thesis and the research it contains could not exist if not for the unfailing support of my supervisors, Roger Whitaker and, particularly, Stuart Allen. Their guidance and input have provided drive, shaped my research and have given me the confidence required for producing and defending ideas both within and outside of research. My utmost gratitude is extended to them for this and for their continual encouragement at all stages of my PhD, always with fresh perspective and clarification for research pathways and for keeping me on track. I feel confident in saying that I was very lucky in being in their supervision and I would not be the same person, professionally and otherwise, today if it wasn’t for their combined inputs over the past few years. I’d also like to thank Martin Chorley for his continuously helpful and enthusiastic crowdsourcing expertise, without which the research in this thesis could not have been validated in the way it has been. Matt Williams has provided insight throughout my time as a PhD student, from long sessions discussing the tiniest irrelevant details to recommendations on research direction and scope. Discussions with him have helped form many of my research ideas. I am also deeply grateful to Chris Gwilliams for his helpful input over course of my PhD, including information on where one might find the most cost-effective M.O.T. deals through to how best to structure complex and fiddly data queries. Thanks also to Matt John, who, ever critical, helped proof-read this thesis and provided invaluable constructive feedback on notation and nomenclature. My family and close friends have always pushed me and provided encouragement to Acknowledgements iv make me who I am today. If not for them, I would not have had the confidence to carry out and defend my research. They deserve my utmost thanks. Finally, my time as a research student would not have been the same without the con- stant presence and support from the School of Computer Science & Informatics, the other members of the MobiSoc group, the ‘superteam’, and non-academic friends. In addition to those named previously, Gualtiero Colombo, Ian Cooper, Jon Quinn, Diego Pizzocaro, Rich Coombs, Liam Turner, Nick Sharp, and Ross Taylor have made the time spent researching my PhD both interesting and enjoyable and I am extremely grateful to them. v Abstract Information sharing and user-generated content on the Internet has given rise to the increased presence of uninteresting and ‘noisy’ information in media streams on many online social networks. Although there is a lot of ‘interesting’ information also shared amongst users, the noise increases the cognitive burden in terms of the users’ abilit- ies to identify what is interesting and may increase the chance of missing content that is useful or important. Additionally, users on such platforms are generally limited to receiving information only from those that they are directly linked to on the social graph, meaning that users exist within distinct content ‘bubbles’, further limiting the chance of receiving interesting and relevant information from outside of the immediate social circle. In this thesis, Twitter is used as a platform for researching methods for deriving “interestingness” through popularity as given by the mechanism of retweeting, which allows information to be propagated further between users on Twitter’s social graph. Retweet behaviours are studied, and features; such as those surrounding Tweet audience, information redundancy, and propagation depth through path-length, are uncovered to help relate retweet action to the underlying social graph and the com- munities it represents. This culminates in research into a methodology for assigning scores to Tweets based on their ‘quality’, which is validated and shown to perform well in various situations. vi Contents Acknowledgements iii Abstract v Contents vi List of Publications xi List of Figures xii List of Tables xvi List of Acronyms xviii Glossary xix 1 Introduction 1 1.1 Twitter as a Social Network . .2 1.2 The Social Graph and Information Flow . .4 1.3 The Problem . .5 Contents vii 1.4 Thesis Structure . .7 2 Background and Research Domain 9 2.1 Key Concepts . .9 2.2 Domain Introduction and Literature Survey . 12 2.2.1 Information Propagation through Retweeting . 12 2.2.2 Retweets and the Social Graph . 16 2.2.3 User Influence . 19 2.2.4 Twitter as an Information-Retrieval System . 21 2.2.5 Interesting and ‘Interestingness’ . 23 2.2.6 Twitter is a ‘Memepool’ . 32 2.2.7 Precision and Recall . 34 2.3 Collecting Twitter Data . 38 2.4 Motivation . 39 3 Understanding The Behaviour of Retweeting in Twitter 41 3.1 Tweet and Retweet Properties . 43 3.1.1 Retweet Groups . 43 3.1.2 Retweet Trees . 46 3.1.3 Path-Length . 47 3.2 Twitter Propagation Analysis . 49 3.3 Retweet and Retweet Group Analysis . 49 3.3.1 Data Collection Methodology . 50 Contents viii 3.3.2 Exploring Retweet Group Path-Lengths . 52 3.3.3 Size of Retweet Groups . 54 3.3.4 A Tweet’s Audience - How Many Users Can be Reached? . 55 3.3.5 Retweet Groups on the Social Graph . 60 3.3.6 The Temporal Properties of Retweets . 66 3.4 Summary . 68 3.5 Taking the Investigative Research Further . 69 4 Analysis of Twitter’s Social Structure 71 4.1 Propagation Patterns Exhibited by Different Graph Structures . 73 4.1.1 Overview of the Simulation Algorithm . 74 4.1.2 Generating a User’s Retweet Probability . 77 4.1.3 Summary of Training Features . 79 4.1.4 Training the Model . 80 4.1.5 Running the Simulations . 81 4.1.6 Network Analyses . 82 4.1.7 General Comparison of Propagation Characteristics across Dif- ferent Graph Structures . 88 4.2 Using the Social Graph as a Method for Inferring Interestingness . 90 4.2.1 Data Collection . 93 4.2.2 Validating the Accuracy of Inference Results . 95 4.2.3 Improving The Interestingness Inference Performance . 99 4.3 Chapter Summary . 100 Contents ix 4.3.1 Network Structure Analysis . 101 4.3.2 Interestingness Inference Methodology . 101 5 Inferring Interestingness 103 5.1 Interestingness through Tweet Scoring . 105 5.2 Further Adaptations of the Inference Methodology . 107 5.3 Collecting the Training and Testing Data . 110 5.4 Retweet Counts as Nominal Attributes . 112 5.5 Predicting Estimated Retweet Counts . 119 5.5.1 The Classifier . 119 5.5.2 Classification Performance . 121 5.5.3 Effects of varying the Cardinality of Nominal Retweet Counts 124 5.6 Training and Testing Against the Classifier . 125 5.6.1 Data Corpora . 126 5.6.2 Features . 126 5.7 Initial Validations of the Scoring Methodologies . 129 5.7.1 Planning the Validations . 129 5.7.2 Carrying Out the Validations . 130 5.7.3 Outcomes From the Validations . 132 5.7.4 Methodology and Validation Remarks . 138 5.8 Addressing Individual Information Relevance . 139 5.8.1 Methodology . 139 Contents x 5.8.2 Assigning Scores to the Assessed Tweets . 141 5.8.3 Results from the Further Validations . 143 5.9 Chapter Summary . 149 5.9.1 Interestingness Scores . 150 5.9.2 Methodology Validations . 150 5.9.3 Improvements and Qualities . 151 6 Assessment and Conclusions 152 6.1 Analysis of Research and Results . 152 6.1.1 Retweeting & the Twitter Structure . 152 6.1.2 Interestingness Scores . 154 6.1.3 Validation . 155 6.1.4 Methodology Evaluation . 156 6.1.5 Contributions . 159 6.2 Limitations . 161 6.3 Further and Future Work . 162 6.3.1 Building on the Social Structure . 162 6.3.2 Taking the Scoring Methodology Further . 164 6.4 Final Remarks . 166 Bibliography 168 xi List of Publications Some of the work produced towards this thesis has also been published separately as follows. • [58] - W. Webberley, S. M. Allen, R. M. Whitaker. Inferring the Interesting Tweets in Your Network, in Workshop on Analyzing Social Media for the Benefit of Society (SOCIETY 2.0), 3rd International Conference on Social Computing and its Applications (SCA), Karlsruhe, Germany. IEEE 2013 • [57] - W. Webberley, S. Allen, R. Whitaker. Retweeting: A Study of Message- Forwarding in Twitter, in Workshop on Mobile and Online Social Networks (MOSN’11), 5th International Conference on Network and System Security (NSS), Milan, Italy. IEEE 2011 xii List of Figures 1.1 Twitter’s 2006 homepage (from http://web.archive.org) compared to its 2014 homepage . .3 2.1 Examples of user and home timelines. 10 2.2 A notifications page. 11 2.3 Examples of friends and followers lists. 12 2.4 A retweeted Tweet.

Load more