Detecting Personal Life Events from Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online. The advantages of this type of technology also have direct merits within marketing, al- lowing companies to target customers with better products. We also suggest that the tech- niques and methodologies built throughout this thesis also have opportunities to support research within other areas such as cyber bullying, and radicalisation detection. The aim of this thesis is to build upon the under researched area of life event detection, specifically targeting Twitter, and Instagram. Our goal is to develop classifiers that identify a list of life events inspired by cognitive psychology, where we target a total of seven within this thesis. To achieve this we look to answer three research questions covered in each of our empiri- cal chapters. In our first empirical chapter, we ask; What features would improve the classifi- cation of important life events. To answer this, we look at first extracting a new dataset from Twitter targeting the following events: Getting Married, Having Children, Starting School, Falling in Love, and Death of a Parent. We look at three new feature sets: interactions, con- tent, and semantic features, and compare against a current state of the art technique. In our second empirical chapter, we draw inspiration from cheminformatics, and fre- quent sub-graph mining to ask; Could the inclusion of semantic and syntactic patterns im- prove performance in our life event classifier. Here we look at expanding our tweets into semantic networks, as well as consider two forms of syntactic relationships between tokens. We then mine for frequent sub-graphs amongst our tweet graphs, and use these as features in our classifier. Our results produce F1 scores of between 0.65 and 0.77, providing an im- provement between 0.01 and 0.04 against the current state of the art. In our final empirical chapter, we look to answer our third research question; How can we detect important life events from other social media sites, such as Instagram?. We ask iii Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson this question, as we believe Instagram to be a preferred environment to share personal life events. In this chapter, we extract a new dataset, targeting the following events: Getting Married, Having Children, Starting School, Graduation, and Buying a House. Our results find that our methodology provides F1 scores between 0.78, and 0.82, an improvement in F1 score between 0.01 and 0.04 against the current state of the art. iv Contents 1 Introduction 1 1.1 Motivation ................................. 2 1.2 Challenges and Gaps ............................ 4 1.3 Research Questions & Hypothesises .................... 6 1.4 Outline ................................... 8 1.5 Publications ................................. 10 2 Background & Related Work 12 2.1 Social Media Event Detection ........................ 13 2.2 Life Event Detection on Social Media .................... 18 2.3 Frequent Graph Mining ........................... 25 2.4 Classifier Algorithms ............................ 27 2.5 Conclusion ................................. 30 3 Methodology 31 3.1 Life Events .................................. 32 3.2 Machine Learning Framework ........................ 33 3.3 Evaluation .................................. 33 3.4 Hypothesis Testing & Baselines ....................... 34 3.5 Graph API ................................. 35 4 Detecting Personal Life Events on Twitter 42 4.1 Introduction ................................ 43 4.2 Personal Life Events ............................. 44 4.3 Dataset ................................... 45 4.4 Feature Sets ................................. 49 4.5 Feature Creation ............................... 55 4.6 Evaluation Setup .............................. 57 4.7 Results ................................... 60 4.8 Discussion .................................. 67 4.9 Limitations and Recommendations ..................... 72 4.10 Summary .................................. 73 v 5 Frequent Sub-Graph Mining of Syntactic and Semantic Graphs 74 5.1 Introduction ................................ 75 5.2 Graph Mining for Event Detection ..................... 77 5.3 Creating Feature Graphs ........................... 87 5.4 Frequent Graph Mining ........................... 89 5.5 Evaluation Setup .............................. 99 5.6 Results ................................... 103 5.7 Discussion .................................. 117 5.8 Limitations and Recommendations . 119 5.9 Summary .................................. 121 6 Detecting Personal Life Events on Instagram 122 6.1 Introduction ................................ 123 6.2 Instagram Dataset .............................. 125 6.3 Pipeline Enhancements . 136 6.4 Graph Feature Enhancements . 139 6.5 Interaction Features ............................. 148 6.6 Evaluation Setup .............................. 150 6.7 Results ................................... 154 6.8 Discussion .................................. 168 6.9 Limitations and Recommendations . 171 6.10 Summary .................................. 173 7 Discussion & Future Work 174 7.1 Discussion .................................. 175 7.2 Future Work ................................. 185 8 Conclusion 189 8.1 Interaction, Content, and Semantic Features . 190 8.2 Semantic and Syntactic Graphs . 191 8.3 Identifying Events on Instagram . 192 References 209 vi Listing of figures 3.1 Methodology Outline ............................ 32 3.2 Graph Generation Architecture ....................... 36 3.3 Web Client ................................. 40 4.1 Quiz Funnel Pass Rate ........................... 47 4.2 Dataset Distribution ............................ 50 4.3 Text Processing Pipeline ........................... 56 5.1 Token Graph: Leslie just wants to be married to a congress man . 79 5.2 Dependency Graph: My Father passed away . 81 5.3 Token Dependency Subgraph Example ................... 81 5.4 POS Dependency Sub-graph Example .................... 81 5.5 Token & POS Subgraph Example ...................... 82 5.6 WordNet Graph: Leslie just wants to be married to a congress man . 84 5.7 ConceptNet Graph: Leslie just wants to be married to a congress man . 86 5.8 Frequent Graph Mining Pipeline ...................... 87 5.9 WordNet Depth Effect on F1 ........................ 92 5.10 WordNet Minimum Support ........................ 94 5.11 ConceptNet Minimum Support ...................... 94 5.12 Syntactic Minimum Support ........................ 95 6.1 Sample Hashtag Distribution . 129 6.2 Example CrowdFlower Annotation . 131 6.3 Instagram Event Annotation Distribution for Personal Accounts . 131 6.4 Instagram Event Annotation Distribution for Adverts . 132 6.5 Random Annotation Distribution . 134 6.6 Instagram Graph Design . 136 6.7 WordNet Depth for Getting Married Against F1 Score . 140 6.8 WordNet Depth for Giving Birth Against F1 Score . 141 6.9 WordNet Depth for Graduation Against F1 Score . 141 6.10 WordNet Depth for Buying a House Against F1 Score . 142 6.11 WordNet Depth for Starting School Against F1 Score . 142 6.12 WordNet Depth Median Across Events Against F1 Score . 143 vii 6.13 Syntactic Graph Delta Thresholds . 146 6.14 ConceptNet Graph Delta Thresholds . 146 6.15 WordNet Graph Delta Thresholds . 147 viii For somehow putting up with me throwing away a lucrative career in London to pursue a PhD, this thesis is dedicated to Joanna, as well as our cats, Hazel and Willow. ix Acknowledgments Firstly I would like to thank my main supervisor, Professor Harith Alani. His advice has been invaluable throughout the course of my PhD, and has helped hone both my written, and critical thinking skills. I feel incredibly lucky to have had such a great supervisor. Secondly, I would