Downloads/ Connector/J/5.0.Html

Downloads/ Connector/J/5.0.Html

Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods Vasileios Lampos Department of Computer Science University of Bristol A thesis submitted to the University of Bristol in accordance with the requirements for the degree of Doctor of Philosophy arXiv:1208.2873v1 [cs.LG] 13 Aug 2012 June 2018 “The lunatic is in the hall, the lunatics are in my hall, the paper holds their folded faces to the floor, and every day the paper boy brings more. And if the dam breaks open many years too soon, and if there is no room upon the hill, and if your head explodes with dark forebodings too, I’ll see you on the dark side of the moon.” Pink Floyd ∼ Brain Damage Declaration I declare that the work in this thesis was carried out in accordance with the requirements of the University’s Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Except where indicated by specific reference in the text, the work is the candidate’s own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the thesis are those of the author. Vasileios Lampos Signature: .............................. Date: .............................. Abstract A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The social web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The present Ph.D. Thesis deals with the problem of inferring information – or patterns in general – about events emerging in real life based on the contents of this textual stream. We show that it is possible to extract valuable information about social phenomena, such as an epidemic or even rainfall rates, by automatic analysis of the content published in Social Media, and in particular Twitter, using Statistical Machine Learning methods. An important intermediate task regards the formation and identification of features which characterise a target event; we select and use those textual features in several linear, non-linear and hybrid inference approaches achieving a significantly good performance in terms of the applied loss function. By examining further this rich data set, we also propose methods for extracting various types of mood signals revealing how affective norms – at least within the social web’s population – evolve during the day and how significant events emerging in the real world are influencing them. Lastly, we present some preliminary findings showing several spatiotemporal characteristics of this textual information as well as the potential of using it to tackle tasks such as the prediction of voting intentions. Acknowledgements I leave this Ph.D. ‘significantly’ older. This is a matter of fact. All other matters – abstract or not – surrounding this script as well as the time behind it are going to be products of subjective analysis. I am grateful to my parents, Ioannis Lampos and Sofia Kousidou, and my brother Spyros Lampos. Every person in this world is bound to its ‘family’ in a healthy and limitless way. Well, ‘limitless’ might sometimes be – by definition – unhealthy, but diving properly into this primitive relationship may need another Ph.D. Thesis and surely no ‘funding’ exists for such an adventure. Next, and, of course, with a significant importance measured, however, on a relatively different scale, I would like to thank Prof. Nello Cristianini, my Ph.D. advisor. Keeping all the good moments in mind, it has been fun working with him; it was as if we were not really working. I would like to express my gratitude to Prof. Peter Flach – my internal Ph.D. progress reviewer – for his quite useful advice and remarks in every single progress review. I am also indebted to Dr. Steve Gregory for his help in several occasions during and right after the completion of my M.Sc. studies. I am grateful to Dr. Tijl De Bie and Prof. Ricardo Araya for their contribution in my work. I would also like to thank Thomas Lansdall-Welfare and Elena Hensinger for our short collaboration. During my time in the university, I came across many interesting people such as the porters of the Merchant Venturers Building. I thank them for being polite and positive every time I was in and out from the building. This might sound strange, but positive energy – for some people who are by default negative – can be extremely important. I would like to thank Omar Ali, Syed Rahman and Stefanos Vatsikas for being honest and helpful in many occasions. I am also grateful to Christopher Mims, the journalist who wrote the first article on our work for MIT’s Technology Review. A special ‘thank you’ to Dr. Philip J. Naylor for maintaining our servers and databases; he is the person who once literally saved this project. This Ph.D. project would not have been possible, if EPSRC did not grant me a DTA schol- arship (DTA/SB1826) to cover my tuition fees and without NOKIA Research’s financial support. On the same ground, I would also like to thank the Computer Science Department of the University of Bristol, and its Head at the time Prof. Nishan Canagarajah, for cover- ing parts of my funding in the third year of my Ph.D.; I have to thank my family – again – for covering the rest. My friends – amongst other more important things – were also part of my Ph.D. studies and I ‘have’ to thank them. Therefore, Charis Fanourakis, Chloé Vastesaeger, Costas Mimis, Leonidas Kapsokalivas, Nick Mavrogiannakis, Takis Tsiatsis and Thanos Ntzoumanis there you go, you got your names in my ‘book’. Finally, I would like to thank all those Twitter people I have been interacting with for their indirect help in making me understand several aspects of this platform better and, of course, for their invaluable company during the endless nights I spent in writing up. I wish them ‘success’ in their life quests, if and only if. Contents Abstract 5 Acknowledgements 6 List of Figures 13 List of Tables 17 Abbreviations 21 1 Introduction 23 1.1 Summary of questions, aims and goals . 25 1.2 Work dissemination . 26 1.3 Peer-reviewed publications . 27 1.4 Summary of the Thesis . 29 2 Theoretical Background 31 2.1 Defining the general field of research . 32 2.2 Regression . 33 2.2.1 Linear Least Squares . 33 2.2.2 Regularised Regression . 34 2.2.3 Least Absolute Shrinkage and Selection Operator (LASSO) . 36 2.3 Classification . 38 2.3.1 Decision Trees . 38 2.3.2 Classification and Regression Tree . 40 2.4 Bootstrap . 42 2.4.1 Bagging . 42 2.5 Feature Extraction and Selection . 43 2.6 Foundations of Information Retrieval . 45 2.6.1 Vector Space Model . 45 2.6.2 Term Frequency – Inverse Document Frequency . 46 2.6.3 Text Preprocessing – Stemming and Stop Words . 47 2.7 Summary of the chapter . 49 3 The Data – Characterisation, Collection and Processing 51 Contents 10 3.1 Twitter: A new pool of information . 52 3.1.1 Characterisation of Twitter – Why Twitter content is important . 53 3.2 Collecting, storing and processing Twitter data . 55 3.2.1 RSS and Atom feeds . 55 3.2.2 Collecting and storing tweets . 56 3.2.3 Software libraries for processing textual and numerical data . 58 3.3 Ground truth . 60 3.4 Summary of the chapter . 61 4 First Steps on Event Detection in Large-Scale Textual Streams 63 4.1 Introduction . 64 4.2 An important observation . 65 4.2.1 Computing a term-based score from Twitter corpus . 65 4.2.2 Correlations between Twitter flu-scores and HPA flu rates . 66 4.3 Learning HPA’s flu rates from Twitter flu-scores . 68 4.3.1 What is missing? . 71 4.4 Automatic extraction of ILI textual markers . 72 4.4.1 What is missing from this approach? . 76 4.5 The Harry Potter effect . 77 4.6 How error bounds for L1-regularised regression affect our approach . 80 4.7 Summary of the chapter . 82 5 Nowcasting Events from the Social Web with Statistical Learning 85 5.1 Recap on previous method’s limitations . 86 5.2 Nowcasting events . 86 5.3 Related work . 88 5.4 Methodology . 89 5.4.1 Forming vector space representations of text streams . 90 5.4.2 Feature selection . 91 5.4.3 Inference and consensus threshold . 92 5.4.4 Soft-Bolasso with consensus threshold validation . 93 5.4.5 Extracting candidate features and forming feature classes . 94 5.5 Comparing with other methodologies . 97 5.6 Case study I: nowcasting rainfall rates from Twitter . 98 5.6.1 Experimental settings . 99 5.6.2 Results . 100 5.7 Case study II: nowcasting flu rates from Twitter . 103 5.7.1 Experimental Settings . 108 5.7.2 Results . 108 5.8 Discussion . 113 5.9 Applying different learning methods . 117 5.9.1 Nowcasting rainfall rates with CART . 118 5.9.2 Nowcasting flu rates with CART . 122 5.9.3 Discussion . 125 5.10 Properties of the target events . 126 5.11 Summary of the chapter . 129 11 Vasileios Lampos, Ph.D. Thesis 6 Detecting Temporal Mood Patterns by Analysis of Social Web Content 133 6.1 Detecting circadian mood patterns from Twitter content . 134 6.1.1 Data and methods . 134 6.1.1.1 Mean Frequency Mood Scoring . 135 6.1.1.2 Mean Standardised Frequency Mood Scoring . 136 6.1.2 Experimental results . 136 6.1.2.1 Testing the stability of the extracted patterns .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    238 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us