Extracting Actionable Information from Microtexts

Extracting actionable information from microtexts Ali Hurriyeto¨ gluˇ arXiv:2008.00343v1 [cs.CL] 1 Aug 2020 i The research was supported by the Dutch national research program COMMIT/. SIKS Dissertation Series No. 2019-17. The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Infor- mation and Knowledge Systems. An electronic version of this dissertation is available at http://repository.ru.nl ISBN: 978-94-028-1540-5 Copyright c 2019 by Ali Hurriyeto¨ glu,ˇ Nijmegen, the Netherlands Cover design by Burcu HÃĳrriyetoÄ§lu c Published under the terms of the Creative Commons Attribution License, CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, provided the original author and source are credited. Acknowledgements The work presented in this book would be much harder without support from my family, friends, and colleagues. First and foremost, I thank my supervisors prof. dr. Antal van den Bosch and Dr. Nelleke Oostdijk, for their invaluable support, guidance, enthusiasm, and optimism. Being part of LAnguage MAchines (LAMA) group at Radboud University was a great experience. I met a lot of wonderful and interesting people. The round table meeting, hutje-op-de-hei, ATILA, the IRC channel, and lunches were fruitful sources of motiva- tion and inspiration. I am grateful for this productive and relaxed work environment. Erkan, Florian, Iris, Kelly, Maarten, Martin, Wessel, Alessandro, Maria, and Ko were the key players of this pleasant atmosphere. I thank all my co-authors and collaborators, who were in addition to my advisors and LAMA people, Piet Daas & Marco Puts from Statistics Netherlands (CBS), Jurjen Wage- maker & Ron Boortman from Floodtags, Christian Gudehus from Ruhr-University Bochum, for their invaluable contributions to the work in this thesis and to my development as a scientist. Florian and Erkan, we have done a lot together. Completing this together in terms of you supporting me again as being my paranymphs feels awesome. I appreciate it. Serwan, Erkan, Ghiath, and Ali are the friends who were always there for me. I am very happy to have this invaluable company with you in this long and hard journey in life. Remy and Selman were the great people who made my time in Mountain View, CA, USA memorable during my internship at Netbase Solutions Inc. It would not be that much productive and cheerful without them. Their presence made me feel safe and at home. I am grateful to prof. dr. Marteen de Rijke for welcoming me and Florian to his group on Fridays during the first year of my PhD studies. This experience provided me the context to understand information retrieval. ii iii I would also like to thank the members of the doctoral committee, prof. dr. Martha Larson, prof. dr. ir. Wessel Kraaij, prof. dr. Lidwien van de Wijngaert, Dr. ir. Alessan- dro Bozzon, and Dr. Aswhin Ittoo. Moreover, I appreciate support of the anonymous reviewers who provided feedback to my submissions to scientific venues. Both the for- mer and latter group of people significantly improved quality of my research and un- derstanding scientific method. There is not any dissertation that can be completed without institutional support. I appreciate support of Graduate School for the Humanities (GSH), Faculty of Arts, and Center for Language and Speech Technology (CLST) at Radboud University, COMMIT/ project, and the Netherlands Research School for Information and Knowledge Systems (SIKS). They provided the environment needed to complete this dissertation. I am always grateful to my family for their effort in standing by my side. Last but not least, Sevara and Madina, you are meaning of happiness for me. Ali HÃĳrriyetoÄ§lu SarÄśyer, May 2019 Contents 1 Introduction 1 1.1 Research Questions . .3 1.2 Data and Privacy . .4 1.3 Contributions and Outline . .5 2 Time-to-Event Detection 7 2.1 Introduction . .7 2.2 Related Research . .9 2.3 Data Collection . 11 2.4 Estimating the Time between Twitter Messages and Future Events . 14 2.4.1 Introduction . 14 2.4.2 Methods . 15 2.4.2.1 Linear and Local Regression . 15 2.4.2.2 Time Series Analysis . 16 2.4.3 Experimental Set-up . 16 2.4.3.1 Training and Test Data Generation . 16 2.4.3.2 Baseline . 18 2.4.3.3 Evaluation . 18 2.4.4 Results . 18 2.4.5 Conclusion . 20 2.5 Estimating Time to Event from Tweets Using Temporal Expressions . 21 2.5.1 Introduction . 22 2.5.2 Experimental Set-Up . 23 2.5.2.1 Data Sets . 23 2.5.2.2 Temporal Expressions . 24 2.5.2.3 Evaluation and Baselines . 26 2.5.3 Results . 28 2.5.4 Analysis . 31 2.5.5 Conclusion . 32 2.6 Estimating Time to Event based on Linguistic Cues on Twitter . 34 2.6.1 Introduction . 34 2.6.2 Time-to-Event Estimation Method . 36 2.6.2.1 Features . 37 2.6.2.2 Feature Selection . 42 2.6.2.3 Feature Value Assignment . 42 2.6.2.4 Time-to-Event Estimation for a Tweet . 43 2.6.3 Experimental Set-up . 44 iv Contents v 2.6.3.1 Training and Test Regimes . 44 2.6.3.2 Evaluation and Baselines . 44 2.6.3.3 Hyperparameter Optimization . 46 2.6.4 Test Results . 48 2.6.5 Discussion . 49 2.6.6 Conclusion . 53 2.7 Conclusion . 53 3 Relevant Document Detection 70 3.1 Introduction . 71 3.2 Related Research . 72 3.3 Relevancer . 74 3.3.1 Data Preparation . 75 3.3.2 Feature Extraction . 76 3.3.3 Near-duplicate Detection . 76 3.3.4 Information Thread Detection . 77 3.3.5 Cluster Annotation . 79 3.3.6 Creating a Classifier . 81 3.3.7 Scalability . 82 3.4 Finding and Labeling Relevant Information in Tweet Collections . 83 3.5 Analysing the Role of Key Term Inflections in Knowledge Discovery on Twitter . 87 3.6 Using Relevancer to Detect Relevant Tweets: The Nepal Earthquake Case 90 3.7 Identifying Flu Related Tweets in Dutch . 95 3.8 Conclusion . 97 4 Mixing Paradigms for Relevant Microtext Classification 102 4.1 Introduction . 102 4.2 Classifying Humanitarian Information in Tweets . 103 4.2.1 Introduction . 103 4.2.2 Approach 1: Identifying Topics Using Relevancer . 105 4.2.3 Approach 2: Topic Assignment by Rule-based Search Query Gen- eration . 106 4.2.4 Combined Approach . 109 4.2.5 Results . 111 4.2.6 Discussion . 111 4.2.7 Conclusion . 112 4.3 Comparing and Integrating Machine Learning and Rule-Based Microtext Classification . 113 4.3.1 Related Studies . 114 4.3.2 Data Sets . 115 4.3.3 Baseline . 117 4.3.4 Building a Machine Learning Based Classifier . 117 4.3.5 Rule-Based System . 118 4.3.6 Comparing Machine Learning and Rule-Based System Results . 119 4.3.7 Integrating ML and RB Approaches at the Result Level . 121 4.3.8 Discussion . 124 Contents vi 4.3.9 Conclusion . 124 4.4 Conclusion . 125 5 Conclusions 126 5.1 Answers to Research Questions . 126 5.2 Answer to Problem Statement . 129 5.3 Thesis Contributions . 129 5.4 Outlook . 130 References 133 Samenvatting 145 Summary 147 Curriculum Vitae 149 SIKS Dissertation Series 150 Chapter 1 Introduction Microblogs such as Twitter represent a powerful source of information. Part of this information can be aggregated beyond the level of individual posts. Some of this aggregated information is referring to events that could or should be acted upon in the interest of e-governance, public safety, or other levels of public interest. Moreover, a significant amount of this information, if aggregated, could complement existing information networks in a non-trivial way. Here, we propose a semi-automatic method for extracting actionable information that serves this purpose. The term event denotes what happens to entities in a defined space and time (Casati & Varzi, 2015). Events that affect the behaviors and possibly the health, well-being, and other aspects of life of multiple people, varying from hundreds to millions, are the focus of our work. We are interested in both planned events (e.g. football matches and concerts) and unplanned events (e.g. natural disasters such as floods and earthquakes). Aggregated information that can be acted upon is specified as actionable.1 Actionable information that can help to understand and handle or manage events may be detected at various levels: an estimated time to event, a graded relevance estimate, an event’s precise time and place, or an extraction of entities involved (Vieweg, Hughes, Starbird, & Palen, 2010; Yin, Lampert, Cameron, Robinson, & Power, 2012; Agichtein, Castillo, Donato, Gionis, & Mishne, 2008). Microtexts, which are posted on microblogs, are specified as short, user-dependent, minimally-edited texts in comparison to traditional writing products such as books, essays, and news articles (Ellen, 2011; Khoury, Khoury, & Hamou-Lhadj, 2014). 1http://www.oed.com/view/Entry/1941, accessed June 10, 2018 1 Introduction 2 Users of a microblog platform may be professional authors but very often they are non- professional authors whose writing skills vary widely and who are usually less con- cerned with readers’ expectations as regards the well-formedness of a text. Generally, microtexts can be generated and published in short time spans with easy-to-use inter- faces without being bound or dependent to a fixed place. Microbloggers may generate microtexts about anything they think, observe, or want to share through the microblogs. In recent years, vast quantities of microtexts have been generated on microblogs that are known as social networking services, e.g., Twitter2 and Instagram3 (Vallor, 2016).

Load more