Download (5MB)
Total Page:16
File Type:pdf, Size:1020Kb
A University of Sussex PhD thesis Available online via Sussex Research Online: http://sro.sussex.ac.uk/ This thesis is protected by copyright which belongs to the author. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Please visit Sussex Research Online for more information and further details characterising semantically coherent classes of text through feature discovery andrew david robertson Submitted for the degree of Doctor of Philosophy Department of Informatics University of Sussex May 2018 supervisor: David Weir DECLARATION I hereby declare that this thesis has not been and will not be submit- ted in whole or in part to another University for the award of any other degree. Signature: Andrew David Robertson ii ABSTRACT There is a growing need to provide support for social scientists and humanities scholars to gather and “engage” with very large datasets of free text, to perform very bespoke analyses. method52 is a text analysis platform built for this purpose (Wibberley et al., 2014), and forms a foundation that this thesis builds upon. A central part of method52 and its methodologies is a classifier training component based on dualist (Settles, 2011), and the gen- eral process of data engagement with method52 is determined to constitute a continuous cycle of characterising semantically coherent sub-collections, classes, of the text. Two broad methodologies exist for supporting this type of engage- ment process: (1) a top-down approach wherein concepts and their relationships are explicitly modelled for reasoning, and (2) a more surface-level, bottom-up approach, which entails the use of key terms (surface features) to characterise data. Following the second of these approaches, this thesis examines ways of better supporting this type of data engagement to more effectively support the needs of social scientists and humanities scholars in engaging with text data. The classifier component provides an active learning training en- vironment emphasising the labelling of individual features. However, it can be difficult to interpret and incorporate prior knowledge of features. The process of feature discovery based on the current classi- fier model does not always produce useful results. And understand- ing the data well enough to produce successful classifiers is time- consuming. A new method for discovering features in a corpus is introduced, and feature discovery methods are explored to resolve these issues. When collecting social media data, documents are often obtained by querying an API with a set of key phrases. Therefore, the set of pos- sible classes characterising the data is defined by these basic surface features. It is difficult to know exactly which terms must searched for, and the usefulness of terms can change over time as new discussions and vocabulary emerge. Building on the feature discovery techniques, a framework is presented in this thesis for streaming data with an automatically adapting query to deal with these issues. iii ACKNOWLEDGEMENTS I am incredibly grateful for the support and guidance of my super- visor David Weir and Jeremy Reffin, both of whom I admire and have taught me a great deal; their advice was invaluable throughout my research. This work was funded by the Text Analytics Group (TAG) at Sussex University. I thank TAG for providing me with this opportunity and an excellent place to work. I am lucky to have worked with such friendly, helpful, and clever people as the members of TAG. Thanks especially to Simon Wibberley, Jack Pay, Thomas Kober, Julie Weeds, Miroslav Batchkarov, and Matti Lyra for years of collaboration and discussion. I would like to thank the users of the method52 platform, whose experiences were vital to this thesis. I particularly appreciate the demos side of the Centre for the Analysis of Social Media (CASM) for not only being users but teachers of new users. In particular, Josh Smith who diligently tested new features and collected user feedback for this research. I appreciate immensely my parents Kim & David Robertson, sister Mandy, and fiancée Mary Browning for their years of support and confidence in everything I do. Writing this thesis was made signific- antly easier and more pleasant through Mary’s encouragement and support each day. iv CONTENTS 1introduction 1 1.1 Important Features of a Corpus . 5 1.2 Method52 .......................... 7 1.2.1 Component Contributions . 12 1.3 Method52 & Similar Tools . 13 1.4 Bespoke Classification & Active Learning . 14 1.5 Feature Discovery with Active Learning . 18 1.6 Adaptive Streaming . 20 2dataengagementasclassdiscovery 22 2.1 Engagement in Two Mindsets . 24 2.2 Engaging with Textual Datasets: An Example . 26 2.3 The Search & Explore Cycle . 30 2.4 Conclusions . 33 3surprisinglyfrequentfeatures 36 3.1 Related Work . 38 3.2 Evaluation Strategy . 44 3.3 Measure of Surprisingness . 49 3.4 Extending to Phrases . 52 3.4.1 Counting Phrases . 54 3.4.2 Pruning Phrases . 57 3.4.3 Ranking Phrases . 58 3.5 Tweet Analysis . 61 3.6 Single Large Document Analysis . 63 3.7 Establishing the Expectation . 66 3.8 Using Existing Method52 Pipeline Architecture . 69 3.9 SFPD Comparison with Chi-square and Log-likelihood Ratio . 74 3.10 Corpus Homogeneity . 81 3.11 Chapter Summary . 82 4featurediscoverywithactivelearning 84 4.1 Classification in Method52 ................. 86 4.1.1 Analysis by Classifier . 86 4.1.2 Classifier Training Procedure . 91 4.1.3 Classifier Model . 95 4.1.4 Active Learning Queries . 97 4.1.5 Methodologies for Applying Classifiers to Twit- ter Corpora . 98 4.2 Dataset: Brussels Bombings . 99 4.3 Dataset: Father’s Day . 110 v contents vi 4.4 Exploration Using Surprisingly Frequent Features . 112 4.4.1 SFPD for Corpus Insights . 114 4.4.2 SFPD for Classification Insights . 119 4.4.3 Domain Adaptation . 124 4.4.4 Irrelevance Filtering . 127 4.4.5 Vocabulary Overlap . 128 4.4.6 Using Discovered Phrases in the Feature Model 133 4.5 Original Feature Contexts . 133 4.6 Encoding Context with Syntactic Ngrams . 136 4.6.1 Related Work . 138 4.6.2 Approach . 143 4.6.3 Adapting Dependency Parsing to Twitter . 145 4.6.4 Impact on classifier training process . 150 4.7 Weighting prior knowledge . 157 4.7.1 Frequency-based Weighting . 159 4.7.2 Indicativeness Weighting . 163 4.8 Chapter Summary . 164 5adaptivetwitterstreaming 166 5.1 Related Work . 172 5.2 Dataset: Disrupting Daesh . 182 5.3 Adaptive Streaming with SFPD . 183 5.4 Topic Structure . 190 5.5 Near-duplicate Tweets . 194 5.6 Maintaining Precision Through Relevance Assessment 196 5.6.1 Blacklisting Irrelevant Terms . 198 5.6.2 Qualifying Relevant Terms & Data Collection . 199 5.6.3 Classifier Scope Problem . 200 5.6.4 Classifier Drift Problem . 201 5.7 Selective Target & Relevance . 202 5.8 Query Term Expiration . 204 5.9 Domain Adaptation . 206 5.9.1 Continuous Sample Integration . 208 5.9.2 Noise Reduction with a priori Topic Knowledge 209 5.9.3 Topic Specificity . 210 5.10 Evaluation on Brexit . 210 5.11 Chapter Summary . 222 6conclusions 225 6.1 Future Work . 228 6.2 Limitations . 233 acronyms 234 bibliography 235 1 INTRODUCTION Social scientists and humanities scholars increasingly wish to gather and “engage” with large text datasets, to perform highly bespoke analyses. The size of these datasets makes them challenging to study; analysts encounter difficulty finding and isolating the data they are interested in. This necessitates the support of automated tools. The general objective when analysing these large datasets of un- structured text is to infer some structure that permits downstream (later in the analysis) quantitative analysis of the previously unstruc- tured data. In order to attain this goal, the analyst must first gather the appropriate data, then “code up” the documents according to some structure. In order to demonstrate the kinds of insights which are very specific to the individual researcher and dataset, the analysis must be bespoke, and generally once complete cannot be directly re- applied to another scenario. This runs counter to what is often the expectation in NLP research, wherein a particular task (such as senti- ment analysis) is defined, and the techniques for approaching the task are iterated on over time, but applicable to the same task, over and over again. This thesis is slightly unusual, in that it is not a typical type of this kind of NLP research, but more a study of how to better support the above type of bespoke analysis. Nor is it a social science thesis, since it still focuses on the technology for supporting scholars. In these bespoke analyses, the nature of the data gathering stage depends on the type of study being undertaken. The researcher may wish to analyse free text responses from questionnaires, or records or other data from a public body such as a police department or health organisation. In these cases, the gathering of data is usually taken care of by the organisation in question. The analyst may altern- atively be interested in some section of a historical dataset, such as the transcripts of Old Bailey trials. Here, in order to gather data, the analyst will likely need to determine which documents are actually relevant. This may first be a filter on metadata, such as the date range in which a trial occurs, or the role of the speaker. But the documents may be further filtered by whether they contain certain key phrases, or whether classified relevant by a classifier trained to model the no- tion of relevance required for the study.