Where Are You Talking About? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
Total Page:16
File Type:pdf, Size:1020Kb
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring Milan Gritta Supervisor: Dr Nigel Collier Department of Theoretical and Applied Linguistics University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Fitzwilliam College February 2019 Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. Itis not substantially the same as any that I have submitted, or, is being concurrently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. It does not exceed the prescribed word limitforthe relevant Degree Committee. Milan Gritta February 2019 Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring Abstract - Milan Gritta The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys. I would like to dedicate this thesis to my loving parents who worked extraordinarily hard to give me this opportunity. Thank You. Acknowledgements I would like to acknowledge the generous funding support of the Natural Environment Research Council (NERC) via the interdisciplinary Centre of Doctoral Training (CDT) called Data, Risk, Environmental and Analytical Methods (DREAM1) under the leadership of Dr Stephen Hallett and Co. of Cranfield University and more locally, Dr Mike Bithell of Cambridge University. The CDT is a great model for PhD students, broadening our horizons, offering excellent support and developing non-academic skills through diverse experiences. Naturally, I want to recognise Dr Nigel Collier for his dedicated supervision throughout the 3-4 years, for co-authoring my papers and having the confidence in me as his next PhD student back in 2015. I also want to thank Dr Anna Korhonen for her dedicated (second) supervision and support throughout the PhD process and for expertly examining the thesis (internal). I further want to thank Dr Mark Stevenson2 from Sheffield University for agreeing to examine my thesis (external) and for providing excellent feedback. I want to acknowledge my fellow authors Dr Nut Limsopatham (2015 - 2017) and Dr Mohammad Taher Pilehvar (2015 - 2019) for their continued advisory support and feedback during manuscript and thesis publication. I extend my gratitude to our industrial partner Dr Jens Linge from the Joint Research Centre of the European Commission for his feedback, collaboration and for hosting me at the JRC in Ispra, Italy in February 2018; the Fitzwilliam College Graduate Officer Sue Free for her support, particularly during my 2018 intermission due to family issues; Savina Velotta and Angela Colclough from DREAM CDT for ensuring the events felt homely and welcoming, enabling us to focus on learning. Last but not least, I thank Cambridge University students Mina Frost and Qianchu (Flora) Liu for providing linguistic expertise; the NVIDIA GPU Grant Program for the free graphics card; my fellow LTL students for their company, support and advice. Finally, Thank You to all who helped out in any way big or small, somehow during my thesis submission and new job busyness, I forgot to acknowledge you, you know who you are, you deserve it. Thank You Very Much! 1http://www.dream-cdt.ac.uk/ 2https://www.sheffield.ac.uk/dcs/people/academic/mstevenson Table of contents List of figures xiii List of tables xvii 1 Introduction1 1.1 Preamble . .2 1.1.1 Thesis Format . .2 1.1.2 Summary of Contributions . .2 1.1.3 Published Papers . .5 1.2 Research Partnerships . .5 1.2.1 DREAM CDT . .5 1.2.2 Joint Research Centre . .6 1.2.3 EMM and MediSys . .6 1.3 Related Work in Biosurveillance . .7 1.3.1 Summary . .9 2 Geoparsing Landscape 11 2.1 Introduction . 11 2.2 What’s Missing in Geoparsing? . 12 2.2.1 Introduction to the Survey . 12 2.2.2 Related Work . 13 2.2.3 Featured Systems . 15 2.2.4 Evaluation Corpora . 18 2.2.5 Evaluation Metrics . 21 2.2.6 Results and Analysis . 24 2.2.7 Discussion and Future Work . 28 2.2.8 Closing Remarks . 31 2.3 Summary . 32 viii Table of contents 3 Handling Metonymy in Geoparsing 33 3.1 Introduction . 33 3.2 Minimalist Metonymy Resolution . 34 3.2.1 Related Work . 35 3.2.2 The ReLocaR Dataset and its Improved Annotation . 36 3.2.3 Predicate Window (PreWin) . 39 3.2.4 Neural Network Architecture . 41 3.2.5 Results and Error Analysis . 42 3.2.6 Discussion . 46 3.2.7 Conclusions . 47 3.3 Summary . 48 4 Toponym Resolution 49 4.1 Introduction . 49 4.2 Geocoding with Map Vector . 50 4.2.1 Related Work . 51 4.2.2 Neural Network Geocoder . 53 4.2.3 The Map Vector (MapVec) . 54 4.2.4 Evaluation Metrics . 57 4.2.5 The GeoVirus Dataset . 59 4.2.6 New SOTA Results . 60 4.2.7 Discussion and Error Analysis . 62 4.2.8 Conclusions . 63 4.3 Summary . 64 5 What’s Next for Geoparsing? 67 5.1 Introduction . 67 5.2 A Pragmatic Guide to Geoparsing Evaluation . 68 5.2.1 Current Challenges . 68 5.2.2 Summary of the most salient findings . 70 5.2.3 Related Work . 70 5.2.4 Geographic datasets and the pragmatics of toponyms . 71 5.3 A Pragmatic Taxonomy of Toponyms . 74 5.3.1 Non-Toponyms . 75 5.3.2 Literal Toponyms . 75 5.3.3 Associative Toponyms . 78 5.4 Standard Evaluation Metrics . 80 Table of contents ix 5.4.1 Geotagging Metrics . 81 5.4.2 Toponym Resolution Metrics . 81 5.4.3 Recommended Metrics for Toponym Resolution . 83 5.4.4 Important Considerations for Evaluation . 83 5.4.5 Unsuitable Datasets . 85 5.4.6 Recommended Datasets . 86 5.5 GeoWebNews . 86 5.5.1 Annotation Procedure and Inter-Annotator Agreement (IAA) . 87 5.5.2 Annotation of Coordinates . 88 5.5.3 Evaluation . 88 5.5.4 Conclusion and Future Work . 92 5.5.5 Closing Thoughts . 93 6 Geoparsing for Disease Monitoring 95 6.1 Introduction . 95 6.2 Geoparsing Evaluation . 96 6.2.1 MediSys News Stream . 96 6.2.2 MediSys Geotagging . 97 6.2.3 MediSys Toponym Resolution . 99 6.3 FlexiGraph . 101 6.4 Geolocation System Proposals . 104 6.4.1 Literal and Associative Types . 105 6.5 Summary . 109 7 Conclusion 111 7.1 Synopsis . 111 7.2 Discussion: Towards Robust Biosurveillance . 112 7.2.1 Disease Monitoring, NLP and System Risks . 114 7.3 Future Work . 116 7.4 Summary . 120 References 123 Appendix A Supplementary Materials 139 A.1 Quoted News Text . 139 List of figures 2.1 The geoparsing pipeline comprises two main stages geotagging and geocod- ing. Geotagging retrieves only literal toponyms (ideally filtering metonymic occurrences) from text and generates candidate coordinates for each. Geocod- ing then leverages the surrounding context to choose the correct coordinate and link the mention to an entry in a geographical knowledge base.