Leveraging Semantic Annotations for Event-Focused Search & Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging Semantic Annotations for Event-focused Search & Summarization Dissertation submitted towards the degree Doctor Engineering (Dr.-Ing) of the Faculty of Mathematics and Computer Science of Saarland University by Arunav Mishra Saarbrücken September 2017 Day of Colloquium 12 / 03/ 2018 Dean of the Faculty Univ.-Prof. Dr. Frank-Olaf Schreyer Examination Board Chair of the Committee Univ.-Prof. Dr. Dietrich Klakow First reviewer Prof. Dr. Klaus Berberich Second reviewer Prof. Dr. Gerhard Weikum Third reviewer Prof. Dr. Claudia Hauff Academic Assistant Dr. Rishiraj Saha Roy "Intelligence is not the ability to store information, but to know where to find it." -Albert Einstein Dedicate to my wonderful teachers and loving family . Acknowledgements I would like to express my deepest gratitude to Klaus Berberich for giving me an oppor- tunity to work under his guidance. This work is made possible with his unconditional support, expert scientific advice, and futuristic vision. However, the encouraging aspect of working under him was the exceptional freedom he granted to pursue challenging problems from various fields of information science (retrieval, summarization, and spatiotemporal text mining). In addition, our common interest in music that often triggered very interesting conversations made work even more enjoyable. I am extremely thankful to Gerhard Weikum for supporting me throughout my Master’sand Ph.D studies. His high standards of conducting research constantly inspired and trained me to become a better researcher. I also thank the additional reviewers and examiners, Dietrich Klakow and Claudia Hauff for providing valuable feedback for further improvements of this work. I acknowledge that this work would have not been possible without the influence, teachings, and guidance of several people. Firstly, I am extremely grateful to Martin Theobald for first introducing me to the department group and presenting me anop- portunity to work on a Master’s thesis under his guidance. His teachings motivated and prepared me for my Ph.D studies. Secondly, I offer a special thanks to Avishek Anand and Sarath Kumar Kondreddi for their constant support and guidance. The institute not only presented a high class academic but also a very lively and social environment which made this journey very enjoyable. It was made possible by the wonderful friends and colleagues that I made and met here, especially - Shubhabrata, Mittul, Kai, Amy, Yusra, Kaustubh, Saurabh, Kashyap, Johannes, Luciano, Dragan, Kiril, and the list goes on. I also thank Petra Schaaf and all other staff members within the Max Planck Institute for being very supportive towards the students and abstracting them from all fine-grained administrative issues. Last but not least, I am extremely grateful to my parents, Ramakanta Mishra and Saswati Mishra, and my brother Anubhav Mishra for always supporting me in all my endeavors. Finally, I believe the biggest contributor in the success of this work is my wife, Bhawna Kalra, and my son Neer Mishra, who have always been by my side. Abstract Today in this Big Data era, overwhelming amounts of textual information across different sources with a high degree of redundancy has made it hard for a consumer to retrospect on past events. A plausible solution is to link semantically similar information contained across the different sources to enforce a structure thereby providing multiple access paths to relevant information. Keeping this larger goal in view, this work uses Wikipedia and online news articles as two prominent yet disparate information sources to address the following three problems: • We address a linking problem to connect Wikipedia excerpts to news articles by casting it into an IR task. Our novel approach integrates time, geolocations, and entities with text to identify relevant documents that can be linked to a given excerpt. • We address an unsupervised extractive multi-document summarization task to generate a fixed-length event digest that facilitates efficient consumption ofinfor- mation contained within a large set of documents. Our novel approach proposes an ILP for global inference across text, time, geolocations, and entities associated with the event. • To estimate temporal focus of short event descriptions, we present a semi-supervised approach that leverages redundancy within a longitudinal news collection to esti- mate accurate probabilistic time models. Extensive experimental evaluations demonstrate the effectiveness and viability of our proposed approaches towards achieving the larger goal. Kurzfassung Im heutigen Big Data Zeitalters existieren überwältigende Mengen an Textinformatio- nen, die über mehrere Quellen verteilt sind und ein hohes Maß an Redundanz haben. Durch diese Gegebenheiten ist eine Retroperspektive auf vergangene Ereignisse für Konsumenten nur schwer möglich. Eine plausible Lösung ist die Verknüpfung seman- tisch ähnlicher aber über mehrere Quellen verteilter Informationen, um dadurch eine Struktur zu erzwingen, die mehrere Zugriffspfade auf relevante Informationen, bietet. Vor diesem Hintergrund benutzt diese Dissertation Wikipedia und Onlinenachrichten als zwei prominente aber dennoch grundverschiedene Informationsquellen um die folgenden drei Probleme anzusprechen: • Wir adressieren ein Verknüpfungsproblem, um Wikipedia-Auszüge mit Nachricht- enartikeln zu verbinden und das Problem in eine Information-Retrieval-Aufgabe umzuwandeln. Unser neuartiger Ansatz integriert Zeit- und Geobezüge sowie Entitäten mit Text, um relevante Dokumente, die mit einem gegebenen Auszug verknüpft werden können, zu identifizieren. • Wir befassen uns mit einer unüberwachten Extraktionsmethode zur automatis- chen Zusammenfassung von Texten aus mehreren Dokumenten um Ereigniszusam- menfassungen mit fester Länge zu generieren, was eine effiziente Aufnahme von Informationen aus großen Dokumentenmassen ermöglicht. Unser neuartiger Ansatz schlägt eine ganzzahlige lineare Optimierungslösung vor, die globale In- ferenzen über Text, Zeit, Geolokationen und mit Ereignis-verbundenen Entitäten zieht. • Um den zeitlichen Fokus kurzer Ereignisbeschreibungen abzuschätzen, stellen wir einen semi-überwachten Ansatz vor, der die Redundanz innerhalb einer langzeiti- gen Dokumentensammlung ausnutzt, um genaue probabilistische Zeitmodelle abzuschätzen. Umfangreiche experimentelle Auswertungen zeigen die Wirksamkeit und Tragfähigkeit unserer vorgeschlagenen Ansätze zur Erreichung des größeren Ziels. Contents 1 Introduction1 1.1 Motivation . .1 1.2 Contributions . .6 1.3 Publications . .8 1.4 Outline........................................ 11 2 Foundations & Background 13 2.1 Information Retrieval . 13 2.1.1 Statistical Language Models . 16 2.1.2 Query Likelihood Retrieval Model . 17 2.1.3 Kullback-Leibler Divergence Retrieval Model . 18 2.1.4 Estimation of Query Models . 20 2.1.5 Document Models and General Smoothing Methods . 22 2.2 Specialized Information Retrieval . 23 2.2.1 Temporal Information Retrieval . 24 2.2.2 Geographic Information Retrieval . 29 2.2.3 Entity-based Information Retrieval . 33 2.3 Text Summarization . 34 2.4 Extractive Multi-Document Summarization . 36 2.4.1 Global Inference Problem . 36 2.4.2 Maximal Marginal Relevance . 37 2.4.3 Greedy Approach . 38 2.4.4 Dynamic Programming Approach . 39 2.4.5 Integer Liner Programming-based Approaches . 39 2.5 Summary....................................... 42 3 Connecting Wikipedia Events to News Articles 43 3.1 Motivation & Problem Statement . 43 3.2 Related Work . 47 3.3 Leveraging Time + Text . 49 3.3.1 Notation & Representations . 51 3.3.2 Time-Aware Language Models . 52 xiv Contents 3.3.3 Experimental Evaluation . 55 3.4 EXPOSÉ: Exploring Past News for Seminal Events . 60 3.4.1 Algorithmic Building Blocks . 61 3.4.2 Demonstration Setup . 64 3.4.3 Exploratory Interface . 65 3.4.4 Demonstration Scenario . 66 3.5 Leveraging Text + Time + Geolocation + Entity . 68 3.5.1 Notation & Representations . 70 3.5.2 IR Framework . 71 3.5.3 Query Models . 72 3.5.4 Document Models . 78 3.5.5 Experimental Evaluation . 78 3.6 Summary & Future Directions . 86 4 Summarizing Wikipedia Events with News Excerpts 89 4.1 Motivation & Problem Statement . 89 4.2 Related Work . 95 4.3 Event Digest Generation . 98 4.3.1 Notation & Representations . 100 4.3.2 Event -Digest Generation Framework . 101 4.3.3 Query and Excerpt Models . 101 4.3.4 ILP Formulation . 104 4.3.5 Experimental Evaluation . 107 4.3.6 Results & Analysis . 117 4.4 Design of Test Collection for Coherence Evaluation . 124 4.4.1 Crowdflower Task Design . 125 4.4.2 Experimental Setup . 129 4.4.3 Experiment 1: Impact of Sentence Order . 132 4.4.4 Experiment 2: Impact of Sentence Proximity . 133 4.4.5 Experiment 3: Feasibility of using Crowdflower . 135 4.5 Summary & Future Directions . 138 5 Estimating Time Models for Short News Excerpts 141 5.1 Motivation & Problem Statement . 141 5.2 Related Work . 146 5.3 Notation & Representations . 149 5.4 Temporal Scoping Framework . 150 5.4.1 Excerpt Models . 151 5.4.2 Inter-Excerpt Relations . 152 5.4.3 Distribution Propagation . 155 5.5 Experimental Evaluation . 155 Contents xv 5.5.1 Setup . 156 5.5.2 Goals, Measures, and Methodology . 157 5.5.3 Results & Analysis . 161 5.6 Estimating Event Focus Time . 167 5.7 Summary & Future Directions . 168 6 Conclusion 173 6.1 Outlook........................................ 174 List of Figures 177 List of Tables 179 Bibliography 181 Appendix A Query Workloads 203 A.1 Wikipedia Year Page Events . 203 A.2 Wikipedia Excerpts to New York Times