Thesis Has Been Carried out Under the Auspices of SIKS, the Dutch Research School for Information and Knowledge Systems
Total Page:16
File Type:pdf, Size:1020Kb
UvA-DARE (Digital Academic Repository) Entities of interest Discovery in digital traces Graus, D.P. Publication date 2017 Document Version Final published version License Other Link to publication Citation for published version (APA): Graus, D. P. (2017). Entities of interest: Discovery in digital traces. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:04 Oct 2021 TRACES DAVID GRAUS DIGITAL IN DISCOVERY ENTITIES OF INTEREST DISCOVERY IN DIGITAL TRACES DAVID GRAUS Entities of Interest Discovery in Digital Traces David Graus Entities of Interest Discovery in Digital Traces ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof. dr. ir. K.I.J. Maex ten overstaan van een door het College voor Promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op vrijdag 16 juni 2017, te 10:00 uur door David Paul Graus geboren te Beuningen Promotiecommissie Promotor: prof. dr. M. de Rijke Universiteit van Amsterdam Co-promotor: dr. E. Tsagkias 904Labs Overige leden: prof. dr. A.P.J. van den Bosch Radboud Universiteit Nijmegen prof. dr. ing. Z.J.M.H. Geradts Universiteit van Amsterdam dr. ir. J. Kamps Universiteit van Amsterdam dr. E. Kanoulas Universiteit van Amsterdam prof. dr. D.W. Oard University of Maryland Faculteit der Natuurwetenschappen, Wiskunde en Informatica The research was supported by the Netherlands Organiza- tion for Scientific Research (NWO) under project number 727.011.005. SIKS Dissertation Series No. 2017-23 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. The printing of this thesis was financially supported by the Co van Ledden Hulsebosch Center, Amsterdam Center for Forensic Science and Medicine. Copyright © 2017 David Graus, Amsterdam, The Netherlands Cover by Rutger de Vries/perongeluk.com Printed by Off Page, Amsterdam ISBN: 978-94-6182-800-2 Acknowledgements I started the process that culminated in the little booklet you are now reading in 2012. At that time, I had never touched a unix, I didn’t know what an ssh was, and I hardly knew how to write a line of code. I hate to be overly dramatic,1 but I have grown, not only as a researcher, also as a person, and this growth is not exclusively to be attributed to the passing of time. Over the course of my PhD I have become more confident, about myself and my skills, and I have learned I can pick up anything I put my mind (and time) to. The latter is by far the greatest life’s lesson my PhD has brought me. For all of this, I have to thank some people. First, P&C, for setting up the prior by raising me in a home where I naturally got exposed to computers, technology, and (overly) critical thinking. P, for putting me and Mark in front of a Sinclair ZX81 at young age, traumatizing us with the scariest ASCII-rendered games imaginable. It shaped me into the tough trooper I am today. C, for putting up with a grumpy household when our attempts at assembling computers from parts scrambled from all over the place invariably failed. Next, Maarten, for taking a “chance” with me. I was quite wet behind the ears, as one would say, before I started my PhD. I am glad and thankful that this did not hold Maarten back for giving me the opportunity to see how I would pan out as a computer scientist. I was a coin-flip. Thanks for flipping! Next next, ILPS. First, the senior population at ILPS. In particular, Edgar, who at the time understood better than me what I was doing, and who saw that I could do and might enjoy this PhD. Manos and Wouter, who guided, supervised, co-authored, and co-lived this period with me. My peers at ILPS, of which I will name a few because these are my acknowledgements and I get to decide whether I single people out or not: Daan, Anne, Tom, and Marlies. I enjoyed interfering with your day-to-day activities, sharing misery, disagreeing (hi Anne), coffee breaks, bootcamps and runs, and everything else. Then, my co-authors, for partaking in that thing we do in science, which conveniently provided the basis of this dissertation. Paul, Ryen, and Eric for mentoring me during my internship at Microsoft Research in the summer of 2015. Best summer ever. Finally, my brothers from other mothers, Rutger and Marijn, for putting up with me being right all the time, slightly cocky, and (at times) tending towards the far-right end of the spectrum of self-confidence. I’m sorry. Moving on to the nonliving things: Polder, Oerknal, Maslow, and Joost (in order of appearance) for providing shelter for unwinding, escaping, brainstorming, philosophizing and reflecting on life, academia, and beyond. Oh, and also for drinking beers. Twitter, for giving me a place where I could write down stuff when Christophe would stop listening to the monologues I typically directed at my computer screen (but were intended for all to hear). Spotify and Bose, for sheltering me from the real world during extended periods of (coding) cocooning. Okay, this is not an exhaustive list, but I have to both start and stop somewhere. Before we part ways and move onto the researchy stuff, I will leave you with one big fat cliche—which´ I know loses some of its expressive power by virtue of being a cliche´ (but which I hope to counteract a bit by being explicit about my awareness of it being a cliche):´ It was a great journey! David 1I really don’t. Contents 1 Introduction 1 1.1 Research Outline and Questions . 3 1.2 Main Contributions . 9 1.3 Thesis Overview . 10 1.4 Origins . 11 2 Background 13 2.1 Information Retrieval . 13 2.1.1 Retrieval Models . 14 2.1.2 Evaluation . 16 2.2 Semantic Search . 17 2.2.1 Knowledge Bases and Entities . 17 2.3 Emerging Entities . 20 2.3.1 Analyzing Emerging Entities . 21 2.3.2 Predicting Emerging Entities . 21 2.3.3 Constructing Entity Representations for Effective Retrieval . 22 2.4 Analyzing and Predicting Activity . 24 2.4.1 Mining Email . 24 2.4.2 User Interaction Logs . 25 2.5 What’s next . 27 I Analyzing, Predicting, and Retrieving Emerging Entities 29 3 Analyzing Emerging Entities 31 3.1 Introduction . 31 3.2 Research Questions . 33 3.3 Data . 34 3.3.1 Emerging Entities in News and Social Media . 35 3.3.2 Time Series of Emerging Entity Mentions . 36 3.4 Methods . 37 3.4.1 Time Series Clustering . 38 3.4.2 Time Series Analysis . 40 3.5 Results . 41 3.5.1 Emergence Patterns . 41 3.5.2 Entity Emergence Patterns in Social Media and News . 49 3.5.3 Emergence Patterns of Different Entity Types . 53 3.6 Conclusion . 57 3.6.1 Findings . 58 3.6.2 Implications . 59 3.6.3 Limitations . 60 3.6.4 What’s Next? . 61 v CONTENTS 4 Predicting Emerging Entities 63 4.1 Introduction . 63 4.2 Approach . 65 4.3 Unsupervised Generation of Pseudo-ground Truth . 65 4.4 Sampling Pseudo-ground Truth . 66 4.4.1 Sampling Based on Entity Linker’s Confidence Score . 66 4.4.2 Sampling Based on Textual Quality . 67 4.5 Experimental Setup . 68 4.5.1 Dataset . 69 4.5.2 Experiment I: Sampling Pseudo-ground Truth . 69 4.5.3 Experiment II: Prediction Experiments . 69 4.5.4 Evaluation . 69 4.6 Results . 70 4.6.1 Experiment I: Sampling Pseudo-ground Truth . 70 4.6.2 Experiment II: Impact of Prior Knowledge . 72 4.7 Conclusion . 74 5 Retrieving Emerging Entities 77 5.1 Introduction . 77 5.2 Dynamic Collective Entity Representations . 80 5.2.1 Problem Statement . 80 5.2.2 Approach . 80 5.2.3 Description Sources . 81 5.2.4 Adaptive Entity Ranking . 84 5.3 Model . 84 5.3.1 Entity Representation . 84 5.3.2 Features . 85 5.3.3 Machine Learning . 86 5.4 Experimental setup . 87 5.4.1 Experiments . 87 5.4.2 Baselines . 89 5.4.3 Data Alignment . 89 5.4.4 Machine Learning . 90 5.4.5 Evaluation . 90 5.5 Results and Analysis . 91 5.5.1 Dynamic Collective Entity Representations . 91 5.5.2 Modeling Field Importance . 94 5.5.3 Ranker Adaptivity . 95 5.6 Conclusion . 97 II Analyzing and Predicting Activity from Digital Traces 101 6 Analyzing and Predicting Email Communication 103 6.1 Introduction . 103 6.2 Communication Graph . 104 vi CONTENTS 6.3 Modeling . 105 6.3.1 Email Likelihood . 105 6.3.2 Sender Likelihood .