Thesis Michael Lind Mortensen 04042016
Total Page:16
File Type:pdf, Size:1020Kb
Multi-Criteria Decision Support Queries in Exploratory & Open World Settings Michael Lind Mortensen PhD Dissertation Department of Computer Science Aarhus University Denmark Multi-Criteria Decision Support Queries in Exploratory & Open World Settings A Dissertation Presented to the Faculty of Science and Technology of Aarhus University in Partial Fulfillment of the Requirements for the PhD Degree by Michael Lind Mortensen April 4, 2016 Abstract Throughout the past decade, data sources have increased significantly in both their size, availability, richness, complexity and dynamics. This data sur- plus is not only enabling new businesses, scientific achievements and economic growth; it can also enable normal people to make better real-world decisions if provided with the right tools. The class of multi-criteria decision support queries is said to be one such set of tools, with skyline and top-k queries being the main representatives. Over the past decades, skylines and top-k queries have been extensively studied, yet due to a number of usability and trust issues, they have yet to enjoy wide adoption in either practical scientific or industrial applications. Simply put, the theoretical gain and intent of these tools do not match the reality of how users make decisions. In this thesis, we take a step forward in bridging the gap between the theory and intent of multi-criteria decision support queries and how users actually analyze their options and make decisions in real life. The thesis is separated into two parts. In the first part, we investigate the use of skyline queries for exploratory search, in which users pose a string of related queries, exploring the options available to them. While this is a common usage pattern in real applica- tions, utilizing skyline queries in such an interactive scenario is non-trivial. Specifically, we study the effects of exploratory search on the usability of sky- line queries, and introduce caching-based methods for their efficient compu- tation in those settings. We also present a method for the targeted sampling of k-representative skyline points, enabling a fixed size diverse and relevant overview of all options. In the second part, we investigate the expansion of multi-criteria decision support queries into an open-world paradigm, where both unknown data and latent attributes are expected to exist. Specifically, we investigate the impact of open-world data on the conventional skyline paradigm and suggest a new probabilistic open-world paradigm for future research. We also study general open-world adaption by introducing a multi-criteria filtering method, capable of automatically filtering documents with latent attributes. Finally, we present an interdisciplinary work in Computer Science and Medicine, evaluating our filtering method for real systematic reviews in Evidence-Based Medicine. i Resumé Datakilder er det seneste årti blevet betydeligt større, mere tilgængelige, rigere, komplekse og dynamiske. Dette dataoverskud understøtter ikke blot nye virksomheder, forskningsresultater og økonomisk vækst; det gør også nor- male mennesker i stand til at tage bedre beslutninger i det virkelige liv, givet de rette værktøjer. Klassen af multi-kriterie beslutningsunderstøttende fore- spørgsler nævnes som netop sådanne værktøjer, hvor skyline og top-k fore- spørgsler er hovedrepræsentanterne. Over de seneste årtier er der udført om- fattende forskning i skyline og top-k forespørgsler, men grundet en række problemer med brugbarhed og troværdighed, så har de endnu ikke vundet in- dpas i hverken praktisk forskning eller industri. Kort sagt, så passer formålet, og den teoretiske gevinst ved disse værktøjer, ikke overens med den måde brugere tager beslutninger på i virkeligheden. I denne afhandling tager vi et skridt nærmere at bygge bro mellem teorien og formålet bag multi-kriterie beslutningsunderstøttende forespørgsler, og så den måde hvorpå brugere reelt evaluerer deres muligheder og tager beslutninger. Afhandlingen er i to dele. Første del undersøger brugen af skyline forespørgsler til udforskningssøgn- ing, hvor brugere laver en række relaterede forespørgsler imens de undersøger deres muligheder. Selvom dette er et normalt brugsmønster i virkelige sys- temer, så er brugen af skyline forespørgsler i sådan et interaktivt miljø ikke trivielt. Specifikt, så undersøger vi effekten af udforskningssøgning på brug- barheden af skylines og introducerer en caching-baseret metode til effektiv beregning af forespørgelser. Vi præsenterer også en metode til målrettet ud- vælgelse af k repræsentative skyline punkter, således vi understøtter et forskel- ligartet og relevant overblik over alle muligheder. I anden del undersøger vi udvidelsen af multi-kriterie beslutningsunder- støttende forespørgsler til et åben-verden paradigme, hvor både ukendt data og latente attributter forventes at eksistere. Specifikt, så undersøger vi påvirknin- gen af et åben-verden paradigme på den konventionelle skyline, og foreslår et nyt probabilistisk åben-verden paradigme til fremtidig forskning. We un- dersøger også generel tilpasning til den åbne verden, ved at introducere et multi-kriterie filtreringsværktøj, hvilket er i stand til automatisk at filtrere dokumenter med latente attributter. Til sidst, så præsenterer vi et interdisci- plinært projekt i Datalogi og Medicin, hvor vores filtreringsværktøj evalueres på virkelige systematiske undersøgelser i evidensbaseret medicin. iii Acknowledgments First and foremost, I would like to thank my patient and supportive advisor Ira Assent, who provided valuable advice, guidance, comments, challenges and opportunities to learn and improve throughout my PhD studies. I would also like to thank my brilliant co-authors, who were both great to work with and helped improve my skills immensely. A very special thanks here goes to Sean Chester, who not only functioned as a highly competent co-author and collaborator, but also provided valuable guidance on my PhD and remains a close personal friend. I appreciate the support throughout and look forward to more fun times in the future. Thanks to all my colleagues at the Data-Intensive Systems Group, both current and former. You made the years awesome. I wish you all the best. A special shout-out here goes to Anders Skovsgaard, my fellow Dane, who is not only a great friend, but also provided much needed external perspective on my work and was a great sparring partner. Another special shout-out to my office mate, Dr/Superwoman Barbora Micenkova, whose open and friendly nature made sharing an office with her both pleasant and fun throughout all the years. We will always have Dr. Spikey. A special thanks to Tim Kraska, who supported and guided me during my stay at Brown University, and who enabled me to work with great people like Thomas Trikalinos, Gaelen Adam, Yeounoh Chung, Carsten Binnig and especially Byron Wallace. You made my stint at Brown a very memorable and cherished one, with both my wife Sara and I feeling very welcome. Also thank you to Anders Møller and Marianne Graves Petersen for pro- viding an external perspective during our PhD support group meetings. Your guidance was very appreciated. Lastly, but likely mostly, I would like to thank my beautiful, wonderful and fantastic wife Sara, who not only lived with years of me being too busy for my own good, but also quit her job and moved to the US with me. She is my rock, my inspiration and my everything. Thank you for your positive attitude and all your support. Michael Lind Mortensen, Aarhus, April 4, 2016. v Contents Abstract i Resumé iii Acknowledgments v Contents vii I Overview 1 1 Introduction 3 1.1 Notaton & definitions . 4 1.2 Thesis Outline . 8 1.3 Other Contributions . 8 2 Exploratory Search with Multi-Criteria Decision Support Queries 11 2.1 On the Suitability of Skyline Queries for Data Exploration . 12 2.2 Efficient Caching for Constrained Skyline Queries . 16 2.3 Taking the Big Picture: Representative Skylines based on Sig- nificance and Diversity . 23 3 Multi-Criteria Decision Support Queries in the Open World 31 3.1 Open-world Paradigm for Multi-Criteria Decision Support Queries 34 3.2 CrowdFilter: Semi-Automated Multi-Criteria Filtering in the Open World . 42 3.3 Crowdsourcing Citation Screening for Systematic Reviews . 47 II Publications 53 4 On the Suitability of Skyline Queries for Data Exploration 55 4.1 Introduction . 55 4.2 Background . 57 vii viii CONTENTS 4.3 Theoretical effects . 58 4.4 Empirical Investigation . 61 4.5 Conclusion . 68 4.6 Acknowledgements . 69 5 Efficient Caching for Constrained Skyline Queries 71 5.1 Introduction . 71 5.2 Related work . 74 5.3 Preliminaries . 75 5.4 Exploiting related queries . 78 5.5 Arbitrary constraint changes . 82 5.6 Cache-Based Constrained Skyline . 86 5.7 Experimental evaluation . 87 5.8 Conclusion . 97 5.9 Appendix . 97 6 Taking the Big Picture: Representative Skylines based on Significance and Diversity 101 6.1 Introduction . 101 6.2 The Big Picture . 106 6.3 Complexity and Algorithms . 115 6.4 Experimental evaluation . 120 6.5 Related work . 127 6.6 Concluding remarks . 131 6.7 Acknowledgements . 131 6.8 Proofs . 131 7 CrowdFilter: Semi-Automated Multi-Criteria Filtering in the Open World 139 7.1 Introduction . 139 7.2 Multi-criteria filtering with the crowd . 142 7.3 The CrowdFilter System . 148 7.4 Learning with CrowdFilter . 154 7.5 Strategies & Optimizations . 158 7.6 Preliminary Results & Future Work . 162 7.7 Related work . 166 7.8 Acknowledgments . 170 8 Crowdsourcing Citation Screening for Systematic Reviews 171 8.1 Introduction . 172 8.2 Methods . 173 8.3 Results . 179 8.4 Discussion . 180 8.5 Appendix A - Tables and Figures . 184 CONTENTS ix 8.6 Appendix B - Citation screening crowd questions . 186 8.7 Appendix D - Anonymized worker satisfaction responses, re- views etc. 194 8.8 Appendix C - Experiences from early experimental iterations . 195 8.9 Appendix E – Honeypot details . 200 Bibliography 203 Part I Overview 1 Chapter 1 Introduction Throughout the past decade, data sources have increased significantly in both their size, availability, richness, complexity and dynamics.