Towards Collaborative Session-Based Semantic Search

Faculty of Computer Science Interactive Media Lab Dresden Master’s Thesis Towards Collaborative Session-based Semantic Search Sebastian Straub A thesis submitted in fulfillment of the requirements for the degree of Master of Science (M.Sc.) Supervisor: Prof. Dr.-Ing. Raimund Dachselt Advisors: Dr.-Ing. Annett Mitschick Dr.-Ing. Anke Lehmann Date: October 5, 2017 iii Declaration of Authorship Hiermit versichere ich, dass ich die vorliegende Arbeit mit dem Titel Towards Col- laborative Session-based Semantic Search selbstständig und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle Stellen, die anderen Quellen im Wortlaut oder dem Sinn nach entnommen wurden, sind als solche mit Angaben der Herkunft kenntlich gemacht. Dies gilt auch für alle bildlichen Darstellungen. Die Arbeit wurde bisher in gleicher oder ähnlicher Form noch nicht als Prüfungsleistung eingereicht. Dresden, 05.10.2017 v Abstract Towards Collaborative Session-based Semantic Search by Sebastian Straub In recent years, the most popular web search engines have excelled in their ability to answer short queries that require clear, localized and personalized answers. When it comes to complex exploratory search tasks however, the main challenge for the searcher remains the same as back in the 1990s: Trying to formulate a single query that contains all the right keywords to produce at least some relevant results. In this work we want to investigate new ways to facilitate exploratory search by making use of context information from the user’s entire search process. Therefore we present the concept of session-based semantic search, with an optional extension to collaborative search scenarios. To improve the relevance of search results we expand queries with terms from the user’s recent query history in the same search context (session-based search). We introduce a novel method for query classification based on statistical topic models which allows us to track the most important topics in a search session so that we can suggest relevant documents that could not be found through keyword matching. To demonstrate the potential of these concepts, we have built the prototype of a session-based semantic search engine which we release as free and open source software. In a qualitative user study that we have conducted, this prototype has shown promising results and was well-received by the participants. vii Zusammenfassung Towards Collaborative Session-based Semantic Search von Sebastian Straub Die führenden Web-Suchmaschinen haben sich in den letzten Jahren gegenseitig darin übertroffen, möglichst leicht verständliche, lokalisierte und personalisierte Antworten auf kurze Suchanfragen anzubieten. Bei komplexen explorativen Rechercheaufgaben hingegen ist die größte Herausforderung für den Nutzer immer noch die gleiche wie in den 1990er Jahren: Eine einzige Suchanfrage so zu formulieren, dass alle notwen- digen Schlüsselwörter enthalten sind, um zumindest ein paar relevante Ergebnisse zu erhalten. In der vorliegenden Arbeit sollen neue Methoden entwickelt werden, um die explorati- ve Suche zu erleichtern, indem Kontextinformationen aus dem gesamten Suchprozess des Nutzers einbezogen werden. Daher stellen wir das Konzept der sitzungsbasierten semantischen Suche vor, mit einer optionalen Erweiterung auf kollaborative Suchsze- narien. Um die Relevanz von Suchergebnissen zu steigern, werden Suchanfragen mit Begriffen aus den letzten Anfragen des Nutzers angereichert, die im selben Suchkon- text gestellt wurden (sitzungsbasierte Suche). Außerdem wird ein neuartiger Ansatz zur Klassifizierung von Suchanfragen eingeführt, der auf statistischen Themenmo- dellen basiert und es uns ermöglicht, die wichtigsten Themen in einer Suchsitzung zu erkennen, um damit weitere relevante Dokumente vorzuschlagen, die nicht durch Keyword-Matching gefunden werden konnten. Um das Potential dieser Konzepte zu demonstrieren, wurde im Rahmen dieser Arbeit der Prototyp einer sitzungsbasierten semantischen Suchmaschine entwickelt, den wir als freie Software veröffentlichen. In einer qualitativen Nutzerstudie hat dieser Pro- totyp vielversprechende Ergebnisse hervorgebracht und wurde von den Teilnehmern positiv aufgenommen. ix Acknowledgements First of all I would like to express my very great appreciation to my advisors Annett Mitschick and Anke Lehmann for their patient guidance and valuable feedback. Your continued support of my work was of vital importance to me. I am grateful to Prof. Raimund Dachselt for the opportunity to write this thesis about a topic of my own choosing. I know that this is not a matter of course and I hope I didn’t blow it for future students who come forward with ideas for their own theses. My special gratitude goes out to my colleagues at Avantgarde Labs, whom I had the pleasure to work with during my entire MA studies. You provided me with a playground where I could try out anything that I have learned during my studies and so it is not an understatement when I say that this work wouldn’t have been possible without that great experience. And finally, I am particularly grateful to my parents, my brother, my sister and my dear friends here in Dresden and back home for their unfailing support and the highly appreciated distractions that kept me going whenever I was ready to scrap the whole thing. You guys are awesome! i Contents Abstract. v Zusammenfassung. vii Acknowledgements. ix 1. Introduction 1 2. Related Work 5 2.1. Topic Models . 5 2.1.1. Common Traits . 5 2.1.2. Topic Modeling Techniques . 6 2.1.3. Topic Labeling . 8 2.1.4. Topic Graph Visualization . 9 2.2. Session-based Search . 10 2.3. Query Classification . 12 2.4. Collaborative Search . 13 2.4.1. Aspects of Collaborative Search Systems . 14 2.4.2. Collaborative Information Retrieval Systems . 14 3. Core Concepts 19 3.1. Session-based Search . 19 3.1.1. Session Data . 20 3.1.2. Query Aggregation . 21 3.2. Topic Centroid . 22 3.2.1. Topic Identification . 23 3.2.2. Topic Shift . 27 3.2.3. Relevance Feedback . 29 3.2.4. Topic Graph Visualization . 30 3.3. Search Strategy . 32 3.3.1. Prerequisites . 33 3.3.2. Search Algorithms . 33 3.3.3. Query Pipeline . 35 3.4. Collaborative Search . 38 3.4.1. Shared Topic Centroid . 38 3.4.2. Group Management . 39 3.4.3. Collaboration . 40 3.5. Discussion . 42 4. Prototype 45 4.1. Document Collection . 46 4.1.1. Selection Criteria . 46 ii 4.1.2. Data Preparation . 46 4.1.3. Search Index . 50 4.2. Search Engine . 52 4.2.1. Search Algorithms . 53 4.2.2. Query Pipeline . 55 4.2.3. Session Persistence . 58 4.3. User Interface . 59 4.4. Performance Review . 64 4.5. Discussion . 65 5. User Study 67 5.1. Methods . 67 5.1.1. Procedure . 68 5.1.2. Implementation . 68 5.1.3. Tasks . 70 5.1.4. Questionnaires . 70 5.2. Results . 71 5.2.1. Participants . 71 5.2.2. Task Review . 72 5.2.3. Literature Research Results . 75 5.3. Discussion . 76 6. Conclusion 79 Bibliography 83 Weblinks . 90 A. Appendix 91 A.1. Prototype: Source Code . 91 A.2. Survey . 92 A.2.1. Tasks . 92 A.2.2. Document Filter for Google Scholar . 95 A.2.3. Questionnaires . 96 A.2.4. Participant’s Answers . 100 A.2.5. Participant’s Search Results . 102 1 Chapter 1 Introduction “And how will you inquire into a thing when you are wholly ignorant of what it is? Even if you happen to bump right into it, how will you know it is the thing you didn’t know?” — Plát¯on, Mén¯on This quote from a fictional dialogue between ancient Greek philosopher Socrates and the politician Meno which was written by Plato has become known as Meno’s Paradox or The Paradox of Inquiry: At its core, it states that there is no need to search for something we already know and there is no way to recognize anything that we don’t know, therefore acquiring knowledge should be impossible. For a theist this problem is solved by referring to a higher power which has knowledge of the absolute truth and by following the rules of the higher power, even mere mortals can recognize this truth and therefore gain knowledge. Atheists however can only deduce knowledge from what they themselves define as axioms, the suitability of which they can assess by observation of the universe, but in the end there is no way to prove even the most basic axioms due to the lack of an absolute reference frame. From a practical perspective, the implications of Meno’s paradox are of little relevance to us: Humans have very different and often contradictory assumptions about the world, but in most cases these assumptions still allow them to get through their daily lives without getting into too much trouble with their environment. However, the issue of acquiring knowledge about things we don’t know anything about is a very practical one which we constantly encounter. Searching for information has become much more efficient and convenient with the rise of web search engines that allow access to knowledge sources that go far beyond what libraries traditionally used to offer. The popular search engines go through great lengths to offer search results that are localized and even personalized, like a search for “movies” might return a list of movies which are currently running in nearby theaters and which fit your cinematic profile. Yet while these retrieval methods do a great job when we have a clear goal, looking for information about topics we are already well aware of, most search engines fail to support the users in acquiring new knowledge in fields that they are unfamiliar with. This leads to the question how we can support people in exploratory search tasks when the goal is blurry at best and it is unclear how relevant sources that contain the desired knowledge can be discovered.

Load more