Extracting and Learning Semantics from Social Web Data

Thomas Niebler Extracting and Learning Semantics from Social Web Data Dissertation zur Erlangung des akademischen Grades eines Dok- tors der Naturwissenschaften (Dr. rer. nat.) in der Fakultät für Mathematik und Informatik der Universität Würzburg Würzburg, im Oktober 2018 For my family: my safe harbour and the source of my strength. Acknowledgements A long journey now comes to an end. I would have never arrived here, if not for the many, many wonderful people that supported me in writing this dissertation. Unfortunately, it is almost impossible to thank everyone in person: Still, I want to explicitly mention some of them. First of all, I want to thank my supervisor Andreas Hotho for the extremely interesting topic of my thesis, the encouraging research environment as well as his guidance and belief in me throughout the years. Many thanks also go out to Robert Jäschke, who agreed to review this thesis. I also want to thank my former students Daniel Schlör, Luzian Hahn, and Tobias Koopmann, who contributed some parts to this thesis. At this point, I also want to thank my colleagues from far away, with whom I could collaborate on many parts of this thesis: Markus Strohmaier, Philipp Singer, and Stephan Doerfel. I especially want to express my sincerest gratitude to the great band of people that I was allowed to call my colleagues: Albin Zehe, Alexander Dallmann, Christian Pölitz, Daniel Schlör, Daniel Zoller, Lena Hettinger, Markus Ring, and Martin Becker from the DMIR Group, as well as the guys from the AI Chair, especially Georg Dietrich, Georg Fette, Jonathan Krebs, Marian Ifland, Markus Krug, and Maximilian Ertl. Your openness for interesting and fruitful discussions but also for a quick chat and last but not least, some casual table soccer sessions make me very happy to have you as my friends. Special thanks go out to my decade-old accomplice in many things, Martin Becker. You played an immeasurable role in my journey by always encouraging and supporting me, both in research and real life. Speaking of decades, I also want to mention my friends that I was very lucky to meet and keep with me for over 12 years (and so many years to come!): Alexander Kleinschrodt, Andreas Freimann, Isabel Runge, Ralf Zeuka, Dogan Cinbir, Andreas Bauer, Thien Anh Le, Jonas Meier, and Sven Gehwald. Thank you for the many days (and nights) of big and small adventures! Then there are those that, many, many years ago, laid the groundwork upon which I was able to build my life as it is now. Since my first thoughts, I knew that I could rely on my family in each and every aspect, most prominently Christine, Günther, Stefan, Rudolf, and Hannelore. You provided me a loving, safe, and stable environment in which I could explore my interests and to which I could retreat in bad times. Words cannot express how grateful I am for knowing you are there. The last person that I want to thank, but can never thank enough, is my wife Nicola. Without your unconditional love, your unfaltering support, your constant encourage- ment, and finally your unshakeable belief in me, I would have never been able to complete this work. Thank you for being the most important part of my life. I love you. Es gibt nichts Schöneres, als geliebt zu werden, geliebt um seiner selbst willen oder vielmehr: trotz seiner selbst. Victor Hugo Abstract Making machines understand natural language is a dream of mankind that existed since a very long time. Early attempts at programming machines to converse with humans in a supposedly intelligent way with humans relied on phrase lists and simple keyword matching. However, such approaches cannot provide semantically adequate answers, as they do not consider the specific meaning of the conversation. Thus, if we want to enable machines to actually understand language, we need to be able to access semantically relevant background knowledge. For this, it is possible to query so-called ontologies, which are large networks containing knowledge about real-world entities and their semantic relations. However, creating such ontologies is a tedious task, as often extensive expert knowledge is required. Thus, we need to find ways to automatically construct and update ontologies that fit human intuition of semantics and semantic relations. More specifically, we need to determine semantic entities and find relations between them. While this is usually done on large corpora of unstructured text, previous work has shown that we can at least facilitate the first issue of extracting entities by considering special data such as tagging data or human navigational paths. Here, we do not need to detect the actual semantic entities, as they are already provided because of the way those data are collected. Thus we can mainly focus on the problem of assessing the degree of semantic relatedness between tags or web pages. However, there exist several issues which need to be overcome, if we want to approximate human intuition of semantic relatedness. For this, it is necessary to represent words and concepts in a way that allows easy and highly precise semantic characterization. This also largely depends on the quality of data from which these representations are constructed. In this thesis, we extract semantic information from both tagging data created by users of social tagging systems and human navigation data in different semantic-driven social web systems. Our main goal is to construct high quality and robust vector representations of words which can the be used to measure the relatedness of semantic concepts. First, we show that navigation in the social media systems Wikipedia and BibSonomy is driven by a semantic component. After this, we discuss and extend methods to model the semantic information in tagging data as low-dimensional vectors. Furthermore, we show that tagging pragmatics influences different facets of tagging semantics. We then investigate the usefulness of human navigational paths in several different settings on Wikipedia and BibSonomy for measuring semantic relatedness. Finally, we propose a metric-learning based algorithm in adapt pre-trained word embeddings to datasets v containing human judgment of semantic relatedness. This work contributes to the field of studying semantic relatedness between words by proposing methods to extract semantic relatedness from web navigation, learn high- quality and low-dimensional word representations from tagging data, and to learn semantic relatedness from any kind of vector representation by exploiting human feedback. Applications first and foremest lie in ontology learning for the Semantic Web, but also semantic search or query expansion. vi Zusammenfassung Einer der großen Träume der Menschheit ist es, Maschinen dazu zu bringen, natürliche Sprache zu verstehen. Frühe Versuche, Computer dahingehend zu programmieren, dass sie mit Menschen vermeintlich intelligente Konversationen führen können, basierten hauptsächlich auf Phrasensammlungen und einfachen Stichwortabgleichen. Solche Ansätze sind allerdings nicht in der Lage, inhaltlich adäquate Antworten zu liefern, da der tatsächliche Inhalt der Konversation nicht erfasst werden kann. Folgerichtig ist es notwendig, dass Maschinen auf semantisch relevantes Hintergrundwissen zugreifen können, um diesen Inhalt zu verstehen. Solches Wissen ist beispielsweise in Ontologien vorhanden. Ontologien sind große Datenbanken von vernetztem Wissen über Objekte und Gegenstände der echten Welt sowie über deren semantische Beziehungen. Das Erstellen solcher Ontologien ist eine sehr kostspielige und aufwändige Aufgabe, da oft tiefgreifendes Expertenwissen benötigt wird. Wir müssen also Wege finden, um Ontolo- gien automatisch zu erstellen und aktuell zu halten, und zwar in einer Art und Weise, dass dies auch menschlichem Empfinden von Semantik und semantischer Ähnlichkeit entspricht. Genauer gesagt ist es notwendig, semantische Entitäten und deren Beziehun- gen zu bestimmen. Während solches Wissen üblicherweise aus Textkorpora extrahiert wird, ist es möglich, zumindest das erste Problem - semantische Entitäten zu bestimmen - durch Benutzung spezieller Datensätze zu umgehen, wie zum Beispiel Tagging- oder Navigationsdaten. In diesen Arten von Datensätzen ist es nicht notwendig, Entitäten zu extrahieren, da sie bereits aufgrund inhärenter Eigenschaften bei der Datenakquise vorhanden sind. Wir können uns also hauptsächlich auf die Bestimmung von semantis- chen Relationen und deren Intensität fokussieren. Trotzdem müssen hier noch einige Hindernisse überwunden werden. Beispielsweise ist es notwendig, Repräsentationen für semantische Entitäten zu finden, so dass es möglich ist, sie einfach und semantisch hochpräzise zu charakterisieren. Dies hängt allerdings auch erheblich von der Qualität der Daten ab, aus denen diese Repräsentationen konstruiert werden. In der vorliegenden Arbeit extrahieren wir semantische Informationen sowohl aus Taggingdaten, von Benutzern sozialer Taggingsysteme erzeugt, als auch aus Naviga- tionsdaten von Benutzern semantikgetriebener Social Media-Systeme. Das Hauptziel dieser Arbeit ist es, hochqualitative und robuste Vektordarstellungen von Worten zu konstruieren, die dann dazu benutzt werden können, die semantische Ähnlichkeit von Konzepten zu bestimmen. Als erstes zeigen wir, dass Navigation in Social Media- Systemen unter anderem durch eine semantische Komponente getrieben wird. Danach vii diskutieren und erweitern wir Methoden, um die semantische Information in Tagging- daten als niedrigdimensionale sogenannte “Embeddings” darzustellen. Darüberhinaus demonstrieren wir,

Load more