Probabilistic Models for Context in Social Media Novel Approaches and Inference Schemes
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Models for Context in Social Media Novel Approaches and Inference Schemes Christoph Carl Kling Institute for Web Science and Technologies University of Koblenz{Landau [email protected] November 2016 Vom Promotionsausschuss des Fachbereichs 4: Informatik der Universit¨atKoblenz{Landau zur Verleihung des akademischen Grades Doktor der Naturwissenschaften (Dr. rer. nat.) genehmigte Dissertation. PhD thesis at the University of Koblenz-Landau Datum der wissenschaftlichen Aussprache: 16.11.2016 Vorsitz des Promotionsausschusses Prof. Dr. Ralf L¨ammel Berichterstatter: Prof. Dr. Steffen Staab Berichterstatter: Prof. Dr. Markus Strohmaier Berichterstatter: Prof. Dr. Lars Schmidt-Thieme 2 Abstract This thesis presents novel approaches for integrating context information into probabilistic models. Data from social media is typically associated with metadata, which in- cludes context information such as timestamps, geographical coordinates or links to user profiles. Previous studies showed the benefits of using such context information in probabilistic models, e.g. improved predictive perfor- mance. In practice, probabilistic models which account for context informa- tion still play a minor role in data analysis. There are multiple reasons for this. Existing probabilistic models often are complex, the implementation is difficult, implementations are not publicly available, or the parameter estimation is computationally too expensive for large datasets. Addition- ally, existing models are typically created for a specific type of content and context and lack the flexibility to be applied to other data. This thesis addresses these problems by introducing a general approach for modelling multiple, arbitrary context variables in probabilistic models and by providing efficient inference schemes and implementations. In the first half of this thesis, the importance of context and the poten- tial of context information for probabilistic modelling is shown theoretically and in practical examples. In the second half, the example of topic models is employed for introducing a novel approach to context modelling based on document clusters and adjacency relations in the context space. These mod- els allow for the first time the efficient, explicit modelling of arbitrary context variables including cyclic and spherical context (such as temporal cycles or geographical coordinates). Using the novel three-level hierarchical multi- Dirichlet process presented in this thesis, the adjacency of ontext clusters can be exploited and multiple contexts can be modelled and weighted at the same time. Efficient inference schemes are derived which yield interpretable model parameters that allow analyse the relation between observations and context. 3 Zusammenfassung Diese Arbeit stellt neue Verfahren zur Integration von Kontext-Information in probabilistische Modelle vor. Daten aus sozialen Medien sind oft mit Metadaten assoziiert, wie bei- spielsweise Uhrzeit, geographische Koordinaten oder Nutzerdaten. Fruhere¨ Studien haben die Bedeutung von Kontextinformationen fur¨ die Vorher- sagekraft von Wahrscheinlichkeitsmodellen gezeigt. In der Praxis spielen Wahrscheinlichkeitsmodelle die Kontextinformationen integrieren dennoch eine geringe Rolle. Die Grunde¨ hierfur¨ sind vielf¨altig: Vorhandene Wahr- scheinlichkeitsmodelle sind oft komplex, die Implementierung ist schwierig, Implementierungen sind nicht frei verfugbar¨ oder die Parametersch¨atzung ist fur¨ gr¨oßere Datens¨atze zu aufwendig. Dazu kommt, dass vorhandene Mo- delle typischerweise fur¨ eine spezielle Art von Kontextinformation angepasst sind und nicht ohne Weiteres auf andere Daten angewendet werden k¨onnen. Diese Arbeit stellt einen neuen Ansatz zur Modellierung von mehre- ren, beliebigen Kontext-Variablen in Wahrscheinlichkeitsmodellen vor und pr¨asentiert effiziente Inferenz-Strategien. In der ersten H¨alfte der Arbeit wird die Bedeutung von Kontextinforma- tionen anhand theoretischer und praktischer Beispiele gezeigt. In der zweiten H¨alfte werden neue Topic-Modelle vorgestellt, die Kontextinformationen an- hand von Dokument-Clustern und Nachbarschaftsbeziehungen im Kontext- Raum modellieren. Diese Topic-Modelle erlauben erstmals die effiziente, ex- plizite Modellierung von zyklischen Kontextvariablen und Kontextvariablen mit Verteilung auf der n-Sph¨are (beispielsweise von Zeit-Zyklen und geogra- phisch verteilten Variablen). Mit Hilfe des neuartigen Hiearchichal Multi- Dirichlet Process (HMDP), der in dieser Arbeit vorgestellt wird, k¨onnen Nachbarschaftsbeziehungen zwischen Kontext-Clustern ausgenutzt werden und es k¨onnen mehrere Kontexte gleichzeitig modelliert und gewichtet wer- den. Es werden effiziente Inferenz-Strategien hergeleitet, die interpretierbare Modell-Parameter liefern, welche die Analyse der Beziehung zwischen Beob- achtungen und Kontext-Informationen erlauben. 4 Acknowledgments This dissertation would not have been possible without the help and support of many friends and colleagues. I thank my friend and colleague J´er^omeKunegis who supported me throughout my PhD studies and helped shaping my ideas. I thank my su- pervisor Steffen Staab for giving me the freedom and time to get familiar with probabilistic topic modelling. Markus Strohmaier supported me in (finally) writing the paper on online democracies and gave valuable feed- back. Arnim Bleier helped me a lot by pointing me to related work and by introducing me to online inference techniques. Heinrich Hartmann guided my research on online democracies in the right direction. Sergej Sizov sup- ported me in the early stages of my research and motivated me to do topic modelling. I thank Lisa Posch, J´er^omeKunegis and Arnim Bleier for proofreading this thesis and for many helpful comments. Dirk Homscheid provided the dataset of the Linux kernel mailing list, and Damien Fay provided the user profiles of the online network for sexual fetishists. I thank my colleagues at the Institute WeST of the University of Koblenz{ Landau and at the Department of Computational Social Science at GESIS in Cologne for many valuable discussions, especially on the study of online democracies. Finally, I want to thank my family for their support and encouragement. Contents Introduction3 The Importance of Context . .3 Probabilistic Models and Prior Distributions . .4 Graphical Models and Topic Models . .5 Applications . .6 Contributions . .7 1 Foundations and Related Work 11 1.1 The Concept of Probability . 11 1.2 Basic Concepts and Distributions . 12 1.2.1 Bernoulli distribution . 13 1.2.2 Likelihood . 13 1.2.3 Binomial distribution . 14 1.2.4 Marginal probability . 14 1.2.5 Maximum likelihood estimation . 14 1.3 Bayesian Networks . 15 1.3.1 Dependencies in Bayesian networks . 17 1.3.2 Plate notation . 18 1.4 Priors and Bayesian Inference . 18 1.4.1 Beta distribution . 19 1.4.2 Maximum a posteriori estimation . 20 1.4.3 Bayesian inference . 22 1.4.4 Multinomial distribution . 23 1.4.5 Dirichlet distribution . 24 1.4.6 Maximum a posteriori estimation and Bayesian infer- ence for the multinomial distribution . 25 1.4.7 Fisher distribution and approximate inference . 25 1.4.8 Latent variables and expectation-maximisation . 28 1.5 Topic Models . 30 1.5.1 Latent semantic analysis . 31 1.5.2 Probabilistic latent semantic analysis . 32 1.5.3 Latent Dirichlet allocation . 33 1.6 Inference for Complex Graphical Models . 34 5 6 CONTENTS 1.6.1 Expecation-maximisation and maximum a posteriori inference . 34 1.6.2 Gibbs sampling . 35 1.6.3 Variational inference . 37 1.7 Evaluation of Topic Models . 40 1.7.1 Human evaluation . 40 1.7.2 Perplexity . 40 1.8 Non-Parametric Topic Models . 41 1.8.1 The Dirichlet process . 42 1.8.2 Inference for the HDP . 46 1.9 Summary . 51 2 Single-Context Voting Models 53 2.1 Problem Setting and Approach . 53 2.1.1 Problem setting . 54 2.1.2 Approach . 55 2.2 Delegative Democracies . 55 2.2.1 Democracy platforms . 56 2.2.2 Pirate parties . 56 2.3 Description of the Dataset . 56 2.3.1 LiquidFeedback platform . 56 2.3.2 Dataset . 57 2.4 Voting Behaviour and Delegations . 57 2.4.1 Existence and role of super-voters . 57 2.4.2 User approval rates . 60 2.4.3 Impact of delegations . 62 2.4.4 Temporal analysis of the delegation network . 64 2.5 Power in Online Democracies . 64 2.5.1 Power indices . 65 2.5.2 Probabilistic interpretation of power indices . 66 2.5.3 Theoretical (uniform) power indices . 67 2.5.4 Empirical power . 68 2.5.5 Non-uniform power indices . 70 2.5.6 Evaluation . 74 2.5.7 Distribution of Power . 75 2.6 Discussion . 76 2.7 Summary . 76 3 Single-Context Topic Models 79 3.1 A Classification of Context Variables . 80 3.1.1 Discrete context variables . 80 3.1.2 Linear and continuous context variables . 80 3.1.3 Cyclic and spherical context variables . 80 3.2 Single-Context Topic Models . 81 CONTENTS 7 3.2.1 Topic models for discrete context variables . 82 3.2.2 Topic models for linear context variables . 84 3.2.3 Topic models for cyclic and spherical context variables 84 3.2.4 Categorisation of approaches . 87 3.3 Drawbacks of Existing Models . 87 3.4 Multi-Dirichlet Process Topic Models . 90 3.5 The Basic Model . 91 3.5.1 Geographical clustering . 91 3.5.2 Topic detection . 92 3.6 The Neighbour-Aware Model . 94 3.6.1 Definition of a geographical network . 95 3.6.2 Advantages of the neighbour-aware model . 95 3.7 The MDP Based Geographical Topic Model . 97 3.7.1 The multi-Dirichlet process . 98 3.7.2 Inference . 99 3.7.3 Estimation of scaling parameters for the MDP . 100 3.7.4 MDP-based topic model . 101 3.7.5 Generative process . 101 3.7.6 Generalised estimator of the concentration parameter 103 3.8 Evaluation . 106 3.8.1 Datasets . 106 3.8.2 Experimental setting . 107 3.8.3 Comparison with LGTA . 108 3.8.4 Effect of the region parameter . 110 3.8.5 User study . 110 3.8.6 Runtime comparison . 114 3.9 Summary . 115 4 Multi-Context Topic Models 119 4.1 Generalisations and Special Cases of MGTM . 119 4.1.1 Relation to the hierarchical Dirichlet process . 120 4.1.2 Relation to the author topic model .