A New Way of Visualizing Semantic Similarity Over Time
Total Page:16
File Type:pdf, Size:1020Kb
Running head: VISUALISING SEMANTIC SIMILARITY OVER TIME 1 A New Way of Visualizing Semantic Similarity over Time Anis Mohamed Boudih ANR: 773591 Thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Communication and Information Sciences, Master Track Data Science: Business and Governance, at the School of Humanities of Tilburg University Supervisor: dr. E.A. Keuleers Second reader: dr. A. Alishahi Tilburg University School of Humanities Department of Communication and Information Sciences Tilburg center for Cognition and Communication (TiCC) Tilburg, The Netherlands December 21, 2018 VISUALISING SEMANTIC SIMILARITY OVER TIME 2 Preface First of all, I want to thank dr. Emmanuel Keuleers for his support and guidance during this project. Thank you for being so patient and teaching me so many new things in a relatively short amount of time. I would never come to this result without your input. Furthermore, I want to thank dr. A. Alishahi for supervising my thesis as the second reader and my fellow students for the discussions and feedback during the organized weekly sessions. In addition, I would also like to thanks Ákos Kadar for answering my questions and providing me guidance when dr. Emmanuel Keuleers went to visit some conferences. Lastly, I would like to thank my family and friends for all the support. Anis Boudih Tilburg, December 2018 VISUALISING SEMANTIC SIMILARITY OVER TIME 3 Abstract Time-stamped corpora are characterised by their historical record. Therefore, they are an effective source for analysing cultural and historical phenomena. Michel et al. (2011) have demonstrated how quantitative analysis of cultural phenomena in a time-stamped corpus could be performed. Visualising the frequency of specific words allowed them to reason quantitatively about the cultural and abstract changes occurring in society. However, due to the computational advances in the last few years, it is now possible to go beyond the exploration of frequency counts. We build on the approach of Michel et al. by visualising the semantic similarity through time instead of visualising frequency counts and analysing recent cultural and historical phenomena instead of phenomena occurring between 1800 and 2000. The News on the Web Corpus (Davies, 2013), a large time-stamped corpus that spans from 2010 to present time enables us to analyse current cultural and historical phenomena. Our research demonstrates how the visualisation of semantic similarity over time between a reference vector and target words can be utilised to investigate bitcoin, which will be the use case for evaluation and exploring our proposed approach. The visualisations that are being introduced should be considered as evidence that these visualisations and accompanying methods can be used for abstract reasoning about the phenomena that occurs in society. We expect that this new approach can significantly improve the analysis of cultural and historical phenomena. Keywords data visualisation, corpus linguistics, distributional semantics, culturonomics, word2vec VISUALISING SEMANTIC SIMILARITY OVER TIME 4 Table of Contents Introduction ................................................................................................................................ 6 1.1 Context .............................................................................................................................. 6 1.2 Research questions ......................................................................................................... 10 1.3 Thesis outline .................................................................................................................. 10 Related academic work ............................................................................................................ 12 2.1 Data visualisation ........................................................................................................... 12 2.2 Word embeddings ........................................................................................................... 13 2.3 Visualising word embeddings ........................................................................................ 18 2.4 An exploration use case: The bitcoin phenomenon ........................................................ 21 Experimental setup ................................................................................................................... 25 3.1 Use Case ......................................................................................................................... 25 3.2 Data set ........................................................................................................................... 26 3.2.1 Cleaning the data ......................................................................................................... 27 3.3 Method ............................................................................................................................ 28 3.3.1 GENSIM ...................................................................................................................... 28 3.3.2 Hyperparameters Word2Vec ....................................................................................... 28 3.3.3 Evaluation of word embeddings .................................................................................. 30 3.4 Visualising semantic similarity over time ...................................................................... 34 3.4.1 Semantic similarity ...................................................................................................... 34 3.4.2 Recap: previous work on visualising word embeddings ............................................. 35 3.4.3 User-defined method ................................................................................................... 36 VISUALISING SEMANTIC SIMILARITY OVER TIME 5 3.4.4 Anchoring method ....................................................................................................... 36 3.4.5 Evaluating the approach .............................................................................................. 37 Results ...................................................................................................................................... 40 4.1 Baseline performance ..................................................................................................... 40 4.2 Visualising semantic similarity between two words: ‘user defined method’ ................. 41 4.3 Visualising semantic similarity using the ‘anchoring method’ ...................................... 44 4.4 Visualising additional non-semantic dimensions of information ................................... 47 Discussion ................................................................................................................................ 53 5.1 Results linked to the literature ........................................................................................ 53 5.2 Limitations of study and data ......................................................................................... 55 5.3 Contribution of study and future research ...................................................................... 56 Conclusion ................................................................................................................................ 59 RQ 1 ...................................................................................................................................... 59 RQ 2 ...................................................................................................................................... 59 RQ 3 ...................................................................................................................................... 60 RQ 4 ...................................................................................................................................... 60 References ................................................................................................................................ 62 VISUALISING SEMANTIC SIMILARITY OVER TIME 6 Introduction In this chapter, I will introduce the goal of my thesis. Section 1.1 will provide an overview of the context of this study. Section 1.2 will present the main objectives of this study. Section 1.3 will provide an overview of the structure of this thesis. 1.1 Context The availability of large-scale collections of historic texts and online databases such as news articles and Google Books have simplified and stimulated research of historic events (Frermann & Lapata, 2016). For instance, Michel et al. (2011) have compiled about 4% of all the books that have been printed between 1800 and 2000 into a collection of historic texts. Michel et al. have proposed how cultural phenomena can be investigated quantitatively in a corpus. ‘Culturonomics’, as the authors name the analysis to human culture, investigates fields such as the adoption of technology, the evolution of grammar, lexicography, and other closely related fields to human culture (Michel et al., 2011). The authors have demonstrated that timestamped corpora can provide insight on historic events and proved that these phenomena can be analysed quantitatively by inferring conclusions from the frequency of words in relation to time. However, we are currently able to look beyond the frequency counts of words in a timestamped corpus due to the computational progress and the advances in creating semantic