<<

Running head: VISUALISING SEMANTIC SIMILARITY OVER TIME 1

A New Way of Visualizing Semantic Similarity over Time

Anis Mohamed Boudih

ANR: 773591

Thesis submitted in partial fulfilment

of the requirements for the degree of

Master of Science in Communication and Information Sciences,

Master Track Data Science: Business and Governance,

at the School of Humanities

of Tilburg University

Supervisor: dr. E.A. Keuleers

Second reader: dr. A. Alishahi

Tilburg University School of Humanities Department of Communication and Information Sciences

Tilburg center for Cognition and Communication (TiCC)

Tilburg, The Netherlands December 21, 2018 VISUALISING SEMANTIC SIMILARITY OVER TIME 2

Preface

First of all, I want to thank dr. Emmanuel Keuleers for his support and guidance during this project. Thank you for being so patient and teaching me so many new things in a relatively short amount of time. I would never come to this result without your input.

Furthermore, I want to thank dr. A. Alishahi for supervising my thesis as the second reader and my fellow students for the discussions and feedback during the organized weekly sessions.

In addition, I would also like to thanks Ákos Kadar for answering my questions and providing me guidance when dr. Emmanuel Keuleers went to visit some conferences.

Lastly, I would like to thank my family and friends for all the support.

Anis Boudih

Tilburg, December 2018

VISUALISING SEMANTIC SIMILARITY OVER TIME 3

Abstract

Time-stamped corpora are characterised by their historical record. Therefore, they are an effective source for analysing cultural and historical phenomena. Michel et al. (2011) have demonstrated how quantitative analysis of cultural phenomena in a time-stamped corpus could be performed. Visualising the frequency of specific allowed them to reason quantitatively about the cultural and abstract changes occurring in society. However, due to the computational advances in the last few years, it is now possible to go beyond the exploration of frequency counts. We build on the approach of Michel et al. by visualising the semantic similarity through time instead of visualising frequency counts and analysing recent cultural and historical phenomena instead of phenomena occurring between 1800 and 2000.

The News on the Web Corpus (Davies, 2013), a large time-stamped corpus that spans from

2010 to present time enables us to analyse current cultural and historical phenomena.

Our research demonstrates how the visualisation of semantic similarity over time between a reference vector and target words can be utilised to investigate bitcoin, which will be the use case for evaluation and exploring our proposed approach. The visualisations that are being introduced should be considered as evidence that these visualisations and accompanying methods can be used for abstract reasoning about the phenomena that occurs in society. We expect that this new approach can significantly improve the analysis of cultural and historical phenomena.

Keywords data visualisation, , distributional , culturonomics,

VISUALISING SEMANTIC SIMILARITY OVER TIME 4

Table of Contents

Introduction ...... 6

1.1 Context ...... 6

1.2 Research questions ...... 10

1.3 Thesis outline ...... 10

Related academic work ...... 12

2.1 Data visualisation ...... 12

2.2 embeddings ...... 13

2.3 Visualising word embeddings ...... 18

2.4 An exploration use case: The bitcoin phenomenon ...... 21

Experimental setup ...... 25

3.1 Use Case ...... 25

3.2 Data set ...... 26

3.2.1 Cleaning the data ...... 27

3.3 Method ...... 28

3.3.1 GENSIM ...... 28

3.3.2 Hyperparameters Word2Vec ...... 28

3.3.3 Evaluation of word embeddings ...... 30

3.4 Visualising semantic similarity over time ...... 34

3.4.1 Semantic similarity ...... 34

3.4.2 Recap: previous work on visualising word embeddings ...... 35

3.4.3 User-defined method ...... 36 VISUALISING SEMANTIC SIMILARITY OVER TIME 5

3.4.4 Anchoring method ...... 36

3.4.5 Evaluating the approach ...... 37

Results ...... 40

4.1 Baseline performance ...... 40

4.2 Visualising semantic similarity between two words: ‘user defined method’ ...... 41

4.3 Visualising semantic similarity using the ‘anchoring method’ ...... 44

4.4 Visualising additional non-semantic dimensions of information ...... 47

Discussion ...... 53

5.1 Results linked to the literature ...... 53

5.2 Limitations of study and data ...... 55

5.3 Contribution of study and future research ...... 56

Conclusion ...... 59

RQ 1 ...... 59

RQ 2 ...... 59

RQ 3 ...... 60

RQ 4 ...... 60

References ...... 62

VISUALISING SEMANTIC SIMILARITY OVER TIME 6

Introduction

In this chapter, I will introduce the goal of my thesis. Section 1.1 will provide an overview of the context of this study. Section 1.2 will present the main objectives of this study. Section 1.3 will provide an overview of the structure of this thesis.

1.1 Context

The availability of large-scale collections of historic texts and online databases such as news articles and Google Books have simplified and stimulated research of historic events

(Frermann & Lapata, 2016). For instance, Michel et al. (2011) have compiled about 4% of all the books that have been printed between 1800 and 2000 into a collection of historic texts.

Michel et al. have proposed how cultural phenomena can be investigated quantitatively in a corpus. ‘Culturonomics’, as the authors name the analysis to human culture, investigates fields such as the adoption of technology, the evolution of grammar, lexicography, and other closely related fields to human culture (Michel et al., 2011). The authors have demonstrated that timestamped corpora can provide insight on historic events and proved that these phenomena can be analysed quantitatively by inferring conclusions from the frequency of words in relation to time.

However, we are currently able to look beyond the frequency counts of words in a timestamped corpus due to the computational progress and the advances in creating semantic vector spaces. The embodiment of this technological progress is Word2Vec: a shallow three- layered neural network with two different architectures that can learn representations of words from large text corpora (Mikolov, Chen, Corrado, & Dean, 2013a). This model was introduced in 2013 by Mikolov et al. and was much more appropriate than earlier models for creating vector spaces out of a large collection of text. This is especially achievable by negative sampling, which is an improved computational method for learning word representations more efficiently than earlier models (Dyer, 2014). Mikolov et al. (2013b) have VISUALISING SEMANTIC SIMILARITY OVER TIME 7 demonstrated that the speed of learning representations was significantly improved over previous models (Baroni, Dinu, & Kruszewski, 2014). In addition, a benchmark comparison between Word2Vec and other models also demonstrated that the Word2Vec vectors have state-of-the-art performance on different test sets (Mikolov et al., 2013a). These advances enable us to explore the analysis of culturonomics in a way that was previously not possible.

Exploring how linguistic contexts of words have changed over time in timestamped corpora will provide interesting information about the development of phenomena. For instance,

Kulkarni, Al-Rfou, Perozzi & Skiena (2015) have investigated how the associations of the word ‘gay’ evolved over time. Its nearest neighbours gradually changed from cheerful to homosexual associations, such as ‘lesbian’, ‘homosexual’, and ‘transgender’. Thus, examining the associations of a linguistic or cultural phenomenon through time will allow for the analysis of culturonomics in a timestamped corpus.

This thesis will build upon the approach of Michel et al., who have demonstrated how culturonomics can be analysed in a simple but clear manner by utilising the frequency counts of ngrams in a time-tagged corpus. We will try to expand upon the approach of Michel et al. by analysing the temporal change in the semantic similarity between specific words in time to gain insight on the temporal development of cultural and linguistic phenomena. The semantic similarity between two words can be measured by calculating the distance between word embeddings. A is a representation of a word’s meaning in a numerical vector. Word embeddings that are made with models are based on the notion of semantic similarity and all the other generalisations that are built upon that notion.

This is known as the distributional hypothesis, which states that words that are used and occur in the same contexts tend to have similar meanings (Harris, 1954).

Michel et al. have encouraged the exploration of other resources for the analysis of linguistic and cultural phenomena such as newspapers, manuscripts, and other human VISUALISING SEMANTIC SIMILARITY OVER TIME 8 creations because they typically contain human ‘voices’ that are worth considering. Therefore, this study uses the News on the Web Corpus (Davies, 2013). This is a time-tagged corpus that contains web-based newspapers and magazines from 2010 up to the present time. Analysing linguistic and cultural phenomena with a news corpus can be advantageous over the usage of books as data because the former method enables the study of language evolution through the lens of events (Yao et al., 2018). Moreover, news corpora are also characterised by consistency in narrative style and grammar, as opposed to the Google Ngram Books Corpus that was used by Michel et al. Their corpus lacked consistency in style and grammar because the books were published over a long time (e.g., between 1800 and 2000) and because both fictional books and non-fictional books were present in the corpus.

Michel et al. have demonstrated that simple two-dimensional graphs can be used to analyse cultural or linguistic phenomena. The visualisations that had been presented in their paper clarified their analysis and demonstrated the data in a simple but powerful way. In this study, we will explore different ways of visualising semantic similarity through time in order to analyse culturonomics.

To demonstrate our proposed method and accompanied visualisations, a highly relevant contemporary phenomenon named bitcoin will be explored through the lens of our method. Bitcoin is an interesting case to explore because it is based on blockchain, which is one of the most significant technological advancements since the internet (Metry, 2017). In addition, the popularity of bitcoin and its high value on the bitcoin market has led to significant investment and speculation about the future of digital money (Cheah & Fry, 2015).

The price of bitcoin has kept increasing the last few years although its fundamental value is still unknown (Cheah & Fry, 2015). In fact, the prices of bitcoin have dramatically dropped earlier this year by more than 50% of their peak value (Young, 2018). This situation has many remarkable parallels with other historic and speculative bubbles. A similar pattern has VISUALISING SEMANTIC SIMILARITY OVER TIME 9 occurred a few times in the history that have ultimately led to a crash in the value of a product on the stock market. Some examples are the tulip mania in the 16th century, the internet bubble in the 20th century, and the housing bubble in 2008 (Ferguson, 2008).

The evaluation of such a phenomenon is linked to the scientific relevance of this study. Exploring the News on the Web Corpus at every timestamp with word embeddings will produce a time-reflective text representation that will pave the way for a new approach to analyse linguistic and cultural phenomena. This study will illustrate the potential and drawbacks of analysing linguistic and cultural phenomena through time with semantic similarity. Hence, the method of Michel et al. will be broadened by our study because we are exploring a new method that can answer other questions. In addition, this study will also contribute to scientific knowledge about visualising word embeddings, because there is little scientific literature available that focuses on visualising word embeddings (Smilkov et al.

(2015); Hilpert & Perek, 2015). The reason for this might that high-dimensional data cannot be visualised directly due to the number of dimensions. There are two state-of-the-art dimensionality reduction techniques (e.g., principal component analysis and t-distributed stochastic neighbour embedding) that can reduce the number of dimensions of a semantic vector space (Van Der Maaten & Hinton, 2008). As a result, a word embedding can be displayed as a point on a two-dimensional graph (Hilpert & Perek, 2015). Besides the scientific relevance, the study of this specific method also has practical relevance because it eventually can be used to analyse cultural and linguistic phenomena such as bitcoin.

Consequently, this approach contributes to a deeper understanding of society and the temporal development of socially relevant phenomena such as bitcoin, social media, and self-driving cars. Moreover, our approach helps interested parties to investigate linguistic phenomena in the vast amount of digitalised textual data by providing a visualisation that allows anyone to look up associations or search for lexical changes over time. VISUALISING SEMANTIC SIMILARITY OVER TIME 10

1.2 Research questions

The main objectives for this thesis are: (1) to explore simple but powerful methods for visualising semantic associations of words over a given time period; that (2) can be applied to the analysis of recent events using a current time-tagged corpus.

Hence, the following problem statement is formulated:

Can we develop novel ways of visualising semantic similarity through time to analyse recent cultural and historical phenomena?

The research questions are as follows:

1. Can we develop a method for the comparison of a user-provided reference word and other user-provided target words?

2. Can we develop a method for automatically choosing relevant target words given a reference word and a user-specified time period?

3. Can we develop a method for visualising additional non-semantic dimensions of information of the target words?

4.. Which of the developed methods are appropriate for analysing current events using recent time-tagged corpora?

1.3 Thesis outline

The following outline will be followed in the upcoming chapters. Chapter 2 provides an introduction of the methodology and theory that forms the basis for this thesis. In this chapter, we will first discuss literature related to data visualisation. Next, the concept of word embeddings will be discussed by reviewing prominent literature that is closely related to distributional semantics and distributional models. The next subsection will classify different data visualisations related to word embeddings and discuss the relevant aspects of each example. Finally, our use case, the bitcoin phenomenon, will be reviewed by viewing bitcoin from a historical perspective. VISUALISING SEMANTIC SIMILARITY OVER TIME 11

The third chapter discusses the experimental setup that was used in this study, starting with an introduction to the dataset. After a brief introduction to the News on the Web Corpus

(Davies, 2013), the pre-processing of data and the parameters of the Word2Vec model will be discussed. Subsequently, we will evaluate our created semantic vector spaces with different test sets to validate our word embeddings. After the evaluation of word embeddings, the visualisation of word embeddings will be further emphasised.

The fourth chapter starts with presenting the baseline performance of this study. We will then present our proposed method with three different visualisations. The focus of this chapter is to demonstrate our developed methods by answering the use case questions formulated in

Section 3.3.5. We will apply then qualitative reasoning to bitcoin.

The fifth chapter puts the results in perspective by linking them to the literature. This chapter will also discuss the contribution of this thesis within the existing framework. Areas for further research will be identified based on the contribution of this study,

I will conclude this thesis by answering the research questions that were stated in Chapter

1.

VISUALISING SEMANTIC SIMILARITY OVER TIME 12

Related academic work

In this chapter, I will delve into academic work that is related to data visualisation and word embeddings. After some general discussion about data visualisation, a section will be devoted to the concept of word embeddings and how they can be created. After this, an overview of scientific work related to visualising word embeddings will be provided. Finally, some historic context about the use case, bitcoin, will be presented.

2.1 Data visualisation

More attention is being given to the presentation of graphics since the work of Edward

Tufte (Chen, Härdle & Unwin, 2008). According to Chen et al. (2008), knowledge about this topic is usually expressed as a set of principles that need to be followed rather than formal theories, and while there is considerable applied literature, theoretical literature is rarely found. The development of new data visualisations was led in the past by technological advances, data collection, and statistical theory (Young, Valero-Mora, & Friendly, 2006).

However, according to Chen et al., the development of new data visualisation methods is currently led by increased computer processing and capacity, access to massive data sets, and continuous streaming of data.

Those who wish to work with data visualisation should be aware of the fundamental difference between graphics for presentation and graphics for exploration (Tufte, 1983;

Young et al., 2006; Theus & Urbanek, 2007; Chen et al., 2008). Presentation graphics generally take the form of a single graph that portrays a key message without providing further context (Theus & Urbanek, 2007). Moreover, graphics for the purpose of presentation should be of high quality and be revealing to many observers. Exploratory graphics, in contrast, should be fast and informative rather than slow and precise (Chen et al., 2008). To illustrate the difference between presentation and explorative graphics, Chen et al. has used the example of air pollution in the United States. While exploring air pollution in the USA can VISUALISING SEMANTIC SIMILARITY OVER TIME 13 be achieved with a simple barplot or scatterplot, presentation to a wide audience is best achieved in the form of a geographical map of the USA that includes extra dimensions of information. Because a presentation graph is intended to for presentation, it is necessary for it to contain detailed legends and captions, while exploratory graphics are drawn simply to support the data investigations of one data analyst (Chen et al., 2008). In general, a visualisation should avoid distorting the data’s message, because the data must be presented properly (Tufte, 2001). Moreover, Chen et al. have argued that due to the scarcity of relevant theory, examples of data visualisation are becoming more important for understanding practice and providing a basis for future progress.

2.2 Word embeddings

Distributional approaches to meaning acquisition use distributional properties of linguistic entities as the formal structure of semantics (Sahlgren, 2006). In doing so, these approaches to meaning acquisition rely fundamentally on several assumptions about the nature of language and meaning. These set of assumptions are generally known as co- occurrence statistics in text corpora that can be used for the study of word meaning (Mitchell

& Lapata, 2010). This basis has its roots in the corpus linguistic tradition exemplified by the famous quotation ‘You shall know a word by the company it keeps’ (Firth, 1957, p. 11) and the distributional hypothesis ‘The degree of semantic similarity between two linguistic expressions A and B is a function of the similarity of the linguistic contexts in which A and B can appear’ (Harris, 1954). However, there is more variation of this assumption to be found in the scientific literature that tries to formulate largely the same thing while adding an annotation. More recent papers tend to describe the previous assumption as ‘words with similar meanings will occur with similar neighbours if enough text material is available’

(Schütze & Pedersen, 1995, p. 161); ‘a representation that captures much of how words are used in natural context will capture much of what we mean by meaning’(Landauer & Dumais, VISUALISING SEMANTIC SIMILARITY OVER TIME 14

1997, p. 218); and ‘words that occur in the same contexts tend to have similar meanings’

(Pantel, 2005, p. 2). The idea behind this is that there is a correlation between distributional similarity and meaning similarity (Sahlgren, 2006). Therefore, it is possible to discover the semantics of a word by inspecting a significant number of linguistic contexts that represent the distributional behaviour of a word (Lenci, 2008). This approach of representing a distributional meaning of a word is known as distributional semantics. Thus, distributional semantic models rely on the idea that the distributional meaning of a word depends on the contexts in which the word is used and that the contents of a word can be captured by its contextual representation (Miller & Charles, 1991; Lenci, 2008).

The primary source of information for distributional semantics are large collections of texts, because they possess linguistic usages that can be used to identify the distributional properties of a word. The distributional properties of a word are captured in a vector: ‘a sequence of numbers which encode the statistical association strength between a word and a certain context or distributional feature’ (Lenci, 2008, p. 11). Words that are represented as numerical vectors are also known as word embeddings (Bengio, Ducharme, Vincent, &

Jauvin, 2003). Word embeddings can be conceived as coordinates of a word in a high- dimensional geometric space. Words with similar coordinates will be located near each other, while dissimilar words will be located far from each other. In other words, measures of vector similarity can be used to express the contextual similarity between two words. However, there is criticism of equating this similarity to semantic similarity. Pado and Lapata (2003) have stressed that the proximity between two words in a vector space cannot indicate the nature of the semantic relation, because distributional similar words can be synonyms, antonyms, or even semantically unrelated. On the other hand, there are studies that have demonstrated the

‘psychological reality’ of semantic similarity. For instance, Miller and Charles (1991) have demonstrated that people have an intuitive understanding of semantic similarity, and that VISUALISING SEMANTIC SIMILARITY OVER TIME 15 people also have little trouble in making accurate judgments. Several other studies have reported significant inter-subject agreement between human ratings and ratings derived from distributional models (Miller & Charles 1991).

Distributional semantic models (DSMs) rely on the assumption that the contextual information of a word can approximate a word’s meaning (Miller & Charles, 1991).

Constructing word embeddings can be performed in two major ways. In general, DSMs use numerical vectors to keep track of the co-occurrence of words (e.g., contexts) in a corpus to approximate the meaning of all the words that occur in a corpus (Turney & Pantel, 2010).

Constructing word vectors out of raw co-occurrence counts with DSMs was considered to be state-of-the-art for many decades (Schütze, 1992; Baroni et al., 2014). (LSA) is the most famous DSM that uses the raw counts of co-occurrences for the construction of a semantic vector space (Landauer & Dumais, 1997). Co-occurrence counts of words are collected with a context window. This context window can be a window that spans n words or characters that move through a corpus, or a context window can be the entire text of a document in the case of LSA (Schütze, 1992; Landauer & Dumais, 1997). Every time word i appears with word j within a window, the count of cell fij in a large-count matrix is increased by one. But, as one might imagine, this usually leads to a very sparse matrix

(Salhgren, 2006). As a result, LSA is often used with a linear algebra technique called singular value decomposition, which is a technique that can extract the most informative dimensions from a sparse matrix (Landauer & Dumais, 1997). These dimensions are then used to create an n-dimensional semantic vector space with n-dimensional context vectors.

Over the last few years, there has been significant interest in a new type of DSM for constructing word embeddings. Miikkulainen and Dyer (1991) were the first to propose a neural network for language modelling. However, it was Bengio et al. (2003) who pushed this idea to a larger scale and focused on learning a statistical model of the distribution of word VISUALISING SEMANTIC SIMILARITY OVER TIME 16 sequences. The model learns distributed feature vectors to represent the similarity between words. However, training time and capacity of the model was still an issue for this neural network language model (NNLM). The main cause for this was the softmax layer that produced a probability distribution over all words, which was too computationally complex at that time (Chen, Graugier & Auli, 2016). Several feed-forward neural network architectures for computing vector representations followed after NNLM. However, limited improvements were made in terms of reducing the computational cost (Collobert & Weston, 2008; Chen et al., 2016).

It was Mikolov et al. (2013) who achieved the reduction of computational complexity in the new type of DSMs (Mikolov et al., 2013a). Mikolov et al. had proposed two simple neural network architectures for computing continuous vector representations of words: continuous bag-of-words (CBOW) and the continuous skip-gram model (Skip-Gram). The

CBOW architecture tries to predict the middle word in a window spanning using the context words in that window spanning (Mikolov et al., 2013a). In contrast, the Skip-Gram architecture tries to predict the context words of a given word. Figure (1) is a graphical illustration of the CBOW and Skip-Gram architecture.

VISUALISING SEMANTIC SIMILARITY OVER TIME 17

Figure 1. Examples of five types of semantic relationships and nine types of syntactic relationships in the semantic-syntactic word relationship task. Adapted from ‘Efficient

Estimation of Word Representations in Vector Space’, by T. Mikolov, I. Sutskever, K. Chen,

G. S. Corrado and J. Dean, 2013b, Nips, p. 6, Copyright 2013 by Nips.

Both models were relatively inexpensive computationally, due to the replacement of the non-linear hidden layer with a simpler structure (Mikolov et al., 2013a). Negative sampling also contributed to the success of the new models. This new proposed loss function contrasted with the hierarchical softmax by being less computationally expensive (Mikolov et al., 2013b). Negative sampling uses randomly sampled ‘negative’ context words for every training sample k, for which the model should predict a negative ratio for the current target word. Sampling k negative context words with the current word instead of comparing the current word with every single word in the vector space should have a significant impact on reducing the computational cost (Mikolov et al., 2013b).

A comparison between vectors computed with Word2Vec and vectors computed with other architectures such as LSA and NNLM has revealed the quality of vectors built with

Word2Vec. For instance, vectors computed with Word2Vec achieved significantly higher scores on several test sets than vectors computed with NNLM, such as the semantic and syntactic test set. Moreover, Mikolov et al. have discovered that simple questions can be answered with word vectors by performing simple algebraic operations on the vectors. For example, the following algebraic operation: vec (‘biggest’) – vec (‘big’) + vec (‘small’) means the following in plain human language: find a word that is similar to ‘small’ in the same way that ‘biggest’ is too ‘big’. Mikolov et al. demonstrated that the resulting vector X was closest to the word ‘smallest’ in the vector space, which was exactly the word they were seeking.

Such a linear relationship was detected among many word pairs, such as a capital and its VISUALISING SEMANTIC SIMILARITY OVER TIME 18 country, a plural and its noun, or a word and its antonym (Mikolov et al., 2013b).

2.3 Visualising word embeddings

While there has been research on reducing dimensionality to enable the visualisation of word embeddings, few studies have tried to visualise word embeddings differently.

For the purposes of this thesis, we distinguish between two types of visualisations of word embeddings. The first type is the visualisation of the pairwise similarities between many vectors simultaneously. Word vectors are very high-dimensional data structures, and since understanding data that has three or more dimensions is difficult, the number of dimensions must be transformed before it can be interpreted (Kromesch & Juhász, 2005). Projecting high- dimensional data to a low-dimensional space provides a way to visualise high-dimensional data (Yang, 2011). This implies that for a visualisation of word representations, the high- dimensional data is being mapped into a low-dimensional space while preserving the same distance as far as possible. Dimensionality reduction techniques such as principal component analysis (PCA) or t-distributed stochastic neighbour embedding (t-SNE) are mainly used for this type of visualisations (Hilpert & Perek, 2015; Smilkov et al., 2015).

An example of visualisation of word representations can be found in a research paper by Hilpert and Perek (2015). This paper has explored how animated scatterplots (e.g., motion charts) can be used to demonstrate the temporal development of selected words. Hilpert and

Perek have used this type of graph to visualise semantic change with respect to semantic category and normalised frequency. The approach of their model is called ‘many a NOUN”’ and is portrayed in Figure (2). Every spot in the figure stands for a noun, and the spots are consistent with their semantic distance. In addition, each colour stands for a different semantic category, and the size of a spot indicates the normalised frequency of a noun. In Figure (2), the construction is instructed to track words that are closely related to the semantic category VISUALISING SEMANTIC SIMILARITY OVER TIME 19

‘time’. As can be seen from the figure, the visualisation successfully classifies a range of words that are closely related to the semantic category ‘time’.

Figure 2. Snapshot of the ‘many a NOUN’ construction. The selected words are related to the semantic category of time. Adapted from ‘Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts’ by M. Hilpert, F. Perek, 2015, Linguistics

Vanguard, 1(1), p. 8, Copyright 2015 by De Gruyter Mouton.

Another practical example of reducing dimensionality to visualise word embeddings has been proposed by Smilkov et al. (2015) Embedding Projector, as they call their visualisation, allows researchers and developers to analyse, explore, and understand embeddings in a low-dimensional visualisation. The authors’ visualisation aims to provide a method for conducting three tasks: exploring local neighbourhoods, observing the geometry of the word embeddings, and finding semantically significant directions (Smilkov et al.,

2015). In the visualisation, each word is mapped to a point. As a result, it should be possible to conduct those tasks. However, there is one potential limitation that might affect the conduction of those tasks. Due to memory constraints, it is not possible to project a larger than 10,000 words in a two- or three-dimensional space. Therefore, it is not possible for a user to look up every word of interest. This drawback is caused by the VISUALISING SEMANTIC SIMILARITY OVER TIME 20 dimensionality reduction techniques that are used for visualising word embeddings. Reducing dimensions of word embeddings to two or three dimensions can be done with dimensionality reduction techniques such as t-SNE and PCA. However, these algorithms tend to be slow and computationally complex (Van Der Maaten & Hinton, 2008). Consequently, users of

Embedding Projector are limited in the amount of data they can use.

Figure 3. Snapshot of the Embedding Projector, with pre-uploaded word embeddings as data.

The second type of visualisation is based on the similarity between a reference vector and other vectors over different instantiations of a vector space. This does not require a transformation of the data into a low-dimensional space, because the target words of a reference vector do not possess a high-dimensional data structure. An example of this type of visualisation can be found in the study of Frermann and Lapata (2016). As depicted in Figure

(4), their Bayesian model automatically extracts temporal word representations of a word as a set of senses. In addition, the proportion of each sense has been displayed for each 20-year time interval. Each sense stands for the 10 most probable words, and highlighted words are indicated as highly representative. Every bar is tagged with the initial year of the time VISUALISING SEMANTIC SIMILARITY OVER TIME 21 interval, and the colours are an indication of the proportion of each sense during each time interval. Consequently, this visualisation makes it possible to explore the diachronic meaning change of a specific word. The visualisation of Frermann and Lapata makes it clear that a visualisation of word embeddings is not necessarily restricted to reducing the dimensionality of word embeddings.

Figure 4. Tracking semantic change for transport and bank over 20-year time intervals between 1700-2010. Adapted from ‘A Bayesian Model of Diachronic Meaning Change’ by L.

Frermann, M. Lapata, 2016, Transactions of the Association for Computational Linguistics, 4, p. 37, Copyright 2016 by the Association for Computational Linguistics.

2.4 An exploration use case: The bitcoin phenomenon

Michel et al. (2011) have demonstrated how historic changes in the society can be analysed with the temporal frequency of specific words. For instance, the popularity of Jewish artists in English and German books started to significantly decrease during the second world war, which can be explained by Nazi censorship of Jewish artists (Michel et al., 2011). The practical and scientific value of those examples have inspired us to explore an enrichment of the approach of Michel et al. To do so, we will develop visualisations of the semantic similarity between specific words to gain insight on the development of phenomena. The use case for these visualisations will be bitcoin, a contemporary phenomenon which is remarkably VISUALISING SEMANTIC SIMILARITY OVER TIME 22 similar to historic speculative bubbles. The following paragraph will provide some background on the use case for readers who are not acquainted with it.

Bitcoin is a virtual currency that is broadcasted on a peer-to-peer network and allows owner and receiver to perform direct transactions (Nakamoto, 2008). This electronic payment system is based on cryptographic proof, which enables two parties to directly transfer electronic cash without the interference of a trusted third party (Nakamoto, 2018). However, besides the enormous public interest in bitcoin, its digital value has raised alarming economic and societal issues (Cheah & Fry, 2015). There is currently significant debate over the actual value of bitcoin. Recent fluctuations in the exchange price of bitcoin do not indicate that it holds a constant value. As a matter of fact, the fluctuations suggest ‘a substantial speculative component’ in bitcoin (Dowd, 2014). The demand for further investigation and cautiousness is high, because the speculative component of bitcoin could potentially suggest a speculative bubble (Dale et al., 2005). A speculative bubble is trade in an asset at a price range that strongly exceeds the intrinsic value of an asset (Dowd, 2014).

Bitcoin’s current situation is remarkably similar to historical examples of speculative bubbles. The first major speculative bubble occurred after companies introduced the concept of a share: an indivisible unit of capital that expresses the unit of ownership of an interest in a company in return for a proportion of the future profit of a company in the form of a dividend

(Ferguson, 2008). These shares carried a substantial speculative component because they enabled access to distributions of any future profit. As a matter of fact, at the beginning of the

18th century, the South Sea Company was about to achieve a monopoly on all trade with

South American countries. Many investors expected a repeat of the East India Company, which had a flourishing success with the spice trade from India. The massive demand for shares in the South Sea Company led to a booming of its share prices, since speculation about the treasures of South America had spread (Ferguson, 2008). However, after a few months of VISUALISING SEMANTIC SIMILARITY OVER TIME 23 flourishing share prices, they started to collapse, leading to an intense economic crisis.

Another well-known example is the ‘tulip mania’ which took place in the 1630s in the

Netherlands. Tulips take 7 to 12 years to flower on average. Hence, tulip bulbs were sold on the market as a durable good (Ferguson, 2008). The demand for tulips started to grow proportionally in 1634. As a result, the professional growers started to ask more money for a tulip bulb. In addition to its popularity in the Dutch market, there was also a high demand for tulips bulbs in France, which was also noticed by speculators who had started to enter the tulip market. This immense demand eventually resulted in tulips bulbs being sold at the price of a luxury home. This situation persisted until May 1637, when buyers refused to show up at a routine tulip bulb auction. As a result, the price plunged by 99% (Garber, 1989).

More recent examples of speculative bubbles are the Great Crash in 1929 and the internet bubble at the end of the 20th century. In both cases, a combination of rapidly increasing stock prices and an overconfidence in future profits led to a situation that was completely out of control (Galbraith, 1955; Ferguson, 2008). For instance, many people decided to quit their jobs and become full-time traders during the rapid adaption and growth of the internet (DeLong & Konstantin, 2006). At the turn of the millennium, the NASDAQ stock market reached a peak, with many investors investing in internet stocks. The price of many internet stocks grew by 1,000% in a few years, fuelling an even higher demand for internet stocks. However, a few scandals with internet-related companies resulted in eroded investor confidence (Ferguson, 2008). As a result, many investors tried to desperately sell their stocks, and the value of stocks dramatically lowered within a few days.

The previously discussed historical examples testify to the usefulness of analysing the bitcoin phenomenon. The current situation of bitcoin is analogous to previous speculative bubbles. For instance, bitcoin is being sold at a price that strongly exceeds its constant value, which is comparable to the ‘tulip mania’, the stock market crash in 1720, and the internet VISUALISING SEMANTIC SIMILARITY OVER TIME 24 bubble in 2000 (Cheah & Fry, 2015). In addition, just like in the previous examples, bitcoin is attracting many speculative investors due to its significant growth in value in the last few years. This situation has the potential to cause harm to many people if bitcoin appears to be the next speculative bubble. Therefore, it will be instructive to explore bitcoin with the visualisations of word embeddings over time to gain more profound access to the temporal development of bitcoin. For instance, it will be interesting to explore the evolution of the bitcoin ecosystem, what words became more related to bitcoin, and how bitcoin relates to words such as ‘profit’, ‘risk’, and ‘bubble’. Answering these questions will provide us with information that allows us to reason about bitcoin from a contemporary perspective and to demonstrate the added value of our approach.

VISUALISING SEMANTIC SIMILARITY OVER TIME 25

Experimental setup

In this chapter, I will discuss the experimental setup and method. I will first discuss the use case, the data set, and the way the data is pre-processed. Then, I will discuss the method for the construction of the vector spaces and the corresponding hyperparameters.

Subsequently, I will evaluate the constructed word embeddings in terms of quality. Section

3.3 will focus on visualising semantic similarity and the methods that have been developed to analyse current cultural phenomena.

3.1 Use Case

Before we delve deeper into the data set and the experimental procedure for this study, we will first discuss our use case. Section 2.4 provided context about the bitcoin phenomenon.

Moreover, it was stated that the current situation of bitcoin is analogous to previous situations that eventually led to a speculative bubble. As a result, we think that will be instructive and justified to use our proposed approach to gain insight into this cultural phenomenon.

The bitcoin phenomenon will be explored with different sub-questions (see below), which will reveal the strength and weaknesses of our approach. These sub-questions derive from one main question that is the entryway to our analysis of bitcoin: how did the bitcoin ecosystem change between January 2010 and October 2016? For instance, someone whose interest lies in investigating the bitcoin ecosystem may want to examine how bitcoin is related to specific words (e.g., in terms of semantic similarity) over time. Examining the semantic similarity over time between a reference word and other words may provide a way of reasoning regarding the development of a cultural phenomenon and its ecosystem. Therefore, we will investigate how strongly bitcoin is being related (in terms of semantic similarity) to the following words: ‘peer-to-peer’, ‘bubble’, ‘hack’, ‘risk’, and ‘profit’. These words are carefully chosen because a strong similarity between our phenomenon and these words may help us to reason about the development of bitcoin. Hence, we should develop a method for VISUALISING SEMANTIC SIMILARITY OVER TIME 26 the comparison of a user-provided reference word and other user-provided target words, which has also been indicated in our first research question.

Anyone interested in investigating bitcoin’s ecosystem may also wish to determine how the associations of bitcoin changed over a time period. As such, questions such as ‘Is a specific trend or development recognizable in the associations of Bitcoin?’ or ‘Do the associations of Bitcoin remain stable over time?’ can be answered with a method that automatically chooses relevant target words, given a reference word and a user-specified time period. This method, which we call the anchoring method, has been developed for the purpose of this study. We will extend the use case for this anchoring method by investigating another concept, namely ‘peer-to-peer’ (or ‘p2p’). Indeed, bitcoin is a p2p electronic cash system that allows online payments to directly transfer money from one owner to another owner without the interference of a trusted third party (Nakamoto, 2008). However, before bitcoin was introduced, p2p was merely associated with a p2p network, which allows computers (the peers) to directly connect and share files with each other without a central server

(Christensson, 2006). The anchoring method will allow us to visualise the changes in the relationship between certain words over time. Comparing the semantic context of bitcoin to that of p2p might help to grasp the social significance of bitcoin.

In the following section, I will discuss these techniques and demonstrate how visualising the changes in semantic relationships between words can provide insight on a phenomenon. It is important to note that bitcoin is merely a use case to evaluate the potential of these techniques.

3.2 Data set

The dataset that was used for creating the semantic vector space was the News on the

Web Corpus (Davies, 2013). This corpus is also known as the NOW Corpus and contains web-based newspapers and magazines from 2010 till present day. It is updated daily with VISUALISING SEMANTIC SIMILARITY OVER TIME 27 approximately five million words. The version that was used in this thesis ranges from

January 2010 till November 2016 and contains about four billion words. The corpus is compiled using an automatic script that downloads about 10,000 webpages from links provided by the Google news aggregator every night at 10:00PM. Subsequently, the downloaded webpages are cleaned by removing boilerplate material (Davies, 2013).

Studying cultural and historical phenomena with a current news corpus has some advantages over sources such as the Google Books corpus. Although the corpus that was used by Michel et al. (2011) is considerably larger and consists of nearly 6% of the books that have been published, a news corpus has a higher consistency in narrative style and grammar (Yao et al., 2018). Because the NOW Corpus covers the current era, it is evidently more suitable to investigate the latest phenomena occurring in the society. Moreover, news corpora report systematically on current events and social trends (Yao et al., 2018).

3.2.1 Cleaning the data

Pre-processing the data is a vital part of natural language processing because it deals with a massive amount of unstructured text data. However, text cleaning can be considered to be task specific. Whether text data is going to be used for machine learning or heavily influences the cleaning of data. In this case, our word embeddings are built for analysing the temporal change in the semantic similarity between specific words in time to gain insight on the temporal development of cultural and historical phenomena. Considering this, it is important that the word embeddings that are used for this task provide accurate semantic associations.

Luckily, the NOW corpus was already pre-processed.1 Hence, the data only needs to be lightly pre-processed for the Word2Vec model. One requirement for using Word2Vec with the GENSIM library was that the input data needed to be specified as a list of sentences. All

1 The data provided in the NOW corpus was already tokenized. Words were also lemmatized and lowercased. VISUALISING SEMANTIC SIMILARITY OVER TIME 28 words were lowercase in each sentence, because this allows the instances of words at the beginning of a sentence to be matched to instances of a word appearing in other parts of a sentence (Schütze, 2008). The drawback of this way of pre-processing is that some semantic distinctions between words are lost, such as the difference between ‘Apple’ the company name and ‘apple’ the fruit, or US President ‘Bush’ and the ‘bush’ plant. The following section will discuss the parameters that were used for building the semantic vector spaces with

Word2Vec.

3.3 Method

This subsection will provide a broad overview of the methods that were used to build a visualisation of word embeddings obtained from the NOW corpus. I will first discuss the methods and libraries that were used for the creation of word embeddings. Then, I will explain the methods for visualising semantic similarity over time.

3.3.1 GENSIM

For current purposes, we divided the NOW Corpus into partitions of 6 months,2 except for the last chunk which has 4 months of data (e.g., 07-2016 up till 10-2016). For every partition of the NOW Corpus, word embeddings were built with the GENSIM library

(Rehurek & Sojka, 2010). The GENSIM library gives users the ability to specify some hyperparameters for their method. For instance, the user can choose between the Skip-Gram or CBOW training algorithm, the number of dimensions for each vector, the size of the context window, or whether negative sampling will be used.

3.3.2 Hyperparameters Word2Vec

Mikolov et al. (2013a) have proposed two new architectures for building word embeddings: Skip-Gram and CBOW. Mikolov et al. have demonstrated that Skip-Gram with

2 We also divided the corpus in yearly partitions, however for consistency we will continue with six-month partitions. VISUALISING SEMANTIC SIMILARITY OVER TIME 29 negative sampling and 300 dimensions outperforms other architectures on several tasks. The leverage of negative sampling is that it tries to reduce the complexity of updating the weight matrices after every training sample. The reduction in complexity is achieved by sampling a few negative words for every training sample instead of the whole vocabulary (Mikolov et al.,

2013b). Furthermore, negative sampling also approaches the maximisation of the log probability of softmax in a simplified way, without losing the quality of the vector representations. The number of negative samples that should be drawn for every training sample must be specified as a hyperparameter for the Word2Vec model. This number varies between 5 – 20, according to the results in Mikolov et al. Hence, sampling five negative words should be sufficient for large datasets.

The size of the context window can also be chosen. This determines how many words before and after the current word will be used as context words. Since displaying the semantic associations of a word is one of the main goals, a relatively large context size of 10 was chosen (Levy & Goldberg, 2014). Because of this setting, large parts of a sentence will be used for the context, and the Word2Vec model will be able to find better of a word

(Mikolov et al., 2013b). Levy and Goldberg have also demonstrated that larger windows tend to more capture better domain information of a word in a vector. However, one must be aware that a larger window size will result in an increase in the training time for the corpus. Finally, we decided to choose the default value of five for the minimum count of occurrence of a word in the corpus to qualify for the vocabulary. As a result, words that appear less than five times in the corpus are absent from the resulting vector spaces. In summary, these are the hyperparameters that were used in our Word2Vec model:

• Both CBOW and Skip-Gram embeddings were built.

• Number of dimensions: 300

• Minimum word count: 5 VISUALISING SEMANTIC SIMILARITY OVER TIME 30

• Window size: 10

• Negative sampling: yes

• Number of negative words sampled: 5

3.3.3 Evaluation of word embeddings

Although the ultimate test for evaluating the quality of a generated semantic model should be its performance in a downstream application, there are several ways for evaluating the general quality of the word embeddings.

For a long time, WordSim-353 was considered to be the ‘gold standard’ for evaluating the ability of distributional semantic models to predict semantic relations between word pairs

(Finkelstein et al., 2001). WordSim-353 relies on a data set that consists of 350 noun pairs and has a similarity between zero and ten (rounded down to one floating point) and is manually rated by humans. As a result, the similarity score in the vector space can be compared to the similarity in the data set, resulting in a correlation score between the model and the WordSim-

353 data set.

Mikolov et al. (2013a) have proposed a significantly more challenging similarity task for evaluating the quality of word vectors. They have discovered that simple algebraic operations with vector representations of words can be used to explore many different similarities between words. For example, the question ‘Which word is to “small” as “big” is to “biggest”’ can be answered by simply computing vector X = vector (‘biggest’) – vector

(‘big’) + vector (‘small’). This new vector X can be used for searching the closest word in the vector space, which appears to be the word ‘smallest’ when the word vectors are well trained

(Mikolov et al, 2013b). This finding motivated Mikolov et al. to develop a more complex similarity task with five types of semantically related questions and nine type of syntactically related questions. Figure 5 depicts five type of semantic relationships and nine types of syntactic relationships that can be answered by performing simple algebraic operations with VISUALISING SEMANTIC SIMILARITY OVER TIME 31 vector representations of words. In total, there are 8,869 semantic questions and 10,679 syntactic questions, with each question being introduced as ‘which word is similar to x1, as word y1 is to y2’. There is only one correct answer for every question: the closest word to the computed vector. The closest word to the computed vector must be the same as the correct answer in the test set. The accuracy for this test set provides a strong indication of the quality of the learned word embeddings for follow-up applications (Mikolov et al., 2013b).

Figure 5. Examples of five type of semantic relationships and nine types of syntactic relationships in the semantic-syntactic word relationship task. Adapted from ‘Efficient

Estimation of Word Representations in Vector Space’ by T. Mikolov, I. Sutskever, K. Chen,

G. S. Corrado, and J. Dean, 2013b, Nips, p. 6, Copyright 2013 by Nips.

Hill et al. (2014) have proposed SimLex-999, a new ‘gold standard’ for evaluating semantic similarity, as an alternative to WordSim-353 (Hill et al, 2014). In contrast to

WordSim-353, SimLex-999 explicitly quantifies similarity (see end of paragraph) in contrast to WordSim-353. As a result, word pairs that are related to each other but are not necessarily the same have a low similarity rate. For example, the words ‘cup’ and ‘coffee’ are more similar than the words ‘car’ and ‘train’ in the WordSim-353 data set, although the latter shares VISUALISING SEMANTIC SIMILARITY OVER TIME 32 many more common properties (Hill et al., 2014). Therefore, ‘cup’ and ‘coffee’ must be an association, because they are clearly not similar. In contrast, ‘cup’ and ‘mug’ have the strongest relationship of word similarity given that they are synonyms. The implementation of this significant difference was performed by 500 participants. They were asked to distinguish association and similarity while rating the similarity of pairs. Therefore, SimLex-999 consists of a significant number of pairs, such as ‘movie’ and ‘theatre’, which are strongly associated but still have a low similarity rate.

During the analysis of the quality and validity of the generated vector spaces, the

Text8 corpus with the parameter settings mentioned in the previous section was taken as a reference for baseline performance. With a vocabulary size of 253,855 words, the Text8 corpus can be considered to be strong baseline performance, considering the vocabulary sizes of each partition of the NOW corpus. Figure 6 presents the results of each vector space on the semantic- (left) and syntactic (right) word relationship test.

Figure 6. Semantic and syntactic scores on the word relationship test set for CBOW and

Skip-Gram word embeddings created from different partitions of the NOW corpus.

Figure 6 indicates that the test set scores of the NOW vector spaces were generally higher than the scores achieved by the Text8 vector spaces (Skip-Gram: 47.00% and 42.00%;

CBOW: 50.00% and 45.00%). More importantly, our vector spaces achieved scores that are comparable to those reported in Mikolov et al. (2013b). Mikolov et al. trained various Skip-

Gram models with 300 dimensions and reported a semantic accuracy of on average 50.00% VISUALISING SEMANTIC SIMILARITY OVER TIME 33 and an average syntactic accuracy of 60.00%. Figure 6 illustrates that the semantic accuracy scores obtained by our Skip-Gram vector spaces vary between 57.74% and 73.30%, while the syntactic scores of our Skip-Gram vector spaces vary between 45.28% and 55.32%.

Moreover, our CBOW vector spaces achieved even better scores than the Skip-Gram vector spaces. The reason for this may be the contents of our corpus, because CBOW with negative- sampling works better than Skip-Gram with negative sampling when both models are dealing with data that has many frequent words (Naili et al., 2017). While this topic is certainly worth further exploration, it is outside of the scope of the current study. The scores of our vector spaces are primarily an indication of validity: whether our models are accurately measuring what they are supposed to measure.

Our constructed vector spaces were also tested on the WordSim-353 and SimLex-999 test data. According to Hill et al., modelling similarity between word pairs should be more challenging than modelling association. Therefore, we should expect a lower correlation score for the SimLex-999 data set. Figure 7 presents a summary of the conducted tests and their scores. The results strongly indicate that the NOW vector spaces perform better than the

Text8 vector spaces in both test sets. Moreover, these results confirm that modelling similarity is more challenging than modelling association. According to Hill et al., the average correlation score for WordSim-353 with Word2Vec is 61.10%, while the average score for

SimLex-999 with Word2Vec is 41.10%. We report an average correlation score of 65.00% on

WordSim-353. On the other hand, our NOW corpus vector spaces achieved an average score of 37.00% on SimLex-999. Based on the achieved scores on all the test sets, we can conclude that the general quality of the word embeddings is sufficient for the aimed task of this thesis. VISUALISING SEMANTIC SIMILARITY OVER TIME 34

Figure 7. Correlation scores for SimLex-999 and WordSim-353 test set on chunks of the

NOW corpus and the Text8 corpus.

3.4 Visualising semantic similarity over time

In the following subsections we will introduce three ways of visualising word embeddings, each of which can be used to examine the temporal change in the semantic similarity between specific words and can provide insight into the temporal development of cultural and historical phenomena. However, each visualisation has a different method of choosing the vectors of which the similarity will be visualised. Moreover, because every visualisation provides different information about the vector space, a wider range of questions can be explored. After a short recap of previous work on visualising word embeddings, we will introduce two methods for choosing vectors in the visualisations. Finally, we will discuss the evaluation of visualisations which will take place in the next chapter.

3.4.1 Semantic similarity

A comparison between the reference vector and the other specified vectors will be based on the degree of which two words can occur in the same linguistic contexts (Harris,

1954; Miller & Charles, 1991). This amount (e.g. the semantic similarity) can be measured by calculating the similarity or distance between the elements of both vectors. Different metrics can be used to measure the angle or distance between vectors, such as the Cosine similarity or the Euclidean distance. To remain constant in this thesis, it was decided to use the Cosine similarity to measure the similarity degree between to vectors. VISUALISING SEMANTIC SIMILARITY OVER TIME 35

→ → cos(푥, 푦) = 푥 ∙ 푦 → → |푥| ∙ |푦|

Given this equation, the cosine similarity of two vectors can be obtained by taking the cosine of the angle of two vectors (Huang, 2008). This can be done for two vectors x and y by taking their dot product and dividing this by their magnitudes. The cosine similarity can be negative and varies between -1 and 1. As a result, word vectors of two words that are semantically related to each other, such as ‘dog’ and ‘cat’, will have a cosine similarity closer to 1. A cosine similarity of two vectors that are very dissimilar should be closer to -1. Subsequently, this also implies that it is possible with the semantic similarity to look up the nearest neighbours of specific words in a semantic vector space.

3.4.2 Recap: previous work on visualising word embeddings

There are two types of visualisations of word embeddings that can be distinguished in the existing literature. The first type is the visualisation of the pairwise similarities between many vectors simultaneously. This visualisation type is typically based on dimensionality reduction techniques such as t-SNE and PCA (Van Der Maaten & Hinton, 2008).

Dimensionality reduction techniques enable the visualisation of the word embeddings as coordinate points in a low-dimensional space (Hilpert & Perek, 2015; Smilkov et al., 2015).

Easy investigation of local neighbours and clusters of semantically similar words are this first type’s key strengths. For instance, Hilpert and Perek have visualised 230 nouns in a two- dimensional motion chart. This moving scatterplot revealed how the associated semantic category and normalised frequency of each noun changed over time. A similar visualisation was developed by Smilkov et al. (2015). Word embeddings were visualised as points in a low- dimensional space, allowing users to inspect the local neighbourhoods of points and analyse the geometry of the vector space. VISUALISING SEMANTIC SIMILARITY OVER TIME 36

The second type is the visualisation of the similarity between a reference vector and other vectors over different instantiations of a vector space. This does not require a transformation of the data into a low-dimensional space. This visualisation charts the similarity between a reference vector and other vectors in the vector space on one coordinate axis, while the other coordinate axis is used for another purpose, for instance to represent time when using a time-tagged corpus. This type of visualisation has been applied by Frermann and Lapata (2016). The authors had presented a Bayesian model that automatically infers temporal word representations as a set of senses and where each sense stands for the 10 most probable words. As a result, it is possible to track the meaning change of a reference vector in a bar chart by investigating how the proportions of the senses (colour-coded) change at each time interval.

3.4.3 User-defined method

The first method that we propose for the visualisation of semantic similarity is the

‘user-defined’ method. This ‘baseline’ method aims to visualise a comparison of a user- provided reference word and other user-provided target words. This means that a user must specify which reference vector will be compared with other user-specified vectors along the temporal dimension. The user-defined method has the purpose of facilitating a user to look up and investigate how specific words of interest are being related over time. It offers significant freedom in terms of exploration but has the disadvantage of depending on the user’s previous knowledge on the subject. I will use this method to explore the relationship between bitcoin and some words that are strongly related to Bitcoin but have been used in a different context in the past. The expectation is that the visualisation will indicate that this evolution has occurred.

3.4.4 Anchoring method

The second method that we propose is a novel way of automatically choosing relevant VISUALISING SEMANTIC SIMILARITY OVER TIME 37 target words, given a reference word and a user-specified time period. Our method finds the n most relevant target words to x in the first and last vector space that is related to the time period t1-t2 for a reference vector x and time period t1-t2. Hence, the n most relevant target words are the nearest semantic neighbours at the beginning and at the end of the time period that we are interested in. In other words, the first and last vector spaces provide vectors that function as anchors to gain insight on the development of the reference word. For instance, if we consider ‘snap’ to be our reference vector, where n is 3 and t1-t2 is January 2010 – July

2016, then the anchoring method finds the three closest neighbours of ‘snap’ in the NOW vector space of January 2010 and the NOW vector space of July 2016. The anchors of the word ‘snap’ are ‘break’, ‘catch’, and ‘grab’ in the 01-2010 space, while the anchors in the 07-

2016 space are ‘selfie’, ‘snapchat’, and ‘photo’. Visualising this information and the corresponding similarity scores over time can provide insight on the development of a reference word. Furthermore, such visualisation also facilitates a temporal representation of a reference word.

I will use this method to find the anchors of ‘bitcoin’ from January 2010 to October

2016. In addition, I will use this method to examine if the changes in the semantic associations of ‘peer-to-peer’, a word that is strongly related to bitcoin, can be related to the development of bitcoin as a cultural phenomenon.

3.4.5 Evaluating the approach

Michel et al. (2011) have demonstrated a quantitative method that uses frequency counts of specific words for the study of cultural and historical phenomena based on large textual data. A sudden increase in the frequency of the word ‘bitcoin’ could mean many things, such as an increase in its popularity or a crisis. The only thing that is certain is that a sudden increase in the frequency of a word means that the word is being communicated more often in news articles. This uncertainty in reasoning about phenomena might disappear when VISUALISING SEMANTIC SIMILARITY OVER TIME 38 we expand the approach of Michel et al. with novel ways of visualising semantic similarity over time. For instance, the temporal representations of a phenomenon may provide a clue as to why a phenomenon is being communicated more frequently.

To demonstrate the added value of our approach, three questions related to our use case will be explored in Chapter 4. These questions will be used to evaluate each proposed visualisation, but they will also provide insight into bitcoin from different angles. The first visualisation that will be introduced is a two-dimensional graph that aims to facilitate how a reference vector is being related to other vectors over time. This visualisation will allow us to look up the word ‘bitcoin’ and investigate how it is being related to specific words of interest.

The second visualisation that will be presented is a visualisation of our proposed ‘anchoring method’. The anchoring method will be visualised in a heat map that aims to illustrate the anchors of a reference vector and their temporal change over time. This method will allow us to gain insight on the nearest neighbours of bitcoin and their development over time.

The last visualisation that will be proposed is a dynamic motion chart. A dynamic motion chart is an animated scatterplot which allows multiple dimensions of information of a reference vector to be presented while the animation moves through time. The information that will be presented in our dynamic motion chart are the semantic similarity over time between a reference word and its anchors, the first three neighbours of each anchor, the dimension of time, and the normalised frequency counts. This visualisation will be used for investigating the influence of a phenomenon on the associations of an anchor. Examining this relationship will help us to grasp the social significance of bitcoin.

The starting point for our use case is as follows: How did the bitcoin ecosystem change between January 2010 and October 2016? This main question will be divided into VISUALISING SEMANTIC SIMILARITY OVER TIME 39 three sub-questions that will be used to gain insight on different angles of our use case.

Eventually, these sub-questions will be used for both exploration of our use case and the evaluation of our visualisations. The sub-questions on the analysis of bitcoin will be as follows:

• For the user defined method: How does bitcoin relate to the words ‘peer-to-peer’,

‘bubble’, ‘hack’, ‘risk’, and ‘profit’?

• The visualisation of the anchoring method will be examined by the following

question: how did the nearest neighbours of bitcoin change between January 2010 and

July 2016?

• The dynamic motion chart will be examined by the following question: how did the

words associated with p2p change after the introduction and upsurge of bitcoin?

The answers to these questions formulated above will allow us to reason about the development of the bitcoin ecosystem between January 2010 and October 2016. Furthermore, this analysis of bitcoin will shed light on the added value and constraints of our proposed approach.

Before we explore these questions, we will start the results section with an analysis of baseline performance to make the added value of our approach more explicit. The baseline performance will be the approach described in the study of Michel et al., because that study is the starting point of our study. That it is now possible to go beyond the exploration of frequency counts due to computational advances has driven us to explore novel ways of visualising semantic similarity. These visualisations should be viewed as extensions of the study of Michel et al. rather than a new approach. Therefore, it is essential that we report the baseline performance obtained with the approach of Michel et al.

VISUALISING SEMANTIC SIMILARITY OVER TIME 40

Results

In this section, we will present three novel ways of visualising semantic similarity over time based on the research questions that were formulated in Chapter 1. However, we will start with a baseline visualisation, which is based on the approach of Michel et al. (2011).

4.1 Baseline performance

The baseline visualisation for our study is based on using frequency counts to explore cultural and historical changes (Michel et al., 2011). Figure 8 depicts the result of this approach.

Figure 8. Visualisation of five popular technology related words and their normalised frequency counts over time as observed in the NOW corpus.

In Figure 8, the frequency counts of four widely adopted technological brands and services are being depicted next to the frequency counts of Bitcoin. The following observations are based on the information in Figure 8. First, a sudden increase in the frequency of a word could mean many things. For instance, we can conclude from the bitcoin line that there is not was much attention paid to bitcoin in 2010. However, the rise in frequency after July 2013 has many potentially explanations. For instance, an increase in frequency could be due to a rise in popularity, a crisis, or anything else which is closely related to the communication of bitcoin in a news source. Furthermore, it is also possible to VISUALISING SEMANTIC SIMILARITY OVER TIME 41 make an observation based on the comparison between the frequency of the word ‘bitcoin’ and the other words. We think that a comparison can provide insight on the cultural adoption of technology, if we assume that the frequency reflects the relative attention (e.g., adoption) paid to each term. Michel et al. have demonstrated with their visualisations that the cultural adoption of technology has become more rapid since the beginning of the 19th century. For instance, technology invented from 1840-1880 was widely adopted within 50 years, while technology from 1880-1920 was widely adopted within 27 years. This trend has been continued over the last few decades: bitcoin attracted relatively significant attention within 4 years of its introduction.3 The rest of this chapter will use our proposed methods and questions to obtain a more profound insight into this contemporary phenomenon.

4.2 Visualising semantic similarity between two words: ‘user defined method’

The first novel visualisation of semantic similarity over time that will be introduced is also the most trivial. The semantic similarity between two words will be visualised in a two- dimensional graph based on the choices of a user. The variable ‘year’ will be placed on the x- axis, while the cosine similarity will be placed on the y-axis. This ‘user defined’ method provides the user with the ability to choose a reference vector and other vectors over different instantiations of a vector space. In Chapter 3, the following sub-question was formulated for the user defined method: How does bitcoin relate to the words ‘peer-to-peer’, ‘bubble’,

‘hack’, ‘risk’, and ‘profit’? The visualisation related to this question is depicted in Figure 9.

3 We assume that the frequency of the other technology related terms stand for relative high frequencies in the NOW corpus and therefore already are being widely adopted in the society. VISUALISING SEMANTIC SIMILARITY OVER TIME 42

Figure 9. Visualisation of semantic similarity between a reference vector and other vectors over time. Here, the reference vector ‘bitcoin’ is compared with other vectors in both Skip-

Gram (left) and CBOW (right) word embeddings.

Each line represents a comparison over time between two words, with a high cosine similarity indicating a strong semantic association between two words in a specific time period. As is evident from Figure 9, both Skip-Gram and CBOW word embeddings capture a relatively strong semantic association between bitcoin and peer-to-peer. This might indicate that bitcoin co-occurred for a significant amount of time in the same context as peer-to-peer and is therefore more semantically similar to bitcoin than the other target vectors. This outcome was expected because peer-to-peer is a substantial part of bitcoin technology

(Nakamoto, 2008). Therefore, it was expected that peer-to-peer would be captured as the strongest association to bitcoin out of the chosen words and that this strong association would remain constant over time.

Furthermore, it seems that the Skip-Gram and CBOW word embeddings captured approximately the same relatedness between the reference vector of bitcoin and the other target vectors. However, this is not always the case. Therefore, it is always desirable to perform a check on the difference between the Skip-Gram and CBOW word embeddings.

After peer-to-peer, the word ‘hack’ has the strongest semantic association to Bitcoin. This does not mean that bitcoin can be considered to be a hack. However, it is remarkable that the VISUALISING SEMANTIC SIMILARITY OVER TIME 43 word ‘hack’ is more related to bitcoin than the words ‘profit’ and ‘loss’. This is probably indicates that ‘hack’ is often used in the context of bitcoin. However, there is still uncertainty over which context an associated word is used in. For instance, the word ‘hack’ can have either a positive association to bitcoin (e.g., ‘bitcoin is not a hack’) or a negative association

(e.g., ‘another bitcoin hack’). Thus, a deeper analysis of bitcoin coverage in the news sources covered in the NOW corpus is necessary to prevent any biased conclusions.

‘Bubble’ is the least associated word to bitcoin, according to our word embeddings. It is difficult to form specific conclusions about this information because it merely indicates that

‘bitcoin’ and ‘bubble’ hardly co-occurred in the NOW corpus. However we considered it worthwhile to present this semantic relation given the analogousness to previous speculative bubbles.

The visualisation in Figure 9 demonstrated its ability to investigate how a phenomenon is related to specific words over time. This visualisation could accurately present the relatedness between a reference vector and other target words. We tried to determine how bitcoin was related to specific words of interest to gain insight on the development of the bitcoin ecosystem. We were able to make some observations about the relatedness of bitcoin to those words. However, it also became clear that it is hard to make thorough observations about the relatedness between these words of interest. For instance, it is hard to clarify the semantic similarity between a phenomenon and a specific word based on the given information. Future studies should focus on addressing this constraint because the information in this visualisation has the potential to be instructive for culturonomics in general or to be useful to organisations. For instance, an organisation might use this visualisation to track how their products or services are related to words of interest. VISUALISING SEMANTIC SIMILARITY OVER TIME 44

4.3 Visualising semantic similarity using the ‘anchoring method’

This subsection will present a visualisation of the anchoring method described in subsection 3.4.4. The objective of the anchoring method is to capture the gradual change of a word’s semantic associations in a user-specified time space. A visualisation of the most similar words in the first and last time period and their semantic scores along the time space will allow users to easily investigate how the strongest associations move over time. As a result, the following question was formulated for this visualisation: How did the nearest neighbours of bitcoin change between January 2010 and July 2016? Figure 10 depicts the visualisation with the Skip-Gram word embeddings related to this question.

Figure 10. Visualising the semantic similarity over time using the anchoring method and a heatmap with Skip-Gram word embeddings. The words placed on the x-axis are the anchors of the reference vector ‘bitcoin’.

Every box in this visualisation represents a word-year-similarity score, while the colour bar next to the heatmap provides an indication of the corresponding semantic score.4

Researchers who are focused on investigating a change in the anchors of a reference word

4 It is possible to display the exact similarity score, by moving the cursor on the individual box of interest. VISUALISING SEMANTIC SIMILARITY OVER TIME 45 should focus on the transition from blue to red or vice versa. This is because a transition from blue to red might indicate a strong semantic association that did not exist before. However, the user should take critical note of the scale of the colour bar, because the scale of the colour bar will be different for every word input. If the associations of a word remain constant over time, then the heatmap will have a smaller range in the colour bar.

The visualisation in Figure 10 starts in January 2011 because the word ‘bitcoin’ only appeared twice in the NOW corpus in 2010. Therefore, it is unlikely that our Word2Vec model would present any reliable neighbours during this time period. Hence, we chose to begin our analysis of the nearest neighbours of ‘bitcoin’ in January 2011. In this time period, the nearest neighbours to Bitcoin were ‘currency’, ‘paypal’, ‘peer-to-peer’, and ‘nakamoto’.

Most of these associations with bitcoin seem valid. For instance, Nakamoto is the founder of bitcoin, and peer-to-peer is the technology that underpins bitcoin (Nakamoto, 2008).

However, the word ‘paypal’ is harder to place in context, because it can be interpreted in several ways. For instance, bitcoin can be interpreted as a competitor of PayPal, a collaborator with PayPal, or a follow-up payment system.

The right side of the visualisation indicates that the nearest neighbours to bitcoin in

July 2016 are ‘bitfinex’, ‘blockchain’, ‘cryptocurrency’, ‘ethereum’, ‘leocoin’, and ‘zcash’. A comparison to the nearest neighbours in January 2011 leads to the conclusion that there is a change in the nearest neighbours of bitcoin in both time periods. Most of the strongest associated neighbours in July 2016 were not strongly associated with bitcoin beforehand.

Another interesting data point in this chart is that the nearest neighbours in July 2016 tend to have a greater magnitude in terms of a higher similarity score. This may be caused by the substantial increase in coverage of bitcoin in the NOW corpus, which was demonstrated in

Figure 9. VISUALISING SEMANTIC SIMILARITY OVER TIME 46

Additionally, the neighbours of bitcoin in July 2016 highlight other relevant parts of bitcoin in comparison to the earlier-discussed neighbours. For instance, the neighbours ‘ethereum’,

‘leocoin’, and ‘zcash’ are competitors of the bitcoin technology. The neighbour ‘bitfinex’ is a cryptocurrency trading platform, and the neighbour ‘blockchain’ is the technology that is being used by many cryptocurrencies to record transactions (Young, 2018). The question:

How did the nearest neighbours of bitcoin change between January 2010 and July 2016? Is visualised with CBOW word embeddings in Figure 11 to compare and verify our observations.

Figure 11. Visualising the semantic similarity over time using the anchoring method and a heatmap with CBOW word embeddings. The words placed on the x-axis are the anchors of the reference vector ‘bitcoin’.

Figure 11 demonstrates how the nearest neighbours to bitcoin in January 2011 are similar to the Skip-Gram word embeddings. The only difference is the neighbour ‘andresen’, which refers to a lead developer of the bitcoin client software. However, this difference disappeared after a while, because neighbour ‘andresen’ slowly became one of the weakest neighbours to bitcoin in July 2016. This is probably because Andresen stopped working on VISUALISING SEMANTIC SIMILARITY OVER TIME 47 bitcoin in 2014. In addition, no significant differences were found between Skip-Gram and

CBOW word embeddings for this question.

In the previous subsection, a change in the nearest neighbours for the reference word

‘bitcoin’ was investigated with a heatmap and our newly developed distributional method. We gained insight on the anchors and how they changed over time for the bitcoin phenomenon.

However, Figures 10 and 11 indicate that there is some ‘noise’ in the nearest neighbours of bitcoin. For instance, demonstrates that the nearest neighbours to bitcoin are: ‘crypto- currencies’, ‘crypty-currency’, ‘cryptocurrency’ and ‘cryptocurrencies’. These duplicate words with different conjugations may have interfered with other neighbours, which could potentially influence the analysis of a phenomenon. Addressing this constraint in future studies would likely improve the added value of this method. Although we succeeded in the presentation of multiple dimensions of information for a user-specified word (e.g. time, similarity scores, and anchors of the final and initial years), we still believe that there is space for improvement of the visibility of data. Therefore, another visualisation of semantic similarity will be presented in the next subsection.

4.4 Visualising additional non-semantic dimensions of information

This subsection proposes a method for visualising additional non-semantic dimensions of information of the anchors. A dynamic motion chart is one possible solution for the visualisation of additional non-semantic dimensions of data. A dynamic motion chart is an animated scatterplot where the data can ‘move’ over time. We will use a dynamic motion chart to visualise the semantic similarity over time using the anchoring method, where each bubble represents an anchor of a user-chosen reference word. As we move across time, the relative size of a bubble may shrink or increase, based on the normalised5 frequency count of

5 Because we’re dealing with unequal sub-corpora sizes, presenting the ‘raw’ frequencies will be misleading. Therefore the raw counts are normalised by first taking the log10, subsequently this value is divided by the sub- corpora size. VISUALISING SEMANTIC SIMILARITY OVER TIME 48 a word. Each bubble has an absolute position and moves vertically through time according to its similarity score. In addition, this visualisation also introduces the possibility of looking up the three closest neighbours for each anchor. This should enable deeper insights on the local neighbourhood of the reference vector. This visualisation will be examined by the following question: how did the words associated with peer-to-peer change after the introduction and upsurge of bitcoin? Figure 12 depicts the visualisation of the reference vector ‘peer-to-peer’6 with Skip-Gram word embeddings in a dynamic motion chart.

Figure 12. A snapshot of the reference vector ‘peer-to-peer’ at the starting point of the dynamic visualisation.7

Figure 12 demonstrates that there are roughly two groups of words that can be associated with peer-to-peer. The clear separation between strongly and weakly related groups is caused by the anchoring method, which makes it possible to explore the ‘anchor

6 P2P, stands for peer-to-peer and is nowadays heavily linked to bitcoin. Bitcoins are broadcasted from user to user via the peer-to-peer network without the interference of a broker (Nakamoto, 2008). 7 https://github.com/klimpie94/visualisations/blob/master/Dynamic%20visualization.ipynb VISUALISING SEMANTIC SIMILARITY OVER TIME 49 neighbours’ of a reference vector. Therefore, the most important observation from Figure 12 is that the least related associations to peer-to-peer in January 2010 are ‘cryptocurrency’

(0.37735), ‘fintech’ (0.26605), and ‘blockchain’ (0.35388). By contrast, the most related associations to peer-to-peer in January 2010 are ‘file-sharing’ (0.83435), ‘bittorrent’

(0.80762), and ‘downloading’ (0.70111). Hence, the word ‘peer-to-peer’ was more closely related to the sharing of files and documents. Moreover, a dissociation was found in the

January 2010 vector space between peer-to-peer and bitcoin-related associations. This is probably because bitcoin was still relatively unknown at that time (see Figure 9).

Figure 13. A snapshot of the dynamic visualisation of reference vector ‘peer-to-peer’ at 01-

2014.

In Figure 13, the motion chart has moved to January 2014. Figure 15 depicts an example of the amount of information that can be requested for each bubble. The three closest neighbours, the normalised frequency count, and the similarity score can be looked up at every time period. Furthermore, Figure 13 also demonstrates that words that were not closely related to ‘peer-to-peer’ in 2010, such as ‘blockchain’ (0.55913), ‘cryptocurrency’ (0.66044), VISUALISING SEMANTIC SIMILARITY OVER TIME 50 and ‘non-bank’ (0.60189) started to increase. The reason for this might be that bitcoin rose to prominence in 2014: it was the most famous cryptocurrency on the exchange market

(investopedia.com, accessed on Aug 13th 2018). A significant increase in the coverage of this phenomenon was indicated by the normalised frequency count of the word ‘bitcoin’. This count started to overtake the frequency counts of famous technology brands such as Facebook and Apple. The social significance of bitcoin has likely caused changes in the temporal associations of the word ‘peer-to-peer’.

Figure 14. A snapshot of the dynamic visualisation of reference vector ‘peer-to-peer’ at 07-

2016.

Figure 16 illustrates how the dynamic visualisation for the word ‘peer-to-peer’ came to its end. As can be observed, there is a substantial growth in the size of several bubbles. For example, ‘fintech’8 significantly increased both in frequency size and the similarity score.

8 Fintech refers to the emerging trend in the financial sector, of using technology in order to apply any innovation. For instance, blockchain, cryptocurrencies, algorithms and smart contracts are nowadays part of fintech. VISUALISING SEMANTIC SIMILARITY OVER TIME 51

Furthermore, a significant change in the most associated words in comparison to Figure 12 can be observed. Words that are closely related to bitcoin, such as ‘cryptocurrency’, ‘fintech’, and ‘e-payments’ are strongly related to peer-to-peer. On the other hand, words such as

‘downloading’, ‘bittorrent’, and ‘file-sharing’ became less associated with peer-to-peer. This change was probably due to the major emergence of bitcoin and blockchain-related techniques. A similar pattern was also found in the CBOW word embeddings, which are depicted in Figure 15.

Figure 15. A snapshot of the dynamic visualisation of reference vector ‘peer-to-peer’ with

CBOW word embeddings at 07-2016.

The dynamic visualisation of the reference word ‘peer-to-peer’ with CBOW word embeddings indicated comparable neighbours at each time space. The pattern that was observed in the previous Skip-Gram visualisation was also observable in our CBOW visualisation. The associations of peer-to-peer changed from the sharing of files and

BitTorrent services to associations that were closely related to bitcoin and blockchain. Hence, the hypothesis formulated in Section 3.1 can be confirmed. The influence of the bitcoin VISUALISING SEMANTIC SIMILARITY OVER TIME 52 phenomenon on the associations of peer-to-peer is the most logical explanation for the changes over time. This explanation is supported by evidence such as Figure 9, which demonstrates the strong relatedness of the words ‘bitcoin’ and ‘peer-to-peer’.

VISUALISING SEMANTIC SIMILARITY OVER TIME 53

Discussion

In this chapter, the results that were presented in Chapter 4 will be linked to the literature that was discussed in Chapter 2. In addition, Section 5.2 will provide a discussion about the limitations of this dissertation and data. Section 5.3 will detail the contributions of this study and possible directions for future research.

5.1 Results linked to the literature

In the previous chapter, three novel visualisations of semantic similarity through time were presented. The goal of these visualisations was to broaden the approach of studying cultural and historical phenomena that was introduced by Michel et al. (2011). The authors had illustrated how the rise and fall of frequency counts over time enabled reasoning about cultural and historical phenomena. However, the introduction of Word2Vec and negative sampling made it possible to look beyond the frequency counts of words and analyse large amounts of textual data with word embeddings (Mikolov et al., 2013a). As a result, we tried to expand the approach of Michel et al. (2011) by analysing the temporal change in the semantic similarity between specific words in time to gain insights on the temporal development of cultural and linguistic phenomena. Instead of examining the popularity of selected words by observing their difference in word frequency over time, we proposed to examine the similarity of selected words over time to gain insight on recent phenomena.

Analysing the anchors of a word or the semantic similarity of a word to other target words enabled us to gain insight on the communication of specific words in online news sources.

The first visualisation that was presented was a visualisation of semantic similarity over time between a reference vector and other words in a two-dimensional graph. Although this visualisation offered significant freedom in terms of exploration, it had the disadvantage of depending on the user’s previous knowledge on the subject. We explored the following question with this visualisation: ‘How does bitcoin relate to the words “peer-to-peer”, VISUALISING SEMANTIC SIMILARITY OVER TIME 54

“bubble”, “hack”, “risk”, and “profit”?’ We were able to make some abstract observations about the bitcoin phenomenon and the other words based on the information that was gleaned.

However, as expected, the visualisation indicated that knowledge about the subject or the news coverage was required to avoid any biased conclusions. This constraint is also applicable to the approach of Michel et al. (2011), where a thorough understanding of the development of frequency counts is based on a user’s previous knowledge of a subject.

The second visualisation that was introduced aimed to visualise the anchor neighbours of a reference vector. A distributional method was developed for the observation of those neighbours and their temporal similarity over time for a user-specified reference vector. In this method, the first and last vector spaces provide vectors that function as anchors to gain insight on the development of the reference word. Visualising a reference word with our distributional method allows a user to easily investigate the temporal change in semantic similarity. We visualised the distributional method with a heatmap, which requires a user to make active sense of the temporal change between target words by their own. Since we were dealing with a data set than spanned such a small time space, this was not a problem.

However, if we were dealing with significantly more time points, then it would probably be more desirable to develop an accompanying table that emphasizes significant trends.

The second visualisation that we used enabled us to gain successful insight on the changes of the nearest neighbours of bitcoin. Combining the insights gleaned from this visualisation with the approach of Michel et al. may be the solution to the ‘previous knowledge’ constraint. Because the distributional method provides a smart way to look up knowledge about the communication of a phenomenon, it is suitable for collaboration with a visualisation of frequency counts.

The last visualisation that was proposed was a dynamic visualisation of semantic similarity through time. We proposed a moving scatterplot that was inspired by Hilpert and VISUALISING SEMANTIC SIMILARITY OVER TIME 55

Perek (2015). The first difference between our visualisation and the visualisation of Hilpert and Perek is that we were not restricted to the visualisation of nouns. Instead, our visualisation could visualise any target words of a user-specified reference vector. Second, our bubbles moved vertically according to their similarity score over time, while the bubbles in the visualisation of Hilpert and Perek had a fixed position in accordance with their semantic category. The last difference is that we tried to demonstrate how a motion chart could be utilised for displaying temporal similarities between specific words, while the motion chart of

Hilpert and Perek was developed for the analysis of linguistic change. The only similarity between both motion charts is that the normalised frequency of a word was indicated by the size of a bubble.

5.2 Limitations of study and data

In addition to our visualisations, the data that was used in our study was also significantly different from the data used by Michel et al. (2011). We used the NOW corpus, which contains web-based newspapers and magazines from 2010 till the present time. By contrast, the study of Michel et al. (2011) was based on all the books that had been printed between 1800 and 2000. The NOW corpus allowed us to explore a recent phenomenon through the lens of current events, despite a much smaller time space than was used by Michel et al. (2011). However, this did not lead to obstruction of the analysis of bitcoin, our use case.

One possible reason for this may be the amount of data that was available in each partition of the NOW corpus. As a result, we successfully managed to gain insight on the bitcoin phenomenon in annual and 6-month periods, which were the most appropriate lengths for our analysis.

One limitation of the data was the unequal corpus sizes of every time period. For instance, vector space in January 2010 had a vocabulary size of 113,107 unique words, while vector space in January 2016 had 345,224 unique words. These differences have undoubtedly VISUALISING SEMANTIC SIMILARITY OVER TIME 56 influenced the quality and results of the word embeddings. In addition, such a difference in vocabulary size would likely cause problems when words and their associations are compared over time because some words are not represented in some years. As a result, the outcomes of research may be invalid if a comparison between vector spaces is based on wrong information. A solution for this limitation might be a dynamic time span that slowly moves through the NOW Corpus. This means that an annual vector space must be built with every month as a starting point. This implementation might improve the comparison over time.

Another limitation of our study was the way we pre-processed the data. Because each word was in lowercase, subtle distinctions between words were lost, such as the brand name

‘Apple’ and ‘apple’ the fruit. Therefore, our study did not completely present accurate information based on the choices of a user. This drawback could have been prevented by pre- processing the data in a different way.

Finally, it is important to emphasise the potential influence of the way the word embeddings were constructed. For example, a vector space that is built by counting word co- occurrences or Word2Vec with Skip-Gram may accentuate different aspects of semantic similarity. For instance, the neighbours of a reference vector or the distances between word embeddings may be significantly different, which could influence the resulting insights. An empirical investigation of the consequences of specific decisions in the construction of word embeddings could be a possible direction for future research. A thorough understanding of the implications of the utilised parameters could be a substantial element of such a study.

5.3 Contribution of study and future research

This study presented three novel methods of visualising semantic similarity through time to analyse recent cultural and historical phenomena. Each method’s perks and drawbacks were explored in a case study of bitcoin-related questions. We demonstrated how cultural and historical phenomena can be investigated with the visualisation of semantic similarity. We VISUALISING SEMANTIC SIMILARITY OVER TIME 57 believe that we have made a significant impact on the approach to analysing cultural and historic phenomena. This was achieved in part due to a newly developed method, which we called the anchoring method. This method helps a user to find the relevant target words of a reference vector and subsequently demonstrate how the target words gradually change over time.

In this study, we used the first and last period of a diachronic vector space to retrieve the anchor neighbours of a word. However, in future research, it may be better to investigate a specific time interval for obtaining the anchor neighbours because the timespan of a corpus is usually several decades. For instance, the anchoring method that was used in our study would not be suitable for the corpus that was used in the study of Michel et al. (2011). Obtaining the anchors in a 10-year or 20-year interval would make more sense, because the corpus in their study did cover a 200-year timespan.

We also proved the power of data visualisation by visualising and exploring high- dimensional data in a simple but clear way. These visualisations allowed us to analyse recent cultural and historical phenomena in a new way, thereby contributing to the scientific literature on culturonomics.

However, there is still much that could be explored in future conducted studies. A future study might focus on how a vector space’s style of construction can influence results.

We demonstrated that semantic similarity over time can provide insight on the development of phenomena. However, we do not know much about the consequences of the way a vector space was constructed. An empirical investigation of this concept may have a significant contribution to the study of cultural and historical phenomena.

Another potential area of investigation in a future study is a dynamic time span. This may be the solution for unequal data partitions, because a vector space that is built with significant overlap in the contents of the partitions of the data might improve the temporal VISUALISING SEMANTIC SIMILARITY OVER TIME 58 representation of a word. However, one potential risk of this method is that it may lead to homogenous vector spaces, which makes it harder to investigate a development over time.

Furthermore, it might also be instructive to develop more methods of flexibility for the proposed visualisations. For instance, the flexibility to choose the anchors of a reference vector that are being displayed, or the flexibility to choose anchor time-points of a reference vector. This would enable the user to choose the time points for obtaining the anchors of a reference vector.

To conclude, it would be very interesting to interactively visualise the data of the recently proposed dynamic Word2Vec models (Szymanski et al., 2015; Yao et al., 2018).

These models have demonstrated their capacity and potential for inferring and indicating a word’s temporal meaning. These models dramatically increase the opportunities for conducting research on cultural and historical phenomena. For example, temporal analogies have been proven to automatically find equivalent phenomena from the past that is related to current phenomena. For instance, the temporal analogies of the word ‘mp3’ in 2000 are

‘stereo’ and ‘disks’ in 1994, and the ‘Obama’ query in 2016 is equivalent to ‘Bush’ in 2006 and ‘Clinton’ in 1994 (Yao et al., 2018). Automatically visualising this kind of information may improve the analysis of cultural and historical phenomena.

VISUALISING SEMANTIC SIMILARITY OVER TIME 59

Conclusion

This thesis addressed the following problem statement: Can we develop novel ways of visualising semantic similarity through time to analyse recent cultural and historical phenomena?

In order to find an answer to this problem statement, the following research questions were formulated:

RQ 1: Can we develop a method for the comparison of a user-provided reference word and other user-provided target words?

RQ 2: Can we develop a method for automatically choosing relevant target words given a reference word and a user-specified time period?

RQ 3: Can we develop a method for visualising the additional non-semantic dimensions of information of the target words?

RQ 4: Which of the developed methods are appropriate for analysing current events using recent time-tagged corpora?

Based on the results presented in Chapter 4, this thesis has found the following answers to these research questions:

RQ 1

We successfully developed a method that can compare a user-defined reference word to other user-defined target words. The semantic similarity over time allowed us to investigate how specific words of interest are related to each other. This baseline method was visualised by a line graph, which was the most accurate and meaningful way to display the data.

RQ 2

We successfully developed a novel way of automatically choosing relevant target words, given a reference word and a user-specified time period. Our method chose the n most relevant target words at the beginning and at the end of the relevant time period. The reason VISUALISING SEMANTIC SIMILARITY OVER TIME 60 for this is that the nearest semantic neighbours in the first and last vector spaces provide vectors that function as anchors to gain insight on the development of a reference word.

RQ 3

We successfully developed a method that can visualise additional non-semantic dimensions of information of target words. This resulted in a dynamic motion chart that was able to visualise both the normalised frequency count of a target word (through the size of a bubble) and the approach that was described in RQ 2. To make this visualisation more informative, an extra layer of information about the target words was implemented. This layer consisted of the three nearest neighbours at each timestamp and can be looked up by moving the cursor onto the target word of interest. This extra layer of information should make it more convenient to investigate local neighbours and clusters of semantically similar words.

RQ 4

The results indicated that all three methods are potentially appropriate for analysing current events using recent time-tagged corpora. Different questions related to bitcoin, our use case, were formulated to gain insight on the appropriateness of each proposed method. The first method demonstrated its ability to investigate how a current phenomenon is related to specific words over time. However, the examination of this method also demonstrated that it is not feasible to make thorough statements without a significant amount of knowledge on the subject. Therefore, we consider the second method to be more appropriate for the analysis of current events.

Our second method was a novel distributional method that automatically chose relevant target words given a reference word (e.g., the phenomenon) and a user-specified time period. The method aimed to demonstrate how the anchors of a reference vector develop over time, which is more suitable for analysing current phenomena given the underlying information that was represented by those anchors. VISUALISING SEMANTIC SIMILARITY OVER TIME 61

The third method was a combination between the frequency approach described by

Michel et al. (2011) and our anchoring method. However, this third method visualised the normalised frequency counts inconveniently for analysis. Moreover, the frequency counts were visualised for automatically chosen words instead of specific words of interest. As a result, we believe that our second method is the most appropriate for analysing current events using recent time-tagged corpora. Recent computational advances made it possible to go beyond the exploration of frequency counts and develop these methods. However, our proposed methods should not be viewed as new approaches. Instead, they are extensions of the study of Michel et al. and should be used in a manner that is consistent with those authors’ described approach.

VISUALISING SEMANTIC SIMILARITY OVER TIME 62

References

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of

context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual

Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 238–247.

https://doi.org/10.3115/v1/P14-1023

Bengio, Y. (2013). of Representations: Looking Forward. ArXiv:1305.0445 [Cs].

Retrieved from http://arxiv.org/abs/1305.0445

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language

Model. The Journal of Machine Learning Research, 3, 1137–1155. Retrieved from

http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Cheah, E., & Fry, J. (2015). Speculative bubbles in Bitcoin markets? An empirical investigation

into the fundamental value of Bitcoin. Economics Letters, 130, 32–36.

http://dx.doi.org/10.1016/j.econlet.2015.02.029

Chen, C., Härdle, W., & Unwin, A. (2008). Handbook of Data Visualization. Berlin-Heidelberg:

Springer-Verlag.

Chen, W., Grangier, D., & Auli, M. (2016). Strategies for Training Large Vocabulary Neural

Language Models. In Proceedings of the 54th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers), 1975–1985. Retrieved from

https://arxiv.org/pdf/1512.04906.pdf

Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep

neural networks with multitask learning. In Proceedings of the 25th International Conference

on Machine Learning (ICML-08), pp. 160–167. http://dx.doi.org/10.1145/1390156.1390177

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural

language processing (almost) from scratch. J. Mach. Learn. Res., 12, 2493–2537.

Retrieved from http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf VISUALISING SEMANTIC SIMILARITY OVER TIME 63

Christensson, P. (2006). P2P Definition. Retrieved 2018, Sep 16, Retrieved from

https://techterms.com

Dale, R., Johnson, J., & Tang, L. (2005). Financial markets can go mad: evidence of irrational

behaviour during the South Sea bubble. Economic History Review, 58, 233–271.

http://dx.doi.org/10.1111/j.1468-0289.2005.00304.x

Davies, M. (2013). Corpus of News on the Web (NOW): 3+ billion words from 20 countries,

updated every day. Retrieved from https://corpus.byu.edu/now/.

DeLong, J., & Magin, K. (2006). A Short Note on the Size of the Dot-Com Bubble. National

Bureau of Economic Research Working Papers, Working Paper No. 12011

Dowd, K. (2014). New Private Monies: A Bit-Part Player?. London: Institute for Economic

Affairs (IEA).

Dyer, C. (2014). Notes on noise contrastive estimation and negative sampling. arXiv preprint

arXiv:1410.8251. Retrieved from https://arxiv.org/pdf/1410.8251.pdf

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E.

(2001). Placing search in context: The concept revisited. In Proceedings of the 10th

international conference on World Wide Web, pp. 406–414. Retrieved from

http://gabrilovich.com/papers/context_search.pdf

Firth, J. R. (1957). Papers in Linguistics. London: Oxford University Press.

Frermann, L., & Lapata, M. (2016). A Bayesian Model of Diachronic Meaning Change.

Transactions of the Association for Computational Linguistics, 4, 31–45. Retrieved from

https://transacl.org/ojs/index.php/tacl/article/view/796/169

Galbraith, J. K. (1955). The Great Crash. Cambridge: Riverside Press.

Garber, P. M. (1989). Tulipmania. The Journal of Political Economy, 97(3), 535-560. VISUALISING SEMANTIC SIMILARITY OVER TIME 64

Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-

sampling word-embedding method. ArXiv:1402.3722 [Cs, Stat]. Retrieved from

http://arxiv.org/abs/1402.3722

Harris, Z. S. (1954). Distributional Structure. In Word, 10(2–3), 146–162.

https://doi.org/10.1080/00437956.1954.11659520

Hill, F., Reichart, R., & Korhonen, A. (2014). SimLex-999: Evaluating Semantic Models with

(Genuine) Similarity Estimation. ArXiv:1408.3456 [Cs]. Retrieved from

http://arxiv.org/abs/1408.3456

Hilpert, M., & Perek, F. (2015). Meaning change in a petri dish: constructions, semantic vector

spaces, and motion charts. Linguistics Vanguard, 1(1), 1–12. https://doi.org/10.1515/lingvan-

2015-0013

Huang, A. (2008). Similarity Measures for Text . In Proceedings of the sixth

new zealand computer science research student conference (NZCSRSC2008), 49-56.

Retrieved from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&rep=rep1&type=pdf

Kromesch, S., & Juhasz, S. (2005). High Dimensional Data Visualization. In Proc. International

Symposium of Hungarian Researchers on Computational Intelligence, 230–237.

Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically Significant Detection of

Linguistic Change. In Proceedings of the 24th International Conference on World Wide Web,

pp. 625–635. https://doi.org/10.1145/2736277.2741627

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic

analysis theory of acquisition, induction, and representation of knowledge. Psychological

Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211

Lenci, A. (2008). Distributional Semantics in Linguistic and Cognitive Research. Italian Journal of

Linguistics, 20, 1–31. VISUALISING SEMANTIC SIMILARITY OVER TIME 65

Levy, O. & Goldberg, Y. (2014). Dependency-Based Word Embeddings. Proceedings of the 52nd

Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),

302-308. http://dx.doi.org/10.3115/v1/P14-2050

Metry, M. (2017). Blockchain Technology is the Most Significant Invention since the Internet and

Electricity. Retrieved July 28, 2018, from https://medium.com/@markymetry/blockchain-

technology-is-the-most-significant-invention-since-the-internet-and-electricity-f2d44a631ef6

Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, …

Aiden, E. L. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. In

Science, 331(6014), 176–182. https://doi.org/10.1126/science.1199644

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient Estimation of Word

Representations in Vector Space. In Proceedings of ICLR Workshops Track, 1–12. Retrieved

from https://arxiv.org/pdf/1301.3781.pdf

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed

Representations of Words and Phrases and their Compositionality. Nips, 1–9. Retrieved from

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-

compositionality.pdf

Miikkulainen, R., & Dyer, M. G. (1991). Natural Language Processing With Modular Pdp

Networks and Distributed Lexicon. Cognitive Science, 15(3), 343-399.

Miller, G. A., & Charles. W. G. (1991). Contextual correlates of semantic similarity. Language

and Cognitive Processes, 6(1), 1-28.

Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive

Science, 34(8), 1388–1429. http://dx.doi.org/10.1111/j.1551-6709.2010.01106.x

Naili, M., Chaibi, A. H., & Ben Ghezala, H. H. (2017). Comparative Study of Word Embedding

Methods in Topic Segmentation. Procedia Computer Science, Vol. 112, 340–349.

http://dx.doi.org/10.1016/j.procs.2017.08.009 VISUALISING SEMANTIC SIMILARITY OVER TIME 66

Nakamoto, S. (2008). Bitcoin: A Peer-to Peer Electronic Cash System. Retrieved July 30, 2018,

Retrieved from https://bitcoin. org/bitcoin.pdf

Pado, S., & Lapata, M. (2003). Constructing semantic space models from parsed corpora. In

Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp.

128–135

Rehurek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In

Proceedings of New Challenges Workshop, LREC 2010, 45–50.

Sahlgren, M. (2006). The word-space model: using distributional analysis to represent

syntagmatic and paradigmatic relations between words in high-dimensional vector spaces.

Stockholm: Dep. of Linguistics, Stockholm Univ. Retrieved from

http://eprints.sics.se/437/1/TheWordSpaceModel.pdf

Schütze, H. (1992). Dimensions of meaning. Proceedings of the 1992 ACM/ IEEE Conference on

Supercomputing, 787-796. http://dx.doi.org/10.1109/SUPERC.1992.236684

Schütze, H. (2008). Introduction to information retrieval. In Proceedings of the international

communication of association for computing machinery conference, pp. 787-796.

Schütze, H., & Pedersen, J. (1995). Information retrieval based on word senses. Proceedings of the

4th Annual Symposium on Document Analysis and Information Retrieval, 161-175

Smilkov, D., Thorat, N., Nicholson, C., Reif, E., Viégas, F. B., & Wattenberg, M. (2016).

Embedding Projector: Interactive Visualization and Interpretation of Embeddings.

ArXiv:1611.05469 [Cs, Stat], 1–4. Retrieved from https://arxiv.org/pdf/1611.05469.pdf

Theus, M., & Urbanek, S. (2008). Interactive Graphics for Data Analysis: Principles and

Examples, London: Chapman & Hall/CRC.

Tufte, E. R. (2001). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press. VISUALISING SEMANTIC SIMILARITY OVER TIME 67

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine

Learning Research, 9, 2579–2605. Retrieved from

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Yang, L. (2011). Distance-preserving dimensionality reduction: Distance-preserving

dimensionality reduction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge

Discovery, 1(5), 369–380. https://doi.org/10.1002/widm.39

Yao, Z., Sun, Y., Ding, W., Rao, N., & Xiong, H. (2018). Dynamic word embeddings for evolving

semantic discovery. In Proceedings of the Eleventh ACM International Conference on Web

Search and Data Mining, pp. 673-681. http://doi.acm.org/10.1145/3159652.3159703

Young, F. W., Valero-Mora, P. M., & Friendly, M. (2006). Visual Statistic: Seeing Data With

Dynamic Interactive Graphics. Hoboken, NJ: John Wiley & Sons.

Young, J. (2018). Bitcoin Price Drop From $20,000 Likely Due to Market Manipulation: Traders.

Retrieved July 28, 2018, Retrieved from https://www.ccn.com/bitcoin-price-drop-from-

20000-likely-due-to-market-manipulation-traders/