DOES INFLUENCE THOUGHT? INVESTIGATING WITH TOPIC MODELS

BAS RADSTAKE

Registration Number: 984292 Student Number: u1251371

Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Communication and Information Sciences, Master Track Data Science: Business and Governance, at the school of humanities of Tilburg University

Thesis Supervisor: Prof. Dr. Max Louwerse

Tilburg University School of Humanities Tilburg, The Netherlands July 2017

Contents

Abstract

1 Linguistic Relativity 1

2 Topic Models 3 2.1 Multinomial Distribution ...... 4 2.2 Dirichlet Distribution ...... 5 2.3 Latent Dirichlet Allocation ...... 6

3 United Nations Parallel Corpus 7 3.1 ...... 7 3.1.1 English, French, and Spanish ...... 9 3.1.2 Russian ...... 9 3.1.3 Arabic ...... 9 3.1.4 Mandarin Chinese ...... 10

4 Topic Comparison 10

5 Hypothesis 12

6 Study 1 12 6.1 Data Description ...... 12 6.2 Pre-processing ...... 12 6.3 Latent Dirichlet Allocation ...... 13 6.4 Topic Comparison ...... 13 6.5 Results ...... 14

7 Study 2 16 7.1 Results ...... 17

8 Study 3 18 8.1 Results ...... 19

9 Study 4 19 9.1 Results ...... 20

10 Study Comparison 21

11 Discussion 22 References 26

Appendix A 28 Abstract

Linguistic relativity is the name of the theory that suggests that language influ- ences thought. Attempts at proving linguistic relativity have been done in the realm of and psychology, but a computationally motivated effort has not been made yet.

In this study, an attempt to identify linguistic relativity has been made by com- paring the meaning of parallel texts. Linguistic relativity dictates that texts in different languages can never yield exactly the same meaning. The United Na- tions Parallel Corpus was used, as it contains high quality parallel . This corpus is available in the English, French, Spanish, Mandarin Chinese, Rus- sian, and Arabic languages. Meaning was estimated using the Latent Dirichlet Allocation topic model. This model extracts the latent semantic structure of a text.

When comparing the latent semantic structure of parallel texts, it was found that the Spanish, French, and English languages showed considerable similarities. This means that similar languages have similar meanings, which is remarkable given that the texts were all profesional translations of each other. The results of the current study point towards the existence of linguistic relativity, however, future research should provide more insight in how exacty the connection between language and thought exists. 1 Linguistic Relativity

The view that language influences a person’s world view has long been an idea that has generated a lot of interest. This idea that language influences thought is called linguistic relativity (Lucy, 1997), also called the Sapir-Whorf hypoth- esis, after two researchers whose work forms the basis for the theory (Whorf, 1956). Both Sapir and Whorf studied Native American languages, and Whorf’s most convincing argument for linguistic relativity came from studying the Na- tive American Hopi language. As the Hopi language lacked an understanding of time as an object or substance that can be divided, he argued that the world view of Hopi speakers was thus different than the world view of speakers of other languages. The amount of interest the theory generated is hardly surprising; if the hypothesis is true, it would have consequences for a large number of fields, such as linguistics, psychology, anthropology, and even public policy. It would mean that languages are not comparable to each other on their own, as it would be necessary to consider the speaker’s cognition as well. The validity of research based on language comparison would be challenged, models of would have to be reevaluated, and policies would have to be revised. However, empiral research has thus far failed to give conclusive evidence for the existence of lin- guistic relativity (Wolff & Holmes, 2011).

Linguistic relativity is not just one hypothesis, and it can be construed in many ways. For example, an extreme approach of linguistic relativity is that of linguis- tic determinism, which states that languages determines thought. It implies that people who speak different languages have different thought processes. This hy- pothesis has been generally rejected by the scientific community (Wolff & Holmes, 2011). However, several other less extreme hypotheses of possible effects of lan- guage on thought have been identified. For example, it was found that language influences our way of representing exact numbers. The people of the Pirah˜atribe have no method of expressing exact numbers in their language, which influenced the way they memorized quantities (Frank, Everett, Fedorenko, & Gibson, 2008). Because there were no words for exact quantities, Pirah˜aspeakers had trouble performing complex memory tasks involving exact numbers, even though En- glish speakers did not. This does not necessarily mean that language determines thought, but it suggests that language can have an effect on cognition. Even though the literature is not consistent in this regard, it seems there is at least some connection between thought and language, although efforts to emperically prove this have thus far been unsuccessful (Wolff & Holmes, 2011).

1 Research in linguistic relativity has been a multidisciplinary effort in linguistics, psychology, and anthropology. However, so far this research has been on a relativ- ity small scale, comparing speakers of different languages on an individual basis (Frank et al., 2008). In order to increase the scale at which linguistic relativitiy is investigated, it is necessary to look beyond traditional field boundaries. One possible way of analysing languages at a large scale is by using computational techniques. With the relatively recent rise in computer performance and a reduc- tion of costs of data storage, fields such as computational linguistics have gained in popularity (Manning, 2016). With increasing computational power, research in computational linguistics has become more accessible and more powerful, as more and more data is able to be processed. The rise of computational linguistics has given way to a more data driven approach to linguistics. (Manning, 2016).

One of the ways in which cross-linguistic hypotheses can be tested with compu- tational linguistics is by analysing a parallel corpus (Egbert, Staples, & Biber, 2015). A parallel corpus is a collection of documents and their translations into a different language. Parallel corpora are interesting because of the alignment of original and translated text. This allows them to be used for cross-linguistic research and the training of machine learning algorithms that aid in machine translation. The construction of parallel corpora is a laborious and expensive process, which means studies are relatively scarce (Egbert et al., 2015).

It is assumed that texts in parallel corpora convey the exact same meaning. This seems intuitive, because these texts are direct translations of each other. How- ever, at the core of linguistic relativity lies the assumption that texts in different languages can never mean exactly the same, as their meanings are influenced by the language the texts are in. By computationally estimating the meaning of parallel texts, one can investigate if this is truly the case. Meaning of text can be computationally estimated by using topic models. Topic modeling is a way of computationally analysing a large amount of unclassified text (Alghamdi & Alfalqi, 2015), such as a corpus. Its main use is to discover underlying patterns of words within documents, also called topics. These topics represent hidden semantic structures of a text. This can be considered to be the meaning of a text.

The aim of the current research is to gain insight in how the hidden semantic structures of parallel translated texts relate to each other. The same texts in

2 different languages should yield the exact same topics. If this is not the case, the current study can provide a first step towards identifying linguistic relativity in a computational way.

2 Topic Models

The field of topic modeling originated with analyses such as Latent Semantic Analysis (Deerwester, 1988) and later Probabilistic Latent Semantic Analysis (Hofmann, 1999). Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003) is the algorithm that is the basis for a large portion of current day topic models. In the current research, Latent Dirichlet Allocation (LDA) will be used to generate topics from a parallel text.

Latent Dirichlet Allocation (LDA) is an unsupervised technique that is used to discover underlying topics in documents (Blei et al., 2003). LDA represents each document as a mixture of topics that contain words with certain probabil- ities. The collection of topics is different for each document. Say, for example, LDA was used to generate topics over all documents in a corpus with topics about psychology. Some documents may have topics from health psychology and psychopathology, while other documents might contain topics from work and organizational psychology. The challenge of learning topics from collections of documents is that the topics are not known in advance.

After running the LDA model, topics can be displayed by returning the words that have the highest probability to belong to that topic. If a hypothetical LDA were to be performed on the previously mentioned corpus with topics about psy- chology, topics could look like Table 1, displaying the 3 words with the highest probabilities for 3 topics.

topic 1 topic 2 topic 3

research health stress

paper symptom satisfaction

university therapy salary

Table 1: Results of a hypothetical LDA.

As can be seen in Table 1, the LDA algorithm extracts meaning by assigning

3 words to a topic. If the analysis is succesful, words that share a similar mean- ing are assigned to the same topic. In this case, words that seem to relate to research are grouped together, as well as words that relate to perhaps health, and work. Note that the algorithm does not assign meaning to these topics, it merely groups them together. The model that is trained on the data might now be used to classify documents on their content, for example classifiying whether a document is about health psychology or organizational psychology.

At the very core of LDA lies the Dirichlet distribution, which is fundamental in understanding how LDA works. The Dirichlet distribution has been widely used when modeling the distribution of words in text documents, and it has been shown to be more appropriate than a multinomial representation (Madsen, Kauchak, & Elkan, 2005). This is because a Dirichlet distribution handles the phenomenon of ’burstiness’ better: the fact that when a word shows up in a doc- ument, it is likely to show up again (Doyle & Elkan, 2009). Both the multinomial and the Dirichlet distributions will now be described, as they both play a role in how LDA works.

2.1 Multinomial Distribution

The multinomial distribution is a multidimensional generalization of the binomial distribution (Forbes, Evans, Hastings, & Peacock, 2011). It considers a number of independent trials with k possible outcomes. For the binomial distribution, k would be equal to 2, whereas for the multinomial distribution, k is larger than 2. For each trial, outcome A happens with probability p. The multinomial distribution refers specifically to a set of trials that are independent of each other.

The probability function for the multinomial distribution is the probability that each event Ai occurs xi times, with i = 1, ..., k, in n trials and is given by:

k Y xi f(xi, ..., xk) = n! (pi xi!) (1) i=1

In a more intuitive way, the multinomial distribution can be visualised as follows: consider an experiment in which a die is rolled 10 times. A binomial distribution would be used if one would want to find out the probability to roll 2, as in this case a roll of 2 is called a succes and a roll of 1, 3, 4, 5, or 6 would be considered a failure. A multinomial distribution would be used if one would want to find to out the probability to roll exactly 4 times 1, and 2 times 2, as here the amount

4 of possible outcomes increases: there are 10 independent trials with 6 possible outcomes.

When the multinomial distribution is used to represent text, the multinomial represents the probability to observe a certain vector of word counts. Because of the way how the multinomial distribution models words, it captures the oc- currence of common words correctly. It however fails to correctly model the less common words in a document, which is a problem since the lesser occurring words are more likely to be information carrying words (Madsen et al., 2005). This is where the Dirichlet distribution comes in.

2.2 Dirichlet Distribution

The Dirichlet distribution is a probability density function over destributions, or ’a distribution of multinomials’ (Forbes et al., 2011). The Dirichlet distribution is a continuous distribution, whereas the multinomial distribution is discrete. This means that where the multinomial distribution can only output vectors of integers, the Dirichlet distribution can output vectors of arbitrary real numbers.

The probability function for the Dirichlet distribution can be described as follows. A probability mass function (pmf) is a function that gives the probability that a random variable is exactly equal to some value. The Dirichlet distribution can be thought as a distribution over pmfs of length k. Madsen, Kaucak and Elkan (2005) define the probability function of the Dirichlet distribution as:

W P Γ( αw) W Y p(θ; α) = w=1 θaw−1 (2) W w Q w=1 Γ(αw) w=1

With θ being a vector in the W-dimensional probability simplex, Γ representing the Gamma function, and α being the vector that defines the parameters of the Dirichlet distribution.

The Dirichlet distribution is a distribution over probability vectors instead of count vectors like the multinomial distribution. Burstiness dictates that words that when a word shows up, it is likely to show up again (Doyle & Elkan, 2009). This is better modeled by a probability than a count. The Dirichlet distribution thus models rare words better (Madsen et al., 2005), making it ideal for modeling text documents.

5 2.3 Latent Dirichlet Allocation

A limitation of using the original LDA algorithm as described by Blei et al. (2003) is that the entire dataset has to be held in memory. This is not desirable, as doing analyses directly from memory takes up more computational memory than is often available. A solution to this is to use a slightly altered version of the algorithm as described by Hoffman et al. (2010). Hoffman et al. use an online-learning approach, which means that a model can be updated when more data is added. This allows for the chuncking of a dataset in sets of documents, which can be processed one by one. The current section will paraphrase Hoffman et al. (2010)

To perform LDA, first the amount of topics k has to be specified. Each topic is defined by a multinomial distribution over the vocabulary and is assumed to be drawn from a Dirichlet distribution:

βk ∼ Dirichlet(η) (3)

With βk being the conditional probability table of the words to topics. Given these topics, for each document d a vector of topic proportions θ is drawn:

θd ∼ Dirichlet(α) (4)

With α being a vector of symmetric Dirichlet priors of size k. Then, for each word w in the document d a topic index z is drawn from the topic weights.

zdw ∼ Multinomial(θd) (5)

Now, the selected word has been assigned to a topic according to the topic index. After repeating this process for an entire corpus of documents C, the posterior distribution of each topic β, topic proportion θ, and topic assignment z will reveal the latent structure of the corpus. The process can be visualised as shown in Figure 1.

As mentioned earlier, performing an LDA on parallel texts should yield the same topics. However, this can only be assumed if the corpus it is performed on contains high quality translations. It is therefore necessary to use a high quality parallel corpus for the current research.

6 Figure 1: Visualisation of the LDA process.

3 United Nations Parallel Corpus

In 2016, the United Nations published a parallel corpus containing all of its proceedings translated in the six official working languages: Arabic, English, French, Mandarin Chinese, Russian, and Spanish (Ziemski, Junczys-Dowmunt, & Pouliquen, 2016). The United Nations is one of the world’s largest employers of language professionals (United Nations Language Careers, 2017). The United Nations is also considered one of the most prestigious employers for translation professionals, and the credentials required to gain employment at the United Na- tions are very high. Translators have to pass a competitive exam, after which they are extensively trained before starting on an assignment (Competitive Examina- tions for Language Professionals, 2017). Therefore, there can be little doubt that the translations they produce are of high quality. The high quality of translation implies that the meaning of translations and original are as equivalent as possi- ble. This makes the United Nations Parallel Corpus ideal for the current research.

In the United Nations Parallel Corpus, the languages Arabic, English, French, Mandarin Chinese, Russian, and Spanish are present. This limits the current research to use of these 6 languages.

3.1 Languages

Linguistic relativity dictates that it is not possible for different languages to yield the same meaning. However, it is reasonable to assume that the amount of simi- larity between meanings can differ depending on which languages are examined. If linguistic relativity is true and topics between languages are examined, it can be expected that a pattern is found, for example that culturally and linguistically

7 similar languages yield similar topics. It is therefore important to understand which languages are similar to each other. Historical linguistics is the branch of linguistics that researches how languages change over time (Baldi & Dussias, 2012). It describes how modern languages have descended from their ancestral languages. Languages that come from the same ancestor are more similar to each other than other non related languages (Rowe & Levine, 2015). Addition- ally, languages that have been in contact with each other for a sustained period of time can develop similarities (Blake, 1996). These findings suggest that when languages share a common ancestor or have had a significant amount of contact, the similarities between them can be larger than between unrelated languages.

In the current research, the languages Arabic, English, French, Mandarin Chi- nese, Russian, and Spanish are examined. Of these languages, the English, French, Russian, and Spanish languages have descended from the same ancestor, or proto-language, Proto-Indo-European (Beekes, 2011). In Figure 2, the rela- tion between these languages is visualised. Mandarin Chinese and Arabic are both in two other language groups, Sino-Tibetan and Semitic respectively, which means that they developed seperately (Rowe & Levine, 2015). It is therefore assumed that the Indo-European languages are more similar to each other than to languages that they are not related to. Within the Indo-European branch, an- other distinction is made: the Slavic, Germanic, and Romance families (Beekes, 2011). Languages that belong to the same family have diverged from Proto-Indo- European together before splitting off and are thus closer to each other than to other languages of the same language group. As for similarities through sus- tained periods of contact, a large part of the vocabulary in English has it’s roots in Romance languages, due to the influence of Norman French on the English language during the Norman invasion of England and later on, the influence of Latin (Blake, 1996).

Between the 6 examined languages, 3 levels of similarity can thus be identified. The first one consists of French and Spanish, as they are both part of the Ro- mance language family. The second level would be between French, Spanish, and English, as English was influenced by Norman French and shares a lot of vo- cabulary with the Romance languages. The last level would be between French, Spanish, English and Russian, as these are all Indo-European languages and they share common roots and structure.

Depending on the representation of text in a language, using text mining tech-

8 Figure 2: Language Tree for the Indo-European-Languages. niques can pose challenges. These will now be described.

3.1.1 English, French, and Spanish

English, French, and Spanish are all languages that are written in the latin script. This has both an advantage and a disadvantage: cleaning texts in these languages is fairly easy as all possible characters are contained in the Unicode subsets Basic Latin and Latin-1 Supplement. However, it is impossible to differentiate between these languages purely by script, and this might pose difficulties when a text has content in multiple languages.

3.1.2 Russian

Russian is written in the cyrillic script. This script is fully covered by the Uni- code set Cyrillic, and differentiating between Russian texts and texts in other languages does not pose a challenge. However, Russian is well-known for it’s grammatical inflections (Wade, 2010). Nouns are subject to 6 different cases, which change the ending of a word depending on grammatical function (Wade, 2010). Nouns also have to agree with three different genders: masculine, fem- inine, and neuter, and number: singular, and plural. Although some of these features are present in other languages, Russian is by far the most inflected lan- guage of the 6 available languages. This means that when converting a Russian text to a word vector, the representation is going to be more sparse, as more forms exist for each word.

3.1.3 Arabic

Arabic is a Semitic language that consists of many dialects and forms (Versteegh, 2014). In the United Nations Parallel Corpus, Modern Standard Arabic is writ-

9 ten (Ziemski et al., 2016).

Arabic is written from right to left in the Arabic abjad. An abjad is a system where each stands for a consonant. Vowels are rarely supplied, and the reader is expected to fill them in by using the context. This can be problematic when using text mining techniques on Arabic texts, as not using vowels results in more words using the same characters (Al-Harbi, Almuhareb, Al-Thubaity, Khorsheed, & Al-Rajeh, 2008). Resulting topics in Arabic might thus show more than those in other languages.

3.1.4 Mandarin Chinese

Mandarin Chinese, often referred to as Chinese, is a Sinitic language from the Sino-Tibetan family (Chen, 1999). Chinese is written using Chinese characters. These characters are logograms that represent meaning. Modern dictionaries list as much as 70,000 characters (Zhen, 2000). However, to attain the highest cer- tificate of proficiency in Mandarin Chinese, only an estimated 5000 characters need to be learned (HSK Chinese Proficiency Test, 2017). As the CJK Unified Ideographs Unicode subset encompasses the most frequent 6763 characters, it can be assumed that only the rarest characters would be left out if one is to clean Chinese texts using this Unicode set. This is not a problem if one is only looking at the most frequent topics, but it might influence results when looking at models with a large number of topics.

What can be a problem however, is the fact that Mandarin Chinese does not use spaces to denote word boundaries (Chen, 1999). This makes it impossible to tokenize words, as words can consist of either one or multiple characters. Using the Jieba package, the characters can be split into words (Sun, 2012). It will however increase the time that is needing to process documents.

4 Topic Comparison

Once the topics are generated by the model, a comparison procedure is needed to measure the similarity between these topics. This is not standard for Latent Dirichlet Allocation, as typically the model is used in a practical setting after training, for example to classify new documents on their contents.

Resulting topics first have to be translated to a common language in order to compare them, for example using machine translation. This should not pose a

10 threat to the validity of the research, as it only alters the representation of the output, and not the content. Then, a similarity measure has to be defined to quantify the connection between the topic vectors. For the current research, co- sine similarity will be used, as it is a commonly used similarity measure to use when clustering documents (Muflikhah & Baharudin, 2009).

The cosine similarity between two documents is equal to the value of the angle between two vectors (Huang, 2008). If documents are represented as term vectors, the cosine similiarity is equal to the correlation between the vectors. The cosine similarity can give an indication of similarity between any two vectors. Cosine similarity between two vectors A~ and B~ can be defined as:

n P AiBi similarity(A,~ B~ ) = i=1 s n s n (6) P 2 P 2 Ai Bi i=1 i=1

Cosine similarity on its own can define the similarity between each language. To visualise the connection between groups of languages, a clustering approach can be taken. In the current research, agglomerative hierarchical clustering is used to provide more insight in the similarity between different groups of languages.

Hierarchical clustering is a way of grouping similar vectors based on a distance measure (Steinbach, Karypis, Kumar, et al., 2000). The type of hierarchical clustering used in the current research is called agglomerative clustering. Ag- glomerative clustering starts with each vector as its own cluster. It then merges the clusters that are most similar to each other based on the distance measure. In the current research, cosine distance will be used, which is a distance measure based on the cosine similarity. Cosine distance is equal to:

distance = 1 − similarity (7)

Once two clusters are merged, the similarity matrix is updated to reflect the similarity between the new cluster and the original clusters. The process is then executed again to compute distances between the new cluster and the remaining vectors. This way of clustering is done in a ’greedy’ manner. This means that the algorithm does not look for an optimal amount of clusters, but rather keeps on making decisions based on the distance until a single cluster is formed. In- terpreting the results is therefore done by analysing in which order the vectors

11 get clustered together before a single cluster is formed. By grouping the most similar languages and comparing them to the others, a hierarchy of similarity can be identified.

5 Hypothesis

Linguistic relativity has not been expanded upon using computational linguistic techniques. The aim of the current research is to try to identify linguistic rela- tivity by comparing the meaning of parallel texts. Latent Dirichlet Allocation is a technique within topic modeling that extracts latent semantic structures from texts. One would assume that parallel texts would yield the same topics, but according to linguistic relativity this can not be true. If the topics, and thus the meaning, of parallel texts are different, it would provide a first step in identifying linguistic relativity computationally.

6 Study 1

Study 1 describes the process of using Latent Dirichlet Allocation to generate topics for the United Nations Parallel Corpus, and comparing them afterwards.

6.1 Data Description

The United Nations Parallel Corpus (Ziemski et al., 2016) was retrieved from https://conferences.unite.un.org/UNCorpus/en/DownloadOverview in the form of compressed collections of XML files. The 11 compressed files were each 1 GB or less in size. Decompressing the files resulted in about 41 GB of data stored in 799.276 XML files.

6.2 Pre-processing

First, documents that were not translated in all six languages were filtered out. This resulted in a selection of 86.307 XML files for each language, or 517.842 files total. Next, each XML file was parsed by selecting only the content of the body tags for each file. For each language, subsets of Unicode can be defined to encompass all the characters that are present in the language. These are:

English, French, and Spanish: Basic Latin, Latin-1 Supplement

Arabic : Arabic

Mandarin Chinese : CJK Unified Ideographs

12 Russian : Cyrillic

For each language, all files in that language were filtered on these subsets. For the languages English, French, Spanish, and Russian, all letters were changed to lower case. Next, all remaining punctuation, numbers, newlines, and trailing spaces were removed.

For all languages except Mandarin Chinese, tokenization was done by splitting the text on each space. Tokenization for Mandarin Chinese was done using the Jieba package in Python (Sun, 2012), as word boundaries are not indicated by whitespace.

A frequency dictionary was constructed for each language, and for each language the 50 most frequent words were removed. Removing the 50 most frequent words from the corpus removes most of the stopwords from the corpora. This approach was taken to keep the process as equal as possible between each language. To reduce the density of the data, words that did not appear in at least 5 docu- ments were excluded from the data. Using the dictionary, each document was transformed into a list of tuples, with word IDs and word frequencies.

6.3 Latent Dirichlet Allocation

Latent Dirichlet Allocation was performed once for each language, on all doc- uments. LDA analyses were executed using the Gensim package for Python (Reh˚uˇrek&ˇ Sojka, 2010). The Gensim package follows the approach as taken in the paper by Hoffman et al. (2010). The amount of topics was set to 10, and the amount of passes was set to 20. For each language, the top 10 words for each topic were displayed, resulting in 100 words total.

6.4 Topic Comparison

Topics for each non-English language were translated to English using Google Translate (Google Translate, 2017). Topics were supplied in a comma-separated format with newlines in between, so the Google Translate algorithm would not mistake word combinations for a . As Google Translate adds capitaliza- tion and articles back in, the translated text was made lower case and articles and non-alphabetical characters were removed.

A dictionary was constructed containing all the words in all topics. This dic- tionary was used to convert topic vectors into vectors that reflect word counts.

13 Similarity between vectors was calculated using cosine similarity. A hierarchical clustering algorithm was performed using the cosine distance and full linkage. The resulting dendrogram visualizes the clusters that exist within the languages.

6.5 Results

The cosine similarity matrix is displayed in Table 2. Cosine values had a mean of 0.192 and a standard deviation of 0.058. The dendrogram displaying the clusters extracted from the data is displayed in Figure 3. Table 3 displays a selection of the most probable words for 3 example topics from the LDA analysis.

AR EN ES FR RU ZH

AR - 0.182 0.113 0.095 0.195 0.246

EN 0.182 - 0.262 0.206 0.184 0.238

ES 0.113 0.262 - 0.308 0.175 0.168

FR 0.095 0.206 0.308 - 0.140 0.141

RU 0.195 0.184 0.175 0.140 - 0.223

ZH 0.246 0.238 0.168 0.141 0.223 -

Table 2: Cosine similarity measure between topics for study 1. AR = Arabic, EN = English, ES = Spanish, FR = French, RU = Russian, ZH = Mandarin Chinese.

The similarity matrix in Table 2 shows a relatively high amount of similarity be- tween Spanish and French (sim = 0.308). Other notable similarities exist within English and Spanish (0.262) and Mandarin Chinese and Arabic (0.246). The dendrogram in Figure 3 shows that the first merge between clusters occured be- tween French and Spanish. Later on, English was added to this cluster. Chinese, Arabic, and Russian ended up in the other cluster, with Chinese and Arabic be- ing clustered together first. The clustering together and the similarities between French, Spanish, and English indicate that similar languages have similar mean- ings. However, a comparably large similarity between Arabic and Chinese does not follow this rationale.

14 Figure 3: Dendrogram for study 1.

AR rights, subject, law, woman, development EN women, law, convention, conference, children ES women, article, convention, children, peace FR women, man, convention, people, children RU person, women, rights, children, persons ZH convention, children, rights, women, law

AR - EN unodc, unops, ipsas, mdgs, umoja ES court, ipsas, accused, unmit, minustah FR unamid, unmik, usd, idb, amisom RU - ZH -

AR israel, palestinian, territories, gaza, syrian EN palestinian, israeli, protocol, israel, arab ES - FR - RU israel, lebanon, pact, weapons, israeli ZH israel, lebanon, bromine, airspace, methyl

Table 3: Example topics for study 1.

In Table 3, some example topics were selected to compare the content of the top- ics. As can be seen in Table 3, each language has a topic containing words that seem to involve human rights. In all languages, it contains the words ’women’ and ’child’, and it paired with ’rights’ and ’convention’. This was the only topical

15 structure that was present in all 6 languages. This could explain the low amounts of similarity, as the words in the topic about human rights are similar. Also, all languages written in the latin script had a topic pertaining abbreviations, these were not found in the other languages: they showed words like ’unodc’ and ’ipsas’. These abbreviations presumably stand for International Public Sector Account- ing Standards and United Nations Office on Drugs and Crime, indicating this topic contains just abbreviations. It is notable that these abbreviations are only available in the languages that are written in the latin script, as abbreviations might have been removed from the other languages during pre-processing. Some languages had a topic containing words that seem to relate to Israel and other countries in the Middle East. This is remarkable, as is it seems to be a cohesive topic but it is not present in all 6 languages.

Study 1 shows surprising low amounts of similarity between the languages. Also, some patterns can be identified. It is shown that languages with a similar history and culture show similarities in meaning. Before making conclusions based on the results found, it is important that model fit is considered. If the amount of topics that is specified is too low, topics with different meanings might be merged into a single topic that is not cohesive. As the topics resulting from Study 1 contained barely any cohesive topics, it might be that the amount of topics was too low for the amount of data. It is therefore needed to change the parameters and the data of the model, to try and find a better fit.

7 Study 2

Study 1 showed seemingly low amounts of similarity between the languages. It also showed non-coherent topics, which suggests that the amount of data used to generate 10 topics was too large. By using less data, it might be possible to improve upon Study 1 while still maintaining only 10 topics. To investigate whether this occurred because of a bad fit of the model, Study 2 aims to replicate Study 1 while using a smaller amount of data.

Because reducing the amount of data might give a better fit for a model with 10 topics, Study 2 replicated Study 1 with a smaller subset of the data. Latent Dirichlet Allocation was performed once for each language, on a subset containing all documents made in the year 2014, as this is the most recent year in the United Nations Parallel Corpus. For each language, the top 10 words for each topic were displayed, resulting in 100 words total.

16 7.1 Results

The cosine similarity matrix is displayed in Table 4. Cosine values had a mean of 0.370 and a standard deviation of 0.081. The dendrogram displaying the clusters extracted from the data is displayed in Figure 4. Table 5 displays a selection of the most probable words for 2 example topics from the LDA analysis.

AR EN ES FR RU ZH

AR - 0.326 0.308 0.296 0.378 0.268

EN 0.326 - 0.412 0.452 0.330 0.400

ES 0.308 0.412 - 0.592 0.375 0.302

FR 0.296 0.452 0.592 - 0.426 0.367

RU 0.378 0.330 0.375 0.426 - 0.325

ZH 0.268 0.400 0.302 0.367 0.325 -

Table 4: Cosine similarity measure between topics for study 2. AR = Arabic, EN = English, ES = Spanish, FR = French, RU = Russian, ZH = Mandarin Chinese.

Figure 4: Dendrogram for study 2.

For the second study (mean = 0.370), cosine similarities as displayed in Table 4 are higher than in Study 1 (0.192). This suggests that the topic model might

17 have a better fit in general than in Study 1. Notable results are the similarities between French and Spanish (0.592), English and French (0.452), Russian and French (0.426), and English and Spanish (0.412). These are also all the four lan- guages of the Indo-European language family, again showing the same pattern of similar meanings between similar languages.

The dendogram in Figure 4 shows that French, Spanish, and English have been clustered together, which was to be expected when looking at the similarity mea- sure. However, what was not expected is that Russian was clustered together with Arabic, and not with the other languages of the Indo-European family.

AR nuclear, weapons, weapon, treaty, korea EN weapons, disarmament, nuclear, draft, treaty ES nuclear, treaty, disarmament, nuclear, weapons FR nuclear, weapons, disarmament, treaty, nuclear RU weapons, nuclear, contract, territory, peace ZH weapons, peace, people, treaty, sudan

AR peace, support, resources, administration, budget EN mission, peacekeeping, missions, operations, peace ES peace, missions, operations, peace, budget FR peace, conflict, mission, humanitarian, republic RU support, mission, positions, peace, participant ZH peace, troops, missions, stable, council, personnel

Table 5: Example topics for study 2.

As can be seen in Table 5, compared to Study 1 there are now more topics that the models have in common. At least two topics have a large similarity in words: the first one containing words that relate to nuclear weapons such as ’nuclear’, ’weapons’, ’treaty’. The second one related to peace missions with words as ’peace’, ’troops’, ’support’. The higher cosine values in combination with the observed higher coherence in topics suggests that Study 2 provides a better model fit than Study 1. As now a good model fit has been established for a small subset of the data, it is implicated that to model the complete dataset a higher number of topics is needed. Study 3 and 4 both describe models over the complete dataset, but with a higher number of topics.

8 Study 3

Study 2 showed that less data proved to be a better fit for a model that only had 10 topics. For a model that encompasses all data, more topics would have to be

18 specified in order to maintain good model fit.

Study 3 replicated Study 1, keeping all parameters the same except for the amount of topics which was set to 50. The whole dataset was used. For each language, the top 10 words for each topic were displayed, resulting in 500 words total.

8.1 Results

The cosine similarity matrix is displayed in Table 6. Cosine values had a mean of 0.135 and a standard deviation of 0.046. The dendrogram displaying the clusters extracted from the data is displayed in Figure 5.

AR EN ES FR RU ZH

AR - 0.106 0.136 0.073 0.169 0.153

EN 0.106 - 0.242 0.143 0.168 0.101

ES 0.136 0.242 - 0.173 0.136 0.159

FR 0.073 0.143 0.173 - 0.078 0.075

RU 0.169 0.168 0.136 0.078 - 0.117

ZH 0.153 0.101 0.159 0.075 0.117 -

Table 6: Cosine similarity measure between topics for study 3. AR = Arabic, EN = English, ES = Spanish, FR = French, RU = Russian, ZH = Mandarin Chinese.

Similarities between languages for Study 3 (mean = 0.135) are lower than in Study 1 (0.192) and Study 2 (0.370). Notable results include the similarity be- tween Spanish and English (0.242), French and Spanish (0.173), and Russian and Arabic (0.169). When looking at the clustering output in Figure 5, English, Spanish, and French can be seen together again, which is consistent with Study 1 and 2. What was not expected is the clustering of Arabic and Russian.

9 Study 4

As Study 2 showed that more topics would have to be specified for the model to fit well, Study 3 was done by increasing the amount of topics to 50. However, the resulting cosine similarities were lower than in both Study 1 and Study 2. Study

19 Figure 5: Dendrogram for study 3.

4 was performed to test whether increasing the amount of topics even further would improve model fit. Therefore the final study will increase the amount of topics to 100.

Study 4 replicated study 1, keeping all parameters the same except for the amount of topics which was set to 100. The whole dataset was used. For each language, the top 10 words for each topic were displayed, resulting in 1000 words total.

9.1 Results

The cosine similarity matrix is displayed in Table 7. Cosine values had a mean of 0.104 and a standard deviation of 0.080. The dendrogram displaying the clusters extracted from the data is displayed in Figure 6.

20 AR EN ES FR RU ZH

AR - 0.038 0.042 0.028 0.076 0.058

EN 0.038 - 0.280 0.178 0.048 0.114

ES 0.042 0.280 - 0.197 0.041 0.218

FR 0.028 0.178 0.197 - 0.061 0.142

RU 0.076 0.048 0.041 0.061 - 0.045

ZH 0.058 0.114 0.281 0.142 0.045 -

Table 7: Cosine similarity measure between topics for study 4. AR = Arabic, EN = English, ES = Spanish, FR = French, RU = Russian, ZH = Mandarin Chinese.

Similarities for Study 4 (mean = 0.104) were again lower than those in Study 3 (0.135). This suggests that model fit did not improve for the 4th and final study. When interpreting the results, the following pattern can be seen. Similarities between English and Spanish (0.280), French and Spanish (0.197), and English and French (0.178), follow the same pattern as Study 1, 2, and 3. When looking at the dendrogram in Figure 6, English, Spanish, and French were the first to be clustered. This means that even though the model did not properly fit, a pattern can still be found which suggests that similar languages do carry similar meanings.

10 Study Comparison

An overview of means and standard deviations for all 4 studies can be seen in Table 8. As both Study 1 and Study 2 displayed the highest amounts of similarities, the results of those studies are seen as most reliable. In all four studies, a high similarity is found between French, Spanish, and English.

Study Mean St. Dev.

1 0.192 0.058

2 0.370 0.081

3 0.135 0.046

4 0.104 0.080

Table 8: Overview of Means and Standard Deviations for Studies 1, 2, 3, and 4.

21 Figure 6: Dendrogram for study 4.

11 Discussion

The aim of the current research was to identify the existence of linguistic rela- tivity by examining the hidden semantic structures of parallel texts. If linguistic relativity exists, it is not possible for texts in different languages to carry iden- tical meaning. In all four studies, the values found for the similarities were low. Also, for all four studies, the Spanish, French, and English languages showed considerable similarities. They were clustered together in all the studies. These findings suggest that parallel texts in fact do not yield the same meaning, and that similar languages yield similar meanings. From these findings it is possible to assume that there is a connection between language and meaning, thus point- ing towards the existence of linguistic relativity.

Of course, the current research has no way of providing insights in what way linguistic relativity exists. However, it shows that the outright dismissal of lin- guistic relativity is not correct. Research that is done by comparing multiple languages should recognize the existence of linguistic relativity. People that deal with people that speak different languages such as diplomats should be aware of the fact that the information they provide to the other could be interpreted differently even if the translation is excellent. Public policy should be shaped in such a way that in multi-lingual contexts, which language people speak is acknowledged. Although the current research strongly suggests that linguistic

22 relativity exists, several potential limitations can be identified.

Firstly, all of the languages that consistently appear together are written in the latin script. Because the first step of cleaning the texts was selecting only the relevant unicode characters, this could mean there was some interference between these languages. For example, French texts in the form of quotes could have been present in the English corpus, eventually leading to a higher similarity. These quotes would have been filtered out in other languages, as characters in the latin script were filtered out. However, these quotes only make up a very small part of documents. It might be possible for these quotes to have influenced how simi- lar these languages are, but not in such a way that the current research is invalid.

The way the text was cleaned could have another limitation. With the intent of keeping as much parameters the same for all of the languages, 50 stopwords were removed from each language. This approach succeeded in keeping as much of the process the same for the languages. However, by using this approach some vari- ance might have been inserted. For example, Russian has a lot more inflection that other languages, thus increasing the amount of sparsity in Russian word vec- tors. This could mean that through the way stopwords have been removed, some languages lost some information while others did not have enough stopwords re- moved. These factors are however not problematic for the current research. The alternative would have been to use a very different approach for each language, thus increasing the amount of bias that is present. Additionally, the cleaning of stopwords is not crucial for the training of Latent Dirichlet Allocation models, as LDA is not sensitive for picking up stopwords (Alghamdi & Alfalqi, 2015).

Another limitation was the use of machine translation in the topic comparison procedure. Topics were translated to English using Google Translate (Google Translate, 2017). Using machine translation here means that the quality of translation is entirely dependent on the quality of the algorithm. The quality of the algorithm might vary depending on which language pair is translated, which could mean that some languages received better translations than others. This limitation might have influenced the results slightly, but it is not decisive. This is because the only words that have been translated are the words that were outputted by the LDA model. The representation of the output might have changed, but the content is still the same.

The convergence of the LDA models could also play a role in the quality of the

23 model. The amount of passes over the corpus was set to 20 for all 4 studies. This might have been enough for Study 1 and 2, but it might be possible the quality of the models with 50 and 100 topics would improve if more passes were done. This might also be the reason why Study 3 and Study 4 show low amounts of similarity between the languages.

Finally, a last limitation is how the meaning is measured in the current research. It is proposed that measuring latent semantic structure is equal to the meaning of that text. This might not be the case. It is possible that the way a language encodes information results in a different latent semantic structure. Future research would have to find out if the way meaning is measured in this way is valid.

As has been said in the previous section, several limitations of the current research have been identified. These should be addressed in future research. Primarily, future research should focus on finding concrete evidence that meaning is in fact measured by latent semantic structure. If this is found to not be true, the va- lidity of the current research might be questioned. A way in which the current research might have been improved upon is by developing a more thorough topic comparison procedure. Now, machine translation is used to convert all topics to one language. This might hold a slight bias for some languages. It might be possible for future research to avoid machine translation by comparing databases of semantic similarities between different languages. Another way of improving upon the current research is to use tokenizers for each language to strip words of their grammatical markers and return the root. However, to achieve this one would have to keep in mind that each language would have to be treated equally so there is no additional bias added. Lastly, it would be advised to replicate the current research using topic models with large amounts of topics. The results of Study 3 and Study 4 were not convincing, and one reason that might be is because the models did not converge. It is thus advised to increase the amount of passes the model does over de data.

The current research succeeded in proving that parallel texts do not contain the same meaning. In fact, languages that are linguistically and culturally more similar contain similar meanings to each other. This is surprising find, because one would expect professionally translated texts to yield the same meaning. Fu- ture research is absolutely necessary to find out the exact link between thought and language, but considering the results it is critical to not dismiss linguistic

24 relativity. The current research only provided a first step in trying to identify linguistic relativity computationally, and it provides a foundation upon which further research can be done.

25 References

Alghamdi, R., & Alfalqi, K. (2015). A survey of topic modeling in text mining. I. J. ACSA, 6 (1), 147–153. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., & Al-Rajeh, A. (2008). Automatic arabic text classification. In Proceedings of 9th International Conference on the Statistical Analysis of Textual Data (pp. 77–83). Baldi, P., & Dussias, P. E. (2012). Historical linguistics and cognitive science. Interna- tional Journal of Linguistics, Philology, and Literature, 3 (1), 5–27. Beekes, R. S. (2011). Comparative indo-european linguistics: An introduction. Amster- dam: John Benjamins Publishing. Blake, N. F. (1996). A history of the english language. London: Macmillan. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3 (1), 993–1022. Chen, P. (1999). Modern chinese: history and sociolinguistics. Cambridge: Cambridge University Press. Competitive examinations for language professionals. (2017, May 15). Retrieved from https://careers.un.org/lbw/home.aspx?viewtype=LE. Deerwester, S. (1988). Improving information retrieval with latent semantic indexing. In Proceedings of the 51st Annual Meeting of the American Society for Information Science (pp. 36–40). Doyle, G., & Elkan, C. (2009). Accounting for burstiness in topic models. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 281–288). Egbert, J., Staples, S., & Biber, D. (2015). Corpus research. In The Cambridge Guide to Research in Language Teaching and Learning (p. 119-120). Cambridge: Cambridge University Press. Forbes, C., Evans, M., Hastings, N., & Peacock, B. (2011). Statistical distributions. New York: John Wiley & Sons. Frank, M. C., Everett, D. L., Fedorenko, E., & Gibson, E. (2008). Number as a cognitive technology: Evidence from pirah˜alanguage and cognition. Cognition, 108 (3), 819– 824. Google translate. (2017, May 20). Retrieved from https://translate.google.com/. Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems 23 (pp. 856– 864). Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57). Hsk chinese proficiency test. (2017, June 20). Retrieved from http://english.hanban .org/node_8002.htm.

26 Huang, A. (2008). Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (pp. 49–56). Lucy, J. A. (1997). Linguistic relativity. Annual Review of Anthropology, 26 (1), 291-312. Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Proceedings of the 22nd International Conference on Machine Learning (pp. 545–552). Manning, C. D. (2016). Computational linguistics and deep learning. Computational Linguistics, 41 (4), 701–707. Muflikhah, L., & Baharudin, B. (2009). Document clustering using concept space and co- sine similarity measurement. In International Conference on Computer Technology and Development 2009 (pp. 58–62). Reh˚uˇrek,R.,ˇ & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Rowe, B. M., & Levine, D. P. (2015). A concise introduction to linguistics. London: Routledge. Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD Workshop on Text Mining (Vol. 400, pp. 525–526). Sun, J. (2012). Jieba, chinese word segmentation tool. United nations language careers. (2017, May 15). Retrieved from https :// languagecareers.un.org/dgacm/Langs.nsf/home.xsp. Versteegh, K. (2014). The arabic language. Edinburgh: Edinburgh University Press. Wade, T. (2010). A comprehensive russian grammar. New York: John Wiley & Sons. Whorf, B. L. (1956). Language, thought and reality. Massachusetts: University Press. Wolff, P., & Holmes, K. J. (2011). Linguistic relativity. Wiley Interdisciplinary Reviews: Cognitive Science, 2 (3), 253–265. Zhen, D. (2000). Xiandai hanyu cidian, modern chinese dictionary. Beijing: Commercial Press. Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The united nations parallel corpus v1. 0. In Proceedings of the 10th International Conference on Language Resources and Evaluation (pp. 23–28).

27 Appendix A

Python code used for performing the described analyses.

# L i b r a r i e s −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− from gensim import models, corpora import j i e b a import matplotlib.pyplot as plt import numpy as np import os import re import s c i p y as sp from scipy.cluster.hierarchy import dendrogram, linkage from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import CountVectorizer from xml . e t r e e import ElementTree

# Parameters −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # Amount of languages and years to process languages = ["english", "french", "spanish", "chinese", "arabic", "russian"] years = ["1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014"]

# Set minimum amount of documents a word has to be in, and the amount of top # n frequent words to remove no_below = 5 remove_n = 50

# Set the amount of topics and passes num_topics = [100] p a s s e s = 20

# Functions −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− def add_newlines( f i l e ): ’’’ Adds newlines after every comma for google translate. ’’’ f = open( f i l e , "r", encoding = "utf −8" ) doc = f.read() doc = re.sub(",", ",␣\n", doc) return doc def clean_text(doc, language): ’’’

28 This function takes the unprocessed text of a document and strips it of its punctuation. It returns the document cleaned and splitted on each word. ’’’ i f language in ["english", "french", "spanish"]: doc = doc.lower() # Basic Latin + Latin −1 Supplement without punctuation and numerals doc = re.sub(r"[^\u0041−\u005A\u0061−\u007A\u00C0−\u00FF]", "␣", doc) doc = re.sub(r"(\\n)", "␣", doc) doc = re.sub(r"\s+", "␣", doc) doc = doc.split() doc = [ word for word in doc i f len ( word ) > 1 ] e l i f language == "arabic": # Arabic subset without numerals and punctuation marks doc = re.sub(r"[^\u0621−\u065F]", "␣", doc) doc = re.sub(r"(\\n)", "␣", doc) doc = re.sub(r"\s+", "␣", doc) doc = doc.split() doc = [ word for word in doc i f len ( word ) > 1 ] e l i f language == "chinese": # CJK Unified Ideographs subset doc = re.sub(r"[^\u4E00−\u9FFF]", "␣", doc) doc = re.sub(r"\s+", "␣", doc) doc = jieba.lcut(doc) e l i f language == "russian": # Cyrillic subset without punctuation doc = re.sub(r"[^\u0400−\u047F]", "␣", doc) doc = doc.lower() doc = re.sub(r"(\\n)", "␣", doc) doc = re.sub(r"\s+", "␣", doc) doc = doc.split() doc = [ word for word in doc i f len ( word ) > 1 ] return doc def cluster_output(folder): ’’’ Performs hierarchical cluster analysis ’’’ languages = ["english", "french", "spanish", "chinese", "arabic", "russian"] cv = CountVectorizer() newarray = [] for i , language in enumerate(languages): x = load_output("{}/{}.txt". format (folder , language)) x_new = "" for word in x : x_new += (word + "␣") newarray. insert(i , x_new)

29 cv. fit(newarray) newarray = cv.transform(newarray).toarray() model = sp.cluster.hierarchy.linkage newarray = model(newarray, method = "complete", metric = "cosine") return newarray def compare_output(folder): ’’’ This function generated a table of cosine similarities between every file that the provided folder contains. ’’’ languages = ["english", "french", "spanish", "chinese", "arabic", "russian"] cv = CountVectorizer() t a b l e = {} for i in languages : table[i] = {} for j in languages : x = load_output("{}/{}.txt". format (folder, i)) y = load_output("{}/{}.txt". format (folder, j)) x_new = "" y_new = "" for word in x : x_new += word x_new += "␣" for word in y : y_new += word y_new += "␣" x = cv.fit_transform((x_new, y_new)).toarray() vec1 = x [ 0 ] vec2 = x [ 1 ] vec1 = vec1.reshape(1, −1) vec2 = vec2.reshape(1, −1) table[i][j] = cosine_similarity(vec1, vec2) return t a b l e def find_matching_files() : ’’’ This function returns a dictionary containing the filepaths for all files that are available in all six languages. ’’’ print ("Finding␣matching␣files ...") languages = ["English", "Arabic", "Chinese", "Russian", "Spanish", "French"] years = ["1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012",

30 "2013", "2014"] filepaths = {} for language in languages : filepaths[language] = {} for language in languages : for year in years : filepaths[language][year] = get_filepaths( "UN/{}/{}" . format (language , year)) filepaths[language][year] = [ f i l e [ f i l e .index(’\\’)+1:] for f i l e in filepaths[language][year]] matching_files = {} for year in years : s e t 1 = set (filepaths["English"][year]) s e t 2 = set (filepaths["Arabic"][year]) s e t 3 = set (filepaths ["Chinese"][year]) s e t 4 = set (filepaths ["Russian"][year]) s e t 5 = set (filepaths ["Spanish"][year]) s e t 6 = set (filepaths ["French"][year]) set1.intersection_update(set2 , set3 , set4 , set5 , set6) matching_files[year] = l i s t ( s e t 1 ) return matching_files def get_filepaths(directory): ’’’ This function searches a directory for all .xml files it and its subfolders contain . ’’’ filepaths = [] for root, directory , files in os.walk(directory): for f i l e in f i l e s : path = os.path.join(root , f i l e ) i f path.endswith(".xml"): filepaths .append(path) return f i l e p a t h s def load_files(year, language , matching_files): ’’’ This function takes a list of .xml files and returns a corpus that is both parsed and cleaned. ’’’ files_list = matching_files[year] files_list = ["UN/{}/{}/{}". format (language , year, f i l e ) for f i l e in files_list] f i l e s = [ ] for f i l e in files_list: f = open( f i l e , "r", encoding = "utf8") xml = f.read()

31 files .append((clean_text(parse_xml(xml) , language))) f . c l o s e ( ) files = np.array(files) return f i l e s def load_output( f i l e ): ’’’ This function reads the encoded output file and outputs it into a splitted l i s t . ’’’ f = open( f i l e , "r", encoding = "utf −8" ) output = f.read() output = re.sub("$", "dollar", output) output = re.sub("[^\u0041−\u005A\u0061−\u007A\u00C0−\u00FF␣ ,\n]" , "" , output ) output = re.sub("\n", ",", output) output = output.lower() output = re.sub("the␣", "", output) output = re.sub("a␣", "", output) output = re.sub("\s+", "␣", output) output = output.split(",") output = [word for word in output i f word not in [ "" , "␣" ] ] a l i s t = [ ] for word in output : word = word.split() word = max( word ) alist .append(word) f . c l o s e ( ) return a l i s t def lda(languages , years , num_topics, passes): ’’’ This function performs the Latent Dirichlet Allocation algoritm and saves both the output and the model to a file. ’’’ for language in languages : for n in num_topics : dictionary = corpora.Dictionary() dictionary = dictionary.load("temp/dictionary/{}". format (language)) model = models.LdaModel(num_topics = n, update_every = 1, id2word = dictionary) for i in range (1, passes+1): for year in years : corpus = np.load("temp/step2/{}_{}.npy". format ( language , year ) ) i f len (corpus) == 1: corpus = reshape_corpus(corpus)

32 model.update(corpus) print ("Pass␣{}␣completed␣for␣{},␣{},␣{}␣topics". format ( i , language, year, n)) model. save("temp/models/{}_{}". format (language, n)) write_output("output/{}_topics/{}.txt". format (n, language), model, language ) def parse_xml ( f i l e ): ’’’ This function searches a .xml file for its body context and returns it as p l a i n t e x t . ’’’ body = "" try : cue = ElementTree.fromstring( f i l e ). find("text/body") i f cue : for l i n e in cue.itertext(): body += repr ( l i n e ) except ElementTree. ParseError : print ("ParseError") return body def preprocess(languages , years , no_below, remove_n): ’’’ This function: parses each XML file , cleans the text, constructs a dictionary, converts the text in a term frequency array and saves in a numpy array. ’’’ matching_files = find_matching_files()

for language in languages : dictionary = corpora.Dictionary() for year in years : corpus = load_files(year, language , matching_files) dictionary .add_documents(corpus) corpus = np.array(corpus) np. save("temp/step1/{}_{}.npy". format (language, year), corpus) print ("Step␣1␣completed␣for␣{},␣{}". format (language , year)) dictionary . filter_extremes(no_below = no_below) dictionary .filter_n_most_frequent(remove_n = remove_n) for year in years : corpus = np.load("temp/step1/{}_{}.npy". format (language , year)) corpus = [dictionary.doc2bow(doc) for doc in corpus ] np. save("temp/step2/{}_{}.npy". format (language, year), corpus) print ("Step␣2␣completed␣for␣{},␣{}". format (language , year)) dictionary .save("temp/dictionary/{}". format (language))

33 def reshape_corpus(corpus): ’’’ This function shapes the corpus the correct way, even if only one file is p r e s e n t . ’’’ new_corpus = [] corpus = np.squeeze(corpus) for j in corpus : new_corpus.append( tuple ( j ) ) new_corpus = [new_corpus ,] return new_corpus def write_output(name, model, language): ’’’ This function writes the topics of the trained model to a comma separated f i l e . ’’’ topics = model.print_topics(−1) topics = [x[1] for x in t o p i c s ] topics = [x.split("␣+␣") i f "+" in x else x for x in t o p i c s ] i f language in ["english", "french", "spanish"]: topics = [[re.sub("[^\u0041−\u005A\u0061−\u007A\u00C0−\u00FF]", "", x) for x in y ] for y in t o p i c s ] e l i f language == "arabic": topics = [[re.sub("[^\u0621−\u064A]", "", x) for x in y ] for y in t o p i c s ] e l i f language == "russian": topics = [[re.sub("[^\u0400−\u045F]", "", x) for x in y ] for y in t o p i c s ] e l i f language == "chinese": topics = [[re.sub("[^\u4E00−\u9FFF ] " , "" , x ) for x in y ] for y in t o p i c s ] output = "" for t o p i c in t o p i c s : for word in t o p i c : output += word output += "," f = open(name, "wb") f .write(output.encode("utf −8" ) ) f . c l o s e ( )

# Main −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # Performs the pre−p r o c e s s i n g preprocess(languages , years , no_below, remove_n)

# Study 1−4 lda(languages, years, 10, 20)

34 lda(languages, "2014", 10, 20) lda(languages, years, 50, 20) lda(languages, years, 100, 20)

# Generate cosine tables cosine_table_1 = compare_output("output/Analysis␣1/translated") cosine_table_2 = compare_output("output/Analysis␣2/translated") cosine_table_3 = compare_output("output/Analysis␣3/translated") cosine_table_4 = compare_output("output/Analysis␣4/translated")

# Generate cluster analysis output cluster_table_1 = cluster_output("output/Analysis␣1/translated") cluster_table_2 = cluster_output("output/Analysis␣2/translated") cluster_table_3 = cluster_output("output/Analysis␣3/translated") cluster_table_4 = cluster_output("output/Analysis␣4/translated")

35