Does Language Influence Thought? Investigating Linguistic Relativity with Topic Models

DOES LANGUAGE INFLUENCE THOUGHT? INVESTIGATING LINGUISTIC RELATIVITY WITH TOPIC MODELS BAS RADSTAKE Registration Number: 984292 Student Number: u1251371 Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Communication and Information Sciences, Master Track Data Science: Business and Governance, at the school of humanities of Tilburg University Thesis Supervisor: Prof. Dr. Max Louwerse Tilburg University School of Humanities Tilburg, The Netherlands July 2017 Contents Abstract 1 Linguistic Relativity 1 2 Topic Models 3 2.1 Multinomial Distribution . 4 2.2 Dirichlet Distribution . 5 2.3 Latent Dirichlet Allocation . 6 3 United Nations Parallel Corpus 7 3.1 Languages . 7 3.1.1 English, French, and Spanish . 9 3.1.2 Russian . 9 3.1.3 Arabic . 9 3.1.4 Mandarin Chinese . 10 4 Topic Comparison 10 5 Hypothesis 12 6 Study 1 12 6.1 Data Description . 12 6.2 Pre-processing . 12 6.3 Latent Dirichlet Allocation . 13 6.4 Topic Comparison . 13 6.5 Results . 14 7 Study 2 16 7.1 Results . 17 8 Study 3 18 8.1 Results . 19 9 Study 4 19 9.1 Results . 20 10 Study Comparison 21 11 Discussion 22 References 26 Appendix A 28 Abstract Linguistic relativity is the name of the theory that suggests that language influences thought. Attempts at proving linguistic relativity have been done in the realm of linguistics and psychology, but a computationally motivated effort has not been made yet. In this study, an attempt to identify linguistic relativity has been made by comparing the meaning of parallel texts. Linguistic relativity dictates that texts in different languages can never yield exactly the same meaning. The United Na- tions Parallel Corpus was used, as it contains high quality parallel translations. This corpus is available in the English, French, Spanish, Mandarin Chinese, Rus- sian, and Arabic languages. Meaning was estimated using the Latent Dirichlet Allocation topic model. This model extracts the latent semantic structure of a text. When comparing the latent semantic structure of parallel texts, it was found that the Spanish, French, and English languages showed considerable similarities. This means that similar languages have similar meanings, which is remarkable given that the texts were all profesional translations of each other. The results of the current study point towards the existence of linguistic relativity, however, future research should provide more insight in how exacty the connection between language and thought exists. 1 Linguistic Relativity The view that language influences a person’s world view has long been an idea that has generated a lot of interest. This idea that language influences thought is called linguistic relativity (Lucy, 1997), also called the Sapir-Whorf hypothesis, after two researchers whose work forms the basis for the theory (Whorf, 1956). Both Sapir and Whorf studied Native American languages, and Whorf’s most convincing argument for linguistic relativity came from studying the Na- tive American Hopi language. As the Hopi language lacked an understanding of time as an object or substance that can be divided, he argued that the world view of Hopi speakers was thus different than the world view of speakers of other languages. The amount of interest the theory generated is hardly surprising; if the hypothesis is true, it would have consequences for a large number of fields, such as linguistics, psychology, anthropology, and even public policy. It would mean that languages are not comparable to each other on their own, as it would be necessary to consider the speaker’s cognition as well. The validity of research based on language comparison would be challenged, models of translation would have to be reevaluated, and policies would have to be revised. However, empiral research has thus far failed to give conclusive evidence for the existence of linguistic relativity (Wolff & Holmes, 2011). Linguistic relativity is not just one hypothesis, and it can be construed in many ways. For example, an extreme approach of linguistic relativity is that of linguistic determinism, which states that languages determines thought. It implies that people who speak different languages have different thought processes. This hypothesis has been generally rejected by the scientific community (Wolff & Holmes, 2011). However, several other less extreme hypotheses of possible effects of language on thought have been identified. For example, it was found that language influences our way of representing exact numbers. The people of the Pirahãtribe have no method of expressing exact numbers in their language, which influenced the way they memorized quantities (Frank, Everett, Fedorenko, & Gibson, 2008). Because there were no words for exact quantities, Pirahãspeakers had trouble performing complex memory tasks involving exact numbers, even though En- glish speakers did not. This does not necessarily mean that language determines thought, but it suggests that language can have an effect on cognition. Even though the literature is not consistent in this regard, it seems there is at least some connection between thought and language, although efforts to emperically prove this have thus far been unsuccessful (Wolff & Holmes, 2011). 1 Research in linguistic relativity has been a multidisciplinary effort in linguistics, psychology, and anthropology. However, so far this research has been on a relativity small scale, comparing speakers of different languages on an individual basis (Frank et al., 2008). In order to increase the scale at which linguistic relativitiy is investigated, it is necessary to look beyond traditional field boundaries. One possible way of analysing languages at a large scale is by using computational techniques. With the relatively recent rise in computer performance and a reduc- tion of costs of data storage, fields such as computational linguistics have gained in popularity (Manning, 2016). With increasing computational power, research in computational linguistics has become more accessible and more powerful, as more and more data is able to be processed. The rise of computational linguistics has given way to a more data driven approach to linguistics. (Manning, 2016). One of the ways in which cross-linguistic hypotheses can be tested with computational linguistics is by analysing a parallel corpus (Egbert, Staples, & Biber, 2015). A parallel corpus is a collection of documents and their translations into a different language. Parallel corpora are interesting because of the alignment of original and translated text. This allows them to be used for cross-linguistic research and the training of machine learning algorithms that aid in machine translation. The construction of parallel corpora is a laborious and expensive process, which means studies are relatively scarce (Egbert et al., 2015). It is assumed that texts in parallel corpora convey the exact same meaning. This seems intuitive, because these texts are direct translations of each other. How- ever, at the core of linguistic relativity lies the assumption that texts in different languages can never mean exactly the same, as their meanings are influenced by the language the texts are in. By computationally estimating the meaning of parallel texts, one can investigate if this is truly the case. Meaning of text can be computationally estimated by using topic models. Topic modeling is a way of computationally analysing a large amount of unclassified text (Alghamdi & Alfalqi, 2015), such as a corpus. Its main use is to discover underlying patterns of words within documents, also called topics. These topics represent hidden semantic structures of a text. This can be considered to be the meaning of a text. The aim of the current research is to gain insight in how the hidden semantic structures of parallel translated texts relate to each other. The same texts in 2 different languages should yield the exact same topics. If this is not the case, the current study can provide a first step towards identifying linguistic relativity in a computational way. 2 Topic Models The field of topic modeling originated with analyses such as Latent Semantic Analysis (Deerwester, 1988) and later Probabilistic Latent Semantic Analysis (Hofmann, 1999). Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003) is the algorithm that is the basis for a large portion of current day topic models. In the current research, Latent Dirichlet Allocation (LDA) will be used to generate topics from a parallel text. Latent Dirichlet Allocation (LDA) is an unsupervised technique that is used to discover underlying topics in documents (Blei et al., 2003). LDA represents each document as a mixture of topics that contain words with certain probabilities. The collection of topics is different for each document. Say, for example, LDA was used to generate topics over all documents in a corpus with topics about psychology. Some documents may have topics from health psychology and psychopathology, while other documents might contain topics from work and organizational psychology. The challenge of learning topics from collections of documents is that the topics are not known in advance. After running the LDA model, topics can be displayed by returning the words that have the highest probability to belong to that topic. If a hypothetical LDA were to be performed on the previously mentioned corpus with topics about psychology, topics could look like Table 1, displaying the 3 words with the highest probabilities for 3 topics. topic 1 topic 2 topic 3 research health stress paper symptom satisfaction university therapy salary Table 1: Results of a hypothetical LDA. As can be seen in Table 1, the LDA algorithm extracts meaning by assigning 3 words to a topic. If the analysis is succesful, words that share a similar meaning are assigned to the same topic. In this case, words that seem to relate to research are grouped together, as well as words that relate to perhaps health, and work. Note that the algorithm does not assign meaning to these topics, it merely groups them together.

Load more