<<

Topic modeling on a classical Swedish text corpus of prose fiction Hyperparameters’ effect on theme composition and identification of writing style

By Catharina Apelthun

Department of Statistics

Supervisor: Mattias Nordin

2021 Abstract

A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within . In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style.

Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction.

1 Contents

1 Introduction 3

2 Data 6 2.1 Pre-processing ...... 6

3 Methodology 7 3.1 Hyperparameter Values ...... 13 3.2 Data Corpus of Adjectives ...... 14

4 Results 15 4.1 Model’s Sensitivity for α ...... 17 4.2 Model’s Sensitivity for β ...... 20 4.3 Identifying Writing Style ...... 22

5 Discussion 25

2 1 Introduction

Researchers and professionals with the need to process large quantaties of texts are often using data driven methods like machine learning and natural language processing. These methods are often iterative processes, where the model is created by repeatedly going over the data looking for certain patterns. It can be used for framing of latent semantics structures in large text corpus. For such text mining, a method frequently applied is topic modeling and is used for identification of abstract topics in a collection of documents. Traditionally in the literary research field, themes in prose fiction have been defined by qualitative methods based on close reading with the researcher personally reading the text. It has therefore been limited to a small quantity of texts given the amount of time available in the research project. Statistical methods offer an opportunity to examine general themes in big text data, in amounts of text which are too large for a single researcher to cover. The literary field is dominated by qualitative methods and valuing the neat craft of close reading, where the text is often analysed through the lens of different humanities theories. Giving the research question to a computer to answer is unthinkable for a traditional literary researcher, but it is shown in [3] that with both data driven methods and close reading on the texts, these two approaches can be combined. Data driven methods are struggling to find its’ place in humanities research, but seems likely to become important in the future [6]. In this thesis I am applying a topic modeling method called smoothed Latent Dirichlet Allocation (LDA) to discover themes in literature. It is a Bayesian text mining method used to reveal topics, possible to define as themes, in a collection of documents, a so-called text corpus. It offers a less time consuming process for theme identification in a great number of texts simultaneously. Topics generated by the model are briefly accounted for, but the focus in this thesis is to investigate how the topics, i.e. the themes, are affected when the values of hyperparameters in the model is changing. The behaviour of the topic clusters is mapped by the regulation of the hyperparameters on this, for topic modeling use, rather uncommon text corpus. Topic modeling is a widely used method in text analysis, but primarily not on lit- erary texts. The topic modeling method is not developed for longer and abstract text, but mainly for tweets, abstracts and articles, which can be expected to contain a clear

3 thematic content. When used on literary text corpora it has often been used as a dis- tant reading tool, analysing the texts using quantitative computational large-scale data methods compared to the, in literary analysis, traditional qualitative narrow reading methods. It has also been used as macroanalysis, large-scale computational methods, of authorial writing styles, difference in nationality based on language patterns and theme identification [13, 14]. Other areas of usage has been as an enhanced search tool [21]. The topic modeling approach have in this field mainly covered English literature, such as 19th century prose corpus [21, 13], poetry [18] and contemporary American bestsellers [1]. So far, the topic modeling method have only been used twice on - a master thesis on the popularity of audiobooks [2] and the topic modeling study previously mentioned comparing gender themes analysis on two corpora, a contemporary bestseller corpus and a classic’s corpus [7]. The latter has provided data for this paper. In the second part of this thesis, a smoothed LDA model is used to investigate whether that is a topic modeling method useful for recognition of different writing styles by exam- ining to what extent it is separating authors’ texts from each other. Limiting the analysis to adjectives, I test whether the model can create clusters where authors are grouped into specific clusters. If such a pattern can be identified, the LDA model has shown to be successful in separating texts by different authors based on their adjective usage. The analysis will be performed on a classical Swedish literary corpus consisting of novels and short story collections with modernized spelling. In [7] a LDA model is applied on two different text corpora, of which one is almost identical to the one used in this thesis. The methodological approach is different, where in [7] they are comparing the classical text corpora with a modern bestselling corpora in terms of thematic content and its relation to gender of characters and authors. Even though partly the same data are used, the essays differ in purpose. In my thesis, the focus is (1) statistical, by the investigation of hyperparameters impact on the model output and (2) a possible new application area for the method, by applying the model on a corpus of adjectives to find style markers for authors. In [7] the purpose is literary, analysing theme and gender. The method I use in this paper includes hyperparameters which are to be determined beforehand. Previous results regarding how the properties of the topics are affected by the values set for these parameters have mainly been investigated upon for other types of text corpora, usually with documents containing less amount of words, e.g. tweets, abstracts

4 or articles. Therefore the patterns identified in previous research are not necessarily valid for this text corpus. The generating of the model will be performed on the full novels or short story col- lections. Topic modeling methods, being more frequently applied to corpora containing shorter text documents, are not necessarily well fitted for a corpus with longer text doc- uments like novels and short story collections. Often, when applied on such corpora, the texts are chunked into shorter segments, for instance in [7]. I have chosen to generate the model on a not segmented text corpus of novels and short story collection, risking less precise themes [14]. The advantage of this approach is that the model could be more eas- ily applied to a text corpus without the time consuming pre-processing for segmentation of the texts in the data. Theoretically, the themes are expected to suffer from being less precise due to the non-segmentation, which will be investigated. When a topic modeling method is performed on a literary corpus the word classes included in the analysis are typically limited to nouns or nouns and verbs [14, 7]. These word classes’ semantic properties, being carriers of the story or plot, make them primarly relevant for identification of themes. Adjectives as word class are sometimes included together with nouns, or nouns and verbs in the text corpus used for topic modeling with theme identificational purpose [13]. In this paper, a topic modeling method will be applied on the text corpus of verbs and nouns, but also on a text corpus solely consisting of adjectives. The aim for the adjectives’ corpus topic model is to investigate its ability to create clusters of adjectives which can be mapped to certain authors to identify writing styles. This is, to my knowledge, the first time an analysis as decribed above has been performed only on adjectives. To summarize, I make two contributions to the literature. First, I apply a smoothed LDA model on a classical Swedish text corpus of prose fiction and investigate the result for different values of hyperparameters. Second, I examine the possibility of using a smoothed LDA topic model for identification of writing style based on an adjectives text corpus.

5 2 Data

The data are a text corpus consisting of 118 Swedish novels or collections of short stories published between 1821-1941. The corpus have been composed from the Swedish online collection Litteraturbanken1 (LB), which contains Swedish literature from the 19th cen- tury and the first decades of the 20th century considered historically and aesthetically important, among them several "classics" in Swedish literature. Duplicates have been removed as well as works not having modernized (post-1906) spelling. Before using the data in the analysis, the texts are processed by using the Swedish version of Stagger part-of-speech (PoS) tagger [16]. The PoS-tagger identifies the gram- matical forms of the word and thereby distinguishes for example between words with the same spelling but different meaning, like train (verb) and train (noun). The words are then lemmatized, meaning that they are, by investigating the word together with its PoS-tag and context, reduced to its base form. For example a word in third person will be changed into first person and a verb in past sense into present. So called lexical terms are created by combining the PoS-tag and the lemma, grouping all inflected forms as one term. The classical text corpus have been obtained from [7]. It is almost identical to the classical corpus used in their analysis with the corpus reduced by three documents due to copywrite reasons. The corpus was received in PoS-tagged and lemmatized format.

2.1 Pre-processing

Before generating the model, the text corpus will be tidied. The number of unique words in the corpus are reduced by only including the base form of a word. The inflected form of a word captures the actual presence, in terms of frequency, of a certain term, which could possibly be lost or misjudged due to usage of grammatical form. Only a subset of the full corpus is used for the analysis. To start with, the 100 most common words are excluded. These are typically know as "stopwords" and includes words such as jag(I), i (in), på (on), att (to), tala (speak) and man (one). The stopwords are expected to be too influential in the analysis and are likely not to contribute in thematic understanding of the corpus if occurring recurrently in different themes. Unusual terms

1www.litteraturbanken.se

6 are also excluded, including only words that are present in the corpus at least ten times in ten different novels or short story collections. This data curation approach is common and described in [15]. For the analysis of the behaviour of the hyperparameters, the final subset used includes only verbs and nouns, the bearers of semantic meaning in a text. Another subset only containing adjectives is also created to examine whether the topics of the model can be connected to an author.

3 Methodology

The method used in this paper is smoothed Latent Dirichlet Allocation (LDA), a topic modeling method, for unsupervised classification used on text data. The corpus data, a collection of novels and short story collections, can be divided into documents, each or short story collection and terms, words. In topic modeling the terms, structured into documents, is the only observed variable and the aim is to find natural groups or clusters for these terms. The model produce clusters, so called (latent) topics, based on the terms appearing within them. The topics might then be defined as themes in the text corpus. LDA is the most commonly used topic modeling method, introduced in [5]. The method belongs to a family of mixed membership models where data is decomposed into multiple latent components [4]. The latent components of the model is the distribution of words within a topic, φ, the distribution of topics within a document, θ and the words’ topic assignment, z. LDA is a bag-of-words model, meaning that each document is treated as a bag of words, counting the frequency. The words are exchangeable within the documents and the order or sequence of words does not matter. The words are ordered into a document term matrix (DTM), a common format for representing a text corpus in text mining for bag-of-words models. In a DTM the rows represents the documents, the columns the words and the cells the number of times a certain word occurs in a specific document (See Figure 2). The idea behind LDA is the assumption that a document d is constructed by gener- ating a number of topics from a predefined set of T topics and for each of these topics a number of words are assigned, with different probabilities, from a given vocabulary of size V . The notation of the LDA model is specified in Table 1. Each topic has a distribution

7 over words, φt and even though each word, wid is assigned to a specific topic the word type v ∈ V can be shared between topics. The topics can belong to multiple documents simultaneously and each document has a distribution of topics, θd.

Table 1: Notation for the Latent Dirichlet Allocation model

Description Description α Dirichlet prior hyperparameter for Θ z Topic assignments, N × 1

β Dirichlet prior hyperparameter for Φ zid Topic assignment for word wid (v) D Number of documents nt Number of topic indicators for topic t

T Number of topics for word wi, V × 1 (d) N Number of words in the corpus nt Number of topic indicators for document d

Nd Number of words in document d for topic t, D × 1 V Vocabulary size Φ The word distributions for each topic V × T

w Words in corpus, N × 1 φt Probabilities of words given topic t, V × 1

wd Words in document d, Nd × 1 Θ The topic distribution of a document T × D

wid Word i in the document d θd Probabilities of topics given document d, T × 1

A multinomial classification contains of n repeated trials. For each trial there is a discrete number, two or more, of possible events. The probability of specific event for a given trial is constant and the trials are independent of each other - one event outcome does not affect another one. The multinomial distribution is the probability distribution for the outcomes of such an attempt. When applied to this model, the multinomial classification is the word and topic assignments drawn from the multinomial distribution. This is explained in step 3 of the generative process below. If the prior and the posterior distribution are from the same distribution family it is in Bayesian statistics called conjugate distributions. The Dirichlet distribution is the conju- gate prior of the multinomial, which is why the Dirichlet is an adequate distribution for the priors, φ and θ. The (prior) probability density function of the Dirichlet distribution is given by the form:

T 1 Y p(θ , θ , ··· , θ | α , α , ··· , α ) = θαt−1 (1) 1 2 T 1 2 T B(α) t t=1

8 where QT Γ(α ) B(α) = t=1 t (2) PT  Γ t=1 αt

PT where θt ∈ (0, 1), t=1 θt = 1 and Γ(x) is the gamma function.

The proportions of the document-topic distribution, θd and the topic-term distribu- tion, φt are decided by the Dirichlet distribution with predefined hyperparameters α and

β respectively. θd is a Dirichlet random variable of T dimensions and belongs in a T − 1 PT simplex given that θt ∈ (0, 1), t=1 θt = 1. This is one of the advantages of a Dirichlet dis- tribution - that it is a probability distribution, meaning that the probabilities are identical to the proportions, of topics within a document or words within a topic. The Dirichlet distribution simplex is changing with the value of the hyperparameter - the probabilities are more evenly distributed for hyperparameters over 1 and goes towards a unimodal distribution with decreasing hyperparameter value. In this paper, the hyperparameters

α and β (See Table 1) are single scalar values, meaning that α = (α1 = α2 = ··· = αK )

and β = (β1 = β2 = ··· = βV ). This is a special case of the Dirichlet distribution where the priors are said to be symmetric. It can be used when there is no former knowledge

favoring certain hyperparameters. When (α1, α2, ··· , αT ) = (1, 1, ··· , 1) the symmetric Dirichlet distributions will turn into a uniform distribution. Using symmetric Dirichlet distributions in the models, the prior can be simplified:

T Γ(αT ) Y p(θ , θ , ··· , θ | α) = θα−1 (3) 1 2 T Γ(α)T t t=1

The probability density function of the multinomial distribution is of the following form:

T (d) (d) (d) (d) (d) n ! Y nt p(n1 , n2 , ··· , nT | θ1, θ2, ··· , θT ) = (d) (d) θt (4) n1 ! ··· nT ! t=1

(d) PT PT where n = t=1 nt and t=1 θt = 1. Together with the conjugate Dirichlet prior, the multinomial forms the posterior distribution of the following form:

T (D) 1 Y n +αt−1 p(θ , θ , ··· θ | n) = θ t (5) 1 2 T B(α + n) t t=1

9 (d) (d) (d) where n = n1 , n2 , ··· , nT . The impact of the Dirichlet priors in the smoothed LDA model have been investigated and documented in several papers. The α value determines the amount of smoothing of the topic-document distribution, θ [20]. Higher values of α have shown to smoothen topic- document distributions and make the topics within a document more evenly spread, while lower values of α favours few topics. β can be interpreted as the observation count for the number of times words are generated from a topic before any words from the corpus is observed [20]. Low β forces each topic to favor few words and higher β values allows for the opposite, resulting in less distinct topics, according to previous studies [14, 9]. The smoothed LDA model is a generative model, meaning that it is generated for a certain number of iterations, where the model is updated with the information from the previous one in each iteration. Before generating the model, the total number of topics, T , needs to be fixed. The LDA model process, given the distributions described above, consists of the following steps:

The generative process of LDA:

Step 1: The term distribution, φ, is determined for each topic (t to T ):

φt ∼ Dirichlet(β) Step 2: The proportions, θ, of the topic distribution for the document d are determined by:

θd ∼ Dirichlet(α) Step 3 For each token i:

Choose a topic assignment zid ∼ Multinomial(θd)

Choose a word wid ∼ Multinomial(φzid )

To generate the model, the posterior distribution of the latent variables, given the ob- served data needs to be found. The posterior distribution of the latent variables can be computed from the following equation:

p(z, w, Φ, Θ | α, β) p(z, Φ, Θ | w, α, β) = (6) p(w | α, β)

The distribution is intractable and therefore an approximation is needed for estimation.

10 The estimation method used in this thesis is Gibbs sampling, a Markov chain Monte Carlo (MCMC) method that has shown to be relatively efficient [20]. The idea behind Gibbs sampling is to reduce the high-dimensional distribution by sampling on a conditional distribution with all other variables in the data fixed at current values. The Gibbs sampling algorithm starts with randomly assigning, for each document separately, the words to one of the T topics.Then a vector of topics proportions for a

given document, θd is drawn from Dirichlet(α). The next step is to assume that the word

wi in document d is assigned to the incorrect topic, but all other words in document d are

assigned to the correct topic, and reassign the word wi to a topic based on the probability

that topic t generated word wi. The algorithm can be further simplified to a collapsed Gibbs sampling process by

integrating out the Dirichlet priors and simply sample zi [8]. The probability that topic

t is being assigned to word wi, given all other topic assignments to all other words is the posterior that need to be computed for this estimation method:

p(zi = t | z−i, w, α, β) (7)

where zi = t is the topic assignment of word wi to topic t, z−i refers to topic assignment for the remaining words and w all the other words in the corpus. By applying the rule of conditional probability:

p(zi = t, z−i, w | α, β) p(zi = t | z−i, w, α, β) = ∝ p(zi = t, z−i, w | α, β) = p(z, w | α, β) p(z−i, w | α, β) (8)

Then we have: ZZ p(z, w | α, β) = p(z, w, θ, φ | α, β) dθdφ (9)

The equation can be grouped as two terms:

Z Z p(z, w | α, β) = p(z | θ)p(θ | α) dθ p(w | φ)p(φ | β) dφ (10)

11 These two terms are of the same construction - a multinomial distribution with a Dirich- let prior. The expanded joint distribution, p(z, w | α, β), is, as shown in Equation 8,

proportional to p(zi = t | z−i, w). The computational formulation of the distribution is

derived in detail by [10, 8]. The probability of topic t being assigned to word wi, given all other topic assignments to all other words is estimated with the following probability distribution formula:

(v) (di) ni,t + β ni,t + α p(zi = t | z−i, w) ∝ (11) (·) (di) n−i,t + V β n−i,· + T α

v where ni,t is the count of words assigned to topic t for the word wi, the index v indicating · that the word wi is equal to the vth term in the vocabulary, and n−i,t is the count of words

assigned to topic t except the chosen word wi, where the dot (·) denotes the summation

di over this index. ni,t is the number of times topic t is assigned to a word wi in document

di d (di implies the document in which word wi belongs), n−i,· the number of times topic t is assigned to a word in document d, not including word wi. Equation 11 consists of two ratios, the first one is the probability of word i given topic t and the second the probability of topic t given document d [10]. The sampling process is done for all words in one document at a time and repeated for a beforehand determined number of times. Steyvers and Griffiths (2004) show that the log-likelihood estimates of the latent variables stabilize after a few hundred iterations [10]. The number of iterations is set to 1000 for all models in this paper. In Figure 1, the graphical formulation of the model is presented, first introduced in [5]. The shadowed area, w, represents the only observed variable in the model, the words. φ, θ and z are each of them a set of latent variables and α and β are hyperparameters of the Dirichlet priors for θ and φ respectively. Conditional dependencies are represented by arrows in the figure. The boxes are the sampling steps, of which the number of samples are presented by a index in the lower right corner, defined in Table 1.

12 ) θ

#

( % $

! " '

Figure 1: Graphical LDA model

A topic model can also be interpreted as a matrix factorization [12]. For visualization purpose the LDA model in matrix factorization form is presented in Figure 2, originated from [20]. The DTM can be decomposed into the topic-word distribution, Φ and the document-topic distribution, Θ.

document topic document

= ×

word #$%

word Φ topic Θ

$ × # 4 × # 4 × $

Figure 2: A matrix factorization of the LDA model

The analysis is performed using the LDA function in the topicmodels package [11] in R[17].2

3.1 Hyperparameter Values

To begin with, the model will be generated with default options on the hyperparameter values, α = 50/T and β = 0.01, where T is the number of topics. T is changed in

2R version 3.5.2, GUI 1.70 El Capitan build

13 increments of five, until a majority of the topics can be defined, based on a qualitative evaluation, as themes. In previous research it has been found that these hyperparameters work well for several different text collections [20], although it has not been used with large text chunks. In this paper eleven different values of α = (40, 35, 30, 25, 20, 15, 10, 5, 2.5, 1, 0.1) are tested for the LDA model with β = 0.01 and T = 20. Being the hyperparameter of the Dirichlet prior of the topic distribution for a given document, the evaluation and inves- tigation of the result will mainly cover the probabilities of topics given documents. The result will be compared to the expected, known from previous studies of the parameter’s effect.

For the hyperparameter, β, of the Dirichlet prior of the word-topic distribution, φt following values are tested: β = (10, 5, 1, 0.1, 0.01, 0.001, 0.0001). The values are tested for a LDA model with α = 2.5 and T = 20. Given the formulation of the model, the β values affects the distribution of words within a topic, therefore the evaluation will primarily focus on changes in φ and compare it with expected result from previous research and the model formulation.

3.2 Data Corpus of Adjectives

Adjectives are a type of words not typically contributing in moving the plot forward, even though a considerable usage of them could affect the pace and rythm of a novel. In- stead, they determine the environment, experiences and atmosphere in a text. Therefore they are in this thesis regarded as literary style markers or genre markers and used for identification of certain writing tones. The adjective corpus is investigated to determine whether the smoothed LDA topic model cpuld be used for identification and separation of authors’ personal writing style. This is done by examining the probability of the authors’ texts appearing in the same clusters. Two smoothed LDA models will be generated with 10 and 20 topics respectively. The model with 20 topics has shown to be successful for theme identification and the one with 10 was chosen to detect broader stylistic features if the 20 topics model would fail to identify present patterns. The hyperparameter values will be decided partly based on the result in the hyperparameter evaluation, but also by the motivation of the default options in the R package used, topicmodels. The default options are recommended by

14 Steyvers and Griffiths [20] based on previous research on different types of text corpora. For the model with 10 topics, α = 50/T = 50/10 = 5 and β = 0.01. The model with 20 topics have α = 50/20 = 2.5 and β = 0.01. To evaluate the result, the topic with highest probability of belonging to a given docu-

ment, θd(1), is listed for all documents and the variation within authors are investigated. In this part of the analysis we concentrate on the authors who have contributed to the corpus with at least five documents. This is done to guarantee sufficient cluster sizes. Of the 35 authors in the corpus, six will be more closely considered (Table 2), although all authors with belonging documents will be included in the generating of the models.

Table 2: Authors with five or more novels/short story collections in the corpus

Author Number of documents August Strindberg 23 Selma Lagerlöf 16 Hjalmar Bergman 14 Karin Boye 8 Dan Andersson 5 Maria Sandel 5

4 Results

The number of topics have been tested in increments of five until the topics were possible to identify as themes. With default option of the hyperparameter values of the model, α = 50/T (T being the number of topics) and β = 0.01, for a LDA model of 20 topics there were identifiable themes among the topics. The topics in this model are presented in Table 3. The listed terms are the ones with highest probability of belonging to the current topic. The results in Table 3 are discussed in more detail in sections 4.1 and 4.2.

15 Table 3: Ten most likely words in each topic for model with T = 20

Topic 1 φw,z Topic 2 φw,z Topic 3 φw,z konung (king) NN 0.0177 herr (mister) NN 0.0203 förstå (understand) VB 0.0136 folk (people) NN 0.0081 kapten (captain) NN 0.0198 tycka (think) VB 0.0133 kung (monarch) NN 0.0075 karl (man) NN 0.0088 svara (answer/respond) VB 0.0131 son (son) NN 0.0065 doktor (doctor) NN 0.0064 måste (must) VB 0.0141 svara (answer/respond) VB 0.0063 folk (people) NN 0.0055 öga (eye) NN 0.0120 tid (time) NN 0.0058 mumla (mumble) VB 0.0054 hålla (hold) VB 0.0113 land (land/country) NN 0.0058 svara (answer/respond) VB 0.0054 börja (begin) VB 0.0107 herre (lord) NN 0.0057 huvud (head) NN 0.0053 tid (time) NN 0.0106 häst (horse) NN 0.0053 fortfara (continue) VB 0.0052 sätt (manner) NN 0.0100 ligga (lie) VB 0.0049 yttra (speak/utter) VB 0.0051 fråga (ask) VB 0.0097

Topic 4 φw,z Topic 5 φw,z Topic 6 φw,z älska (love) VB 0.0188 flicka (girl) NN 0.0147 magister (male teacher) NN 0.0121 låta (let/allow) VB 0.0178 fru (Mrs./wife) NN 0.0133 lägga (put) VB 0.0092 kärlek (love) NN 0.0177 far (father) NN 0.0083 ligga (lie) VB 0.0084 måste (must) VB 0.0146 människa (human) NN 0.0072 flicka (girl) NN 0.0083 hjärta (heart) NN 0.0122 pojke (boy) NN 0.0067 båt (boat) NN 0.0073 kvinna (woman) NN 0.0122 finna (find) VB 0.0061 sätta (put) VB 0.0071 människa (human) NN 0.0119 dam (lady) NN 0.0060 äta (eat) VB 0.0066 liv (life) NN 0.0115 utbrista (exclaim) VB 0.0057 synas (appear/be seen) VB 0.0064 själ (soul) NN 0.0112 son (son) NN 0.0056 gumma (old woman) NN 0.0056 brev (letter) NN 0.0101 mamma (mom) NN 0.0056 ge (give) VB 0.0053

Topic 7 φw,z Topic 8 φw,z Topic 9 φw,z doktor (doctor) NN 0.0140 liv (life) NN 0.0146 herr (mister) NN 0.0258 kommissarie (commissar) NN 0.0089 människa (human) NN 0.0092 öga (eye) NN 0.0112 natt (night) NN 0.0084 ligga (lie) VB 0.0080 fröken (Miss) NN 0.0106 klocka (clock) NN 0.0078 blick (gaze) NN 0.0072 herre (lord) NN 0.0092 brev (letter) NN 0.0073 tanke (thought) NN 0.0068 svara (answer/respond) VB 0.0077 dörr (door) NN 0.0070 ansikte (face) NN 0.0064 par (couple) NN 0.0073 svara (answer/respond) VB 0.0066 år (year) NN 0.0062 dam (lady) NN 0.0072 bil (car) NN 0.0066 själ (soul) NN 0.0060 vän (friend) NN 0.0062 papper (paper) NN 0.0064 känsla (feeling) NN 0.0059 fiol (violin) NN 0.0056 finna (find) VB 0.0064 väg (way) NN 0.0060 låta (let/allow) VB 0.0055

Topic 10 φw,z Topic 11 φw,z Topic 12 φw,z tycka (be of opinion) VB 0.0295 människa (human) NN 0.0167 ligga (lie) VB 0.0103 pappa (dad) NN 0.0202 finnas (be/exist) VB 0.0133 natt (night) NN 0.0089 mamma (mum) NN 0.0198 måste (must) VB 0.0113 tid (time) NN 0.0074 moster (aunt) NN 0.0136 liv (life) NN 0.0108 ansikte (face) NN 0.0073 löjtnant (lieutenant) NN 0.0125 karl (man) NN 0.0094 äta (eat) VB 0.0072 morbror (uncle) NN 0.0120 värld (world) NN 0.0084 väg (way) NN 0.0072 läsa (read) VB 0.0101 ord (word) NN 0.0078 år (year) NN 0.0071 fru (Mrs./wife) NN 0.0101 arbetare (worker) NN 0.0076 sätta (put) VB 0.0069 barn (child) NN 0.0094 arbete (work) NN 0.0068 börja (begin) VB 0.0069 bruka (farm) VB 0.0090 leva (live) VB 0.0064 slå (beat) VB 0.0068

Topic 13 φw,z Topic 14 φw,z Topic 15 φw,z präst (priest) NN 0.0213 år (year) NN 0.0121 mor (mother) NN 0.0213 kyrka (church) NN 0.0197 skriva (write) VB 0.0098 barn (child) NN 0.0142 år (year) NN 0.0124 tid (time) NN 0.0093 flicka (girl) NN 0.0093 tid (time) NN 0.0095 läsa (read) VB 0.0091 mamma (mum) NN 0.0083 folk (people) NN 0.0085 liv (life) NN 0.0074 fru (Mrs./wife) NN 0.0081 fara (go) VB 0.0077 finnas (be/exist) VB 0.0066 öga (eye) NN 0.0067 församling (congregation) NN 0.0066 bok (book) NN 0.0063 morsa (mum) NN 0.0065 väg (way) NN 0.0064 samhälle (society) NN 0.0061 blick (gaze) NN 0.0065 läsa (read) VB 0.0064 skola (school) NN 0.0056 arbete (work) NN 0.0064 stad (town) NN 0.0060 anse (think/consider) VB 0.0055 par (couple) NN 0.0062 *) NN= noun, VB=verb

Table continues on next page

16 Ten most likely words in each topic for model with T = 20

Topic 16 φw,z Topic 17 φw,z Topic 18 φw,z öga (eye) NN 0.0138 pojke (boy) NN 0.0290 barn (child) NN 0.016 ligga (lie) VB 0.0109 ligga (lie) VB 0.0151 kvinna (woman) NN 0.0104 skog (forest) NN 0.0100 finnas (be/exist) VB 0.0110 hustru (wife) NN 0.0095 huvud (head) NN 0.0090 land (land) NN 0.0097 börja (begin) VB 0.0084 börja (begin) VB 0.0071 människa (human) NN 0.0094 vän (friend) NN 0.0080 vatten (water) NN 0.0069 börja (begin) VB 0.0090 svara (answer/respond) VB 0.0069 sol (sun) NN 0.0066 skog (forest) NN 0.0089 liv (life) NN 0.0069 väg (way) NN 0.0062 tycka (be of opinion) VB 0.0082 människa (human) NN 0.0062 berg (mountain) NN 0.0056 fara (go) VB 0.0079 äta (eat) VB 0.0062 barn (child) NN 0.0056 gård (courtyard) NN 0.0068 rum (room) NN 0.0062

Topic 19 φw,z Topic 20 φw,z mor (mother) NN 0.0151 herr (mister) NN 0.0332 präst (priest) NN 0.0109 fru (Mrs./wife) NN 0.0204 far (father) NN 0.0106 mor (mother) NN 0.0166 hustru (wife) NN 0.0100 far (father) NN 0.0123 väg (way) NN 0.0090 farmor (grandmother) NN 0.0111 fara (go) VB 0.0083 barn (child) NN 0.0096 ligga (lie) VB 0.0082 hus (house) NN 0.0094 ge (give) VB 0.0077 herre (lord) NN 0.0092 år (year) NN 0.0075 gård (courtyard) NN 0.0090 finnas (be/exist) VB 0.0075 år (year) NN 0.0089 *) NN= noun, VB=verb

4.1 Model’s Sensitivity for α

For the LDA model of 20 topics and a fixed β value of 0.01 eleven different values of α = (40, 35, 30, 25, 20, 15, 10, 5, 2.5, 1, 0.1) were tested and evaluated. The values on α have been arbitrarily chosen and the default option in the topicsmodels package is 50/T (50/20 = 2.5). Large values of α expects to produce models in which each document involves several topics, while when α → 0 each document will involve a single topic [9]. Since the probabilities, θd, will sum to one, this should be investigated by comparing the number of topics for each model for different values of α having a probability greater than an arbitrary value, indicating an evident influence of the topic, of belonging to a given document. A probability for a topic given a certain document, θd, in this paper I define as influential if θd ≥ 0.1 or θd ≥ 0.2. The sum of number of topics per document fulfilling this cut-off values of influential topics is presented in Table 4. The number of topics being influential increases for a decreasing value of α. For higher values of α the topics seem to be more evenly spead over a document.

17 Table 4: No. of topics with a higher probability of belonging to a certain document

α value

θd 40 35 30 25 20 15 10 5 2.5 1 0.1 ≥ 0.1 179 199 192 201 221 234 252 298 308 341 340 ≥ 0.2 93 95 98 98 116 118 119 141 160 173 188

To evaluate the difference in contribution a single topic have on a document for different values of α, the single topic with highest probability of belonging to each document is plotted for all α values (Figure 3). The probability of the primary topic for each document generally increases with decreasing α.

Figure 3: Primary topic for each document, all different α values

To further investigate the relation between topics and documents for the different values PD of α the mean of the sum of each primary topic, d=1 θd(1)/D, for all different values of α is computed (Table 5). For lower α values, a single topic’s probability of belonging to a document tends to increase [11]. This pattern is reflected for this literary corpus, where the sum of all documents’ single topic with highest probability increases as the value of PD PD α decreases. d=1 θd(1)/D = 0.314 for α = 40 and d=1 θd(1)/D = 0.390 for α = 0.1.

18 Table 5: Frequency of influential terms for different values of α

α value 40 35 30 25 20 15 10 5 2.5 1 0.1 No of unique terms* 99 102 97 100 101 97 98 102 102 102 100 Tot. share of unique terms* 0.495 0.510 0.485 0.500 0.505 0.485 0.490 0.510 0.510 0.510 0.500 P θd(1)/D 0.314 0.308 0.334 0.341 0.350 0.360 0.376 0.366 0.381 0.375 0.390 *) The ten terms with highest probability of belonging to each topic

For the eleven different values of α tested, the frequency of unique terms among the ten with highest probability of belonging to a certain topic is constant (Table 5). When investigating the topics for each of the eleven versions of the model, the topics appearing are very similar. The α value does not primarily affect the topics’ content, which it should not do based on the formulation of the model. This result might indicate that the themes within the different novels do not vary much, since when the probability of one topic belonging to a certain document increases the topic does not change substantially. In Table 3 the topics for the model of 20 topics, β = 0.01 and α = 2.5 are presented. Among the most strongly theme identified topics are themes of domination and status markers (Topic 1), love and relations (Topic 4), church and religion (Topic 13), education (Topic 14) and family and home (Topic 20). These themes are recurring in all different α value models. There are also several other topics recurring in all different model variations, but which are more vague in terms of theme identification. For this prose fiction text corpus, the α value affects the distribution of topics within a document as expected according to previous studies, performed on other types of text corpora. Lower values of α results in a larger number of topics with higher probabilities of belonging to a given document, θd. The topics are more evenly distributed for higher values of α, and the topics become more influential for a given document with decreasing α. The decision of what is considered a reasonable value for α depends on the purpose of the topic modeling. Higher values of α will generate a more evenly spread topic-document distribution and with lower α certain topics are more likely to occur in certain documents. Given the context of this model, a α value between 10 and 2.5 is to prefer. One should expect the documents, the novels and short story collections, to differ in theme content, why a too high α value cannot be motivated. By choosing a too low α value the natural overlapping of themes might be overlooked.

19 4.2 Model’s Sensitivity for β

Seven different values of β = (10, 5, 1, 0.1, 0.01, 0.001, 0.0001) were tested and evaluated for the LDA model containing 20 topics and a fixed α value of 2.5. The values on β are arbitrarily chosen and the default option in the topicsmodels package is 0.01. The model is evaluated for changes in β by comparing the distribution and changes in ten terms with the highest probability of belonging to a certain topic.

To investigate the distribution of the word probabilities (φw,t) for the different models some thresholds were set (See Table 6). The number of words with a probability over a certain threshold considered high are increasing with a decreasing β value. For higher β values the words are more evenly spread within the topics resulting in less distinct topics - the topics are generally not identifiable as themes when considering the ten terms with highest probability of belonging to the topic. With decreasing β values the number of words with higher probability of belonging to a topic is increasing - a larger amount of influential terms generates more specific topics. This pattern is expected from the properties of the Dirichlet distribution, when β is decreasing the distribution of word- topic probabilities changes towards a unimodal shape.

Table 6: Word count for probability thresholds of φt for different values of β

β value

φt 10 5 1 0.1 0.01 0.001 0.0001 ≥ 0.005 4 16 117 347 584 767 858 ≥ 0.010 0 2 23 65 133 208 270 ≥ 0.015 0 0 7 21 39 80 109 ≥ 0.020 0 0 2 9 17 43 57

Using the top ten terms for topic identification these are further investigated (See Table 7). For larger values of β, the number of unique words occurring in the top ten list of most likely terms belonging to a certain topic increases in total and the overlapping of words between topics decreases (Table 7). The same patterns goes for very low values of β = (0.001, 0.0001). When β is small, the topics tend to not have evenly spread probability vectors for the different terms, but to have a smaller number of terms being more influential [9]. This pattern seems to be present in this data corpus, until a certain point where the number of influential top ten terms increases again for very small β

20 values (See Table 7). The change in word probability distribution within the topics does regulate the overlapping of words between topics, therefore the same words might appear among the top ten terms within the different topics.

Table 7: Frequency of influential terms for different values of β

β value 10 5 1 0.1 0.01 0.001 0.0001 No. of unique terms* 199 195 146 102 105 129 138 Total share of unique terms* 0.995 0.975 0.730 0.510 0.525 0.645 0.690

*) The ten terms with highest probability of belonging to each topic

For β values resulting in a lower number of unique terms for the ten terms with highest probability of belonging to a topic, β = 0.1 and β = 0.01 (Table 7), the theme identi- fication of the topics are more distinct. For larger values of β the topics are generally not identifiable as themes when considering the ten terms with highest probability of belonging to the topic. In Table 3, the ten terms with highest probability of belonging to each topic for the model β = 0.01 are presented. In the table there are several topics that can be identified as themes, which some are accounted for in the previous section concerning the hyperparameter α. Other themes in the model, not mentioned above, are for instance Topic 16 that seems to be mainly about nature, Topic 11 about existential matters, life and word, and Topic 7 about handling crime/accident during night by contacting police or doctor. The top ten terms for all 20 topics of the model with β = 10 are presented in Table 11 in the Appendix. Thematically there are only a few topics possible to define by the ten terms with highest probability of belonging to each topic. The value of β, being the hyperparameter of the Dirichlet prior for the word-topic distribution, Φ, have shown to affect the number of unique terms among the ten most likely terms for each topic (Table 7). For this text corpus, a minimum of the number of unique top ten terms for a β value around 0.1 and 0.01, is identified. Further is the word-topic distribution changing as expected when using the Dirichlet distribution for different values of the hyperparameter β. Aiming for topics which can be defined as themes, a lower β value of 0.1 or 0.01 is preferable. When β takes values beyond a certain level, the overlapping of the top ten

21 words between topics increases, and the themes becomes, marginally, more vague. Hence, there is no obvious gain using β values below 0.01 in this context. These results are in line with previous research on corpus with less text in the documents, or on segmented data.

4.3 Identifying Writing Style

Topic modeling is usually used for identifying themes in a text corpus and for that purpose the nouns or nouns and verbs are often included. In this section the text corpus consist solely of adjectives to investigate whether the topics the LDA model creates for such a corpus can be used for identification of writing style. The performance of the model is evaluated by investigating to what extent the topics separates authors’ documents from each other. If the topics have a large share in documents written by a certain author, a writing style might be considered identified. There are two topic models, Model 1 and Model 2, with 10 and 20 topics respectively, run on the adjectives data corpus. Lower α values favour topics belonging to certain documents (See section 4.1) and low β values generates more specific topics (See section 4.2). To generate specific topics, limiting the overlapping of words between topics, a β value of 0.01 is chosen for both models. The α value is set to α = 5 for the model with 10 topics, α = 2.5 for the model with 20 topics, which is the default values in the topicmodels package. Being considered rather low, these values are chosen and expected to generate higher probabilities for some topics to belong to certain documents. There are no obvious patterns within the topics produced by the models. In Table 8 the topics with the ten terms with highest probability of belonging to each topic is presented. The topics are overlapping to some extent in terms of words appearing among the ten most likely words in each topic. Even though the topics are not easily defined by a name based on the ten most likely terms, they serve their purpose quite well in collecting authors’ document under the same topics (Table 9) and thereby identifies some kind of writing style.

22 Table 8: Topics for adjective corpus model with 10 topics

Topic 1 φw,t Topic 2 φw,t Topic 3 φw,t Topic 4 φw,t ny (new) 0.0433 egen (own) 0.0330 kär (dear) 0.0316 sådan (such) 0.0400 svensk (swedish) 0.0402 enda (only) 0.0251 ung (young) 0.0297 lång (long) 0.0336 mången (many) 0.0296 ny (new) 0.0228 båda (both) 0.0242 ung (young) 0.0253 sist (last) 0.0252 lång(long) 0.0148 viss (certain) 0.0207 hög (high) 0.0244 egen (own) 0.0243 ensam (lonely) 0.0139 stackars (poor/pitiful) 0.0189 mången (many) 0.0233 helig (holy) 0.0230 djup (deep) 0.0135 vacker (beautiful) 0.0168 död (dead) 0.0209 mycken (much) 0.0186 stark (strong) 0.0133 dum (stupid) 0.0124 fattig (impoverished) 0.0181 ung (young) 0.0173 främmande (foreign) 0.0133 rätt (proper/correct) 0.0120 glad (happy/glad) 0.0179 sådan (such) 0.0152 inre (inner) 0.0128 hög (high) 0.0099 egen (own) 0.0164 ena (one) 0.0141 ung (young) 0.0127 sann (true) 0.0090 stark (strong) 0.0161

Topic 5 φw,t Topic 6 φw,t Topic 7 φw,t Topic 8 φw,t sådan (such) 0.0420 vit (white) 0.0480 vacker (beautiful) 0.0453 ny (new) 0.0332 egen (own) 0.0321 svart (black) 0.0381 mycken (much) 0.0371 ond (evil) 0.0204 mången (many) 0.0254 röd (red) 0.0272 sådan (such) 0.0360 ung (young) 0.0193 ny (new) 0.0250 mörk (dark) 0.0251 glad (happy/glad) 0.0277 hög (high) 0.0176 ond (evil) 0.0247 lång (long) 0.0248 rädd (afraid) 0.0256 ensam (lonely) 0.0168 mycken (much) 0.0184 blå (blue) 0.0237 ond (evil) 0.0248 sådan (such) 0.0166 fast (solid) 0.0173 ung (young) 0.0218 mången (many) 0.0210 dålig (bad) 0.0149 rätt (propoer/correct) 0.0161 hög (high) 0.0196 rolig (funny/amusing) 0.0192 stark (strong) 0.0148 lång (long) 0.0145 tyst (quiet) 0.0177 svår (difficult) 0.0181 skön (comfortable/fair) 0.0110 fattig (impoverished) 0.0129 grön (green) 0.01739 lång (long) 0.0179 vit (white) 0.0102

Topic 9 φw,t Topic 10 φw,t kort (short) 0.0233 sådan (such) 0.0259 ung (young) 0.0176 viss (certain) 0.0218 fin (nice) 0.0135 mycken (much) 0.0212 egen (own) 0.0135 ny (new) 0.0193 nära (near) 0.0126 sist (last) 0.0188 djup (deep) 0.0106 lång (long) 0.0135 lycklig (happy) 0.0105 riktig (proper/correct) 0.0134 lätt (light/easy) 0.0104 säker (sure) 0.0127 kär (dear) 0.0102 lätt (light/easy) 0.0122 svår (difficult) 0.0096 övrig (other) 0.0121

In Table 9 the number of documents sharing the same primary topic, the topic with highest probability of occuring in a document, is presented for the six authors with five or more documents in the corpus. For the model with 10 topics, 10 out of a total of 16 documents written by Selma Lagerlöf (62.5%) was assigned to the same primary topic, Topic 4. For 18 out of 23 (78%) documents written by August Strindberg, Topic 8 was the primary topic. Corresponding values for Karin Boye and Hjalmar Bergman is 87.5% and 85.7% respectivley for Model 1. For Dan Andersson and Maria Sandel all documents authored by them had the same primary topic, Andersson Topic 5 and Sandel Topic 9. The model with 20 topics produced similar results (See Table 9 and Table 10).

23 Table 9: Frequency of topic with highest probability of belonging to a given document

Topic number (frequency) Total no. Share of doc. with same top topic Author Model 1 Model 2 of documents Model 1 Model 2 August Strindberg Topic 8 (18) Topic 2 (12) 23 0.783 0.522 Selma Lagerlöf Topic 4 (10) Topic 13 (11) 16 0.625 0.688 Hjalmar Bergman Topic 3 (12) Topic 4 (9) 14 0.857 0.643 Karin Boye Topic 2 (7) Topic 14 (8) 8 0.875 1.0 Dan Andersson Topic 5 (5) Topic 5 (5) 5 1.0 1.0 Maria Sandel Topic 9 (5) Topic 7 (5) 5 1.0 1.0

There are no overlap in most common primary topic for the authors, neither for Model 1 or Model 2. There are primary topics shared between authors’ documents in the model with 10 topics (Model 1). The model with 20 topics did not result in any overlapping. In Table 10 the primary topics together with document frequency for all of the authors are presented.

Table 10: Frequency of topic with highest probability of belonging to a given document

Topic (frequency) Author Model with 10 topics Model with 20 topics August Strindberg 1(2), 6(2), 7(1), 8(18) 2(12), 8(4), 19(5), 20(2) Selma Lagerlöf 4(10), 7(6) 1(2), 13(11), 16(3) Hjalmar Bergman 3(12), 5(1), 10(1) 3(1), 4(9), 17(4) Karin Boye 2(7), 10(1) 14(8) Dan Andersson 5(5) 5(5) Maria Sandel 9(5) 7(5)

The first number represents the topic numering and the count of documents having this topic as primary topic is in parentheses

Although Model 1 is separating the topics to primarily belong to documents of a certain author, the ten terms with highest probability (Table 8) of belonging to a certain topic are sometimes shared by authors. Topic 4 (primary topic of Selma Lagerlöf) share five of the ten most likely terms with Topic 5 (majority primary topic of Dan Andersson). These authors do not share any other primary topics. Karin Boye’s first primary topic for Model 1, Topic 2 (Table 9), share four of the ten most likely terms with Strindberg’s Topic 8 and four with Lagerlöf’s Topic 4. Boye does not share any other primary topics with either Strindberg or Lagerlöf in neither Model 1 or Model 2, Table 9). This result implies that the ten most likely terms of belonging to a topic are not necessarily the

24 words separating the topics from each other. The smoothed LDA topic models are successful in defining language patterns, in terms of adjective usage, for authors. When estimating LDA models on a corpus of 118 novels or short story collections, the two models assign a majority of the authors works to the same primary topic. For the model with ten topics over 60% of the investigated authors’ documents were assigned to the same primary topic. It can be concluded that the smoothed LDA model can be used for identification of certain writing styles, separating authors’ texts from each other by topic assignment. The model is a bag-of-words model, where the topics is determined by the word frequency and therefore the result also indicates that authors uses different sets of adjectives.

5 Discussion

The effect of the hyperparameters on the result have shown expected patterns. A decrease in α caused an increase in influence, in terms of proportion in a document, of some topics. For β, the expected effect on the term-topic distribution, the number of influential terms increases with decreasing β. When examining the number of unique terms among the ten with highest probability of belonging to each topic a similar pattern is observed - the number of unique terms decreases until a minimum is detected and the number of unique terms increases again. LDA models are often applied to text corpus containing documents with considerably less text, for instance tweets, abstracts or articles, or longer text documents divided into chunks. The topic formulation of the model for very low values of β might have to do with the length of the documents. When the documents are segmented, the size of the chunks are predefined and usually of the same length. The full documents however differ in length and this as well as the decision of using a not segmented text corpus in general might have affected the topic formulation for very low β values. The hyperparameters of the Dirichlet distributions when identifying authors writing style is chosen partly based on the result in this paper. In that case, the α value is crucial. An investigation of how well an author’s documents could be assigned to certain topics for lower α values can be further investigated in the future. Certain authors are highly overrepresented in the corpus, e.g. August Strindberg,

25 Selma Lagerlöf, Hjalmar Bergman and Karin Boye (See Table 2). This could reflect the result in terms of topic formulation and in extension theme identification. On the other hand, these authors were highly relevant for the literary arena in at the time and are important and relevant today still. Even if the amount of influence certain authors have on the corpus content, in terms of share of documents written by them, it can be motivated that they and their novels/short story collections are the ones that primarily represents the Swedish literature of that time period. It is necessary for the model purpose to have authors contributing with more than one novel/short story collection for an evaluation of the property of identifying writing style using the smoothed LDA model. With an extended corpus where authors to a larger extent contributed with multiple documents, the data could have been divided into a training and test set. That would have made it possible to further evaluate the result and could be an interesting alternative for future research. The conditions for thematic identification of the groups may have been negatively affected due to the decision of using full novels and short story collections instead of segmented. The generated topics were typically general and vague, and sometimes am- biguous. This problem is likely related to the loss of information of words co-occurring within the documents. Even though it could be stated that the model is able to identify themes in a not segmented text corpus, valuable thematic fragments might get lost in large word frequencies, for instance themes covered by certain parts of a novel. For the adjectives’ text corpus the segmentation is probably not as relevant, since it is rather the full adjectives vocabulary of an author that is used for identification of a certain writing style.

Acknowledgements

I wish to show my appreciation to Karl Berglund and Mats Dahllöf for sharing the pre- processed and lemmatized classical Swedish corpus of prose fiction with me to use in my thesis. Thank you for your generosity. I would also like to thank my supervisor associate senior lecturer Mattias Nordin at the Department of Statistics at Uppsala University. Thank you for your optimism and realistic inputs. At last, I want to thank professor Fan Yang Wallentin in Statistics at the Department of Statistics at Uppsala University for an

26 open office door and open ears.

References

[1] Archer, J. and Jockers, M. [2016], The Bestseller Code: Anatomy of the Blockbuster Novel, St. Martin’s Press.

[2] Barakat, A. [2018], What Makes an (Audio)Book Popular?, Master’s thesis, Linkoping University, Linkoping.

[3] Berglund, K. [2017], ‘Killer plotting. typologisk intriganalys utifran fjarrlasningar av 113 samtida svenska kriminalromaner’, Tidskrift for litteraturvetenskap 47(3).

[4] Blei, D. M. and Lafferty, J. D. [2007], ‘A correlated topic model of science’, The Annals of Applied Statistics 1(1), 17–35.

[5] Blei, D. M., Ng, A. Y. and Jordan, M. I. [2003], ‘Latent Dirichlet Allocation’, Journal of Machine Learning Research 3, 993–1022.

[6] Brouillette, D. A. S. and Golumbia, D. [2016], ‘Neoliberal tools (and archives): A political history of digital humanities’, os Angeles Review of Books .

[7] Dahlloof, M. and Berglund, K. [2019], Faces, fights, and families: Topic modeling and gendered themes in two corpora of swedish prose fiction, in ‘DHN 2019 Copenhagen, Proceedings of 4th Conference of The Association Digital Humanities in the Nordic Countries (Copenhagen, March 6-8 2019)’.

[8] Darling, W. M. [2011], A theoretical and practical implementation tutorial on topic modeling and gibbs sampling.

[9] George, C. P. and Doss, H. [2018], ‘Principled Selection of Hyperparameters in the Latent Dirichlet Allocation Model’, Journal of Machine Learning Research 18, 1–38.

[10] Griffiths, T. L. and Steyvers, M. [2004], ‘Finding scientific topics’, Proceedings of the National Academy of Science 101(1), 5228–5235.

27 [11] Gruen, B. and Hornik, K. [2011], ‘topicmodels: An R Package for Fitting Topic Models’, Journal of Statistical Software 40(13), 1–30.

[12] Hofmann, T. [1999], Probabilistic latent semantic analysis, in ‘Uncertainity in Articial Intelligence’.

[13] Jockers, M. [2013], Macroanalysis: Digital Methods and Literary History, University of Illinois Press.

[14] Jockers, M. L. and Mimno, D. [2013], ‘Significant themes in 19th-century literature’, Poetics 41(6), 750–769.

[15] Magnusson, M. [2018], Scalable and Efficient Probabilistic Topic Model Inference for Textual Data, Linkoping University Electronic Press.

[16] Ostling, R. [2013], ‘Stagger: an open-source part of speech tagger for swedish’, Northern European Journal of Language Technology 3(1), 1–18.

[17] R Core Team [2018], R: A Language and Environment for Statistical Computing,R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/

[18] Rhody, L. M. [2012], ‘Topic Modeling and Figurative Language’, Journal of Digital Humanities 2(1), 19–35.

[19] Rosen-Zvi, M., Griffiths, T., Steyvers, M. and Smyth, P. [2004], The author-topic model for authors and documents, in ‘UAI ’04, Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence (Banff, Canada, July 7-11, 2004)’.

[20] Steyvers, M. and Griffiths, T. [2007], Probabilistic topic models, in T. Landauer, D. McNamara, S. Dennis and W. Kintsch, eds, ‘Latent Semantic Analysis: A Road to Meaning’, Laurence Erlbaum.

[21] Tangherlini, T. R. and Leonard, P. [2013], ‘Trawling in the sea of the great unread: Sub-corpus topic modeling and humanities research’, Poetics 41(6), 725–749.

28 Appendix

Table 11: Ten most likely words in each topic for model with β = 10

Topic 1 φw,z Topic 2 φw,z Topic 3 φw,z betala (pay) VB 0.00015 stirra (stare) VB 0.00015 blandning (mixture) NN 0.00015 ljussken (gleam) NN 0.00015 hopp (hope/jump) NN 0.00015 svar (answer/respons) NN 0.00014 fjärd (bay) NN 0.00015 da (?) NN 0.00015 berg (mountain) NN 0.00014 Na (?) NN 0.00014 uppkomst (origin) NN 0.00015 men (harm) NN 0.00014 syster (sister) NN 0.00014 rusta (equip/arm) VB 0.00014 förvåning (surprise) NN 0.00014 färg (color) NN 0.00014 nöje (pleasure/entertainment) NN 0.00014 kar (vat) NN 0.00014 tur (tour/turn) NN 0.00014 kreatur (cattle) NN 0.00014 badort (seaside resort) NN 0.00014 tillåta (allow) VB 0.00014 historia (history/story) NN 0.00014 lärare (teacher) NN 0.00014 förundra (marvel) VB 0.00014 församling (congergation) NN 0.00014 referat (report) NN 0.00014 förlust (loss) NN 0.00014 flaxa (flap) VB 0.00014 proportion (proportion) NN 0.00014

Topic 4 φw,z Topic 5 φw,z Topic 6 φw,z rubrik (header) NN 0.00016 vissna (wither) VB 0.00015 ämbar (pail) NN 0.00015 vetande (knowledge) NN 0.00015 motstånd (resistance) NN 0.00014 bestå (consist) VB 0.00015 ålder (age) NN 0.00014 förmana (admonish) VB 0.00014 våld (violence) NN 0.00014 omfatta (include) VB 0.00014 sekund (second) NN 0.00014 teater (theater/play) NN 0.00014 brist (lack) NN 0.00014 rapp (blow/lash) NN 0.00014 tidning (newspaper) NN 0.00014 resignation (resignation) NN 0.00014 duga (suffice) VB 0.00014 måltid (meal) NN 0.00014 hack (notch) NN 0.00014 trappsteg (stair) NN 0.00014 rättighet (right/privilege) NN 0.00014 klaff (flap) NN 0.00014 tomhet (emptiness) NN 0.00014 beskylla (accuse) VB 0.00014 halsduk (scarf) NN 0.00014 stövel (boot) NN 0.00014 flickunge (girl) NN 0.00014 lusta (lust) NN 0.00014 kåpa (cowl) NN 0.00014 prick (dot) NN 0.00014

Topic 7 φw,z Topic 8 φw,z Topic 9 φw,z fenomen (phenomenon) NN 0.00016 yta (surface) NN 0.00015 stycken (pieces) NN 0.00015 anstrykning (tinge) NN 0.00016 vidskepelse (superstition) NN 0.00015 lä (leeward) NN 0.00015 bord (table) NN 0.00014 rykte (reputation) NN 0.00014 slag (kind/hit) NN 0.00014 kudde (pillow) NN 0.00014 tavla (painting) NN 0.00014 akta (beware) VB 0.00014 hylla (shelf) NN 0.00014 vanställa (blemish) VB 0.00014 storma (storm) VB 0.00014 befalla (command) VB 0.00014 anstå (befit) VB 0.00014 spörsmål (question) NN 0.00014 läder (leather) NN 0.00014 avsked (farwell) NN 0.0001 utförande (performance) NN 0.00014 krans (wreath) NN 0.00014 anställa (hire) VB 0.00014 tomrum (void) NN 0.00014 uppstå (arise) VB 0.00014 strumpa (stocking) NN 0.00014 besvärja (adjure) VB 0.00014 poet (poet) NN 0.00014 uppge (give/declare) VB 0.00014 dussin (dozen) NN 0.00014

Topic 10 φw,z Topic 11 φw,z Topic 12 φw,z (mustache) NN 0.00015 näsduk (handkerchief) NN 0.00015 tjänsteman (official) NN 0.00016 respektera (respect) VB 0.00015 stiga (rise) VB 0.00014 rova (turnip) NN 0.00015 hinna (reach) VB 0.00014 närma (approach) VB 0.00014 allmoge (peasantry) NN 0.00015 befallning (command) NN 0.00014 hämnas (revenge) VB 0.00014 personal (staff) NN 0.00014 vis (wise) NN 0.00014 mormor (grandmother) NN 0.00014 lugna (calm) VB 0.00014 sinnesstämning (mood) NN 0.00014 omtala (mention) VB 0.00014 lakan (sheet) NN 0.00014 tillstånd (state) NN 0.00014 roa (amuse) VB 0.00014 tilltal (utterance) NN 0.00014 besked(information/message) NN 0.00014 skriva (write) VB 0.00014 inträffa (occur) VB 0.00014 tidevarv (era) NN 0.00014 klinga (ring) VB 0.00014 gräsmatta (lawn) NN 0.00014 porslin (porcelain) NN 0.00014 ideal (ideal) NN 0.00014 onåd (disgrace) NN 0.00014

Topic 13 φw,z Topic 14 φw,z Topic 15 φw,z erövra (conquer) VB 0.00015 förevändning (pretext) NN 0.00015 rike (kingdom) NN 0.00015 kärleksbrev (love letter) NN 0.00015 stackare (wretch) NN 0.00014 gnida (rub) VB 0.00014 bemödande (endeavor) NN 0.00015 tvinga (force) VB 0.00014 augusti (august) NN 0.00014 skärpa (sharpness) NN 0.00015 omdöme (opinion) NN 0.00014 värk (ache) NN 0.00014 bord (table) NN 0.00014 lättja (laziness) NN 0.0001 krets (circuit) NN 0.00014 ljud (sound) NN 0.00014 estrad (platform) NN 0.00014 förråda (betray) VB 0.00014 bokstav (letter/character) NN 0.00014 narr (fool) NN 0.00014 ändra (change) VB 0.00014 förekomma (anticipate) VB 0.00014 behärska (master) VB 0.00014 flykt (escape) NN 0.00014 kärna (core) NN 0.00014 drogos (were drawn) NN 0.00014 innehålla (contain) VB 0.00014 överdrift (exaggeration) NN 0.00014 presentera (present) VB 0.00014 skilsmässa (divorce) NN 0.00014 *) NN= noun, VB=verb

Table continues on next page

29 Ten most likely words in each topic for model with β = 10

Topic 16 φw,z Topic 17 φw,z Topic 18 φw,z dikt (poem) NN 0.00015 roman (novel) NN 0.00015 beröring (touch) NN 0.00015 vidta (bring) VB 0.00015 främmande (foreign) NN 0.00014 förete (exhibit) VB 0.00015 ras (race) NN 0.00014 slippa (avoid) VB 0.00014 gaslåga (flame) NN 0.00015 ropa (call) VB 0.00014 nå (reach) VB 0.00014 äta (eat) VB 0.00014 ursäkt (excuse) NN 0.00014 blod (blood) NN 0.00014 bild (picture) NN 0.00014 upplysa (inform) VB 0.00014 täcke (quilt) NN 0.00014 stilla (still) VB 0.00014 greve (count) NN 0.00014 bilda (form/eduucate) VB 0.00014 bestämma (decide) VB 0.00014 uppfånga (catch) VB 0.00014 sabel (saber) NN 0.00014 mista (lose) VB 0.00014 belöning (reward) NN 0.00014 fodra (feed) VB 0.00014 tina (thaw) VB 0.00014 civilisation (civilization) NN 0.00014 bekostnad (expense) NN 0.00014 andedräkt (breath) NN 0.00014

Topic 19 φw,z Topic 20 φw,z ligga (lie) VB 0.00545 träsk (swamp) NN 0.00015 måste (must) VB 0.00531 fel (wrong) NN 0.00014 ge (give) VB 0.00514 låsa (lock) VB 0.00014 öga (eye) NN 0.00509 fördjupning (recess) NN 0.00014 börja (begin) VB 0.00476 förbud (prohibition) NN 0.00014 finnas (be/exist) VB 0.00476 örfil (slap) NN 0.00014 tid (time) NN 0.00470 fullmåne (full moon) NN 0.00014 hålla (hold) VB 0.00469 timma (hour) NN 0.00014 människa (human) NN 0.00461 skog (forest) NN 0.00013 låta (let/allow) VB 0.00457 sova (sleep) VB 0.00013 *) NN= noun, VB=verb

30