Topic Modeling on a Classical Swedish Text Corpus of Prose Fiction
Total Page:16
File Type:pdf, Size:1020Kb
Topic modeling on a classical Swedish text corpus of prose fiction Hyperparameters’ effect on theme composition and identification of writing style By Catharina Apelthun Department of Statistics Uppsala University Supervisor: Mattias Nordin 2021 Abstract A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction. 1 Contents 1 Introduction 3 2 Data 6 2.1 Pre-processing . 6 3 Methodology 7 3.1 Hyperparameter Values . 13 3.2 Data Corpus of Adjectives . 14 4 Results 15 4.1 Model’s Sensitivity for α ........................... 17 4.2 Model’s Sensitivity for β ........................... 20 4.3 Identifying Writing Style . 22 5 Discussion 25 2 1 Introduction Researchers and professionals with the need to process large quantaties of texts are often using data driven methods like machine learning and natural language processing. These methods are often iterative processes, where the model is created by repeatedly going over the data looking for certain patterns. It can be used for framing of latent semantics structures in large text corpus. For such text mining, a method frequently applied is topic modeling and is used for identification of abstract topics in a collection of documents. Traditionally in the literary research field, themes in prose fiction have been defined by qualitative methods based on close reading with the researcher personally reading the text. It has therefore been limited to a small quantity of texts given the amount of time available in the research project. Statistical methods offer an opportunity to examine general themes in big text data, in amounts of text which are too large for a single researcher to cover. The literary field is dominated by qualitative methods and valuing the neat craft of close reading, where the text is often analysed through the lens of different humanities theories. Giving the research question to a computer to answer is unthinkable for a traditional literary researcher, but it is shown in [3] that with both data driven methods and close reading on the texts, these two approaches can be combined. Data driven methods are struggling to find its’ place in humanities research, but seems likely to become important in the future [6]. In this thesis I am applying a topic modeling method called smoothed Latent Dirichlet Allocation (LDA) to discover themes in literature. It is a Bayesian text mining method used to reveal topics, possible to define as themes, in a collection of documents, a so-called text corpus. It offers a less time consuming process for theme identification in a great number of texts simultaneously. Topics generated by the model are briefly accounted for, but the focus in this thesis is to investigate how the topics, i.e. the themes, are affected when the values of hyperparameters in the model is changing. The behaviour of the topic clusters is mapped by the regulation of the hyperparameters on this, for topic modeling use, rather uncommon text corpus. Topic modeling is a widely used method in text analysis, but primarily not on lit- erary texts. The topic modeling method is not developed for longer and abstract text, but mainly for tweets, abstracts and articles, which can be expected to contain a clear 3 thematic content. When used on literary text corpora it has often been used as a dis- tant reading tool, analysing the texts using quantitative computational large-scale data methods compared to the, in literary analysis, traditional qualitative narrow reading methods. It has also been used as macroanalysis, large-scale computational methods, of authorial writing styles, difference in nationality based on language patterns and theme identification [13, 14]. Other areas of usage has been as an enhanced search tool [21]. The topic modeling approach have in this field mainly covered English literature, such as 19th century prose corpus [21, 13], poetry [18] and contemporary American bestsellers [1]. So far, the topic modeling method have only been used twice on Swedish literature - a master thesis on the popularity of audiobooks [2] and the topic modeling study previously mentioned comparing gender themes analysis on two corpora, a contemporary bestseller corpus and a classic’s corpus [7]. The latter has provided data for this paper. In the second part of this thesis, a smoothed LDA model is used to investigate whether that is a topic modeling method useful for recognition of different writing styles by exam- ining to what extent it is separating authors’ texts from each other. Limiting the analysis to adjectives, I test whether the model can create clusters where authors are grouped into specific clusters. If such a pattern can be identified, the LDA model has shown to be successful in separating texts by different authors based on their adjective usage. The analysis will be performed on a classical Swedish literary corpus consisting of novels and short story collections with modernized spelling. In [7] a LDA model is applied on two different text corpora, of which one is almost identical to the one used in this thesis. The methodological approach is different, where in [7] they are comparing the classical text corpora with a modern bestselling corpora in terms of thematic content and its relation to gender of characters and authors. Even though partly the same data are used, the essays differ in purpose. In my thesis, the focus is (1) statistical, by the investigation of hyperparameters impact on the model output and (2) a possible new application area for the method, by applying the model on a corpus of adjectives to find style markers for authors. In [7] the purpose is literary, analysing theme and gender. The method I use in this paper includes hyperparameters which are to be determined beforehand. Previous results regarding how the properties of the topics are affected by the values set for these parameters have mainly been investigated upon for other types of text corpora, usually with documents containing less amount of words, e.g. tweets, abstracts 4 or articles. Therefore the patterns identified in previous research are not necessarily valid for this text corpus. The generating of the model will be performed on the full novels or short story col- lections. Topic modeling methods, being more frequently applied to corpora containing shorter text documents, are not necessarily well fitted for a corpus with longer text doc- uments like novels and short story collections. Often, when applied on such corpora, the texts are chunked into shorter segments, for instance in [7]. I have chosen to generate the model on a not segmented text corpus of novels and short story collection, risking less precise themes [14]. The advantage of this approach is that the model could be more eas- ily applied to a text corpus without the time consuming pre-processing for segmentation of the texts in the data. Theoretically, the themes are expected to suffer from being less precise due to the non-segmentation, which will be investigated. When a topic modeling method is performed on a literary corpus the word classes included in the analysis are typically limited to nouns or nouns and verbs [14, 7]. These word classes’ semantic properties, being carriers of the story or plot, make them primarly relevant for identification of themes. Adjectives as word class are sometimes included together with nouns, or nouns and verbs in the text corpus used for topic modeling with theme identificational purpose [13]. In this paper, a topic modeling method will be applied on the text corpus of verbs and nouns, but also on a text corpus solely consisting of adjectives. The aim for the adjectives’ corpus topic model is to investigate its ability to create clusters of adjectives which can be mapped to certain authors to identify writing styles. This is, to my knowledge, the first time an analysis as decribed above has been performed only on adjectives. To summarize, I make two contributions to the literature. First, I apply a smoothed LDA model on a classical Swedish text corpus of prose fiction and investigate the result for different values of hyperparameters. Second, I examine the possibility of using a smoothed LDA topic model for identification of writing style based on an adjectives text corpus. 5 2 Data The data are a text corpus consisting of 118 Swedish novels or collections of short stories published between 1821-1941. The corpus have been composed from the Swedish online collection Litteraturbanken1 (LB), which contains Swedish literature from the 19th cen- tury and the first decades of the 20th century considered historically and aesthetically important, among them several "classics" in Swedish literature.