Masterarbeit zur Erlangung des akademischen Grades Master of Arts der Philosophischen Fakultat¨ der Universitat¨ Zurich¨

Topic Modeling and Visualisation of Diachronic Trends in Biomedical Academic Articles

Verfasserin: Parijat Ghoshal Matrikel-Nr: 09-716-010

Referent: Prof. Dr. Martin Volk

Betreuer: Dr. Fabio Rinaldi

Institut fur¨ Computerlinguistik

Abgabedatum: 24.06.2017 Abstract

In the biomedical domain, there is an abundance of texts making the task of having a thematic overview about them a challenging endeavour. This is also due to the fact that many of these texts are unlabelled and one simply cannot always assign them to a certain thematic domain. Some texts remain thematically ambiguous and sorting them neatly into thematic domains is impossible. Thus, it could be helpful to implement an unsupervised algorithm to sort into topics a corpus of unlabelled data. In this Master’s thesis, latent Dirichlet allocation will be used on the corpus to automatically generate topics. Throughout the course of this work, I will create topic models based on articles from PubMed Central’s Open Access Subset. Then I will observe diachronic trends in them on three different levels with the help of the . On the first level, I will observe diachronic changes in the popularity of the topics themselves. Then I will check how the popularity of the topic words within a topic evolve throughout the corpus. On the third level, I will observe the popularity of common words that belong to documents about a certain topic. Moreover, a companion website and a topic modeling pipeline is also created as an output of this project. Acknowledgement

I would like to thank Dr. Fabio Rinaldi for his guidance, patience and understanding, and my parents for their unrelenting support. Finally, I would also like to thank Jo who put up with everything else.

ii Contents

Abstract i

Acknowledgement ii

Contents iii

List of Figures vii

List of Tables ix

List of Acronyms x

1 Introduction 1 1.1 Motivation ...... 1 1.2 Research Questions ...... 2 1.3 Thesis Structure ...... 2

2 Theoretical background 3 2.1 Topic Models ...... 3 2.1.1 Precursors of Latent Dirichlet Allocation ...... 3 2.1.2 Latent Dirichlet Allocation ...... 4 2.2 Problem ...... 5 2.3 Issues with Topic Models ...... 5 2.3.1 Categories of Bad Topics ...... 6 2.3.1.1 General and Specific Words ...... 6 2.3.1.2 Mixed and Chained Topics ...... 6 2.3.1.3 Identical Topics ...... 7 2.3.1.4 Incomplete Stopword List ...... 7 2.3.1.5 Nonsensical Topics ...... 7 2.3.2 Topic Alignment ...... 7 2.3.3 Topic Quality Evaluation ...... 8 2.3.3.1 Human Evaluation ...... 8 2.3.3.2 Topic Size ...... 9 2.3.3.3 Topic Word Length ...... 9

iii Contents

2.4 Improving Topic Models ...... 9 2.4.1 Automatic Topic Model Labelling ...... 10 2.4.1.1 Information Retrieval ...... 10 2.4.1.2 Neural Embeddings ...... 10 2.4.2 Text Preprocessing to Acquire Meaningful Topics ...... 11

3 Previous work 12 3.1 Biomedical Topic Modeling ...... 12 3.1.1 Ontology Term Mapping ...... 12 3.1.2 Enriching LDA Output with External Data ...... 12 3.1.3 Discover Relationships between Diseases and Genes with Topic Modeling ...... 13 3.1.4 Comprehensive Biomedical LDA Topics Example Source . . . . 13 3.2 Diachronic Topic Modeling ...... 14 3.2.1 Early Work in Modern Non-biomedical Domain ...... 14 3.2.2 Observe Diachronic Changes and Author Influence in Biomed- ical Domain ...... 14 3.3 Brief Overview of Available Tools ...... 15 3.3.1 MAchine Learning for LanguagE Toolkit (MALLET) ...... 15 3.3.2 Stanford Topic Modeling Toolbox ...... 15 3.3.3 Spark Machine Learning Library (MLlib) ...... 16 3.3.4 R Libraries ...... 16 3.3.5 Gensim ...... 16

4 Methodology 17 4.1 Source of Data ...... 17 4.2 Extracting Topic Models ...... 17 4.2.1 Experiment 1 : Exploring the Topics in the Corpus ...... 18 4.2.1.1 Preprocessing ...... 18 4.2.2 LDA Model Creation Parameters ...... 19 4.2.2.1 Evaluation: 10 Topics Model ...... 19 4.2.2.2 Evaluation: 20 Topics Model ...... 20 4.2.2.3 Model Evaluation ...... 20 4.2.2.4 Evaluation: 50 Topics Model ...... 21 4.2.2.5 Evaluation: 100 Topics Model ...... 21 4.2.2.6 Evaluation Experiment 1: All Models ...... 22 4.2.3 Experiment 2 : Edited Corpus and Modified Model Update Parameters ...... 23 4.2.3.1 Preprocessing ...... 23 4.2.3.2 Results ...... 24

iv Contents

4.2.4 Experiment 3: Online Learning with Different Batch Sizes . . . 24 4.2.4.1 Preprocessing ...... 24 4.2.4.2 Results ...... 25 4.2.5 Experiment 4: Reduced Vocabulary ...... 26 4.2.5.1 Preprocessing ...... 26 4.2.5.2 10 Topics Model ...... 27 4.2.5.3 20 Topics Model ...... 27 4.2.5.4 50 Topics Model ...... 28 4.2.5.5 100 Topics Model ...... 29 4.2.6 Experiment 5: Influence of POS Tags ...... 29 4.2.6.1 Noun-Verb Corpus ...... 30 4.2.6.2 Noun-Adjective Corpus ...... 31 4.2.6.3 Noun-Verb-Adjective Corpus ...... 32 4.2.7 Experiment 6: Extracting Models with Distinct Topics Using Topic Similarity ...... 34 4.2.8 Experiment 7: Extracting Stable Models ...... 35 4.2.9 Topic Labelling ...... 37

5 Data and Topic Exploration 39 5.1 Document Topics Distribution ...... 39 5.2 Data Exploration ...... 39 5.2.1 Topic Distribution ...... 39 5.2.2 Average Topic Probability ...... 41 5.3 Observing Diachronic Trends Using Topic Models ...... 45 5.4 Topic Exploration ...... 45 5.4.1 Frequency of Topic Words in the Corpus ...... 46 5.4.2 Diachronic Shifts within a Topic ...... 48 5.5 Frequency of Popular Words within a Topic ...... 49 5.5.1 Diachronic Popularity of Non-topic Word Related Terms . . . . 51

6 Results and Discussion 52 6.1 Research question Nr. 1 ...... 52 6.2 Research question Nr. 2 ...... 53 6.3 Research question Nr. 3 ...... 53 6.4 Summary ...... 54

7 Website 55 7.1 Generating the charts ...... 55 7.2 Website sections ...... 56 7.2.1 Observing diachronic trends in topics ...... 56

v Contents

7.2.2 Generate frequency of topic words in the corpus ...... 57 7.2.3 Frequency of popular words within a topic ...... 57

8 Diachronic topic modeling pipeline 59 8.1 Data extraction ...... 59 8.1.1 Extract metadata ...... 59 8.1.2 Extract text ...... 59 8.1.2.1 Text preprocessing ...... 59 8.1.2.1.1 POS tagging of the corpus ...... 60 8.1.2.2 Token filtering ...... 60 8.1.3 Corpus creation ...... 61 8.2 LDA topic modeling ...... 61 8.2.1 Dictionary creation ...... 61 8.2.2 Editing the original dictionary ...... 61 8.2.3 LDA corpus creation ...... 62 8.2.4 LDA model creation ...... 62 8.3 Data mapping ...... 62 8.3.1 Mapping, creating yearly average topic probability ...... 62 8.3.2 Mapping: generating reality frequencies for topic words . . . . . 63 8.3.3 Mapping: generating relative frequencies for popular words in topic subcorpora ...... 63 8.4 Other functions ...... 63 8.5 Summary ...... 63

9 Conclusion 65 9.1 Future Work ...... 66

References 67

A Tables 71

Curriculum Vitae 72

vi List of Figures

1.1 Plate notation representing the LDA model (from Blei et al. [2003]) . 4 4.1 Number of articles published per year from 1950-2016 in the corpus of 150 thousand articles ...... 25 4.2 Percentage of articles from the original corpus (1.5 million articles) per year from 1950-2016 that are in the corpus of 150 thousand articles 26 4.3 Average inter-topic similarity ...... 35 5.1 Topic words in article (green : topic 34) ...... 40 5.2 Topic words in article (green: topic 19, yellow: topic 28) ...... 40 5.3 Topic probability distribu- tion of documents of topic 19 ...... 41 5.4 Topic probability distribution of documents of topic 19, where topic probability >0.1 ...... 41 5.5 Topic probability distribution of documents of topics 6,10, 39, and 41, where topic probability >0.1 ...... 42 5.6 Topic probability distribution of documents of topic 25, where topic probability >0.1 ...... 43 5.7 Average topic probability of documents from 2000 to 2015 of topics 10, 23, and 33 ...... 43 5.8 Average topic probability of documents from 2000 to 2015 of topics 11,12,17,21,28, and 43 ...... 44 5.9 Average topic probability of documents from 2000 to 2015 of topics 2,5, and 25 ...... 44 5.10 Average topic probability of documents from 1980 to 2005 of topic 50 45 5.11 Average topic probability of documents from 2000 to 2015 of topics 13,22,34, and 38 ...... 46 5.12 Relative frequency of topic words for topic 13-woman-heart-pregnancy ...... 47 5.13 Relative frequency of pregnancy related words from topic 13 . . . . . 48 5.14 Relative frequency of heart disease related words from topic 13 . . . . 48 5.15 Relative frequency of topic words for topic 22-infection-virus-vaccine 49 5.16 Relative frequency of immunology related words from topic 22 (group 2) ...... 49

vii List of Figures

5.17 Relative frequency of immunology related words from topic 22 (group 3)...... 49 5.18 Relative frequency of words from topic 13 (top 1-5 words) ...... 50 5.19 Relative frequency of words from topic 13 (top 6-10 words) ...... 50 5.20 Relative frequency of words from topic 22 (top 1-5 words) ...... 50 5.21 Relative frequency of words from topic 22 (top 6-9 words) ...... 50 7.1 Website: Part 1 User options ...... 56 7.2 Website: Part 1 Example output for topics 2,3,4,5 ...... 56 7.3 Website: Part 2 Example output for topic 13 (topics shown partially) 57 7.4 Website: Part 3 Example output for topic 13 (top 2-5 words shown) . 58 8.1 Diachronic topic modeling pipeline ...... 60

viii List of Tables

4.1 10 topics generated from a corpus of 1.5 million articles ...... 19 4.2 A selection from the 20 topics generated from a corpus of 1.5 million articles ...... 21 4.3 A selection from the 50 topics generated from a corpus of 1.5 million articles ...... 22 4.4 A selection from the 100 topics generated from a corpus of 1.5 million articles ...... 22 4.5 Identical topics generated from multiple topic models with different topic sizes ...... 24 4.6 Topics from 10 topic model from noun corpus ...... 27 4.7 Topics from 20 topic model from noun corpus ...... 28 4.8 A selection from the 50 topics generated from noun corpus ...... 28 4.9 Focus on breast-cancer related topics from 100 topics models from noun corpus ...... 29 4.10 15 topics from 50 topic model from Noun-Corpus ...... 30 4.11 15 topics from 50 topic model from Noun-Verb corpus ...... 31 4.12 15 topics from 50 topic model from Noun-Adjective corpus ...... 32 4.13 15 topics from 50 topic model from Noun-Verb-Adjective corpus . . . 32 4.14 Percentage of identical terms in between the models ...... 33 4.15 Number of unique words in found in all the topics ...... 33 4.16 Topic similarity based on number similar words over multiple passes 37 5.1 Topics 19, 28, 34 ...... 39 5.2 Fictitious topic probability distribution over multiple topics and doc- uments ...... 40 5.3 Four topics selected for data exploration ...... 46 A.1 All 50 topics from final model ...... 71

ix List of Acronyms

HTML HyperText Markup Language LDA latent Dirichlet allocation NLP Natural Language Processing NLTK Natural Language Toolkit OCR Optical Character Recognition POS Part-Of-Speech XML eXtensible Markup Language

x 1 Introduction

In this section, I will mention the motivation for writing this Master’s thesis. I will also mention my research questions that will be tackled in this work. Finally, I will give an overview of this work stating the themes of the upcoming chapters.

1.1 Motivation

In the field of biomedical literature thousands of articles are published every day. This is by no means a mere exaggeration; between 2012 and 2015, approximately 800’000 articles were annually published1. Research and discoveries in the biomed- ical field are primarily found in scholarly publications; however, due to the afore- mentioned amount of literature being published, the task of finding trends in the biomedical domain can be a challenge. Natural language processing (NLP), as a consequence, can be of great use because these academic publications are often pub- lished in machine-readable text formats. Huang and Lu [2016] mention in their article about community challenges in the biomedical field that collaborations be- tween the biomedical and NLP researchers have become commonplace forming the field of research known as biomedical natural language processing (BioNLP). They also mention that NLP methods and can be used for a multitude of tasks, such as constructing ontologies and curating databases etc.

For those interested in machine-learning approaches, the field of biomedical research publication is well suited for finding patterns in large amounts of data, due to the abundance of publications available.

PubMed offers full articles, as well as abstracts of scientific articles written in the biomedical domain. The data provided on PubMed is quite useful for machine- learning approaches. Firstly, the data is machine-readable, and secondly, there are metadata including authors and date of publication in addition to the scientific texts.

1Based on MEDLINE citation counts by year of publication. https://www.nlm.nih.gov/bsd/ medline_cit_counts_yr_pub.html

1 Chapter 1. Introduction

Exploring diachronic trends in large corpora has been a particular interest of mine for quite a while and I have worked on related projects within the constraints of academia as well as for professional projects. Furthermore, working with biomedical data is undeniably fascinating in its own regard. For these reasons, I decided to embark on this project with the aim of discovering diachronic trends in a large corpus of biomedical publications.

1.2 Research Questions

The research questions that shall be answered in this Master’s thesis are:

1. Is it possible to detect temporal trends in a corpus using topics generated from a topic modeling algorithm?

2. Using topic modeling, can one detect diachronic changes within the words of a given topic throughout the entire corpus?

3. Can one use topic modeling to detect diachronic changes in term/word usage within documents that fall into a specific topic?

1.3 Thesis Structure

This Master’s thesis is structured as follows: at first I will provide a brief overview of the theoretical background for topic modeling in Chapter 2 and explain the topic modeling algorithm that I will use to create the models. Furthermore, I will men- tion the issues of this algorithm and probable strategies of circumventing them. In Chapter 3, I will mention the previous work that has been done in the domain of diachronic and biomedical topic modeling. At the same time, I will give a brief overview of the off-the-shelf available tools for topic modeling. In Chapter 4, the entire methodology to create the topic model from the initial corpus is provided. Then in Chapter 5, which is about the data and topic exploration, I look into the topic model and answer the research questions. In Chapter 6, I discuss the results of the research questions. In Chapter 7, I introduce the companion website and explain its functionalities. In Chapter 8, I introduce the diachronic topic modeling pipeline that has been used to create the topic model. Finally, in Chapter 9, I conclude the findings of my Master’s thesis and mention possibilities of future work to be done in this field.

2 2 Theoretical background

In this section, we look at the theoretical framework behind topic modeling. I give a brief overview of the precursors of latent Dirichlet allocation (LDA). Then I provide a brief theoretical overview of LDA, and the issues that one could face when using topic models. The problems of topic models are explained in detail, as I will be referring to them to justify my methodology.

2.1 Topic Models

One can define topic models as statistical models that are used to learn about the latent structures that exist within a corpus of documents. These models can have many uses; however, discovering patterns is one of the key reasons for building topic models (Boyd-Graber et al. [2014]).

2.1.1 Precursors of Latent Dirichlet Allocation

There are many different kinds of statistical models that are currently in use to discover topics within documents. In this section, I will quickly list a few of these methods and then explain in detail the one which is used by me. (LSA) uses vector-based models for finding coherence between texts. The main methods used in this model are term frequency – inverse document frequency (tf-idf) and singular value decomposition (SVD)1 (Deerwester et al. [1990]). There are certain advantages to using LSA, namely one can find the latent topics that exist within the corpus. However, due to SVD, the mathematical complexity of this model is extremely high.

Another type of topic modeling method is Probabilistic Latent Semantic Analysis (PLSA). It uses a simple two level generative model, where it calculates a probability model for the documents, the topics and the words ( Hofmann [1999]). It has certain advantages, such as the topics can be easily interpreted and the model is based on a

1https://en.wikipedia.org/wiki/Singular_value_decomposition

3 Chapter 2. Theoretical background solid statistical foundation(Holzinger et al. [2014]). However, this model has also a disadvantage because the expectation–maximization algorithm1 used by PLSA has a tendency to find a solution which is not always the global optimum (Leopold [2007]).

2.1.2 Latent Dirichlet Allocation

One of the key concepts for the models used for this Master’s thesis is latent Dirichlet allocation (LDA) by Blei et al. [2003]. They explain LDA as follows:

“Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.”

The underlying logic behind LDA is that similar groups of words will occur in doc- uments with similar topics in them. Thus, latent topics are groups of words that frequently occur in a document. Documents, in this case, are simply probability distribution of latent topics, and the topics can be defined as the probability dis- tributions over words. A key point here is that in this model one is working with probability distributions and not word frequencies. Hence, the syntax of the text within the document does not matter; only the distribution of the words is of im- portance.

Figure 1.1: Plate notation representing the LDA model (from Blei et al. [2003])

The plate notation (see Figure 1.1 ) represents the overall architecture of the LDA model from Blei et al. [2003].

M: total number of documents in the corpus (1...m)

N: number of words in a document (1...n)

α: the Dirichlet prior parameter for the per-document topic distributions

β: the Dirichlet prior parameter for the per-topic word distribution

θm: distribution of topics in a document

1https://en.wikipedia.org/wiki/Expectation-maximization_algorithm

4 Chapter 2. Theoretical background

zmn: denotes the topic for the n-th word in document m wmn: denotes a given word, m: specific document, n: specific word

The generative processes give an insight into how the LDA model assumes the document is created. As the first step, the model calculates the number of words for a given document. Then it determines the document as a mixture of a given set of topics. For example, if the number of topic is set to four, then it would decide that the document m consists of 10% topic 1, 20% topic 2, 40% topic 3, and 30% topic 4. The model tries to generate the words in the document. This is done by choosing a topic for the document, which is the multinomial representation of the topics assigned to the document (40% topic 3, 30% topic 3 etc.). In the next step, it chooses the topic word, which is calculated using the aforementioned topic’s multinomial distribution.

2.2 Machine Learning Problem

The techniques of machine learning can be divided into approximately three cate- gories, namely: supervised, semi-supervised and unsupervised learning. A machine learning problem falls into one of these categories based on the full, partial, or absence of ground truth that could be applied to the model during the training pro- cedure. One uses unsupervised machine learning when there is a complete absence of ground truth. The aim of unsupervised learning methods is to find structures and patterns from the input data, based on the type of machine learning algorithm that is being implemented (Bonissone [2015]). LDA falls under the category of unsuper- vised machine learning algorithm, as the input data is not labelled and the model tries to infer structures within the data, based on predefined parameters. This could be problematic at a later stage, as it could be challenging to judge the quality of the generated topic model as there does not exist any reference with which one could compare it.

2.3 Issues with Topic Models

Boyd-Graber et al. [2014] give a comprehensive overview of the issues that could occur with topics generated by an LDA model. They mention five categories that could be used to judge if a topic is of good quality. As I will be using these metrics to judge the quality of the topics generated by the model (see Chapter 4), I will mention here the aspects of good and bad topics, as proposed by them in the following

5 Chapter 2. Theoretical background sections.

2.3.1 Categories of Bad Topics

2.3.1.1 General and Specific Words

As most words in natural language convey some sort of meaning, it can occur that the model generates topics made of words that are not useful. These are topics that contain words which are often frequent in the corpus, but they are also not specific. Thus, these topics can be perceived as being general and not belonging to a specific subdivision of the corpus. These topics may include stop words that were not removed during the preprocessing step. However, it can also be the case that these are high frequency words, specific to the corpus and should be removed in order to yield meaningful topics.

Boyd-Graber et al. [2014] also state that low-frequency words can cause problems. According to them, topics containing a multitude of specific words can also be bad, because there is a chance that these topics are not representative of a specific subdivision within the corpus, but were generated due to mere chance, as the model generates topics based on word frequencies. The authors do not mention how to avoid the creation of such topics.

2.3.1.2 Mixed and Chained Topics

Boyd-Graber et al. [2014] define ‘mixed topics’ as topics that are made of a set of words that do not make any sense in combination. However, these topics contain subset of words that make sense. Example 2.1 is a case of ‘mixed’ topic, as this topic consists of two subsets, namely names of flowers (in emphasis) and tools.

(2.1) rose, daffodil, daisy, hammer, screwdriver, pliers ...

‘Chained’ topics are related to mixed topics, as here too one has different subset of words within the topic, but the issue here is that at least one word from one of the subsets could belong in the other subset. As shown in Example 2.2, there are two subsets within the topic, but one word, namely ‘apple’, from the first subset, which is about names of fruits could belong in the other other subset, which is about products made the company called Apple.

(2.2) apple, banana, grape, iphone, smartphone, ipad...

6 Chapter 2. Theoretical background

2.3.1.3 Identical Topics

Another issue that can happen while generating topics is that the topics are mostly or completely identical, and possible exhibit a different word order (see Examples 2.3 and 2.4).

(2.3) apple, banana, grape, orange, pineapple

(2.4) grape, apple, pear, pineapple, banana

Boyd-Graber et al. [2014] mention some solutions to avoid such topics. They suggest that one should check if there are empty documents in the dataset, and if the number of topics is excessive for the given dataset.

2.3.1.4 Incomplete Stopword List

These are topics generated as a result of having a stop words list that is incomplete. This issue is somewhat similar to the one mentioned in topics containing general words. However, the difference here is that the topics are not vague, but make sense. For example, they could be a list of first names, or Roman numerals. Here the authors suggest that this problem can be circumvented by updating the list of stop words and running the model again Boyd-Graber et al. [2014].

2.3.1.5 Nonsensical Topics

These are topics that do not make any sense. Boyd-Graber et al. [2014] mention that providing the model with an excessive number of topics to generate may cause nonsensical topics. This is due to the fact that the model tries to generate a given number of topics even if these do not exist in the corpus, the model tries to infer topics based on some pattern it found. The authors mention that, for example, OCR errors can be an artificially generated topic that could probably be detected by a topic modeling algorithm.

2.3.2 Topic Alignment

Another issue with LDA models is that, even if one uses the same corpus and parameters, the model generates the topics in different order for each run. Yang et al. [2016] mention that when using an LDA model, the words that are generated in the model are based on the fixed conditional distributions. This has a side-effect, as it leads to the topics in the model as being exchangeable. Due to this reason,

7 Chapter 2. Theoretical background in statistical topic modeling algorithms such as LDA, even with identical model generation parameters, the topic indices between two topic models may not match. Hence, the authors suggest some kind of alignment measure in order to calculate the similarities between multiple models. Yang et al. [2016] implement the Hungarian algorithm 1 in their work. However, for my experiments I use a different measure to calculate the similarities between multiple models (see Chapter 4.2.7).

2.3.3 Topic Quality Evaluation

Boyd-Graber et al. [2014] mention that after retrieving the topics from the model, there are different ways of judging the quality of them. Furthermore, they indicate that the weakness of most topic modeling papers is that the researchers do only a qualitative assessment after generated topics. They also state that in many cases the quality of the topics are judged based on some NLP related task that is not directly related to the topics themselves. For example, using the inferred topics for an information retrieval task (e.g. Wei and Croft [2006], or using them for a sentiment detection task (e.g. Titov and McDonald [2008]). The second approach is to have a set of held-out articles (i.e. a test set) and using the probability of the observations based on the articles in the test set and those used to train the model. The paper by Wallach et al. [2009] provides a comprehensive overview of evaluating topics based on the probability of the observations on the test set.

At this point I would like to mention that I will not be using a test set to evaluate my results, as I do not have a reference for comparison. Hence, I will be using a different metric to judge the quality of my model (see 4.2.7).

2.3.3.1 Human Evaluation

Chang et al. [2009] propose a method for human evaluation of topics, where the participants are given a set of words from a topic that also include an intruder topic word (see Example 2.52). They mention that the participants are able to identify the intruder word, if the other words in the set belong to the same semantic group. However, if the set contains words that do not belong together (see Example 2.6), then the task is much more difficult for the participants, who then seem to start choosing the intruder word at random. They mention that the quality of the topic can be evaluated based on the consistency of the human judgement.

1https://en.wikipedia.org/wiki/Hungarian_algorithm 2The intruder word is in emphasis.

8 Chapter 2. Theoretical background

(2.5) apple, orange, banana, pineapple, tiger

(2.6) book, computer, teacher, student, weekend

2.3.3.2 Topic Size

The topic size plays a key role in the quality of the model. As Mimno et al. [2011] state, there is a relationship between the number of topics and quality of the topics themselves. They point out that on the one hand, models with a large number of topics provide the user with a more detailed view of the themes that are present in the corpus. On the other hand, having a multitude of topics comes with disadvan- tages because certain topic modeling algorithms tend to create topics even if there are none to be found (see 2.3.1.5).

2.3.3.3 Topic Word Length

Boyd-Graber et al. [2014] refer to the length of the words in the topic as an indicator of topic quality. Their intuition is as follows: if a word has a specific meaning, then it is quite likely to be longer in length and vice versa. Thus, according to Boyd- Graber et al. [2014] topics with a short average word length could be an indication of anomalous topic clustering (e.g. acronyms). Moreover, they mention that the length of the topic word is not an indication for a nonsensical topic that cannot be interpreted by the user. Topics with short words in them probably indicate words that have a tendency to co-occur. The authors allude to a topic that contains the word ‘legislator’ and acronyms for the names of states in the US (e.g. ‘ca’,‘pa’, ‘nc’, ‘fl’, etc.). In this case, the topic shows that names of states tend to co-occur with tokens such as ’legislator‘. The authors do not provide a solution for avoiding such topics. Nonetheless, this criterion of topic length can be applied to my future topic models to evaluate if the output is of adequate quality.

2.4 Improving Topic Models

Boyd-Graber et al. [2014] also mention multiple ways of improving topic models. Their suggestions include merging topics that are similar(see 2.3.1.3) or separating topics that conflate multiple concepts (see 2.3.1.2). In most of their suggestions, they recommend measures that calculate the co-occurrence of words (e.g. point-wise mutual information (PMI)) and expert knowledge amongst few of the approaches that can be implemented. They also state automatic topic labelling as way to

9 Chapter 2. Theoretical background interpret topics without the help of domain experts. This method also provides a summary of what is being presented by the topics.

2.4.1 Automatic Topic Model Labelling

2.4.1.1 Information Retrieval

Lau et al.[2011] attempt at resolving the issue of topic labelling by generating new labels for the topics. The methods implemented by them include using terms that are found in the topics. They search the English Wikipedia for the words which are the top-ranking topic terms and use the article titles that have been returned by their query to generate more topic labels. Then they rank and process the Wikipedia article titles and extract label keywords from them. These keywords are further pro- cessed using a combination of association measures (PMI), and lexical features. Out of all the data sets evaluated in their work, the topic labels of the PubMed abstracts perform less well than labels generated on the other datasets. This approach is cer- tainly interesting; however, in order to implement their methodology, it is necessary to use an API to get the relevant information from Wikipedia, which goes beyond the scope of the focus of this Master’s thesis.

2.4.1.2 Neural Embeddings

An improvement of the model proposed by Lau et al. [2011] is the one by Bhatia et al. [2016]. Even though their methodology has some similarities to the former approach, Bhatia et al. [2016] forego the information retrieval aspect of Lau et al. [2011] and replace it with and doc2vec. The word2vec model generates more abstract labels, whereas doc2vec returns fine-grained labels for a given topic. The strength of the system lies in combining the outputs of both systems. They also make use of different learn-to-rank approaches to improve the quality after the top ranking topic labels. Unlike the approach by Lau et al. [2011], the researchers claim that this method is much easier to implement as one is not required to use search APIs. Moreover, Bhatia et al. [2016] yield better results than Lau et al. [2011] in the previously mentioned article.

As their method mentions doc2vec, to generate multi-word topic labelling, I chose not to implement this for my work. Doc2vec requires a significant amount of compu- tational resources (notably RAM) to run, and it is not recommended for the corpus that I intend to use.

10 Chapter 2. Theoretical background

2.4.2 Text Preprocessing to Acquire Meaningful Topics

A prevalent issue with LDA topic modeling is finding meaningful topics based on the given corpus. Zhu et al. [2014] suggest that when using LDA, such a problem can be significantly reduced by using multiple levels of text pre-processing with methods such as POS tagging, base noun phrase chunking, and K means clustering . Part of their preprocessing approach includes the reduction of tokens in plural to their lemma form. They mention that their method outperforms a baseline with the pre- processing steps and the output which ranks better among human annotators. This paper is of particular interest as many of the approaches mentioned by the authors can be implemented using off-the-shelf NLP tools such as the Natural Language Toolkit (NLTK) for Python (Bird et al. [2009]).

11 3 Previous work

In this section I give a brief overview of the work that has been done in the field of biomedical topic modeling and diachronic topic modeling. I will focus only on the articles which are of interest to my work. Finally, I give a brief overview of the tools that can be used for topic modeling purposes.

3.1 Biomedical Topic Modeling

3.1.1 Ontology Term Mapping

Zheng et al. [2006] used topic modeling on titles and abstracts of protein-related MEDLINE articles. They used LDA and extracted 300 topics from their corpus. They found out that the majority of the extracted topics were not only semantically coherent, but they also featured biological terms. As an added feature, they mapped the topics to the Gene Ontology (GO) controlled vocabulary. They did this by associating the common terms between the topic words the GO term. Thus, this paper exhibits a practical usage of topic modeling in the domain of biomedical publication. Furthermore, this paper also contains multiple examples of biomedical topics which were created using LDA.

3.1.2 Enriching LDA Output with External Data

Other ways of improving the output of the model can be done by enhancing the output with information from an external knowledge base. Wang et al. [2011] do this by applying multiple levels of complex pre-processing steps, which include NER, getting information about the tokens from an external database, and recategorising the extracted information into a relational database for further usage. Moreover, they apply other semantic association features to improve the topics. The LDA method has been enhanced by researchers to suit their need for biomedical topic modeling. The researchers create Bio-LDA, which is an algorithm that performs LDA on a given corpus; moreover, it enriches the results using datasets from the

12 Chapter 3. Previous work life science domain. Then it identifies relationships between the topics using the aforementioned methods.

This article is of interest to me, as they incorporate external information to enhance their method. It has also served as an inspiration for the pipeline that I created to easily process PubMed articles (see Chapter 8).

3.1.3 Discover Relationships between Diseases and Genes with Topic Modeling

ElShal et al. [2016] aim to find relationships between diseases and genes.

They used LDA to extract the topics from the abstracts of biomedical research articles, which is their corpus. They converted the documents in their corpus into vectors using the bag-of-words model approach. From the LDA model, they also used the topics, the topic word distribution, and from the corpus they used the mentions of genes and diseases. They combined these information to find links between genes and diseases. In most cases, they calculated similarities between the documents, topics, genes, and diseases using cosine similarity.

Using this approach, they found many correct links between genes and diseases; however, they also realised that the vocabulary plays an important role and the topics extracted rely heavily on it. This article is interesting as it focusses on the role of vocabulary when creating LDA models.

3.1.4 Comprehensive Biomedical LDA Topics Example Source

Examples of topics that are created from models with biomedical texts as input are an useful resource, because they serve as a reference for what such topics could look like. This is one of the reasons why the article by van Altena et al. [2016] is helpful. The article focuses on the usage of big data themes in scientific texts, with an emphasis on biomedical literature. As their corpus, they take 1308 documents from PubMed and PMC. These abstracts were selected based on the occurrence of big data related keywords in them. As for the topic modeling methodology, they implemented LDA using R. The preprocessing methods included stopword filtering and stemming tokens. An interesting feature implemented by them in the corpus creation was to transform common bigrams into a single token using an underscore (e.g. heath care). To extract the best parameters for their model, they calculated

13 Chapter 3. Previous work the Akaike information criterion (AIC)1 using the following variables: the number of topics, the likelihood of the model, and the number of unique words in the corpus. Based on their corpus, the best number of topics is 25. Finally, they used manual annotators with domain knowledge to ascribe labels to the topics. The article further describes the usage of big data terminology biomedical texts.

Despite the fact that their corpus is about big data, this article provides a highly valuable resource, as it gives an insight into what such topics can resemble when using a biomedical corpus. Moreover, their method for finding the best parameters for the LDA model can potentially be applied for finding the parametres for the models during the experiments.

3.2 Diachronic Topic Modeling

3.2.1 Early Work in Modern Non-biomedical Domain

The paper by Wang and McCallum [2006] does diachronic topic modeling using LDA. The authors developed a tool called Topics over Time (TOT), which uses topic modeling and combines it with co-occurrence patterns of words. However, they do not use biomedical data in their training corpus. Nonetheless, they were able to show trends in their data. This article is being mentioned here, as it is an interesting early example of diachronic topic modeling and trend analysis.

3.2.2 Observe Diachronic Changes and Author Influence in Biomedical Domain

The issue of diachronic changes in topics and the influence of authors is tackled by Song et al. [2014]. They analyse 20,869 articles from PubMed from 2000 to 2011. They extract bibliographical and other relevant information (e.g. abstract, full-text, etc.) information from the XML files. Using the extracted bibliographical informa- tion, they create a relational citation database. Furthermore, the texts are divided into three temporal categories, namely 2000-2003, 2004-2007, and 2008-2011. They ran separate LDA models for each of the aforementioned time periods. The authors state that these temporal categories are based on in-domain trends, the number of publications per year, and having enough data for diachronic topic modeling. They use topic modeling to find out the changes in topics that have occurred in the aforementioned time periods. Moreover, they implement a mixture of information

1https://en.wikipedia.org/wiki/Akaike_information_criterion

14 Chapter 3. Previous work extraction, to find the most influential authors and topic modeling to find the topics associated with them. The researchers use Dirichlet multinomial regression (DMR) as their topic modeling algorithm and the tool MALLET. Using these methods, they not only find the most productive authors, countries, institutions, but also patterns of larger collaboration networks based on topics. This article provides an insight into diachronic changes in topics in the domain of biomedical literature, and what such topics could look like. It also provides a method for finding influential authors and the topics used by them. Moreover, it provides information about finding collabo- ration networks and interrelated fields. Finally, similar to the article by van Altena et al. [2016], this paper is very useful to me as it also provides a comprehensive list of topics that the researchers found during the topic modeling process.

3.3 Brief Overview of Available Tools

There are numerous of topic modeling tools that are available for free, and they can be implemented based on ones knowledge of programming language, the amount of data being processed, and the level of customizability required for the task.

3.3.1 MAchine Learning for LanguagE Toolkit (MALLET)

It is a cross-platform NLP package, which is Java-based (McCallum [2002]). It has tools for document classification, sequence tagging, and topic modeling. The topic modeling toolkit contains some models such as LDA, hierarchical LDA etc. It is easy to use and does not require any prior knowledge of Java, and contains tutorials for users who do not have any prior programming knowledge.

3.3.2 Stanford Topic Modeling Toolbox

It is part of the Stanford NLP suite of (Ramage and Rosen [2009]). Written in Scala, it takes as an input data in spreadsheet (Excel, CSV) format. The topic models available for this toolkit are LDA, labelled LDA, and PLDA. The advantage of using this tool is that it generates the output in Excel format, which eases the data analysis process; moreover, there’s a Java based user interface. The disadvantages include that knowledge of Scala is required to fine-tune the model parameters.

15 Chapter 3. Previous work

3.3.3 Spark Machine Learning Library (MLlib)

The machine learning libraries in Spark offer big data solution for LDA models1. Despite the fact that most of the core code and documentation for Spark MLlib is written in Java (and Scala), there is it a Python wrapper (PySpark API) which enables the user to write the required code in Python. The advantage of this library is that it offers a big data solution that could be useful for handling large amounts of data, which is usually the case for working with the biomedical literature. A disadvantage is that this requires an architecture for big data analysis, and the user documentation is somewhat complicated and requires time to get accustomed to.

3.3.4 R Libraries

There are also topic modeling libraries for R and notable one is called ‘topicmodels’ (Hornik and Gr¨un[2011]). It requires data in document matrix format, so that it can be processed by the aforementioned R package. Moreover, it requires programming knowledge of R and one should be skilled in the methods of data manipulation using other R libraries, so that the input as well as the output can be processed by R.

3.3.5 Gensim

Gensim is a Python library (compatible with Python 2 and 3) that contains a multitude of tools for extracting semantic relations from documents (Reh˚uˇrekandˇ Sojka [2010]). As for the topic modeling algorithms, it has LSI and LDA. It takes as input either raw text or text in Python readable list formats. Gensim has its own text processing tools such as simple tokenizers. A key feature is that Gensim also allows parallel processing which can reduce the time it takes to run the models. Moreover, this library has a very detailed documentation, which is easy to follow, and allows higher level of customizability for the user.

In addition, as it works on Python, NLP tools such as NLTK can be implemented for preprocessing purposes before feeding the data into the topic modeling algorithm. I will be using Gensim for this project because of the level of customizability of this library and my current programming knowledge of Python.

1https://spark.apache.org/mllib/

16 4 Methodology

As mentioned before, LDA is an unsupervised machine learning algorithm. Thus, there is the issue of ground truth, as there is no pre-existing data against which I can compare my results. Therefore, my aim in the following section will be to be aware of the issues of topic modeling and circumvent them, based on what is possible and what is recommended in the literature about building topic models.

4.1 Source of Data

In this section, I document how the topics were extracted from a corpus of around 1.5 million Open access Pubmed articles. I downloaded the Open access bulk article packages from the Pubmed central. The corpus contains articles from the PMC Open Access Subset1. I used the articles from this section because the data which is made available here falls under the Creative Commons or similar licensing regulations, which enables me to avoid any issues regarding copyright laws. The corpus in this section was downloaded in the beginning of March 2017.

4.2 Extracting Topic Models

In the following sections I will implement the LDA algorithm on the corpus to create topic models. I tried different parameters, such as chunk size, dictionary trimming, number of extracted topics, and corpus filtering by POS tags, to get topics that appear as coherent and meaningful to a human reader.

1https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

17 Chapter 4. Methodology

4.2.1 Experiment 1 : Exploring the Topics in the Corpus

4.2.1.1 Preprocessing

The articles downloaded from the PMC Open Access Subset are in XML format. Hence, with the help of Python libraries I extracted the text from each article to create my corpus. As a next step, I tokenized the text and then I reduced the corpus by removing English stopwords1 and punctuation marks that are not part of a token. By this I mean punctuation marks that do not occur within a token. In order to further reduce the vocabulary, I lowercased the entire corpus. This approach has its advantages and disadvantages, as by lowercasing one can consolidate multiple tokens with different casing into a single lowercased token (e.g. ‘CELL’, ‘Cell’ to ‘cell’), which in turn reduces the number of vocabulary items. However, disadvan- tages of this method are that by lowercasing a token one can generate homonyms with different meanings (e.g. ‘AIDS’ vs. ‘aids’) which should not be consolidated into a single vocabulary item. Despite the aforementioned issues, I decided to low- ercase the corpus, as in this context the positive aspects of lowercasing outweigh the disadvantages. Then I lemmatized the text to further reduce the number of vocabulary items (e.g. ‘cell’ and ‘cells’ to ‘cell’).

I used tokenizing and lemmatizing functions provided by NLTK in the text prepro- cessing stage (Bird et al. [2009]). The decision to use NLTK was mainly based on the fact that it is freely available and can be easily combined with using Python. Secondly, the dataset I am using consists of text written in English, and the tools provided by NLTK are trained on English datasets. At this moment I would like to state that biomedical texts pose a difficulty for most text processing tools. This is due to the specific biomedical terminology used by the authors of these texts. Con- sidering the scope of this Master’s thesis, I decided against using tools that have been trained on biomedical data. This decision was based on the examples of topic models that I saw in the work done by Song et al. [2014]2 and van Altena et al. [2016]3, where the topic models were generated from biomedical texts.

In both papers, examples of topic words tend to be common nouns such as ‘cell’, ‘disease’, ‘cancer’, ‘virus’ etc. (from Song et al. [2014]), or ‘brain’, ‘disorder’, ‘bi- ology’ etc (from van Altena et al. [2016]). In both works, the researchers did not implement any tools for recognising biomedical entities. Nonetheless, terms such as

1These were taken from the set of English stop words provided in the NLTK package (Bird et al. [2009]) 2Song et al. [2014] have a table of topics generated by their model in their paper on pages 357-359 3van Altena et al. [2016] also show the topics generated by their model, in tables that can be found in pages 357-359 of their paper.

18 Chapter 4. Methodology

‘dna’ were recognised as topic words by the models in both papers. Therefore, based on the previous work done in this domain, I decided against the implementation of a tool that was trained on biomedical data.

4.2.2 LDA Model Creation Parameters

Model Parameters

After the pre-processing step, I created an LDA model using Gensim. In my first tests, I ran the LDA model with different configurations for extracting the topics within the corpus. I set the parameter number of topics to 10, 20, 50, and 100 for each iteration. Moreover, I set the chunk size to 5000 and set the number of passes to 1.

Dictionary Trimming

I also edited the dictionary used by the model, in order to make the model run faster as well as discard redundant vocabulary items that could potentially behave like spam in the output. Hence, I not only removed words that occurred less than 10 times in the entire corpus, but also 100 of the most common words were removed. Other omissions from the dictionary included removing numbers, and tokens that are less than or equal to three characters long.

Topic number Topic words 1 response participant task trial stimulus subject experiment fig activity day 2 fig structure solution protein patient surface size compound acid reaction 3 patient health age risk participant care score woman clinical outcome 4 snp gene patient population dna expression allele fig association protein 5 protein fig gene activity dna mutant binding expression strain acid 6 expression mouse fig protein gene tissue antibody day tumor activity 7 fig specie water plant concentration area temperature surface site population 8 patient disease blood clinical day infection response tumor serum month 9 gene sequence document expression protein genome fig set region minimal 10 patient fig image parameter region activity network area structure response

Table 4.1: 10 topics generated from a corpus of 1.5 million articles

4.2.2.1 Evaluation: 10 Topics Model

The output of the model returned 10 topics which are based on the entire corpus. The result of this topic model can be found in Table 4.1, where the topic words and the topics numbers are displayed. As it can be seen from the results, the topic can be guessed by using the top five most relevant topic words from the list. For example, other topics 4, 5, and 9 are about ‘gene related’ themes, whereas topic 3 is about patient information

19 Chapter 4. Methodology

Issues

Within the topics, there are certain terms that can be classified as noise namely, “fig” (in emphasis) which probably refers to the shortened form of figure. These terms should be removed in the future experiments. Moreover, as seen in Table 4.1 at least three of the topics can be labelled as gene related. It can be concluded that there is a lack of thematic variance in this model. As previously discussed we have issues that are common with bad topic models. Three issues that can be mentioned here are: mixed and chained topics (2.3.1.2), identical topics (2.3.1.3), and incomplete stopword list (2.3.1.4).

Model Evaluation

It can be seen that the 10 topics that were given as an output are vague, somewhat generic. Moreover, they tend to have common themes. Nonetheless, by looking at them, it is possible to gain a general idea what the corpus is about. However, the topics themselves are far too vague. Ultimately, it is necessary to have more fine-grained topics. Finally, it can be said that this is a poor quality model and 10 topics are not enough for a corpus of this size.

4.2.2.2 Evaluation: 20 Topics Model

In the Table 4.2, some of the results of the 20 topic model are shown. This model was also run with the same set of parameters with the exception of number of topics as the output. There are still some issues of spam words in the output such as ‘fig’. A topic was found that contained LaTeX related markup terms (topic 7). This could be due to the fact that in the corpus the markup terms had not been filtered out during the preprocessing steps. Unlike in the previous iteration, there are more distinguished topics, as new themes emerge in the output. There are topics about insulin (topic 15) and mice, also about mice and brain experiments (topic 20), and cancer (topic 10) which were not in the previous model.

4.2.2.3 Model Evaluation

We can see that more topics are extracted, as new themes emerged from 20 topics model. Those themes are much more detailed than those in the 10 topics model. These topics contain more words which are frequent in the corpus, but they are still not specific. The topics can be perceived as being general and not belonging to a specific subdivision of the corpus (see 2.3.1.1). Moreover, the issues of mixed

20 Chapter 4. Methodology and chained topics (2.3.1.2), identical topics (2.3.1.3), and incomplete stopword list (2.3.1.4) are still prevalent in this model. Finally, it can be said that 20 topics are not adequate to generate such a model for this corpus.

Topic number Topic Words 7 document amsbsy minimal amsfonts mathrsfs wasysym upgreek amssymb amsmath 12pt 10 cancer tumor patient expression breast gene protein tissue mutation line 15 concentration glucose insulin mouse acid fig weight diet rat protein 20 rat neuron mouse animal brain response day patient experiment trial

Table 4.2: A selection from the 20 topics generated from a corpus of 1.5 million articles

4.2.2.4 Evaluation: 50 Topics Model

In this iteration, the same issues with the 10 and 20 topics model persist. This is due to the fact that this model was created also with the same parameters, corpus, and dictionary. However, compared to the previous ones this model resulted in even more detailed topics. It should be said that some of the topics in this model were also found in the previous models. Hence, I will only mention the new topics that emerged. These new topics are about mental health/depression (topic 44), antibiotics resistance (topic 6), vaccines (topic 12), diabetes (topic 17), eyes/surgery (topic 32), bone/tissue (topic 18), neurons (topic 48) and plants (topic 8).

It appears that in this model of 50 topics, the generated topics are less general and have a tendency to be more specific.

Model Evaluation

Although the 50 topics model yielded much more detailed results than the previous iterations, this model contains some of the topics that were found before. Further- more, it should be mentioned that some of the criteria that make a model bad, were found in this iteration as well. Nonetheless, with 50 topics some nuanced themes that are mentioned in the corpus become prevalent as the topics are less vague than before. If I can achieve to solve some of the issues that put the topics generated by this model under the category of bad topics, then for future iterations, 50 topics are an adequate, yet manageable amount that can be generated from this corpus.

4.2.2.5 Evaluation: 100 Topics Model

Unlike the 50 topic model, this iteration contained very few new topics, namely about heart disease (topic 22), aneurysms (topic 63), malaria (topic91) pregnancy

21 Chapter 4. Methodology

Topic number Topic words 6 antibiotic resistance patient isolates strain antimicrobial infection resistant culture day 8 plant leaf soil specie fig water seed site day root 12 virus vaccine influenza antibody vaccination response patient infection protein day 17 risk patient age diabetes blood subject bmi disease association weight 18 bone fracture patient fig tissue implant cartilage week day protein 32 patient eye nerve left surgery right clinical disease month image 44 patient pain score symptom depression sleep protein disorder fig anxiety 48 neuron fig response channel current receptor expression activity synaptic protein

Table 4.3: A selection from the 50 topics generated from a corpus of 1.5 million articles

(topic 74) and reproduction (topic 95). The lack of topic diversity is due to the fact that this model contained similar topics as found in the previous models with slight variations in the topic words and their order. Thus, it can be said that this model has a high number of identical topics. Perhaps, this is due to the fact that 100 topics are too much for the given corpus (2.3.1.3). As this model was made with the same parameters as the aforementioned ones, the weaknesses of those models are also prevalent in this one.

Topic number Topic words 22 patient heart cardiac pressure vitamin ventricular left blood artery volume 63 aneurysm patient fig artery blood protein activity min concentration rbc 74 birth pregnancy infant maternal patient woman asthma age child risk 91 malaria parasite infection hpv falciparum patient mosquito blood day expression 95 oocyte sperm expression embryo patient human chromosome mtdna ovarian stage

Table 4.4: A selection from the 100 topics generated from a corpus of 1.5 million articles

Model Evaluation

The 100 topics model exhibit a rather comprehensive array of topics. Unfortunately, several of them are similar. It is possible to go into further detail with a 200 topics model. As there are very few new topics generated by the current model, branching into a 200 topics model at this point is not advisable. As discussed before, LDA has a tendency of creating topics even though there are none thematically in the corpus (see 2.3.1.5). Hence, at this point it is no longer required to branch further into a more detailed topic model, but to fix the issues that are currently present.

4.2.2.6 Evaluation Experiment 1: All Models

The results of the different iterations show that extracting only a small number of topics, namely 10 to 20 topics, from a corpus result in vague topics. This is due to the fact that the LDA model tries to cluster together high frequency words in the

22 Chapter 4. Methodology corpus resulting in generic topics. With increasing number of topics extracted from the corpus the topics become more specific. It should be noted that with increasing number of topics, there may not be different topics extracted by the model, but different versions of topic belonging to a same theme. As discussed before, the generation of identical topics can sometimes be due to the number of topics which is excessive for the data set.

At the end of this experiment, the results give an overview of the types of topics that can be found in this corpus. As mentioned before, there are many topics and topic words in the output that should be removed which will be done by updating the stopword list. Hence, the corpus needs to be cleaned again and further experiments with different parameters should be run on the new corpus.

4.2.3 Experiment 2 : Edited Corpus and Modified Model Update Parameters

In the previous experiment, I ran the model with the chunk size parameter set to 5000. This means that the model performs online training, meaning that with the influx of new information the model is being updated continuously. The update parameter of the model is dependent on the number of workers (parallel processes) and the chunk size (see 4.1).

update = number of workers × chunk size (4.1)

In this batch of experiments, I decided against online training, as I do not have any control over how the initial chunk is chosen and how representative it is of the entire corpus. Thus, I set the batch parameter to False, which results in the LDA model calculating the topics from the entire corpus at once. The goal of this approach is to find the topics that are representative of the whole corpus and not those chosen by the chunks.

4.2.3.1 Preprocessing

The corpus from the previous model needed to be modified as it contained many spam elements such as LaTeX markup terms. Hence, a new corpus was created with the spam elements removed from them. The other preprocessing elements remain the same as before (see 4.2.1.1).

23 Chapter 4. Methodology

Topic words cell gene patient model mouse cancer activity rate expression population patient protein cell model gene treatment rate activity expression site patient cell treatment gene protein model expression risk activity rate cell protein patient treatment expression activity mouse concentration line model gene cell patient expression treatment model network activity disease function cell protein gene mouse activity patient antibody sequence expression treatment

Table 4.5: Identical topics generated from multiple topic models with different topic sizes

4.2.3.2 Results

The results from this iteration are disappointing because for all the different types of models that I ran, all the topics were variations of the ones shown in Table 4.5. Hence, I can only assume that the model looked at the most common elements in the corpus, which are about running experiments on mice, analysing proteins, and cells. Unlike in the previous experiment, the topics have little variance. The number of topics generated from the model is irrelevant, as even in the 100 topics model all the topics have the word ‘cell’ in them. I can assume that doing online training is important, and in the following experiments, I will experiment with different batch sizes to see how the topics differ. Nonetheless, the results from this experiment seem illogical, as I do not see any correlation between online learning and types of topics generated. Perhaps, the problem could be in the dictionary because a new corpus was generated (as mentioned in the preprocessing step 4.2.3.1) which has different word frequencies.

4.2.4 Experiment 3: Online Learning with Different Batch Sizes

4.2.4.1 Preprocessing

Working with a corpus of 1.5 million documents is not recommended when it comes to using Gensim. This is due to the reason that Gensim during the training process loads a significant amount of the data onto the RAM. Therefore, one cannot run multiple batches with different parameters simultaneously, as doing so almost con- sumed 250 GB of RAM and putting the server to a standstill. For this reason, in order to reduce the training time and run multiple models at the same time, I chose a smaller subset of the corpus. This subset contains 150 thousand randomly selected articles from the entire corpus regardless of year of publication. In Figure 4.1 one

24 Chapter 4. Methodology

Figure 4.1: Number of articles published per year from 1950-2016 in the corpus of 150 thousand articles

can see the number of articles published per year in the smaller corpus. The graph shows the number of publications from the year 1950 to 2016. As it is visible in the graph of Figure (4.1) the number of publications increases exponentially. In order to see if the sample set that was randomly chosen is representative of the corpus, I calculated for each year the percentage of articles from the original corpus that are in the new corpus of 150 thousand articles. As one can see in Figure 4.2 the distribution of the articles for every year for the entire corpus oscillates somewhere between 8 to 11 percent of the articles that were published on a given year and are available in the open access subset. Reducing the number of articles drastically, reduced the training time as well as the strain on the RAM created by running the topic models. The other preprocessing elements remain the same as before (see 4.2.1.1).

4.2.4.2 Results

The results of this batch were nearly identical to the output of the previous exper- iment, thus confirming my suspicion that the online training did not influence the topics generated by the model. Hence, I looked into the differences between the first experiment and the latter two and realised that the key issue is the dictionary. In experiments 2 and 3, after removing the LaTeX markup words from the corpus, the

25 Chapter 4. Methodology

Figure 4.2: Percentage of articles from the original corpus (1.5 million articles) per year from 1950-2016 that are in the corpus of 150 thousand articles

token ‘cell’ was not removed when I deleted 100 of the most common tokens from the corpus. It can be seen that certain vocabulary items have a tremendous influ- ence on the output of the results. In later experiments, the influence of these words should be noted and treated as some form of stopword, as they are high frequency vocabulary items. This raises the issue of which new tokens should be considered as being disruptive when it comes to finding the underlying topics in the corpus. Should other tokens such as ‘protein’ or ‘mice’ be removed as well? Despite the fact that the results from this experiment can be discarded, the underlying cause for not finding topics that vary has possibly been found.

4.2.5 Experiment 4: Reduced Vocabulary

4.2.5.1 Preprocessing

As seen before, the dictionary plays a critical role regarding the type of tokens chosen by the model (see 4.2.4.2). In the preprocessing step I decided to reduce the vocabulary drastically.

I created a new reduced corpus which consists only of lemmatized common and proper nouns1. This was done by running NLTK’s English POS Tagger on the corpus and selecting only the noun related tags for further processing. The other

1The POS tagger in NLTK uses the Penn TreeBank POS tags, which in this case are NN, NNS, NNP and NNPS

26 Chapter 4. Methodology preprocessing methods remain the same as before (see 4.2.1.1).

4.2.5.2 10 Topics Model

As seen in Table 4.6, the topics generated by this model are quite generic and only partially nonsensical. However, none of them are identical. The token ‘mouse’ ap- pears in several topics, namely topics 3, 7, 8 and 9. Perhaps ‘mouse’ is an important token in this subcorpus that I am currently using. Other common topic words are ‘blood’, ‘virus’, and ‘tumour’. The words in these topics are very generic, and the topics themselves also fall into the category of mixed topics.

Topic number Topic name 1 month intervention therapy blood trial hospital outcome pressure infection event 2 sequence mutation specie mouse domain receptor antibody virus genome family 3 infection mouse strain blood antibody sequence virus culture serum isolates 4 child score trial month mortality death outcome infection antibody sequence 5 medium growth solution strain membrane culture surface antibody temperature reaction 6 participant woman intervention score child community service family practice problem 7 mouse plant growth infection antibody macrophage production tumor activation cancer 8 tumor water lesion stage image diagnosis mouse therapy brain field 9 cancer mouse tumor antibody blood brain muscle animal breast receptor 10 sequence parameter image network length target frequency position performance error

Table 4.6: Topics from 10 topic model from noun corpus

Model Evaluation

The topics generated on this iteration do fall under some of the criteria of being poor quality topics (see 2.3.1). Despite being somewhat vague, the topics show the overarching themes in the subcorpus. Reducing the vocabulary items does not have an adverse influence on the generated topics; nonetheless, more fine-grained topics are required. Overall one can say that 10 topics are not enough to demonstrate the thematic diversity within this corpus.

4.2.5.3 20 Topics Model

Unlike in the previous iteration, the topics are less nonsensical for this model. How- ever, they are still very vague and exhibit signs of mixed topics. Here we also observe that the tokens ‘cancer’ as well as ‘mouse’ occur in many of the topics. Perhaps these tokens are important in the subcorpus I am using. Expanding to 20 topics bring forth more detailed topics that were not present in the smaller model. However, one can also observe that some of the topics are partially identical.

27 Chapter 4. Methodology

Topic number Topic name 1 infection virus blood mouse woman mutation antibody pregnancy growth phase 2 sequence specie genome length variation selection distance frequency receptor family 3 cluster sequence strain plant family specie isolates genome annotation transcription 4 network image parameter channel exercise frequency performance solution measurement surface 5 brain image neuron surgery month lesion diagnosis muscle tumor nerve 6 muscle cancer mouse trial blood month specie medium score event 7 participant woman child service intervention question people community practice score 8 temperature mouse hospital medium culture frequency production image lesion cancer 9 cancer blood antibody association exposure tumor bladder smoker brain status 10 trial intervention score outcome month child therapy participant review criterion 11 feature brain frequency image sequence association event error performance position 12 water strain growth infection energy surface temperature culture density production 13 sequence mouse strain domain culture mutation mutant primer antibody promoter 14 mouse antibody infection brain sequence image marker mirnas cancer culture 15 vaccine sequence virus mouse specie infection trial death cancer mortality 16 cancer tumor mutation breast stage survival metastasis therapy methylation sequence 17 plant stress mouse sequence growth water specie reaction temperature promoter 18 compound solution reaction product medium membrane water blood mouse plant 19 blood woman association therapy month parameter pressure serum fracture criterion 20 antibody mouse tumor cancer activation receptor medium growth culture inhibitor

Table 4.7: Topics from 20 topic model from noun corpus

Model Evaluation

In this iteration, we also have the same issues as shown in the 10 topics model. However, the topics are more specific than before, as new themes emerge within them. Nonetheless, they are still too vague and this shows that 20 topics are not adequate for this smaller subcorpus.

4.2.5.4 50 Topics Model

There are some overlaps with the previous model; however, in this case, we can see that ‘cancer’ is again a common topic word. Thus, I would assume perhaps there are texts about cancer in this subcorpus. Some of the topics are about chromosomes (topic 32), surgery complication (topic 7), patient mental health (topic 23), viral and bacterial infection (topics 30 and 35), pregnancy (topic 40), microRNA (topics 26), and some social aspects of hospital/clinical procedures (topic 12) (see Table 4.8).

Topic number Topic words 7 surgery injury complication image technique lesion month blood strain brain 12 network hospital family staff mouse program score participant member intervention 23 depression anxiety movement participant score disorder scale symptom stress month 26 mirnas sequence cancer growth target fraction mirna culture infection medium 30 virus insulin mouse blood trial vitamin score child infection cancer 32 chromosome embryo stage clone exposure receptor mutation locus phenotype marker 35 strain bacteria production culture medium activation growth infection inhibition product 40 pregnancy birth woman mother antibody child outcome parent muscle infection

Table 4.8: A selection from the 50 topics generated from noun corpus

28 Chapter 4. Methodology

Model Evaluation

This model has more detailed topics, and cancer appears to be a significant topic word. There are some identical topics based on the themes represented by them. Overall the model shows varieties based on the types of topics in it. There are some identical topics and instances of mixed topics are present as well.

4.2.5.5 100 Topics Model

The topics found by this model are variations of the topics found in the 10, 20, 50 topics model. However, the advantage of this model is that one can see many topic words that are associated with the same topic. A good example for this are topics about breast-cancer (see Table 4.9). The three topics that fall under the theme of breast-cancer are depicted in Table 4.9. However, we see that the topic words in them have a different thematic focus. Some focus on the growth of the carcinoma (topic 1), others on diagnosis and mortality (topic 23), whereas the last one focuses on tumours and mutations (topics 80).

Topic words Labels cancer tumor breast proliferation antibody growth carcinoma woman medium invasion breast-cancer (1) cancer association death mortality cohort breast diagnosis exposure survival incidence breast-cancer (23) cancer tumor breast sequence therapy mutation plant stage mirnas association breast-cancer (80)

Table 4.9: Focus on breast-cancer related topics from 100 topics models from noun corpus

Model Evaluation

Overall the 100 topic model has the issue of identical topics. As shown in the previous example (see 4.9) it generates fairly good topics. The model lacks thematic diversity, Perhaps 100 topics are too much to be extracted from this subcorpus. Hence, if one were to work with this corpus for the scope of this Master’s thesis, one would have to rely on a better version of the previously generated 50 topics model.

4.2.6 Experiment 5: Influence of POS Tags

In order to gauge the influence of different POS tags, I ran multiple models with different corpora. As the base corpus, I am using the results from 4.2.5.4, where I have a 50 topics model with noun related words. Let’s denote it as Noun-corpus. I created three new corpora, each with a different group of tokens in them. I used

29 Chapter 4. Methodology

Topic number Topic words 1 plant strain sequence medium growth culture primer cancer water mutation 2 stimulus stimulation vaccine mouse neuron target layer latency trial sequence 3 growth insulin image neuron domain field score strain parameter reaction 4 participant trial event image performance block stage activation stimulus frequency 5 measurement temperature blood image water sensor phase device field surface 6 mouse antibody infection blood animal cytokine culture tumor macrophage brain 7 surgery injury complication image technique lesion month blood strain brain 8 domain sequence mutation amino molecule peptide position virus compound family 9 intervention trial score child outcome participant month disorder symptom criterion 10 infant child score infection trial month season image correlation participant 11 brain seizure mouse frequency sequence woman blood trial error stimulation 12 network hospital family staff mouse program score participant member intervention 13 woman participant infection blood child prevalence brain status adult volume 14 antibody culture mouse membrane neuron medium image growth mutation fluorescence 15 student service country community people practice program question school survey

Table 4.10: 15 topics from 50 topic model from Noun-Corpus

the following criteria to create the corpora, namely a corpus with noun and verb type tokens (Noun-Verb-corpus)1, a corpus with noun and adjective type tokens (Noun-Adjective-corpus)2, and a corpus with noun, verb, and adjective type tokens (Noun-Verb-Adjective-corpus)3. Also, as a vocabulary reduction measure that can be compared for all three corpora, I removed the top 100 most common words from each newly created corpus. This was done by removing the top 100 most frequent words from the dictionary of vocabulary frequencies created by Gensim. I then analysed the topics returned by the models.

4.2.6.1 Noun-Verb Corpus

For the Noun-Verb corpus I will only look at the first 15 topics returned by the model (see Table 4.11). The reason I chose 15 topics is due to the fact that topic diversity in this model is extremely low and it is possible to make a judgement on the quality based on the first 15 topics returned by it. I could also have randomly chosen e.g. 20 topics that are part of this model and would not have made a difference for my evaluation of this topic model. I will compare it with the top 15 topics from the Noun-corpus (see Table 4.10).

Whilst observing the topics returned by both corpora, one can observe that the topics from the Noun-corpus demonstrate great diversity. In contrast, the Noun- Verb-corpus exhibits an issue with its dictionary. Most of the observed topics contain the word ‘cell’, ‘gene’, ‘mouse’, ‘data’. These topics are also very bad as most of them are identical, they are also very generic and nonsensical. Furthermore, most of the 15 topics show to a certain extent cases of mixed topics.

1Token with POS tags: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, and VBZ. 2Token with POS tags: NN, NNS, NNP, NNPS, JJ, JJR, and JJS. 3Token with POS tags: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, VBZ, JJ, JJR, and JJS.

30 Chapter 4. Methodology

Model Evaluation

This model has severe issues with its vocabulary and cannot be used for any practical purposes. I also observed that despite the model containing words that fall under the category of verbs, none of the topic words appear explicitly to be verbs themselves. It could be the case that some of the tokens in the topics are homonyms of lemmatized verbs. For example, the tokens ‘effect’, ‘test’, ‘study’ and ‘result’ could possibly be verbs. Unfortunately, in the topic model it is not possible to say if the tokens are indeed nouns or verbs. This is due to the reason how LDA works, namely it ignores the syntax of the tokens and focusses on word frequencies in documents. Moreover, during the text preprocessing stage I may have conflated tokens which are homonyms. It is highly likely that the homonyms that appear in the topics refer both to their verb as well noun forms.

Topic number Topic words 1 cell protein figure antibody activity expression analysis study level membrane 2 cell study group data patient analysis difference result time level 3 cell protein data treatment expression study analysis level gene receptor 4 cell study patient data analysis effect group gene level disease 5 cell population region site bone analysis gene study number data 6 sequence analysis study gene data population time group result rate 7 patient study risk data group time year case result rate 8 cell figure analysis condition activity time result data study effect 9 study patient analysis data group result model treatment effect test 10 gene study patient expression cell group sequence data protein disease 11 cell expression mouse control level protein group analysis figure gene 12 specie population model effect time study data number group size 13 data cell study trial analysis time gene result effect treatment 14 study student patient group time level effect rate treatment analysis 15 cell patient study infection control mouse gene result expression level

Table 4.11: 15 topics from 50 topic model from Noun-Verb corpus

4.2.6.2 Noun-Adjective Corpus

For this corpus, I will compare the topics returned by the model using the Noun- Adjective-corpus with the results from our base Noun-corpus. I observed here as well that the model has issues with the dictionary, as the token ‘cell’, ‘study’, ‘group’. ‘data’ are highly prevalent in all of the 15 topics of the Noun-Adjective-model. In this case too the topics are very generic and, in some cases, nonsensical.

Model Evaluation

Despite the fact that the model has issues with its dictionary, which should be reduced in order to yield sensible results, another aspect that I noticed is that the words in the topics do not contain any adjective-related tokens. I assume that the adjective tokens are low frequency, compared to their noun counterparts and because

31 Chapter 4. Methodology of this reason they do not appear in the topic model. In summary, this model cannot be used because of the aforementioned reasons.

Topic number Topic words 1 cell study analysis patient expression level gene data group figure 2 plant cell study control analysis group leaf sample level gene 3 cell response expression study data effect macrophage level infection group 4 cell expression tumor control mouse figure protein antibody level treatment 5 gene data study analysis figure level effect group control genotype 6 study risk population year group prevalence case cancer analysis woman 7 study level cell data group analysis concentration year child time 8 group study bone case result time difference patient data analysis 9 strain sample sequence group number vaccine study data analysis isolates 10 study group analysis protein data region activity patient gene level 11 treatment data study group analysis response sample time cell model 12 cell study figure analysis data expression group gene number control 13 compound study activity concentration reaction effect result acid group data 14 mutation study analysis cell nerve data sample group patient case 15 group system process study data change time research community level

Table 4.12: 15 topics from 50 topic model from Noun-Adjective corpus

4.2.6.3 Noun-Verb-Adjective Corpus

In this case too, the model generated form this corpus will be compared to the base Noun-corpus. It is also observed here that the tokens ‘cell’, ‘study’, ‘group’, ‘data’ are quite prevalent items in the generated topics. For this iteration, the topics are nonsensical and partially mixed in nature.

Model Evaluation

This model should theoretically exhibit topic words that are verbs, and adjectives. Unfortunately, what is observed here is that the tokens are mostly nouns. There could be the case that some of these nouns are homonyms of verbs as discussed before (see 4.2.6.1). As for the adjectives, they are lacking in this topic model as well. The dictionary of the Noun-Verb-Adjective-corpus is also the cause of the lack of topic diversity for this model.

Topic number Topic words 1 patient study cell treatment group level trial result data therapy 2 model data number value effect study result time parameter analysis 3 cell patient data analysis study effect mouse model figure result 4 cell data group study patient time figure protein analysis result 5 cell response study result gene expression effect analysis number receptor 6 sequence gene specie population data number analysis region genome site 7 patient study disease year data risk rate treatment analysis factor 8 data time sample analysis temperature solution particle figure result surface 9 study treatment result gene time patient group cell analysis effect 10 cell protein figure antibody control result data expression membrane time 11 group study effect analysis time data response control difference patient 12 health study care woman research data intervention group time service 13 cell concentration study activity effect data control figure analysis treatment 14 gene study cell level analysis data activity expression sample treatment 15 cell protein study level activity gene expression effect control result

Table 4.13: 15 topics from 50 topic model from Noun-Verb-Adjective corpus

32 Chapter 4. Methodology

Term Similarity between the Models

Noun-Verb Noun-Adjective Noun-Verb-Adjective Noun 45.77% 44.09% 37.93% Noun-Verb - 61.8% 54.59% Noun-Adjective - - 60.4%

Table 4.14: Percentage of identical terms in between the models

As a next step, I calculated the percentage of identical words between the topic words of the corpora. As seen in Table 4.14, the Noun-corpus has the least amount of common terms with the other three corpora. As a further measure, I also looked at the number of unique topic words for each case (see Table 4.15).

Noun Noun-Verb Noun-Adjective Noun-Verb-Adjective Unique terms 201 178 165 159

Table 4.15: Number of unique words in found in all the topics

It can be seen that the Noun-corpus has the highest number of unique words in the topics. Even though the other models hopefully have tokens from different groups of POS tags, the number of unique tokens decrease with as the number of token types increases.

The explanation for this can be found by looking at the topic models and the dictionary for each corpus. Even though with each new group of tokens included in the corpus the number of tokens in the corpus increases, when it comes to the topic models, the topic words are different types of nouns. This is true for all the corpora created thus far. Hence, the issue of dictionary trimming becomes important because with increasing amounts of tokens, removing only the top 100 tokens for the three corpora is not the best method for getting good topic models. For the verb and the adjective based corpora it has been mentioned some of the vocabulary items may have been conflated during the lemmatization processes. The word frequencies within the dictionaries are no longer identical due to this conflation. Therefore, for the other larger corpora based on the results of Noun-Verb-, Noun- Adjective-, Noun-Verb-Adjective-corpora, it would be advisable to remove a larger amount of most frequent terms than in those corpora that only have fewer types of lemmas. Unfortunately, there are no guidelines that state the amount of high- frequency dictionary items that should be removed. This can only be determined via trial and error with multiple experiments.

Hence, I decided to discard the verbs and the adjectives from my future experiments. This decision is easily comprehensible if one considers the fact, that with the current

33 Chapter 4. Methodology subcorpus, a somewhat ideal cut off point for reducing the dictionary has been found. In the following experiments, I will only work with the corpus of noun related lemmas.

4.2.7 Experiment 6: Extracting Models with Distinct Topics Using Topic Similarity

Similar topics are an issue for topic models. In this section, I look at a measure for detecting similar topics and check if with the current model parameters, my models have them.

In order to find the common topics, I ran the model with the Noun-corpus and previously mentioned parameters 25 times. I calculated if the topics within a model are similar. Let us consider that we have a topic model A, which consists of topics At1 to At50, and each topic has 10 topic words. If I want to check if At1 is similar to At2, At3...At50, I can check if the topic words in At1 match the topic words in the other topics.

Here, I can use a similarity measure, inter-topic-similarity, which is the amount of similarity required between the two topics. If the inter-topic-similarity measure is 100% then all the topic words in At1 are identical to all the topic words in At2, if the similarity measure is 60% then only 6 out of 10 words in any order between the two topics need to be identical.

To check the inter-topic similarity between all the topics of a given model, one should first calculate the the number of combination between two topics (k) in a model of 50 topics (n). This can be calculating number the pairwise combination of 50 objects as follows:

n! 50! = = 1225 (4.2) k!(n − k)! 2!(50 − 2)!

Then the inter-topic similarity can be calculated by using the number of cases where two topics are the same (for a given similarity measure) divided by the number of possible combinations of topics between them. For example, if the model A has 1000 cases where the combination of 2 topics are the same (for a inter-topic- similarity measure of 20%), then the inter-topic similarity for all the topics in model A is (1000/1225)× 100 = 81.63%. If one has multiple topic models, then one can calculate the average of all the inter-topic similarity scores for the models.

34 Chapter 4. Methodology

Figure 4.3: Average inter-topic similarity

Figure 4.3 shows the average inter-topic similarity for the models. It shows that with an inter-topic-similarity measure of 40% the average inter-topic similarity of the model drops to 2.65%. This means that in average only in 32.56 out of 1225 possible topic combinations, one could expect topics with 4 out of 10 words in common. For 50% the average inter-topic similarity of the model drops to 0.43%.

Given a high inter-topic similarity between the topics, a low average inter-topic similarity score for the model indicates that the model does not have many identical topics. This also shows that with the current set of parameters the models do not have many identical topics (see 4.3).

4.2.8 Experiment 7: Extracting Stable Models

As discussed before, LDA models tend to be unstable as each run of the model returns a different set of topics in a different order (see 2.3.2). Hence, the task is to find a model that is stable enough so that it can be used for further testing and experiments. In the following section I will try to find a way to judge the stability of my topic models.

As mentioned before, topic model A consists of topics At1, At2... At50 and similarly

35 Chapter 4. Methodology another model B consists of topics Bt1 . . . Bt50. In order to check if there are any similarities between the models, one could try and match the topics from model A, with model B. This could be done by checking if the topic words At1 match those from any topic in model B. Here it is highly likely that the match is not perfect. Hence, we can have results such as, At1 matches Bt3 (6 out 10 words), and Bt6 (7 out of 10 words). Then I can say that At1 matches Bt6. Then I remove At1 and Bt6 from my comparison setup, and continue to do the same with At2 until At50. I go through all the topics and choose the one with the highest score. If the scores are identical, I choose the first one. For example, if At7 matches with Bt27 and Bt35 with 5 common words, here I would match At7 with Bt27. In order to calculate the similarity between two models, inter-model-similarity, I can count the similarity between the topics, e.g. At1 and Bt6 has the similarity of 7. Two models can have the maximum inter-model-similarity score of the number of topic words per topic multiplied by the total number of topics. In our case this is, 10 × 50 = 500. Thus, if two models (modular permutations) are identical, their inter- model-similarity score would be 500. I calculated the average similarity score for the aforementioned models. I chose one model as a constant and compared it with the other models (model A with model B, then model A with model C etc.). This resulted in an average inter-model similarity score of 171.25 or in other words the models are 34.25% similar to each other.

However, I realized that this measure is not adequate to check the similarity between the models. This is due to the fact that by trying to match every topic in one model with every topic in another model, the matches are sometimes very much forced.

I tried a similar approach as explained above; however, this time I only calculated if the topic was found or not. This means if At1 can be matched with any of the topics in model B. Moreover, I also accounted for the number of common words between the topics, as explained in section 4.2.7.

This can be explained in the following manner. I take model A as my base model, and try to match At1 with any of the topics in model B, with a specific inter-topic similarity measure. For example, with a inter-topic similarity measure of 0.7 (7 out of 10 words should be common), At1 matched with at least one topic in model B, and At2 matched with none and so on. At the end of this example, the result is that 32 topics from model A found a counterpart with a given inter-topic similarity measure in model B. Then the inter-model similarity is (32/50) × 100 = 64%.

The logic behind this measure is that if the inter-model similarity score is high for a high inter-topic similarity score, then the models are similar to each other, and have a variety of topics that are not identical.

36 Chapter 4. Methodology number of passes 10 20 30 40 50 60 70 80 90 100 200 300 400 500 similarity(%) 10 93.2 91.6 91.6 90.4 88.8 92.8 90.4 93.6 88.4 90.4 89 91 91 94 20 79.6 80.4 82 82 80.4 80.4 85.2 80 82 85.2 83 84 85 84 30 68.4 74 78.4 75.2 76 79.2 81.2 79.6 78.8 80.4 84 82 84 79 40 61.6 62 70.8 66.8 69.2 72 76.4 73.2 75.6 78.4 76 78 76 79 50 51.2 54 56 56 60.8 61.6 66.8 66 66.4 70.4 67 74 69 70 60 37.2 40.8 43.2 41.6 48.4 49.6 54 52.4 48.8 54 58 61 51 49 70 23.6 28 28.4 26.8 34.4 34.4 39.6 36.8 32 42.8 41 49 41 34 80 11.6 18.4 16 17.2 20 24.4 26.8 18.8 16 28.4 20 32 22 20 90 5.2 5.2 8.8 4.8 7.2 11.6 12 8 10.8 16 9 17 13 10 100 0.4 0.4 2 0.4 0.4 3.2 2 1.6 2.4 4.4 2 2 3 4

Table 4.16: Topic similarity based on number similar words over multiple passes

I calculated the inter-model similarity for different inter-topic similarity measures. As my preliminary results were rather low, I changed another parameter of my LDA model, the number of passes, which is the number of times the model goes through the entire corpus. Then I calculated the scores for those models. For every change in the number of passes, I ran the model 5 times. The scores on Table 4.16 show the average inter-model similarity scores.

One can see in Table 4.16, that the similarity between the models increase with the number of passes. Moreover, there is a greater similarity between the models if one takes the number of common words between the topics into consideration. Intuitively, the lower the minimum number of common words is, the more similar are the models with each other. Moreover, as one can see with the similarity of 50 to 60% for 100 passes, a model similarity of 70.4% and 54% respectively are achieved. It can be assumed that the models appear to be stable. With 100 passes one achieves a fairly adequate level of similarity. After 100 passes the level of similarity for the models do not increase significantly any more.

At this point, the hyper-parameters for the topic models have been set, and a fairly stable set of topic models have been found. I will choose one of the models with 100 passes, namely the one used as the base model as my topic model for the upcoming sections.

4.2.9 Topic Labelling

There are many ways of automatically labelling topics within a topic model (see 2.4.1). However, despite their usability, they are either dependant on external APIs (Lau et al. [2011]) or require further models to be trained (Bhatia et al. [2016]) However, instead of referring to a topic by merely a number, I propose a rudimentary

37 Chapter 4. Methodology approach of referring to them by their topic number and the top three topic words in a hyphenated construct. For example, topic 19 (see Table 5.1) can also be referred as topic 19-surgery-injury-complication. This will convey more information about the content of the topic. A comprehensive list of all 50 topics can be found in the appendix (see Table A.1).

38 5 Data and Topic Exploration

5.1 Document Topics Distribution

Checking the topic distribution per model can be a challenging task, because LDA represents a document as a probability distribution of multiple topics.

For example, the document in Figure 5.1 is represented by topic 34-cancer-tumor- breast (see Table 5.1) with a topic probability of 0.9034. Whereas the document in figure 5.2 is represented by topic 19-surgery-injury-complication with a probability of 0.4172 and topic 28-lung-liver-platelet with a probability of 0.4142. The topic words from both topics 19 and 28 are present in the article in Figure 5.2.

In some cases it is easy to guess which topic best represents the document (e.g. topic 19-surgery-injury-complication for Figure 5.1). However, for other documents this distinction is not easy, as shown for Figure 5.2. The probability difference between the top 2 topics is 0.4172 -0.4142 = 0.003. There are other cases in the corpus where the probability score between the top topics are identical.

Topic number Topic words 19 surgery injury complication pain technique nerve catheter pressure operation vein 28 lung liver platelet respiratory mortality fibrosis admission sepsis count fluid 34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy

Table 5.1: Topics 19, 28, 34

5.2 Data Exploration

5.2.1 Topic Distribution

Trends in the data can be discovered by looking at the topic probability distribu- tion within the corpus. The logic for this approach is as follows: For the 150,000 documents in the corpus, using Gensim, one can get the document topic probability for each document that was used to create the model. For example, for a fictitious document D4, the topic probabilities are 0.85 for topic 1, and 0.1 for topic 4 (see

39 Chapter 5. Data and Topic Exploration

Table 5.2). After gathering the topic probability scores for all the documents, one can calculate the topic probability distribution for a given topic. For topic 3 in the example shown in Table 5.2, the probability distribution can be visualised by creating a histogram for all the 150,000 data points (documents in the corpus) for which the document probability is greater than 0 (e.g. 0.45,..., 0.1).

Figure 5.2: Topic words in article Figure 5.1: Topic words in article (green: topic 19, yellow: (green : topic 34) topic 28)

Based on the corpus of 150,000 PubMed documents, Figure 5.3 shows the topic probability distribution for the corpus. The x-axis represents the topic probability, and the y-axis shows the number of articles in each category. As it can be seen, this graph does not show much information. This could be due to the reason that in many cases the articles exhibit a low topic probability towards this topic. In order to visualize the data better, I only selected the cases where the topic probabilities are greater than 0.1 (Figure 5.4).

Document ID Year Topic 1 Topic 2 Topic 3 Topic 4 D1 2000 0.25 0 0 0 D2 2000 0 0 0.45 0 D4 2001 0 0.85 0.1 0 D5 2001 0 0 0 0.3

Table 5.2: Fictitious topic probability distribution over multiple topics and docu- ments

For most of the 50 topics, the probability distribution looks similar to that of topic 19 (see Figure 5.2). As for topic 30-adult-host-male and 41-image-volume-measurement, one can observe sudden spikes in the topic distribution. For topic 41, there is a sudden increase in the number of articles at the probability of approximately 0.5. For

40 Chapter 5. Data and Topic Exploration

Figure 5.4: Topic probability distribution Figure 5.3: Topic probability distribu- of documents of topic 19, tion of documents of topic 19 where topic probability >0.1

topic 39, the spike in increase of articles with a topic probability of approximately 0.7 and 0.8. For topic 6-specie-specimen-margin, the probability distribution does not fall exponentially, but stagnates a little. For topic 25-domain-molecule-chain, the drop in number of articles with increasing probability is observed until 0.6, then the number of articles with higher probability increases before falling again (see Figure 5.5). In general, it is observed that the higher the topic probability, the fewer the number of articles that belong to that category. Otherwise, this information is not useful for detecting trends in the data.

5.2.2 Average Topic Probability

To observe the temporal changes in the data, it is required to transform the infor- mation shown before (see 5.2.1) to show the diachronic aspects. For a given topic, an article has the topic probability k. During a given year y, there are ny articles published. Thus, one can calculate the yearly average topic probability Ay (see Equation 5.1). For example, in the fictitious example in Table 5.2, the average yearly topic distribution for the year 2000 and topic 3 is (0+0.45)/2 = 0.225.

ny P ki i=1 Ay = (5.1) ny

This process results in showing temporal trends in the data based on the topics. I looked at a time frame of 15 years, namely from 2000 to 2015. The following graphs below show the results (Figures 5.7 -5.10). The x-axis represents the year and the

41 Chapter 5. Data and Topic Exploration

Figure 5.5: Topic probability distribution of documents of topics 6,10, 39, and 41, where topic probability >0.1

y-axis represents the average topic probability. It should be mentioned that the data points are not continuous, the line drawn between them are for visual aid.

In most cases the average topic probability fluctuates in a given time range or re- mains more or less constant. However, there are cases where one can observe an increase in the topic probability. As one can see in Figure 5.7, the topic probabil- ities for the topics 23-trial-month-therapy and 33-lesion-diagnosis-biopsy fluctuate, with the gradual tendency to increase over time. For topic 10-surface-temperature- particle, after a dip in the early 2000s, the probability for this topic also sharply rises.

In other cases, the opposite is true. The average topic probability gradually drops over time. This is illustrated in Figure 5.8 where the topic probability has ei- ther gradually dropped, as is the case for topic 11-antibody-vector-construct, or the topic probability has dropped and remained stable over the years, e.g. topics 12- activation-inhibitor-phosphorylation and 28-care-intervention-service.

There are also instances where a topic suddenly peaks and then its topic probability

42 Chapter 5. Data and Topic Exploration

Figure 5.6: Topic probability distribution of documents of topic 25, where topic probability >0.1

Figure 5.7: Average topic probability of documents from 2000 to 2015 of topics 10, 23, and 33

wanes gradually over time. This is shown in Figure 5.9, where the topic 25-domain- molecule-chain suddenly exhibits a surge in topic probability in the late 2000s and then sharply declines over the upcoming years. Peaks are also observed for the topics 2-sequence-genome-specie and 5-parameter-correlation-probability.

Finally, Figure 5.10 shows that a topic can vanish over time. This is the case for topic 50-exposure-skin-smoking, which suffers a sharp decline in 1995 and does not reappear in the corpus after 1999.

43 Chapter 5. Data and Topic Exploration

Figure 5.8: Average topic probability of documents from 2000 to 2015 of topics 11,12,17,21,28, and 43

Figure 5.9: Average topic probability of documents from 2000 to 2015 of topics 2,5, and 25

Thus, using the average yearly topic probability, one can observe temporal trends in the data, namely changes in diachronic topic probability, where it increases (see Figure 5.7), decreases (see Figure 5.8), exhibits peaks (see Figure 5.9), and stops being popular (see Figure 5.10).

44 Chapter 5. Data and Topic Exploration

Figure 5.10: Average topic probability of documents from 1980 to 2005 of topic 50

5.3 Observing Diachronic Trends Using Topic Models

With the help of average topic probability as well as a diachronic corpus, it has been shown that diachronic trends within a corpus can be detected using a topic model (see 5.2.2). Moreover, trends in the data vary based on the type of topic. Thus, using topic models one can detect surges as well declines of certain topics within these corpora. These observations assist my claim, that it is indeed possible to detect temporal trends in a corpus by using topics generated from a topic modeling algorithm (see 1.2).

5.4 Topic Exploration

For the upcoming sections, I will be analysing the following topics, namely 13- woman-heart-pregnancy, 22-infection-virus-vaccine, 34-cancer-tumor-breast and 38- infection-resistance-bacteria (see Table 5.3). I chose these topics because I not only know the meaning of topic words in them, but also because the topics themselves have a cohesive theme. As for their average topic probability, these topics exhibit, a certain variability (see Figure 5.11). Thus, it is visible that these topics have gained as well lost popularity during 2000 to 2015. However, simply looking at the average topic probability does not convey much information about the content of these topics. Hence, in the following sections, I will took into the variability that exists within the topics themselves.

45 Chapter 5. Data and Topic Exploration

Topic number Topic words 13 woman heart pregnancy pressure birth hypertension infant delivery mother week 22 infection virus vaccine antibody vaccination antigen replication titer influenza transmission 34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy 38 infection resistance bacteria isolates strain culture pathogen phage tuberculosis coli

Table 5.3: Four topics selected for data exploration

Figure 5.11: Average topic probability of documents from 2000 to 2015 of topics 13,22,34, and 38

5.4.1 Frequency of Topic Words in the Corpus

To analyse the contents of the individual topics, I looked at the relative frequency of the topic words as they occur in the corpus. The relative frequency for each year was calculated as follows: yearly relative frequency =

number of documents in which the topic word occurred in a year number of documents published that year

The aim of this approach is to demonstrate the usage of the topics, and the di- achronic changes that have occurred to them. As seen in Figure 5.12, it is difficult to see any trends as the graph if highly cluttered by the 10 topic words. Hence, I de- cided to divide the topic words into groups, for better visibility of the trends. These were created by dividing the topic words into 2 groups, namely pregnancy related words (e.g. ‘birth’, ‘infant’, ‘mother’, and ‘pregnancy’) and words related to heart disease (e.g. ‘heart’, ‘hypertension’, ‘pressure’). I discarded the words ‘week’ and ‘woman’ from the analysis as they are too general and can occur in many articles regardless of the topic. As for the pregnancy related words, it can be seen that there

46 Chapter 5. Data and Topic Exploration

Figure 5.12: Relative frequency of topic words for topic 13-woman-heart-pregnancy

has been a steep increase in the usage of these words since 2001, but the usage has somewhat stagnated since 2005 (see Figure 5.13). As for the words related to heart disease, the usage of these words have steadily increased over time (see Figure 5.14). Another observation made during this process was that by analysing the trends that are present between the topic words, one can observe groups within a given topic.

Unlike in the example mentioned above one can clearly detect groups within the topic words as demonstrated in Figure 5.15. It is clearly visible that from 2005 onwards the topic words form three distinct groups. Based on the graph in Figure 5.15 one can divide topic topic words into three groups as shown in Examples 5.2 - 5.4.

(5.2) ‘antibody’, ‘infection’

(5.3) ‘replication’, ‘transmission’, ‘virus’, ‘antigen’

(5.4) ‘influenza’, ‘vaccination’, ‘vaccine’, ‘titer’

Out of these immunology related terms from topic 22-infection-virus-vaccine one is able to observe logical grouping between them, which is the case for Example 5.4 where the topic words ‘vaccination’ and ‘vaccine’ fall into the same semantic category. Nonetheless, as these are low frequency terms within the specific grouping

47 Chapter 5. Data and Topic Exploration of topic words, it is somewhat difficult to see the changes in their usage within the topic as a whole. Hence, I plotted the topic words from 5.4 on a separate graph for better visibility of the temporal trends (see Figure 5.16). Unlike in Figure 5.15, where due to the scaling of the image it appears that the relative frequency of the terms from Example 5.4 had remained somewhat stable since 2007, in Figure 5.16 it can be observed that the frequency of these topic words have fluctuated over time. I applied a similar approach to visualize the topics in Example 5.3 (see Figure 5.17). However, this visualization was not helpful to ascertain temporal trends amongst the topic words within this group. Thus, with the means of visualisation it is also possible to detect groups within a topic and observe how these groups evolve temporally.

Figure 5.13: Relative frequency of Figure 5.14: Relative frequency of heart pregnancy related disease related words from words from topic 13 topic 13

It can be seen here that the internal trends within a topic can be examined by looking at the diachronic relative frequency of the topic words. This task can be made easier with the help of a dynamic interface (see Chapter 7).

5.4.2 Diachronic Shifts within a Topic

In the previous section it has been demonstrated that one can use the relative frequency of the words that exist within a model to show diachronic shifts and trends that exist within a topic. Thus, it has been shown that in using topic modeling one can detect diachronic changes within the words of a given topic. Consequently, this provides an answer to my second research question.

48 Chapter 5. Data and Topic Exploration

Figure 5.15: Relative frequency of topic words for topic 22-infection-virus-vaccine

Figure 5.16: Relative frequency of Figure 5.17: Relative frequency of immunology related immunology related words from topic 22 words from topic 22 (group 2) (group 3)

5.5 Frequency of Popular Words within a Topic

A topic does not only consist of the topic words in it, but one also has to consider other underlying trends within the articles of a given topic. These trends can be analysed by visualizing the relative frequency of the most popular words in them. For

49 Chapter 5. Data and Topic Exploration my analysis, I selected articles that belong to a certain topic. I selected these articles based on their document topic probability. If a document has a topic probability that is greater than zero for a specific topic, then it belongs to that topic. As a consequence, a document can belong to multiple topics. Then from this subset of articles I calculated the most frequent words. These are words that have the highest absolute frequency in this subcorpus of articles. Then I used the relative frequency of these popular words within the subcorpus and visualized them in a diachronic manner.

Figure 5.18: Relative frequency of Figure 5.19: Relative frequency of words from topic 13 words from topic 13 (top 1-5 words) (top 6-10 words)

In Figures 5.18 and 5.19 one can see the diachronic relative frequency of the 10 most popular words in topic 13-woman-heart-pregnancy. Similarly, the diachronic frequencies of the popular words in topic 22-infection-virus-vaccine, are displayed in the Figures 5.20 and 5.21.

Figure 5.20: Relative frequency of Figure 5.21: Relative frequency of words from topic 22 words from topic 22 (top 1-5 words) (top 6-9 words)

Specifically for the word ‘brain’, the diachronic relative frequencies differs signifi- cantly for topic 13 and 22. Thus, one can assume, that the focus of the articles

50 Chapter 5. Data and Topic Exploration that occur with these topics exhibit a different thematic focus, despite the usage of common words (see Figure 5.181 and 5.212).

5.5.1 Diachronic Popularity of Non-topic Word Related Terms

In the previous section, I was able to demonstrate that by using topic modeling one can group a corpus into different topic groups using their document topic probability (see 5.5). In these partitions, one can show diachronic trends in the most popular words. I was able to detect diachronic changes in in the usage of words within documents that fall under a specific topic, consequently answering my third and final research question.

1the word ‘brain’ is displayed by the turquoise line 2the word ‘brain’ is displayed by the blue line

51 6 Results and Discussion

6.1 Research Question Nr. 1

I could show with the help of the average topic probability that topics exhibit diachronic trends within a corpus. I also visualized them over a period of 15 years for certain topics. As mentioned before (see 2.3), LDA is an unsupervised machine learning technique and there does not exist any ground truth with which I can compare my results. Furthermore, each iteration of the model generates a different output, which indeed makes LDA a weak candidate, when it comes to replication of the methods mentioned in this Master’s thesis. This means that even with the same data and model parameters one could get different results based on the probabilities calculated by the model. Thus, the results here are for demonstrating my hypothesis as mentioned in the research questions. Therefore, the only way to demonstrate my claim was with the help of visualizations, as the goal was to show diachronic trends in the data.

LDA aids in automatic generation of subdivisions within the corpus. However, LDA does not create the diachronic trends, as they already exist in the data set. However, the topics made by LDA do exhibit thematic entities that are present in the corpus. Hence, the topics exhibit semantic coherence between the topic words. This is the case for topic 34-cancer-tumor-breast, where the topic is made up of mostly words that are about cancer or somehow belong into cancer-related discourse. A similar argument can also be made for topic 22-infection-virus-vaccine, where the topic words belong to the theme of immunology. In other cases, the topics are mixed (see 2.3.1.2), but the subtopics in them demonstrate semantic coherence; for example, topic 13-woman-heart-pregnancy is about pregnancy and heart disease. Therefore, following the trend of such topics might be to some extent helpful; however, one should take the trends in the subtopics also into consideration (see 6.2).

It should be emphasised that these topics are specific to the corpus and are not about the general historical trends in the field of biomedical publication. Hence, the results may vary based on a different subset of documents. As mentioned before (see 4.2.4.1), the documents used for model creation only consists of 150 thousand

52 Chapter 6. Results and Discussion articles, which are about 8-11% of randomly selected articles that were published each year. These topics could be of interest to anyone wishing to explore a data set and check why there is a certain trend occurring in the corpus. It should also be mentioned that in order to acknowledge the validity of a given topic one should consult a domain expert.

6.2 Research Question Nr. 2

I was able to demonstrate diachronic shifts within a topic. These temporal changes in the usage of topic words throughout the entire corpus, show how certain terms within a given topic can exhibit different levels of usage over time. The results also showed that words belonging to a specific subtopic have relative frequencies that are closer to each other.

A side effect of this approach is that one can use it to check the quality of a given topic. As demonstrated by topic 13-woman-heart-pregnancy, vague and generic terms that occur in a topic tend to exhibit a high relative frequencies within the corpus. The results are logical, as generic terms are expected to occur in multiple documents.

In some cases, the topic words with low relative frequencies show an interesting property. If these topic words are similar to each other, then they tend to be grouped together, and can be detected in the visualization. This was the case for topic 22-infection-virus-vaccine, where similar words had similar diachronic rela- tive frequencies. Hence, one an use the diachronic visualisation to detect thematic subdivisions within a topic.

6.3 Research Qsuestion Nr. 3

Using LDA created corpus subdivisions, one can also check trends of words within a topic. This approach as well cannot be quantified with the help of some pre- existing ground truth as the results are entirely corpus dependent. However, if one looks at the topics and the trends of the popular words within the topic related subcorpus, it can be said that this approach is useful for analysing trends within the data. For topic 13-woman-heart-pregnancy one observes that usage of the word ‘brain’ is fairly high. Whereas, if one compares this topic with topic 22-infection- virus-vaccine, which is about pregnancy and heart disease, one would observe a different relative frequency of the word ‘brain’ and a very different diachronic trend

53 Chapter 6. Results and Discussion as well. This shows the different thematic focus within the topics, and that the most popular words in each topic are different. Furthermore, the difference in diachronic trends of the relative frequencies of the words that are common between certain topics emphasize the temporal thematic focus of these topics. Hence, it is possible to detect diachronic changes in the usage of words in a specific topic.

6.4 Summary

In summary, it can be said that even though the three aforementioned approaches cannot be quantified, the results exhibit clear trends, which appear to be logical, based on the topics that I analysed in detail. Visualisations are a key to judging the viability of these diachronic trends. As an exploratory approach into analysing the underlying trends in unlabelled data, visualisations are adequate to judge the validity of the claims made in the research questions. As the aim of all three research questions were to demonstrate visible diachronic trends in the data, a quantifiable measure is not yet required. However, for future work in this domain, it would be advisable to find some metric which quantifies the trends shown in the visualisations.

54 7 Website

As it is quite challenging to view the results of this Master’s thesis in a non interac- tive interface, I have built a companion website, for diachronic topic visualization (DiaTopVis), where one can access and view the data interactively. This section introduces the companion website designed to visualize the results of my Master’s thesis. Also in this section, the features and functionalities of this website are ex- plained. The aim of this website is to provide the user with an easy to use interface for exploring the topics and the underlying themes that exist within them. The in- spiration for this project is based on the work done by Wang and McCallum [2006] and Song et al. [2014]. The interface was likewise inspired by a previous project done by me (Ghoshal et al. [2017]). The website consists of three primary sections each providing an interactive interface to explore the answers to the three research questions tackled in this paper. There are also sections where one can view the topic models and the topics in them. Finally, there is a help section which is there for the user to assist with website navigation and orientation. The help section also provides explanations for each section.

7.1 Generating Charts

The graphs on the website is created by using the C3 library1. C3 is a JavaScript library (based on D32) that can be used to generate multiple formats of charts. An advantage of C3 is that the graphs are interactive. The user can select or remove a line within the chart by clicking on the legend that is associated with the line. In addition, the y-axis is dynamic and immediately responds to the addition or removal of a given line. Furthermore, by placing the cursor on any given point in the chart the user is provided with all the values for a chosen cross section on the y-axis.

1http://c3js.org/examples.html 2https://d3js.org/

55 Chapter 7. Website

7.2 Website Sections

7.2.1 Observing Diachronic Trends in Topics

This section of the website provides the user with a tool to interactively observe the average topic probabilities of a chosen topic. On the website this section, labelled as ‘Part 1: Generate diachronic topic distribution’, generates diachronic topic models. Also in this section of the website the user is guided through a series of steps. If the steps are followed correctly, this would result in the generation of a graph which would show the average topic probability of documents for a chosen number of topics and a specified time range. In ‘Step 1’ the user can choose between multiple topics, the choice being shown in check box format. In ‘Step 2’ the user specifies what time range should be looked at. Finally, ‘Step 3’ generates the topic distribution (see Figure 7.1 and 7.2).

Figure 7.1: Website: Part 1 User options

Figure 7.2: Website: Part 1 Example output for topics 2,3,4,5

56 Chapter 7. Website

7.2.2 Generate Frequency of Topic Words in the Corpus

In this section, the website provides a tool to look at the distribution of the topic words in the corpus. On the website, this section is called ‘Part 2 : Generate topic words distribution’. In a first step the user can choose a topic from a drop-down menu. In a second step the user specifies what time range should be looked at. The final step, again, generates the topic distribution. The graphs here are similar to the ones generated in 5.4.1. The x-axis shows the year and the y-axis shows the relative frequency of the topic words in the entire corpus that was used to create this model. After the graph is generated it shows the relative frequencies for all ten topic words. As this view can be a little cluttered, the user can remove some of the topic words by clicking on the name in the legend below. The advantage of this section is that one has the opportunity to set a time scale and toggle the number of topic words that one wishes to visualize only with a few clicks of a button (see Figure 7.3).

Figure 7.3: Website: Part 2 Example output for topic 13 (topics shown partially)

7.2.3 Frequency of Popular Words within a Topic

In this part of the website can be used to visualize the popular words that occur in documents that belong to a specific topic. This section is called ‘Part 3: Generate distribution of top word(s) in a topic’ on the website. The graphs generated in this section of the website are the ones that were used to demonstrate the popularity of non-topic related terms (see 5.5). In this third part of the website, the user can choose a specific topic from the drop-down menu. Then they can select the years for which they would like to visualize the topics. As a next step, the user can chose the range of popular words they wish to view. The range of popular words that shall

57 Chapter 7. Website be depicted can be set using two HTML input buttons with a step attribute1. That means, for example, if the user sets the range of words to be shown between 2 and 4, then the generated graph will show the results for the second, third and fourth most popular words. Finally, in a last step, the user can simply press a button to generate the relative frequency of the words from a topic subcorpus. These are the three key-sections of the website where the user has the opportunity to explore the topic models with an interactive interface (see Figure 7.4).

Figure 7.4: Website: Part 3 Example output for topic 13 (top 2-5 words shown)

1https://www.w3schools.com/tags/att_input_step.asp

58 8 Diachronic Topic Modeling Pipeline

In the methodology section, I mentioned the steps taken in order to arrive at the final topic model which was used to answer the research questions. This section has a particular focus on the pipeline that takes as an input PubMed XML documents and returns topic models, as well as the CSV files that can used to create the visualizations that I used for answering my research questions. Figure 8.1 shows the structure of the pipeline, which is explained in the following sections.

8.1 Data Extraction

8.1.1 Extract Metadata

The first part of the pipeline consists of data extraction. In the data extraction part a script reads the PubMed XML file and extracts any required metadata from the file. For the purpose of this Master’s thesis only the information about the year of publication was extracted from the file. It is however possible to extend the script in order to allow the extraction of other metadata. The information about the year of publication is saved to a separate file. The metadata information will be used in the data mapping section (see 8.3).

8.1.2 Extract Text

In the text extraction part of the pipeline, the article text from the corpus undergoes different levels of text preprocessing and filtering. These processes are explained in detail in the following sections (see 8.1.2.1 and 8.1.2.2).

8.1.2.1 Text Pre-processing

Here the article text is extracted from the XML file. Then the article text is divided into sentences using a sentence splitter. As a next step these sentences are tokenized

59 Chapter 8. Diachronic Topic Modeling Pipeline

Data Extraction LDA Topic Modeling

Extract Metadata Create LDA Dictionary

Extract Text Edit LDA Dictionary Token Pre-processing Token Filtering (optional)

Remove stopwords, punctuation Sentence Segmentation and numbers

Remove short tokens (e.g. tokens Create LDA Corpus Word Tokenization with less than 3 characters)

User defined criteria POS Tagging (optional) (optional)

Words with specific POS tags Create LDA Model Token lemmatization (optional)

Corpus Creation

Data Mapping

Output: Tabular Data

Figure 8.1: Diachronic topic modeling pipeline

into words. There is an optional step which one can take to filter the results based on POS tagging. This optional step will be discussed in section. Then the tokens in the sentences are lemmatized (see 8.1.2.1.1).

8.1.2.1.1 POS Tagging of the Corpus This is an optional part of the text extraction section where one can do a POS tagging of the corpus and chose to keep only the tokens that fall under a chosen set of POS tags which is defined by the user.

8.1.2.2 Token Filtering

In this section, the corpus is reduced and the following elements are removed from it. These elements include the stopwords, punctuation and numbers. It is also possible to remove tokens that are below a certain length. For example, for my corpus I only kept tokens that had at least three characters in them. The output of this section

60 Chapter 8. Diachronic Topic Modeling Pipeline is sent to the token filtering part of the pipeline (see 8.1.2.2).

8.1.3 Corpus Creation

In this section of the pipeline, the data from the token filtering process is converted into a Python list. The list is then saved as a JSON file. The purpose of creating a corpus at this stage is twofold. Firstly, the corpus will be used to create the LDA model (see 8.2.4), and secondly, at a later state this corpus will be also used to create word frequencies.

8.2 LDA Topic Modeling

In this section, certain aspects such as corpus creation, dictionary creation and LDA model creation are explained. An advantage of this part of the pipeline is that the user can always reuse the data from the previous section to generate different models. This data recycling aspect can be useful as it saves time.

8.2.1 Dictionary Creation

The files from the previously created corpus are used to create a dictionary that will be used by the LDA model. This dictionary is created by Gensim and contains information about the word frequencies in the corpus. In the later stage, this dictio- nary serves as one of the input parameters for the corpus creation that is required by Gensim to make an LDA model.

8.2.2 Editing the Original Dictionary

As it was mentioned multiple times in the methodology section (see Chapter 4), the dictionary plays a key role in the creation of the quality of the topic model. High frequency words in the dictionary may lead to the creation of identical topics and subsequently to a poor quality topic model. However, creating a new dictionary form a corpus can be a time consuming endeavour. For this reason this pipeline contains a dictionary reduction section where the user can specify which parts of the vocabulary should be reduced (e.g. high frequency words, low frequency words or words with certain features that can be user-defined). The output is a reduced dictionary which will be used to create the corpus.

61 Chapter 8. Diachronic Topic Modeling Pipeline

8.2.3 LDA Corpus Creation

In this section Gensim creates a corpus based on the frequencies calculated by the aforementioned dictionary (original or reduced). The corpus itself is not human readable but contains information about word location and frequencies of the doc- uments.

8.2.4 LDA Model Creation

This part of the pipeline creates the LDA model. Gensim takes as an input the dictionary and the LDA corpus and generates the model based on them. Here the user has the option to set certain parameters of the model such as the number of topics, the chunk size and or number of passes. Based on these information, Gensim creates an LDA model. From the output of the LDA model one can extract the topics. Moreover, the model has additional information such as document topic probability for all the documents that were used to create the corpus. The topics as well as the document topic probabilities will be used to generate the diachronic topic information for some parts of the desired output.

8.3 Data Mapping

In order to generate output that would answer my research questions, it is necessary to map the information that was generated in the data extraction section (see 8.1) and the LDA topic modeling section (see 8.2). The key issue here is mapping the documents in the corpus that was created during the data extraction process (see 8.1.3) and the document topic probabilities that were calculated during the LDA model creation process (see 8.2.4).

8.3.1 Mapping: Creating Yearly Average Topic Probability

For calculating the average topic probability for every year, I first map the docu- ments IDs used by Gensim to the metadata information that was extracted before (see 8.1.1). Then I calculate the topic probabilities for all the documents in this cor- pus. This data is saved as a CSV file for later use. The average topic probabilities for every year is calculated using the data from this file and is saved for future use such as visualization.

62 Chapter 8. Diachronic Topic Modeling Pipeline

8.3.2 Mapping: Generating Relative Frequencies for Topic Words

For calculating the diachronic trends of the topic words I also map the document IDs with the metadata. Then for each of the topic words I extract their occurrence from the corpus files (see 8.1.3). Then I calculate the yearly relative frequency for those topic words using the method mentioned in section 5.4.1. The output was saved as a CSV file for visualization purposes.

8.3.3 Mapping: Generating Relative Frequencies for Popular Words in Topic Subcorpora

Finally, for calculating the relative frequencies of words I divide my corpus into multiple subcorpora where each subcorpus represents a topic. For the documents that belong to a specific subcorpus I calculate all the words and their absolute frequencies. Then I selected the top 250 words with the highest absolute frequency. For these words, I calculate their relative frequencies using the same method as in the previous section 8.3.2. Again, the output was saved as a CSV file for visualization purposes.

8.4 Other Functions

This pipeline also contains other functions that can be useful to judge the quality of a topic model. There is a function that calculates topic similarity of multiple models based on number of similar words between them. There is also another function which calculates the inter-topic similarity of a topic model which could be used as a measure to judge the quality of a topic model.

8.5 Summary

The pipeline described in this section helps the user create their own topic models and CSV-tables for data visualisation. Unlike off-the-shelf topic modeling tools, here the user is provided with some predefined pre-processing functions. Due to the modular structure of this pipeline, it is possible to skip certain steps. This is true for the data extraction part where one can skip the part of speech tagging. In the topic modeling section, the modular structure supports the user by saving time, as the output generated in the corpus (see 8.2.3) and the dictionary creation (see 8.2.1) sections can always be used by the LDA model (given these were created

63 Chapter 8. Diachronic Topic Modeling Pipeline from the same original corpus). Furthermore, the data mapping structure helps the user to identify the original files from which the LDA model was created. This is a clear advantage over the current version of Gensim, which does not keep track of the original input files used to create the corpus and the dictionary.

64 9 Conclusion

In this Master’s thesis, I implemented topic modeling on biomedical texts to detect temporal trends within the data. I take a systematic approach towards topic mod- eling of biomedical texts. By conducting a series of experiments, I finally arrive at the desired topic model, which will be used to answer my research questions.

For my first research question, I was able to demonstrate temporal trends within the topics that were generated using an LDA model. With the means of the average topic probability, temporal fluctuations in the popularity of the topics were observed.

For my second research question, I then delved deeper into the topics themselves in order to investigate the impact of the topic words throughout the corpus. For this I looked at the relative frequency of the topic words and observed temporal trends within them. Moreover, I was able to observe diachronic groupings of topic words. These groupings showed tendencies about the quality of the topic words. Semantically similar topic words tend to be grouped together, whereas generic words tend to exhibit a high frequency and are grouped further away for the semantically coherent topic words.

For the third research question, I observed temporal trends using topic modeling within documents that belong to a specific topic. I was able to demonstrate that the popular words within a subcorpus of documents belonging to a specific topic tend to have different diachronic relative frequencies. Moreover, the top 250 most popular words within the corpus tend to be different as well, based on the topic they are representing.

As a further feature, I created a website to demonstrate the results that have been found. The website has a general three part structure, where each part concentrates on the results of one of the research questions.

Finally, I was also able to create a pipeline that would enable the user to create similar results as shown in this paper. The pipeline contains a text extraction component with the focus on text preprocessing and a topic modeling component.

65 Chapter 9. Conclusion

The aim of the pipeline is to facilitate the topic modeling process of PubMed XML files and map elements form the corpus document metadata and information from the topic model together to create data tables that can be used by researchers.

9.1 Future Work

In the scope of this Master’s thesis I was only able to scratch the surface of diachronic topic modeling using a large corpus of biomedical texts. I would like to improve my current work especially the created website to show also the articles with the highest topic probabilities for a specific time frame. I aim to focus on finding the most relevant document for a specific topic, which I was not able to do for this project, as I did not have any access to domain experts who could have evaluated the output of my system.

In future, I plan on gathering articles from only few sources on a specific topic and try to match the temporal trends with actual historical developments in the field. For example, for articles belonging to the general theme of immunology, one could try to match with historical events such as major outbreaks of diseases and discoveries of cures. I was not able to implement such an aspect for this project, as I first needed to test the viability of the framework .

66 References

S. Bhatia, J. H. Lau, and T. Baldwin. Automatic Labelling of Topics with Neural Embeddings. ArXiv e-prints, Dec. 2016.

S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, Inc., 1st edition, 2009. ISBN 0596516495, 9780596516499.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

P. P. Bonissone. Machine Learning Applications, pages 783–821. Springer Berlin Heidelberg, Berlin, Heidelberg, 2015. ISBN 978-3-662-43505-2. doi: 10.1007/978-3-662-43505-2 41. URL http://dx.doi.org/10.1007/978-3-662-43505-2_41.

J. Boyd-Graber, D. Mimno, and D. Newman. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, Florida, 2014. URL docs/2014_book_chapter_care_and_feeding.pdf.

J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296, 2009.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391, 1990.

S. ElShal, M. Mathad, J. Simm, J. Davis, and Y. Moreau. Topic modeling of biomedical text. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 712–716, Dec 2016. doi: 10.1109/BIBM.2016.7822606.

P. Ghoshal, J. Goldzycher, and S. Clematide. Bbdia: Diachronic visualisation of semantically related n-grams using word embeddings. Conference Poster, June 2017. Poster presentation at SwissText 2017: 2nd Swiss Text Analytics Conference.

67 References

T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM, 1999.

A. Holzinger, J. Schantl, M. Schroettner, C. Seifert, and K. Verspoor. Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges, pages 271–300. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. ISBN 978-3-662-43968-5. doi: 10.1007/978-3-662-43968-5 16. URL http://dx.doi.org/10.1007/978-3-662-43968-5_16.

K. Hornik and B. Gr¨un.topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13):1–30, 2011.

C.-C. Huang and Z. Lu. Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in Bioinformatics, 17(1): 132–144, 2016. doi: 10.1093/bib/bbv024. URL http://bib.oxfordjournals.org/content/17/1/132.abstract.

J. H. Lau, K. Grieser, D. Newman, and T. Baldwin. Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 1536–1545, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 978-1-932432-87-9. URL http://dl.acm.org/citation.cfm?id=2002472.2002658.

E. Leopold. Models of Semantic Spaces, pages 117–137. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-37522-7. doi: 10.1007/978-3-540-37522-7 6. URL http://dx.doi.org/10.1007/978-3-540-37522-7_6.

A. K. McCallum. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu, 2002.

D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 262–272, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 978-1-937284-11-4. URL http://dl.acm.org/citation.cfm?id=2145432.2145462.

D. Ramage and E. Rosen. Stanford TMT, 2009. URL http://nlp.stanford.edu/software/tmt/tmt-0.4/.

68 References

R. Reh˚uˇrekandˇ P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.

M. Song, S. Kim, G. Zhang, Y. Ding, and T. Chambers. Productivity and influence in bioinformatics: A bibliometric analysis using PubMed Central. Journal of the Association for Information Science and Technology, 65(2): 352–371, 2014. ISSN 2330-1643. doi: 10.1002/asi.22970. URL http://dx.doi.org/10.1002/asi.22970.

I. Titov and R. T. McDonald. A Joint Model of Text and Aspect Ratings for Sentiment Summarization. In ACL, volume 8, pages 308–316. Citeseer, 2008.

A. J. van Altena, P. D. Moerland, A. H. Zwinderman, and S. D. Olabarriaga. Understanding big data themes from scientific biomedical literature through topic modeling. Journal of Big Data, 3(1):23, 2016. ISSN 2196-1115. doi: 10.1186/s40537-016-0057-0. URL http://dx.doi.org/10.1186/s40537-016-0057-0.

H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In Proceedings of the 26th annual international conference on machine learning, pages 1105–1112. ACM, 2009.

H. Wang, Y. Ding, J. Tang, X. Dong, B. He, J. Qiu, and D. J. Wild. Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLOS ONE, 6(3):1–14, 03 2011. doi: 10.1371/journal.pone.0017243. URL https://doi.org/10.1371/journal.pone.0017243.

X. Wang and A. McCallum. Topics over Time: A non-Markov Continuous-time Model of Topical Trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 424–433, New York, NY, USA, 2006. ACM. ISBN 1-59593-339-5. doi: 10.1145/1150402.1150450. URL http://doi.acm.org/10.1145/1150402.1150450.

X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178–185. ACM, 2006.

Y. Yang, S. Pan, J. Lu, M. Topkara, and Y. Song. The Stability and Usability of Statistical Topic Models. ACM Trans. Interact. Intell. Syst., 6(2):14:1–14:23, July 2016. ISSN 2160-6455. doi: 10.1145/2954002. URL http://doi.acm.org/10.1145/2954002.

69 References

B. Zheng, D. C. McLean, and X. Lu. Identifying biological concepts from a protein-related corpus with a probabilistic topic model. BMC Bioinformatics, 7 (1):58, 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-58. URL http://dx.doi.org/10.1186/1471-2105-7-58.

P. Zhu, J. Shen, D. Sun, and K. Xu. Mining meaningful topics from massive biomedical literature. In 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 438–443, Nov 2014. doi: 10.1109/BIBM.2014.6999197.

70 A Tables

Topic number Topic words 1 strain growth medium culture production bacteria coli yeast mutant plate 2 sequence genome specie family position domain alignment amino tree database 3 genotype marker locus frequency polymorphism variant allele trait variation haplotype 4 mutation mutant embryo phenotype deletion stage loss mouse defect domain 5 parameter correlation probability error prediction estimate feature performance equation simulation 6 specie specimen margin length view dorsal head genus surface seta 7 specie water soil temperature community season abundance habitat climate diversity 8 frequency phase input stimulation stimulus neuron amplitude noise power electrode 9 transcript file probe mirnas microarray array mrna transcription pathway mirna 10 surface temperature particle energy solution layer water film property nanoparticles 11 antibody vector construct sequence plasmid transfection domain buffer blot lane 12 activation inhibitor phosphorylation pathway kinase apoptosis inhibition receptor death growth 13 woman heart pregnancy pressure birth hypertension infant delivery mother week 14 food insulin cholesterol obesity intake consumption diabetes weight alcohol mass 15 bone fracture cartilage spine knee teeth joint surface root pain 16 mouse antibody medium hour culture section animal week serum plate 17 migration adhesion microtubule actin formation image localization junction motility filament 18 brain mouse neuron cortex animal cord motor hippocampus nerve matter 19 surgery injury complication pain technique nerve catheter pressure operation vein 20 promoter methylation transcription histone chromatin chip element regulation modification motif 21 membrane fluorescence vesicle fusion microscopy mitochondrion transport image localization surface 22 infection virus vaccine antibody vaccination antigen replication titer influenza transmission 23 trial month therapy week baseline event outcome intervention medication cohort 24 network cluster module node database edge user tool feature file 25 domain molecule chain bond ligand conformation energy crystal position atom 26 child participant student school family parent people question experience behavior 27 care intervention service practice program management community provider staff hospital 28 lung liver platelet respiratory mortality fibrosis admission sepsis count fluid 29 score disorder item scale symptom depression pain questionnaire anxiety correlation 30 adult host male female larva stage mosquito fitness generation selection 31 chromosome cycle replication damage repair phase focus testis irradiation follicle 32 drug administration agent injection mg/kg combination toxicity dose efficacy therapy 33 lesion diagnosis biopsy examination tumor resection recurrence month mass feature 34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy 35 vessel segment artery wall plaque toxin aggregation formation flow lesion 36 serum kidney plasma vitamin donor biomarkers correlation creatinine marker laboratory 37 plant leaf seed root rice arabidopsis growth fruit wheat stage 38 infection resistance bacteria isolates strain culture pathogen phage tuberculosis coli 39 review search quality paper publication criterion strategy literature science heterogeneity 40 task trial participant memory stimulus performance word session face block 41 image volume measurement sensor intensity position field detection device resolution 42 primer reaction sequence product amplification cycle clone polymerase extraction detection 43 mouse macrophage cytokine production antibody inflammation activation receptor immune antigen 44 differentiation stem culture collagen proliferation marker growth progenitor medium fibroblast 45 stress metabolism acid enzyme iron production synthesis pathway oxygen reaction 46 receptor channel release calcium activation neuron stimulation solution action voltage 47 muscle exercise movement force motor hand training strength limb fiber 48 compound solution acid reaction water mass mixture extract spectrum fraction 49 country prevalence cost survey mortality status death hospital proportion household 50 exposure skin smoking smoker tobacco cigarette worker hair product meta

Table A.1: All 50 topics from final model

71 Curriculum Vitae

Personal Details Parijat Ghoshal Lerchenberg 31 8046 Zurich¨ [email protected]

Education 2015 Bachelor of Arts in English Languages and Literatures at University of Bern since 2015 M.A. studies in Multilingual Text Analysis at the University of Zurich

Professional Activities July 2016 – September 2016 Machine Learning Intern at Neue Zurcher¨ Zeitung since October 2016 Data Scientist at Neue Zurcher¨ Zeitung

72 Selbststandigkeitserkl¨ arung¨

73