<<

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2019

LSTM vs for of Insurance Related Text

HANNES KINDBOM

KTH SKOLAN FÖR TEKNIKVETENSKAP

LSTM vs Random Forest for Binary Classification of Insurance Related Text

HANNES KINDBOM ROYAL

Degree Projects in Applied Mathematics and Industrial Economics (15 hp) Degree Programme in Industrial Engineering and Management (300 hp) KTH Royal Institute of Technology year 2019 Supervisors at Hedvig John Ardelius Supervisors at KTH: Tatjana Pavlenko och Julia Liljegren Examiner at KTH: Jörgen Säve-Söderbergh

TRITA-SCI-GRU 2019:151 MAT-K 2019:07

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

The field of natural language processing has received increased attention lately, but less focus is put on comparing models, which differ in complexity. This thesis compares Random Forest to LSTM, for the task of classifying a message as question or non-question. The comparison was done by training and optimizing the models on historic chat data from the Swedish insurance company Hedvig. Different types of were also tested, such as and Bag of Words. The results demonstrated that LSTM achieved slightly higher scores than Random Forest, in terms of F1 and accuracy. The models’ performance were not significantly improved after optimization and it was also dependent on which corpus the models were trained on.

An investigation of how a chatbot would affect Hedvig’s adoption rate was also conducted, mainly by reviewing previous studies about chatbots’ effects on user experience. The potential effects on the innovation’s five attributes, relative advantage, compatibility, complexity, trialability and observability were analyzed to answer the problem statement. The results showed that the adoption rate of Hedvig could be positively affected, by improving the first two attributes. The effects a chatbot would have on complexity, trialability and observability were however suggested to be negligible, if not negative.

Keywords

Random Forest, Classification, Natural Language Processing, , Neural Networks, Bag of Words, Bachelor Thesis, Diffusion of Innovation, Adoption Rate, User Experience

i

Sammanfattning

Det vetenskapliga området språkteknologi har fått ökad uppmärksamhet den senaste tiden, men mindre fokus riktas på att jämföra modeller som skiljer sig i komplexitet. Den här kandidatuppsatsen jämför Random Forest med LSTM, genom att undersöka hur väl modellerna kan användas för att klassificera ett meddelande som fråga eller icke-fråga. Jämförelsen gjordes genom att träna och optimera modellerna på historisk chattdata från det svenska försäkringsbolaget Hedvig. Olika typer av word embedding, så som Word2vec och Bag of Words, testades också. Resultaten visade att LSTM uppnådde något högre F1 och accuracy än Random Forest. Modellernas prestanda förbättrades inte signifikant efter optimering och resultatet var också beroende av vilket korpus modellerna tränades på.

En undersökning av hur en chattbot skulle påverka Hedvigs adoption rate genomfördes också, huvudsakligen genom att granska tidigare studier om chattbotars effekt på användarupplevelsen. De potentiella effekterna på en innovations fem attribut, relativ fördel, kompatibilitet, komplexitet, prövbarhet and observerbarhet analyserades för att kunna svara på frågeställningen. Resultaten visade att Hedvigs adoption rate kan påverkas positivt, genom att förbättra de två första attributen. Effekterna en chattbot skulle ha på komplexitet, prövbarhet och observerbarhet ansågs dock vara försumbar, om inte negativ.

Nyckelord

Random Forest, Klassificering, Språkteknologi, Maskininlärning, Neurala nätverk, Bag of Words, Kandidatexamensarbete, Användarupplevelse

ii

Acknowledgements

Firstly, I would like to thank the Swedish insurance company Hedvig for providing me with the adequate data and for the project idea. Special thanks to Hedvig’s CTO John Ardelius and his collegue Ali Mosavian for excelent guidance throughout the entire process.

I would also like to acknowledge my supervisor Tatjana Pavlenko at the Department of Mathematics, who always showed dedication and interest in my work. Tatjana allocated time for advice and discussion whenever I ran into trouble or had questions. I would also like to express my gratitude to my supervisor Julia Liljegren at the Department of Industrial Economics and Management. Julia primarily helped me with questions related to the second part of the bachelor thesis.

iii 2019-05-22

Author

Hannes Kindbom, Industrial Engineering and Management KTH Royal Institute of Technology

Supervisors

Associate Professor Tatjana Pavlenko, Department of Mathematics KTH Royal Institute of Technology PhD Julia Liljegren, Department of Industrial Economics and Management KTH Royal Institute of Technology CTO John Ardelius, Hedvig AB Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Problem statements ...... 2 1.3 Purpose ...... 2 1.4 Delimitations ...... 3

Part I - Binary Classification with LSTM and Random Forest 4

2 Theoretical Background 5 2.1 Introduction to Machine Learning ...... 5 2.2 Word embedding ...... 6 2.2.1 Bag of Words ...... 6 2.2.2 Word2vec ...... 7 2.3 Classification ...... 9 2.3.1 Random Forest ...... 9 2.3.2 LSTM ...... 12 2.4 Validation ...... 15 2.4.1 Training and Test Split ...... 15 2.4.2 K-fold Cross-validation ...... 15 2.4.3 Evaluation Metrics ...... 16 2.4.4 Optimization of Hyperparameters ...... 18 2.5 Related Work ...... 18

3 Method 21 3.1 Data ...... 21 3.1.1 Raw Data ...... 21 3.1.2 Formatting ...... 21 3.1.3 Labeling ...... 22 3.1.4 Additional Data ...... 23 3.1.5 Train and Test Data ...... 23 3.2 Implementing Random Forest ...... 23 3.2.1 Word embedding for Random Forest ...... 24 3.2.2 Random Forest ...... 25

v 3.3 Implementing LSTM ...... 25 3.3.1 Word2vec ...... 26 3.3.2 LSTM ...... 26 3.4 Hyperparameter Optimization and Valuation ...... 27 3.5 Human Baseline ...... 28

4 Results 29 4.1 Optimization of Hyperparameters ...... 29 4.2 Human Baseline ...... 32 4.3 Feature Importance ...... 33 4.4 The Final Models ...... 33

Part II - Chatbot’s Effect on Adoption Rate 35

5 Theoretical Background 36 5.1 Diffusion of Innovations ...... 36 5.2 SWOT ...... 38

6 Method 39 6.1 Literature Search ...... 40 6.2 Selection ...... 40

7 Results 41 7.1 Literature Study ...... 41 7.2 Five Attributes of Hedvig ...... 42 7.2.1 Relative Advantage ...... 42 7.2.2 Compatibility ...... 44 7.2.3 Complexity ...... 44 7.2.4 Trialability ...... 45 7.2.5 Observability ...... 45 7.3 SWOT ...... 45

Part I and II - Discussion 47

8 Discussion 48 8.1 Classification with LSTM and Random Forest ...... 48

vi 8.1.1 Word Embedding ...... 48 8.1.2 Optimization of Hyperparameters ...... 48 8.1.3 Data and Human Baseline ...... 49 8.2 Chatbot’s Effect on Adoption Rate ...... 50 8.3 Answer to Problem Statements ...... 51

References 53

9 Appendices 57

vii

1 Introduction

1.1 Background

Machine learning has received increased attention lately, mainly due to improvements in computational power and access to vast amounts of data. Especially the sub-field natural language processing (NLP), which deals with how to utilize computers to process and analyze language data, has been heavily researched. A common task within NLP is to create chatbots, which can be used to reduce response time and increase availability for users. Much of the focus is directed to improve the performance of these chatbots, where text classification is a main part of it. ”Text classification is an important task in NLP with many applications, such as web search, information retrieval, and document classification.” [1]

This thesis is focused on text classification, on behalf of the Swedish insurance company Hedvig. Different approaches for determining whether a given message is a question or not, will be evaluated. The problem is modeled as a NLP-problem and can be decomposed into two main parts. The first part is to represent the piece of text (also called the corpus) as something the computer can interpret. Current state of the art is to use artificial neural networks to create word vectors which make up a semantically meaningful vector space. One widely used model for this is Word2vec [2].

The second part of the problem is to use these word vectors in some classifier. ”Recently, models based on neural networks have become increasingly popular” [1], where recurrent neural networks like the LSTM (Long-Short-Term- Memory) is an example of such.

Apart from this rather technical study, an analysis of how a chatbot would affect the adoption rate of Hedvig, is made. It is interesting to examine whether the investments and increased attention for chatbots can be justified from the perspective of adoption rate. The thesis is divided into three parts where the first one deals with the classification problem and the second part contains a study of how a chatbot would affect the rate of adoption. Lastly, both problems are

1 discussed and conclusions drawn in the third part.

1.2 Problem statements

This thesis seeks to provide answers to the following questions:

How does the relatively more complex machine learning model LSTM compare to the easier to interpret Random Forest when it comes to classifying insurance related messages as questions or non-questions?

The sub-question to be considered is: How would a chatbot affect the adoption rate of Hedvig?

1.3 Purpose

The purpose of the project is the investigate whether it is justified or not to use a complex model (i.e. a ) instead of one that is more straight forward to interpret, like the Random Forest in this case. The results may be of interest to many, since one generally does not want to use unnecessarily complicated models; Complicated models generally require more data and a higher level of competence is necessary to maintain and develop them. The acquired insights and knowledge will also hopefully contribute to the scientific field of natural language processing and machine learning in a broader sense.

Furthermore, the implemented models are of special importance for Hedvig, since they strive to automate their services to a greater extent. To be able to determine whether a message is a question or not could hopefully be the beginning of a chatbot. They are also interested in finding out how implementing a chatbot would affect the rate of adoption. This is especially critical for Hedvig, since it is a startup, currently raising financial capital to expand the business dramatically.

2 1.4 Delimitations

The study is delimited to classifying messages written in Swedish and in particular sent, either via Hedvig’s app or website. The messages will be analyzed individually and without context. Also, only two separate models for the classification task are evaluated, namely LSTM and Random Forest. Another, rather unintentional delimitation is that the data, upon which the analysis is based, was not labeled as questions or non-questions by humans beforehand. This could affect the performance of the models, since is used. The data is explained in detail in section: 3.1

Additionally, this thesis is delimited to investigating the dimension of adoption rate, or more precisely how Hedvig’s adoption rate would be affected if a chatbot was implemented. No empirical studies are made, but rather a study of relevant literature, using the theory about adoption and diffusion of innovation explained in section 5.1. It should be emphasized that this study is delimited to the effects on Hedvig specifically. The results and conclusions may therefore only serve as inspiration to companies similar to Hedvig.

3 Part I

Binary Classification with LSTM and Random Forest

4 2 Theoretical Background

2.1 Introduction to Machine Learning

There are a few established definitions of machine learning and one of the first ones, stated by the pioneer Arthur Samuel year 1959 is:

”Machine learning: Field of study that gives computers the ability to learn without being explicitly programmed.”[3]

Even though this definition can be perceived vague, it gives an intuition of what it is all about. More recently, an additional more detailed definition has been given by Tom M. Mitchell:

”A computer program is said to learn from E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” [4]

One special type of machine learning is supervised learning, which this thesis is exclusively focused on. In supervised learning, there is a so called response measurement yi associated with each of the predictor measurement(s) xi i =

1, ..., n. The objective is then to fit a model which relate yi to xi, such that the response for future observations can be predicted accurately or such that the relationship between response and predictors is better understood.[5, p. 26]

Furthermore, a distinction between qualitative and quantitative variables can be made. Quantitative variables take on numerical values, and qualitative ones take on values in one of K different classes. Problems with qualitative responses are usually referred to as classification problems, while problems with quantitative responses are referred to as regression problems[5, p. 28]. This paper deals solely with a binary classification problem, where the two classes are question and non- question, denoted Q (encoded as 1) and A (encoded as 0) respectively.

5 2.2 Word embedding

In this section, various types of vector representations of the messages (heron referred to as documents, dj) are presented, starting with more straight forward ones as Bag of Words, ending with a more complex word embedding based on neural networks, called Word2vec.

2.2.1 Bag of Words

To be able to classify the documents, predictors (from heron called features) denoted x1, ..., xn, have to be constructed. The method Bag of Words can be used to create these, by transforming each document dj into a vector [x1, ...xi, ..., xn] = ∈ n xdj . Here, n is a prespecified integer, which equals the number of words in the considered vocabulary and xi is equal to the number of occurrences of word i in the document. This that the Bag of Words is based on the idea that similar documents have similar words and it does not take the ordering of words into consideration [6, pp. 108-109]. Moreover, the simple Bag of Words representation does not deal with misspellings, since two documents containing the same words spelled in two different ways are treated as two distinct documents.

Note that the vocabulary of size n should contain words that are believed to describe the different classes. Words like ”what” and ”insurance” might intuitively be more relevant than ”astronaut” and ”spacecraft”, when it comes to classifying insurance related messages as Q or A.

A simple example where n = 7 and the vocabulary is [how, insurance, what, does, cost, much, good] illustrates Bag of Words. Let d1 = ”How much does it cost” and d2 = ”The insurance is good”. Then xd1 = [1, 0, 0, 1, 1, 1, 0] and xd2 = [0, 1, 0, 0, 0, 0, 1].

Even though this simple form of Bag of Words provides an easy to understand representation of text, the raw frequency of words is said to be a bad measure of association between words. Words like ”the” and ”it” tend to occur frequently together with any word, and will therefore not be informative. At the same

6 time; words that appear nearby frequently are more informative than very infrequent ones. The tf-idf algorithm is a method for dealing with this paradox [7, p. 112]. There exist several variants of tf-idf, and this thesis demonstrates the one implemented in Scikit-learn.1. tf-idf means term frequency times inverse document frequency and is defined as in (1):

1 + n tf-idf(i, d) = tf(i, d) · idf(i) , where idf(i) = log d (1) 10 1 + df(d, i) where tf(i, d) denotes the raw frequency of word i in document d and nd equals the total number of documents in the corpus. df(d, i) holds the number of documents that contain word i. These tf-idf(i, d) then form components in vectors of length n, which are normalized with the euclidean norm [8].

To better incorporate the dimension of order into the document vectors xdj , N- grams can be used. A N-gram gN i is a sequence of N words or characters, which can be used in Bag of Words by replacing xdj with gN,dj = [gN 1, ...gN i, ..., gN n], where n now denotes the number of different N-grams in the ”vocabulary”. The importance of choosing an appropriate value for N should be emphasized. Lower values of N reduce the dimensionality of gN,dj , which can affect the performance of the classification algorithms and higher values generally increase computation time [9].

2.2.2 Word2vec

Bag of Words and many other NLP techniques for vector representations of text, treat words as ”atomic units”, lacking notion of similarity between words, since these are represented as indices in a vocabulary. Even though these methods are simple, recent advances in machine learning have enabled more complex models to be trained on much larger corpora, and those often outperform the simpler ones [2].

One such model is Word2vec, which is a well known collection of algorithms,

1Scikit-learn is a python machine learning library

7 openly available as the Word2vec package. Word2vec is based on the idea that words or expressions that are used in similar contexts and ways probably have related meanings [10]. Apart from Bag of Words, Word2vec creates one high dimensional vector, of rather abstract components, for each word in the vocabulary.

The team at Google, lead by Tomas Mikolov, found that these high dimensional word vectors can be used to answer very subtle semantic relationships between words, e.g. ”France is to Paris as Germany is to Berlin”, after being trained on a large corpus. The learned relationships enable for vector operations as for example, vector(”Paris”) - vector(”France”) + vector(”Italy”) ≈ vector(”Rome”). It is suggested that such semantic relationships could be used to improve many NLP tasks, for example question answering systems [2]. Pre-trained word vectors on different large corpora are made freely available by Google, which makes Word2vec suitable for research.2

The technical details about Word2vec are considered to be beyond the scope of this thesis and are therefore left to the interested reader to find out using the referenced literature. However, Word2vec essentially makes use of neural networks in either a Continuous Bag of Words (CBOW) model or Continuous Skip-gram model, illustrated in figure 2.1 [2]. In CBOW, the model learns to predicts a word based on the neighbouring words and the order of the words does not influence the prediction. In Continuous Skip-gram model, each word are used as input to predict nearby words within a certain [2].

2https://code.google.com/archive/p/word2vec/ (visited on 03/24/2019)

8 Figure 2.1: ”The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.”

2.3 Classification

Two different classification algorithms are presented in this section. The theory about Random Forest is based on the book An Introduction to Statistical Learning [5].

2.3.1 Random Forest

Random Forest is a method that has been said to work well for binary classification [11]. A Random Forest consist of several classification trees so it is appropriate to describe those first. For simplicity, the theory is exemplified with feature vectors xdj of the standard Bag of Words type but the case for tf-idf and N-grams is almost identical.

A Classification tree predicts that an observation belongs to the most commonly occurring class of the training observations in the region it belongs { | } to. A region Rj, j = 1, ..., J is defined as Rj = xdj x1 < t1, ..., xi < ti , where ti denotes the cutpoint and J equals the number of regions in the tree. These { | } regions are also called leaves. For example, R1 = xdj x1 < 1, x2 < 2 is equal to the leftmost leave in figure 2.2.

9 Figure 2.2: Example of a classification tree based on Bag of Words with raw frequency and n = 2

Figure 2.3: The three regions corresponding to the leaves in the classification tree in figure 2.2. Red dots are training observations of class A and green dots training observations of class Q

To grow a classification tree, a measure like the classification error rate or the Gini index G, which is a measure of total across all classes, is needed. The classification error rate E measures the fraction of training observations in a region, which do not belong to that region’s most frequent class. E is defined

10 as:

E = 1 − max pˆjk (2) k and the Gini index as: ∑K G = pˆjk(1 − pˆjk) (3) k=1 where pˆjk is the proportion of training observations from the k:th class in the j:th region. In practise, the Gini index is often used since the classification error is not sensitive enough.

E or preferably G is then used as a criterion for making splits when growing the tree. Due to computational complexity, it is infeasible to try all possible partitions into J high-dimensional rectangles of the feature space and therefore an algorithm called recursive binary splitting is used. The objective is to find

R1, ..., RJ such that G is minimized.

Recursive Binary Splitting Algorithm

1. Select the feature xi and the cutpoint t, such that the maximum reduction in { | } G is obtained by splitting the feature space into the regions R1 = xdj xi < t { | ≥ } and R2 = xdj xi t . This means that all possible combinations of xi and t are evaluated.

2. Step one is repeated for one of the newly created regions if not a stopping criterion is reached. A stopping criterion can for instance be to continue until no region contains more than a certain number of observations.

Properties of Classification Trees Classification trees are intuitive and easy to interpret graphically. However, the accuracy in their predictions are not always satisfying and the variance of classification trees is high. That is, if two classification trees are fit to two different parts of the data, they often differ significantly. These problems can be improved by aggregating several classification trees using methods like Random Forest, explained next.

The Random Forest algorithm makes use of bootstrap. The general idea of

Bootstrap is to randomly select nd samples with replacement from the original training data denoted Z = (z1, ..., znd ), where in this report zj = (xdj , ydj ) and ydj

11 is the label of a document. This is repeated B times to form B bootstrap samples [12, p. 249]. Bootstrap is used in Random Forest to reduce variance, by fitting B trees on these bootstrap samples. A new test observation is classified by taking a majority vote of these B trees, i.e. the most commonly occurring predicted class among all trees in the forest. If one of the features is stronger than the others, almost all trees will use it in the top split, and hence the trees will look similar and their predictions will be highly correlated. To decorrelate the trees, only a randomly chosen subset of the n features is considered in each split when building the trees. A fresh sample of features is considered at each split and √ usually this subset is of size n.

Note that some interpretability is lost in the Random Forest, since multiple trees are not as easy to interpret as a single one. The importance of each feature xdj can however be illustrated. The total decrease in the Gini index due to splits over a given feature averaged over all trees works as a measure of that feature’s importance.

2.3.2 LSTM

This section covers the basic theory of feedforward neural networks (FNN), followed by an explanation of a recurrent neural network and ending with a description of the properties of an LSTM.

A FNN is a neural network without cycles, where the outputs from units in each layer are forwarded as inputs to the next layer. Basic versions of a FNN have three types of units: input-, output- and hidden units. Each layer is fully connected, in the standard version of a FNN. This means that there is a link connecting all pairs of units from two neighboring layers. The hidden layers composed by hidden units (which are neural units), are the foundation of a FNN [7, p. 137]. A FNN with one hidden layer and a single output is illustrated in figure 2.4.

A neural unit is taking a weighted sum z of inputs x1, ..., xn (stored as a vector x) plus a constant scalar b called bias. The operation it initially makes can be written as z = w · x + b, where w is the weight vector. A non-linear activation function is thereafter applied to z. Two popular activation functions are the sigmoid (denoted

12 σ) and hyperbolic tangent (tanh). Sigmoid takes a real value z and maps it to (0,1) and tanh maps z to (-1,1) [7, p. 132]. Both are stated in equation 4.

1 ez − e−z σ(z) = tanh(z) = (4) 1 + e−z ez + e−z

The neural units described above are then used to build the FNN. The parameters of a hidden layer are represented by a weight matrix W and bias vector b for the whole layer. W is constructed by combining the weight vector w and bias vector b for each neural unit. The weight of the connection between the ith input xi to the jth hidden unit hj, is the element Wij in W . Then, the output vector h of the hidden layer can be computed as in equation 5.

h = σ(W x + b) (5)

Figure 2.4: A FNN with one hidden layer and a single output y

Humans understand words in a sentence, based on the information provided by the previous words. Traditional feedforward neural networks are not well suited for learning this phenomenon, wherefore recurrent neural networks (RNN) were designed to deal with this issue. A RNN is a network containing loops, where ”the value of a network is directly, or indirectly, dependent on its own output as an input.” They are designed to process sequences explicitly [7, p. 177]. RNNs can be interpreted as having multiple copies of the same network, where each copy takes

13 an input xt and forwards its output ht to the next. A schematic representation of an RNN and this interpretation is given in figure 2.5 [13].

Figure 2.5: An overview of the idea behind a RNN. The output of one network is forwarded as an input to the next.

LSTM is short for Long short-term memory, and is a special kind of RNN first introduced by Sepp Hochreiter et al. in 1997 [14]. There are several variants of LSTMs and all are designed to learn long-term dependencies. In other words, they are capable of remembering information for long time periods. LSTMs can also be seen as a chain-like structure displayed in figure 2.6 [13], where each unit has four neural network layers.

The properties of LSTM have been successfully used for classification of sentences [15], and can be used for the problem of binary text classification. I this case, there is one output unit taking values on (0, 1), where ≥ 0.5 ⇒ Q and < 0.5 ⇒ A. When LSTM is used with Word2vec as word embedding, the feature vectors for all words in a document are used as input.

As with RNNs, the theory behind an LSTM is considered to be beyond the scope of this report, wherefore it is described just briefly.

Figure 2.6: The four interacting neural network layers in LSTM are shown as yellow rectangles. σ denotes the sigmoid activation function and tanh denotes the hyperbolic tangent function used in those neural networks.

14 2.4 Validation

This section explains the details about how to develop and evaluate a language model. The section deals with how to divide the corpus into a training- and test corpus, what metrics to evaluate the model on and how to find a satisfying set of hyperparameters. The theory is explained and exemplified in the context of this thesis.

2.4.1 Training and Test Split

In order to evaluate the performance of a language model, the corpus need to be divided into a training- and test corpus. The model is trained on the training corpus and its performance is tested on the unseen data in the test corpus. It is important to not include any data from the test corpus into the training corpus, since that would introduce bias and result in inaccurate evaluation.

It is recommended to let the test corpus be as large as possible, to make it representative. At the same time, a large training corpus would yield a better model. In practice, a good trade-off is to divide the corpus into 80% training and 20% test [7, p. 44].

2.4.2 K-fold Cross-validation

To avoid , the test corpus should not be touched while developing the models and tuning its hyperparameters. The performance of the models must however be tested during the development process. Therefore, K-fold cross- validation can be used on the training corpus [7, p. 76].

In K-fold cross-validation, the corpus is first randomly divided into K sets (folds) of approximately equal size. One fold forms a validation corpus, while the model is trained on the remaining K-1 folds. Then, some relevant metric is computed on the validation corpus. This is repeated for each of the K folds and the K computed values are then averaged to obtain the K-fold cross-validation estimate [5, p. 183].

15 Figure 2.7: An illustration of 10-fold cross-validation, where ”Dev” means the validation corpus[7, p. 77]

Furthermore, there is a bias-variance trade-off in K-fold cross-validation. The closer K is to n, the more unbiased will the estimates be. Leave one out cross-validation is therefore preferred over K-fold cross-validation, from a bias reduction perspective. However, the variance increases with K, since the K different models are trained on almost the same corpus if K i large. This implies that the models will be highly correlated and the of highly correlated quantities has an increased variance. K=5 or K=10 have been shown empirically to yield estimates of the test error rate with low bias and variance [5, pp. 185- 186].

The cross-validation can be further improved by stratification. It has been suggested that a variant called Stratified K-fold cross-validation generally is superior to the regular version, in terms of bias and variance [16]. In Stratified K-fold cross-validation, the folds are chosen such that the proportion of each class in the corpus is preserved.

2.4.3 Evaluation Metrics

To evaluate a supervised machine learning model, the predicted labels should be compared to the ”actual” (human-defined) labels, which are referred to as gold standard labels.A , illustrated in figure 2.8, is useful when evaluating the model’s performance. Each cell shows a possible outcome when predicting, and the green diagonal cells represent correctly classified documents

16 and the off-diagonal cells represent misclassified ones [7, p. 73].

The accuracy is defined in figure 2.8 and shows what percentage of the test corpus the model classified correctly. Note that accuracy could be misleading when the classes are unbalanced in frequency. If 1% of the test corpus is of class Q and the rest of class A, for instance. Then the accuracy would be 99% if the model was ”hard coded” to classify all documents as A.

Therefore, are often used instead of accuracy. Precision measures the percentage of the documents the model labeled as Q, that were in fact of class Q. Recall measures the percentage of actual Q, that were correctly identified by the model. Unlike accuracy, recall and precision highlight true questions [7, p. 74].

F-measure is an additional metric, which incorporates both precision and recall. F-measure is defined in (6).

(β2 + 1)PR F = , where P = Precision and R = Recall (6) β β2P + R β works as a weight parameter between precision and recall. β > 1 favours recall and β < 1 favours precision, where β = 1 balances them equally. The value of β should be customized after application [7, p. 74].

Figure 2.8: A Contingency table for the task of classifying a document as Q or A. TQ = True Q, FQ = False Q, TA = True A, FA = False A

17 2.4.4 Optimization of Hyperparameters

Both Random Forest and LSTM have a set of hyperparameters, whose values are determined before training the models. There are several ways of searching for the optimal set of such parameters, but two commonly used techniques are Grid search and Random search. Grid search is a method for exhaustively searching through a prespecified subset of the hyperparameter space, while a fixed number of parameter settings are randomly sampled from the specified subset in Random search. The difference between these two approaches is displayed in figure 2.9 [17].

Regarding performance, it has been shown both theoretically and empirically that a randomized search is more computationally efficient than a search over a grid, at least when it comes to neural networks [17]. The performance with a certain set of hyperparameters is evaluated on some specified metric (e.g. F-measure) using K-fold cross-validation.

Figure 2.9: An illustration of the difference between Random search and Grid search, where a two variable function is to be optimized

2.5 Related Work

At the time of writing this thesis, the scientific field of natural language processing is growing and companies like Amazon, Google and Facebook are focusing much of their research on this topic.

18 Facebook is for example developing an open-source library called fastText3 for text classification and for learning word representations fast on a large corpus. Pre-trained models of fasTtext are freely available in multiple languages and the fastText project is mainly based on three papers [1, 18, 19]. Even though fastText is related to this topic, it will not be evaluated in this report. Neither will the model ELMo be evaluated in this report, due to its complexity and lack of competence. ELMo has shown promising results in this area, especially for problems like question answering. It is a model to learn contextual word vectors, which considers the fact that words have different meaning in different contexts [10].

The task of binary text classification of Swedish documents is however not widely researched at this point, but Richard Socher et al. have made a related study on the Stanford Sentiment Treebank corpus. It is based on a data set with 11855 sentences from movie reviews. A new version of a RNN, called Recursive Neural Tensor Network is proposed, which achieved an accuracy of 85.4% on the task of binary (positive/negative) classification [20].

Moreover, as stated in section 1.1, LSTM-models have become popular in this context. Chunting Zhou et al.[21] have introduced a model called C-LSTM, which combines and creates synergies between ordinary LSTM- and CNN-models to represent and classify sentences as positive or negative. ”C-LSTM utilizes CNN to extract a sequence of higher-level phrase representations, and are fed into (LSTM) to obtain the sentence representation.” The model is evaluated on the same Stanford Sentiment Treebank and achieved an accuracy of 87.8% on the binary classification task, while LSTM achieved 86.6%. It is furthermore concluded that the nueral network models outperform other common models such as SVM and Naive Bayes on this task. The C-LSTM was also evaluated on the task of classifying a question into one of six categories, where it achieved an accuracy of 94.6% and outperformed all baseline neural networks (such as LSTM 93.2% and CNN). However, the SVM had the highest accuracy here, to the cost of demanding human and loss of generalization.

Regarding the Random Forest algorithm, Baoxun Xu et al. propose an improved

3https://fasttext.cc/ (visited on 04/08/2019)

19 version of it [22]. Utilizing their developed feature weighting method, the algorithm seems to outperform other popular classifiers such as the Support vector Machine and when used for multiple text classification of high dimensional data, evaluated on F-measure. The algorithm achieved a best test accuracy of 84.6% and a macro averaged F-measure of 79.6%, on the ”Fbis” (from the Foreign Broadcast Information Service data of TREC-5) data set for instance. The Random Forest achieved 83.6% and 78.8% respectively (also better than the SVM and Naive Bayes) on the same data set.

These related papers are relevant, since they indicate that neural network models outperform linear classifiers such as SVM and probabilistic ones such as Naive Bayes, when it comes to binary classification. The opposite seems to hold for multinomial classification of questions. Furthermore, Random forest is shown to perform better than SVM and Naive Bayes in the multi class setting.

What distinguishes this bachelor thesis from the related work, is firstly the language and context of the corpus. A majority of the related papers are based on an English corpus, whereas this study is based on Swedish documents from an insurance company’s chat history. Furthermore, this study is focused on the comparison between two models and approaches, which differ in complexity, rather than trying to achieve a new state of the art solution to the problem of classifying text.

20 3 Method

This section contains a detailed description of how the study was conducted. It starts with a description of the input data and how it was formatted. Thereafter are the details about how the LSTM and Random Forest were implemented with their respective word embedding, presented. The section ends with a description of the model optimization and the human baseline.

3.1 Data

The and evaluation of the models were based on historical chat data from Hedvig’s mobile app and website. Details of the format of this data, how it was processed and labeled are outlined in this section.

3.1.1 Raw Data

The raw data was provided by Hedvig, and consisted of the most recent conversations between their customers and employees. The original text file contained approximately 400 000 rows of text, where each row formed a message (document) written and sent by one of the two parties. Note that a dataset of text is usually called a corpus.

3.1.2 Formatting

The corpus had to be formatted in order to prepare it for later analysis and use in the models; Firstly regular expressions were deployed to remove emojis and special characters such as ”,” and ”-” from the corpus. It is reasonable to assume that those characters are not informative for the classification. Also, all letters were set to lower case so that for instance ”Jag” and ”jag” could be interpreted equally in the models.

All automatically generated messages where also removed, since training on these documents would lead to overfitted models. Thereafter, all documents written by

21 customer service employees where removed because these documents all tended to be very similar, as if they where automatically generated too, which would cause the model to be overfitted as well.

Since Hedvig currently only operates in Sweden, a majority of the documents were written in Swedish and the decision to train and test the models only on Swedish text, was made. Therefore, documents written in other languages (mostly in English) were removed manually.

Infrequent words like names and non-linguistic utterances such as ”haha” and ”wow”, do not generally say anything about the type of document. Those should therefore be removed to reduce the dimensionality of the feature vectors. However, the implementations of Bag of words and LSTM have optional parameters for that purpose, wherefore nothing was done to fix this during the data formatting.

3.1.3 Labeling

Since the original data set contained unlabeled documents, an initial labeling was performed by letting all words up until a question mark form documents labeled as questions (Q) and the rest as non-questions (A). With the intention to improve on this approach, we manually iterated through the corpus and corrected mislabeled documents. A critical action to motivate is why the question marks were removed from the corpus during this labeling process; That decision was made to prevent the models from being overtrained on question marks. The language used in chats are generally more informal and it was observed that a significant proportion of the presumed questions lacked question mark, when inspecting the corpus manually.

After this formatting and cleaning, the corpus consisted of 9470 documents labeled as A and 4288 documents labeled as Q. Some examples of documents from the corpus are given in table 3.1.

22 Document Label låter det intressant Q har ni reseskydd i er försäkring Q jag behöver försäkring A nej inte vad jag vet A

Table 3.1: Some examples of documents from the formatted corpus together with their labels

3.1.4 Sampling Additional Data

To train with more data and generalize the models further, the data from the chat was supplemented with a corpus from Göteborgs-posten (GP). That ”GP- corpus” contained 41184 documents labeled as Q and 80000 documents labeled as A, after the same formatting procedure described above in section 3.1.2 (except from manual labeling). The GP-corpus was chosen, since it was the most similar Swedish corpus that could be found. All tests were performed on the corpus containing documents from Hedvig only as well as on the mixed corpus containing Hedvig and GP documents.

3.1.5 Train and Test Data

Both of the classifiers Random Forest and LSTM need to be trained on a set of training data, as explained in section 2.1. After training the models, the classes of future observed documents can be predicted. Also Bag of Words requires training data, in order to construct the vocabulary described in section 2.2.1.

To simulate future unseen observations on which the performance of the final models could be evaluated, 20% of the corpus was put aside as test data. The remaining 80 % was used as training data to train and optimize the models, in accordance with the theory in section 2.4.1.

3.2 Implementing Random Forest

This section explains how different versions of Bag of Words were used together with Random Forest and a description of how they were tuned for this specific

23 problem is given. The details about Random Forest are given in section 2.3.1.

3.2.1 Word embedding for Random Forest

Two variants on Bag of Words were used to map the documents to feature vectors, before training the Random Forest classifier on those. The purpose of this thesis is mainly to compare a complex model with a less complex one, as mentioned in section 1.3. This motivates the use of Bag of words as a tool for word embedding in the Random Forest.

Bag of words was first tested with the raw frequency of unigrams (1-grams), using CountVectorizer in sci-kit learn. That is, the learned vocabulary consisted only of single words. Furthermore, words that occurred less than ”min_df” times in the training corpus were ignored when learning the vocabulary, in order to only include informative words. One weakness of unigrams in Bag of words is that the structure of a document is lost, which supposedly is influential when trying to separate questions from non-questions. Hence, the performance using higher order N-grams was also investigated during a trial and error procedure.

A tuple (min_N,max_N) could be set as an optional parameter ”ngram_range” in CountVectorizer. min_N and max_N set the boundaries of the range of N-values for different N-grams to be extracted. All values of N such that min_N ≤ N ≤ max_N will be used.

Variants on Bag of Words Raw frequency of words + N-grams tf-idf + N-grams

Table 3.2: The two types of Bag of Words that were used as word embedding before Random Forest could be trained

In addition to the raw frequency, tf-idf was also tested using ”TfidfVectorizer” in sci-kit learn. tf-idf could reasonably solve the ”paradox” mentioned in section 2.2.1 and be a tecnique for learining descriptive features.

Note: The pros and cons and details of the two variants are further described in section 2.2.1.

24 3.2.2 Random Forest

The machine learning algorithm Random Forest was implemented using ”RandomForestClassifier” from the open source library sci-kit learn4 in Python. sci-kit learn and Python is especially appropriate to use in a scientific context, since their availability make the experiments easy to reproduce.

Random Forest was chosen, mainly since it is relatively straight forward to interpret both graphically and mathematically. It has also worked well in previous studies of text classification (see section 2.5). It was however interesting to evaluate its performance on a Swedish corpus for binary classification of insurance related text, since that had not been done before. To get insights about how the model performs and what words are critical, the feature importance of different words was printed using the attribute feature_importances_ in RandomForestClassifier.

The hyperparameters that were used during the optimization, described in section 3.4, are listed in table 3.3. Default values for all the other hyperparameters, defined in sci-kit learn, were used.

Tuned Hyperparameters Description n_estimators Number of trees in the forest max_features Nr. of features to consider when looking for best split max_depth Maximum depth of the tree min_samples_split Min nr. of samples required to split an internal node min_samples_leaf Min nr. of samples required to be at a leaf node

Table 3.3: The hyperparameters that were tuned during the Random Search for Random Forest

3.3 Implementing LSTM

This section gives a description of how LSTM was implemented as a classifier, together with Word2vec as word embedding. LSTM was used with Word2vec, since the aim of this report was to compare a complex model to a less complex.

4https://scikit- learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (visited on 03/15/2019)

25 LSTM and Word2vec are both considered to more complex than Random forest and Bag of words respectively.

3.3.1 Word2vec

As explained in section 2.2.2, Word2vec is said to usually outperform simpler models such as Bag of Words and is suitable for question answering systems, since it incorporates the semantic meaning of a word. It was therefore interesting to see if that held for this specific task of binary classification. Moreover, there existed a Word2vec that was pretrained on a Swedish corpus based on text from Wikipedia5, which makes the results easier to reproduce. Except from the pretrained model, a Word2vec was also trained on the ”Hedvig corpus” as well as on the ”Hedvig+GP corpus”, using the open source python library Gensim.6 All default hyperparameters were used, apart from min_count and size. All words with a total frequency in the corpus lower that min_count were ignored, and size declares the dimensionality of the word vectors. The performance of these word vectors were significantly worse than those trained on Wikipedia, wherefore they were excluded from the official tests and results.

3.3.2 LSTM

An LSTM is constructed to better take the order of words into consideration and is said to have a long-term memory, as outlined in section 2.3.2. It is reasonable that the order of words are structured differently in questions, compared to non- questions. A simple example illustrates this; In the Hedvig corpus, ”är det” occurred in 6.7% of the documents labeled as Q meanwhile the same number was 1.1% for all documents labeled as A. Also, versions of the LSTM model have been successfully used for text classification previously, as described in section 2.5. These are the arguments to why LSTM was tested to solve the problem of deciding whether a document is a question or not.

The LSTM model was implemented using the open source python library Keras7

5https://github.com/Kyubyong/wordvectors (visited on 03/15/2019) 6https://radimrehurek.com/gensim/models/word2vec.html (visited on 03/26/2019) 7https://keras.io/layers/recurrent/ (visited on 03/21/2019)

26 and the hyperparameters that were tuned during the optimization are listed in table 3.4. Default values for all the other hyperparameters were used, since tuning them was considered to have little effect on the performance for this task.

Tuned Hyperparameters Description embedding_dropout_rate Ignores some input words randomly (reduces overfitting) lstm_units Number of words in the memory between iterations output_layer_1_units Units in first output layer output_layer_1_dropout_rate Ignores some output units randomly (while training) epochs Number of times the model trains on the entire corpus batch_size Number of documents in one batch

Table 3.4: The hyperparameters that were tuned during the optimization of LSTM

Note that one epoch is completed when the entire corpus is passsed through the model once, when training. Furthermore, the corpus is divided into batches of a certain batch size.

3.4 Hyperparameter Optimization and Valuation

The models, described above, were first of all tested using the default hyperparameters. After that, some manual adjustments were performed in order to see how well the theory about the models applied for this particular corpus. An increased amount of trees in Random Forest was, for instance, expected to increase accuracy and reduce the variance of the classification trees (see section 2.3.1). After some experimentation, a relevant subspace of the chosen hyperparameters could be defined and a Random search, using RandomizedSearchCV in sci-kit learn8, run on Google Cloud. The motivation to use Random search instead of Grid search can be seen in section 2.4.4.

Furthermore, all experimentation and optimization were mainly evaluated on F1, where recall, accuracy, precision and time were registered as well. β was chosen to 1 in Fβ, since recall and precision were considered to be equally important. Furthermore, stratified 5-fold Cross-validation was applied to all tests, in order to get trustworthy results, without ever training or testing on the test corpus. The

8https://scikit- learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html (visited on 03/21/2019)

27 motivation to why stratification and K = 5 was used, can be found in section 2.4.2.

3.5 Human Baseline

While manually iterating through the corpus, it was discovered that several documents were difficult to label even for humans. Examples of such unclear documents are given in table 3.5. This aroused an interest in comparing the performance of the machine learning models to that of humans.

Therefore, 8 people were invited to manually read and classify a total of 100 documents each on every corpus, after which accuracy, recall, precision and F1 were computed. The participants were randomly chosen among friends and family and were not particularly familiar with insurance beforehand.

Unclear Documents in the Corpus men man får ingen studentrabatt då om jag betalar 1500 självrisk då är det lugnt med andra ord

Table 3.5: Some examples of documents, which can be perceived hard to classify even for humans. It can be argued that context is needed to be able to classify them.

28 4 Results

Some results from the experiments and model development are presented in this section. Firstly, the results from the random search are displayed, followed by a description of how humans performed on the classification task and lastly a summary of the performance of the best models. As mentioned in section 3.4,

F1 was used as a basis for decision making. Recall, precision and accuracy are however presented together with the final models in section 4.4. Note that only the best performing variants of Random Forest and LSTM were selected to be displayed in some of the diagram types, in order to keep the results as concise and informative as possible.

4.1 Optimization of Hyperparameters

Figure 4.1 shows the correlation between mean training time and F1 score, resulting from the random search on LSTM. The mean training time is the average over all 5 trainings in 5-fold cross-validation on the mixed corpus. Larger rings are equivalent to more epochs during training. It can be observed that F1 score seems to correlate with more epochs as well as with training time, which is in line with the theory. Another observation that can be made, is that the measurements are clustered in the upper left corner.

29 Figure 4.1: Graph from random search with 120 iterations on mixed corpus. Bigger rings means more epochs and training time is measured in seconds.

Figure 4.2 illustrates the same correlation as figure 4.1 but for Random Forest together with Bag of Words, also on the mixed corpus. In this figure, larger rings equals more classification trees. The major difference from figure 4.1 is that the correlation between F1 score and the number of classification trees seems to be closer to 0.

More detailed tables over the hyperparameters that yielded the highest F1 during the optimization, are presented in appendix 9.

30 Figure 4.2: Graph from random search with 146 iterations on mixed corpus. Bigger rings means more classification trees and training time is measured in seconds.

Figure 4.3: Before and after random search on LSTM

31 Figure 4.4: Before and after random search on Random Forest

The performance of LSTM before and after the random search, is illustrated in figure 4.3. The models before random search were obtained by applying the relevant theory combined with some trial and error. The results from the same type of test but for Random Forest, are presented in figure 4.4.

Figure 4.3 shows that the performance of LSTM was not consequently improved after the random search. The intention with a random search is to improve the model but F1 became slightly lower after optimization on the mixed corpus.

Figure 4.4 shows that Bag of Words on the mixed corpus performed marginally better overall, in terms of F1. As in in the case with LSTM, did not the random search result in better performance in all cases.

4.2 Human Baseline

The mean scores for 8 randomly chosen humans, classifying 100 documents each from both of the two corpora, are illustrated in table 4.1. The 95% for F1 is also displayed as well as the mean time it took to classify a document. The confidence interval was computed under normality assumption. Note that the humans performed better with respect to all metrics on the Hedvig corpus, except from precision.

32 Corpus F1 CIF 1 Precision Recall Accuracy Time/doc Mixed 0,862  0,022 0,974 0,776 0,877 4,23 s Hedvig 0,945  0,025 0,963 0,930 0,946 3,44 s

Table 4.1: Human baseline, where 8 people got to classify 100 documents by hand on each corpus. The 95% confidence interval is denoted CIF 1.

4.3 Feature Importance

Feature importance was computed to see what words Random Forest detected as meaningful when classifying a document. The words are ranked from most important to less important and only the top five words are displayed in figure 4.2. Note that ”Imp.” is a relative measure and adds up to 1 for all words in the vocabulary. Furthermore, it is reasonable that words like ”hur” (how) and ”vad” (what) are differentiating questions from non-questions.

Hedvig Imp. Mixed Imp. hur 0,058 hur 0,047 ni 0,04 du 0,039 vad 0,026 vad 0,037 kan 0,019 varför 0,03 har ni 0,015 ni 0,017

Table 4.2: Top words with respect to feature importance for Random Forest with BoW and the same parameters that were used in the best model before random search

4.4 The Final Models

The best and final models, evaluated on the 20% test corpora, are presented in tables 4.3 and 4.4. The overall observation is that there are no major differences between the models, in terms of F1. LSTM achieved an F1 of 0, 874 and 0, 877 on respective corpus, which is just slightly higher than all variants of Random Forest. Furthermore are the performance of the models comparable to that of humans on the mixed corpus, while humans achieved higher scores on the Hedvig corpus. As expected, the models are however trained faster than humans. The humans had a mean training time of 32 years (same as their mean age).

33 Model F1-Score Precision Recall Accuracy Training Time RF + tf-idf 0,863 0,908 0,822 0,912 295,034 s RF + BoW 0,864 0,892 0,836 0,911 1299,443 s LSTM 0,874 0,878 0,869 0,915 370,787 s Humans 0,862 0,974 0,776 0,877 32 years

Table 4.3: Final models, evaluated on the mixed corpus

Model F1-Score Precision Recall Accuracy Training Time RF + tf-idf 0,846 0,862 0,829 0,897 18,286 s RF + BoW 0,852 0,867 0,837 0,909 5,36 s LSTM 0,877 0,908 0,848 0,926 185,297 s Humans 0,945 0,963 0,930 0,946 32 years

Table 4.4: Final models, evaluated on the Hedvig corpus

34 Part II

Chatbot’s Effect on Adoption Rate

35 5 Theoretical Background

The theoretical framework that is used to study how implementing a chatbot would affect the adoption rate of Hedvig, is presented in this section. Firstly, the theory about adoption and diffusion is explained, followed by the SWOT- method.

5.1 Diffusion of Innovations

Everett Rogers was an American communication theorist and sociologist, who is especially known for his diffusion of innovation theory. The theory presented in this section is based on the work in his book Diffusion of Innovations [23].

It is intuitive to begin with a definition of what diffusion is. In the context of innovation theory, Rogers defines diffusion accordingly:

”Diffusion is the process in which an innovation is communicated through certain channels over time among the members of a social system.”

Where he defines innovation as follows:

”An innovation is an idea, practice, or object that is perceived as new by an individual or other unit of adoption.”

Some examples of units of adoption are households, organizations, social groups and cities, which clarifies what that concept refers to. Furthermore, it should be highlighted that it does not matter whether or not the idea is objectively new, since the perceived novelty is what determines people’s reaction to it.

An essential part of Rogers’s diffusion theory is concerned with why some innovations are adopted more rapidly than others. The rate of adoption can be defined as the time required for a certain share of the members of a system to accept and adopt an innovation. The system may for instance be defined as an organization or people living in a certain city.

In an attempt to explain why certain innovations diffuse faster than others, Rogers points out five characteristics that is said to explain the rate of adoption. He

36 states that research has shown that innovations that are perceived as having higher relative advantage, compatibility, trialability, observability and lower complexity will be adopted faster than others. Especially important are the first two characteristics: relative advantage and compatibility.

1. Relative Advantage Relative advantage refers to the degree to which an innovation is perceived as superior to its precursor. Relative advantage can be measured in social prestige, convenience, economic terms or satisfaction. The higher perceived relative advantage an innovation has, the faster it will be adopted.

2. Compatibility Compatibility is the degree to which an innovation is perceived as being consistent with past experiences, needs, current values and norms of potential adopters. An innovation that is compatible in this sense, will be adopted more quickly than one that is incompatible, since an incompatible innovation usually requires prior adoption of a new value system.

3. Complexity Complexity is the degree to which an innovation is perceived as difficult to use and understand. Innovations that are easy to understand are adopted faster than complicated ideas, which require the adopter to acquire new knowledge and competence.

4. Trialability Trialability is the degree to which an innovation may be experimented with on a limited basis. Innovations that can be tried will be adopted at a higher rate, since trialability decreases the adopter’s uncertainty about the new idea.

5. Observability Observability is the degree to which the results of an innovation are visible to others. Visibility spur ”word of mouth” and peer discussion of the new idea. If potential adopters see the results of an innovation, they often ask for reviews about it. The more visible the innovation’s results are to others, the more likely they therefore are to adopt it. The clustering of for example solar panels on roofs is one evidence for the importance of obervability.

37 5.2 SWOT

The SWOT-analysis is a common tool for making situation analyzes in business cases. The analysis is conducted by assessing the company’s strengths, weaknesses, opportunities and threats. Strengths include internal abilities and resources that help the company achieve its goals, while weaknesses are just the opposite. Opportunities are external factors or environmental trends, which the company may exploit to its advantage. Threats are also external but may, on the contrary, jeopardize the company’s business [24, pp. 79-80]. The analysis are often illustrated as a two by two matrix, as can be seen in figure 5.1.

Figure 5.1: A matrix is often used to present the SWOT-analysis.

38 6 Method

This section presents the method that was used to investigate how implementing a chatbot would effect the adoption rate of Hedvig (regarded as an innovation)9. The theory about diffusion of innovation, outlined in section 5.1, was used as the main framework for the analysis. It was chosen since it is well established and provides an easy to understand partitioning of the innovation’s characteristics into five attributes, crucial for its success.

The study began by first focusing on collecting general information about Hedvig and how the innovation differentiates in the market today. In particular, was information about the product gathered from the perspective of the five attributes, relative advantage, compatibility, complexity, trialability and observability. An analysis, presented in section 7.2, of how a chatbot would affect these innovation characteristics was thereafter made.

After studying these innovation characteristics and the company in general, it was however apparent that user experience was the most central part in Hedvig’s business model. User experience may be defined as ”a person’s perceptions and responses resulting from the use and/or anticipated use of a product, system or service”[25]. As can be seen from the definition, user experience is a broad concept being a part of all first three attributes (relative advantage, compatibility and complexity). Since it is Hedvig’s primary core competence, it can however mainly be categorized as a relative advantage.

A detailed study of all the innovation’s characteristics was considered to be too extensive for this thesis. A decision to put emphasis on how user experience would be affected by chat automation, was therefore made. The remaining attributes were analyzed briefly without any experiments or profound literature studies.

In order to investigate user experience further, a literature study was conducted. Since Hedvig does not have a functioning chatbot at the time of writing, no

9Apart from being a company, Hedvig may also be regarded as an innovation; The company’s name is the same as its service’s, which in turn satisfies Rogers’s definition of an innovation in section 5.1

39 empirical studies could be pursued to test its potential effect. Other own constructed tests or interviews were also not done, since that was considered to be outside the scope of this report. Instead, previous studies were used as a foundation upon which a discussion about how those studies apply in Hedvig’s case was made in section 7.2.

6.1 Literature Search

Science Direct and Google Scholar were used as the two main sources during the research. The keywords: user experience, chatbot, customer satisfaction were used and further scientific literature were also found by examining the reference lists of the papers. Firstly, the abstract and conclusion were read, to detect if the paper was relevant. If so, the result section was carefully studied and key insights summarized in section 7.1.

6.2 Selection

A total of 23 abstracts of articles were reviewed, where 5 of those were subject to further analysis. The selection of which articles to study more carefully was based on four criteria explained below:

1. Recent work were prioritized over older, since the effect of chatbots reasonably changes over time as they get more advanced fast and the general attitude towards them changes.

2. Empirical studies based on much data were preferred over literature studies to eliminate potential errors that can occur when using secondary sources.

3. Studies made with general and not domain specific chatbots (unless insurance or related field) were prioritized, since those would reasonably be more applicable on Hedvig.

4. Studies that were perceived as unethical; for instance because being discriminating, unfair or containing conclusions that could be used for racism were excluded.

40 7 Results

In this section, a selection of the findings from previous studies about user experience are presented as a summary with attached references. The section ends with an analysis of how a chatbot would affect Hedvig’s five attributes for diffusion. The results from the literature study are incorporated in that analysis.

7.1 Literature Study

Several studies reported differences in the user’s behaviour when conversing with chatbots. By comparing six conversation transcripts from Microsoft’s chatbot ”Little Ice” to six conversations with human friends, analysts observed that users demonstrated different personalities. The users tended to be more agreeable, extroverted, open and self-disclosing when conversing with humans. They also appeared to be more emotionally stable when talking to humans [26].

An additional comparison of 100 human conversations to 100 chatbot conversations, also showed that humans tend to behave differently when talking to chatbots. During this study from year 2015, it was observed that humans sent about twice as many messages to chatbots compared to humans, and each message generally contained fewer words [27].

General user experience and its component of trust have also been studied in the context of chatbots. Thirteen chatbot users were interviewed, and were asked questions concerning their experience with chatbots and factors affecting their trust in these. The interviewed people reported several benefits as well as drawbacks with chatbots, listed in table 7.1 and 7.2 [28]. Related to the reported problem with interpretation of complex questions, it is suggested that a major challenge for chatbots is to identify what questions can be answered by the bot and what must be forwarded to a human [29].

41 Reported Benefits with Chatbots Fast responses and high availability of information Works well for simple and general questions Low threshold for asking questions One does not feel like being judged Do not feel time pressure

Table 7.1: Some of the benefits users experienced with chatbots

Furthermore, several factors were reported to affect the perceived trust in chatbots. Some of these factors are human-likeness, professional appearances, ability to understand and its self-presentation. The researchers also found that the context in which the chatbot is used is affecting its trustworthiness. People are for example more likely to trust the chatbot if they have trust in the brand, hosting the chatbot [28].

Reported Drawbacks with Chatbots Interpretation problems, especially for complex questions Concern for security and privacy Fear for chatbots being a step towards reduced access to customer service personnel

Table 7.2: Some of the drawbacks users experienced with chatbots

7.2 Five Attributes of Hedvig

7.2.1 Relative Advantage

As mentioned in section 6, user experience is the most outstanding attribute of Hedvig. It is a crucial part of Hedvig’s business model, as can be seen by stating their mission: ”Create the world’s most remarkable insurance experience”. Hedvig’s focus on user experience, has made it to their greatest relative advantage. This advantage is further magnified, since user experience is one of the most significant drawbacks of the competitors’ services today.

The app, in which the chatbot would be implemented, is where the users mostly interact with the company and is where much of this user experience is created. The chat is a central part of the app and the user experiences it throughout all three stages declared in Touchpoint wheel. Touchpoint wheel is a model which

42 partitions the user experience into three parts: before, during and after purchase [30, pp. 6-7]. This section contains an analysis (mainly based on the literature study) of how implementing a chatbot would affect the user experience in all these three stages.

In addition to creating a ”remarkable insurance experience”, automation is one of Hedvig’s core competencies; They have clearly communicated their wish to automate the service and have positioned themselves as ”high-tech”. CTO John Ardelius expresses this: ”We try to automate the process as much as possible, not only to make it faster but to make it more fair.”[31] The results in previous work showed that faster responses was a major benefit with chatbots, which is coherent with what John said. The large investments in NLP worldwide may help Hedvig exploit this benefit and facilitate automation through available frameworks and competence. Large financial resources will however be required to maintain that core competence, which can potentially cause problems for a startup like Hedvig. Furthermore, the effects on user experience (e.g. trust), was also showed to be dependent on the provider’s branding. From one perspective, people could therefore be more likely to trust the chatbot, since Hedvig is perceived as high- tech and differentiates themselves as being ”on the customers’ side”.

From another perspective, studies have shown that users tend to align there satisfaction with their expectations, as long as they are fulfilled. When experience differs from what the user expected, the satisfaction generally correlates with their level of disconfirmation [32]. This implies that user experience will be worsened if the chatbot does not correspond to what is promised by the brand. In this sense, the ”high-tech” branding strategy could have negative effects. Furthermore, the chat is currently designed to give the impression that the user is writing to a bot, but is in fact conversing to a human. The service level of these humans are in other words setting the chatbot’s minimum performance level if Hedvig wants to improve user experience, at least regarding existing users.

Additional aspects of the user experience become increasingly important, considering the fact that Hedvig is an insurance service provider; Users appeared to be more emotionally unstable when using chatbots, which can cause problems for Hedvig. The user might be angry or sad when chatting, especially during the

43 claiming process (in the after purchase stage). The identified challenge of knowing when to forward the user to a human, becomes crucial in these scenarios.

7.2.2 Compatibility

Compatibility was also found to be one of Hedvig’s strengths. Hedvig is operating in a time when people are used to getting instant feedback and service. Services like Swish and Kry have reasonably raised people’s expectations. Swish provides instant transactions of money between two parties and Kry enables access to a doctor through a video chat. Hedvig is compatible with this norm and people’s demand for fast service, by trying to deliver quick service through their app and digitized business model. As described in section 7.2.1, a chatbot could definitely enable faster responses, making Hedvig more compatible in this sense. The ”chat user interface” is also widely used in society, meaning that people are used to it.

Nonetheless, the concern for security and privacy is a critical value in society for Hedvig, since they handle large amounts of personal data. Security and privacy was also found to be a perceived drawback with chatbots, implying that it needs to be dealt with in order to increase the compatibility.

The reported fear of reduced access to human service, due to implementing a chatbot, should not affect Hedvig, at least for existing users. As mentioned in section 7.2.1, the chat has always been designed to signal ”chatbot”. New users may however be discouraged to try Hedvig, since the lack of human customer service might frighten them.

7.2.3 Complexity

Simplicity is also something that distinguishes Hedvig today. Many of the competing insurance providers are large and rigid companies, that are struggling with bureaucracy. Hedvig advertises the simplicity on the website as follows: ”Insurance doesn’t need to be old and boring. Hedvig makes protecting you and your home simple, fast and fair.”

44 There are however indicators that implementing a chatbot could affect the complexity of the service negatively. Even though a chatbot will not require any major changes to the user interface, the literature study showed that complex questions to chatbots may cause interpretation problems. In accordance with the analysis in section 7.2.1, this increases the importance of a well developed chatbot.

7.2.4 Trialability

The trialability of Hedvig is rather low today, since users have to pay for the full service to try and experiment with it. Parts of the chat functionality may however be tested without paying. Furthermore is Hedvig the only insurance provider in Sweden without a fixed contract length, which lowers the barriers for users who want to try it.

Even though trialability may be considered as one of Hedvig’s weaknesses today, a chatbot could have a positive impact on it. The capacity of handling customers’ claims and questions would be much higher if the chat was fully automated. Almost no resources would be required to let users try more of the service before deciding to adopt the innovation.

7.2.5 Observability

It can be argued that observability is currently the primary weakness of Hedvig. It does not exist any obvious and natural ways of observing other people using the service. Neither will a chatbot probably contribute to improved observability. The fact that the service is provided through an app is both an issue as well as an opportunity, in this sense; It may be hard to observe others use the app directly but access to internet opens up opportunities to show it there.

7.3 SWOT

A SWOT-matrix is illustrated in figure 7.1, to compile the analysis into a concise format. It provides an overview of how appropriate the current situation is for

45 implementing a chatbot to increase adoption rate. Note that opportunities are external factors out of Hedvig’s control that can be exploited, while threats can potentially lead to a decreased rate of adoption. The strengths and weaknesses are internal characteristics working in favor respectively against Hedvig.

Figure 7.1: A SWOT-analysis for Hedvig in the context of how adoption rate would be affected if a chatbot was implemented.

46 Part I and II

Discussion

47 8 Discussion

8.1 Classification with LSTM and Random Forest

In this section, the results from section 4 are discussed and compared to the theory and previous literature. Potential sources of error and failures in the experiments are also analyzed, ending with answers to the problem statement.

8.1.1 Word Embedding

Emphasis was not put on trying to find the optimal word embedding but two variants were compared for use in Random Forest. It is however difficult to point out a single optimal type of word embedding, since the performance seems to be highly dependent on the corpus. The ambiguous results are illustrated in figure

4.4, were tf-idf yields higher F1 than the standard Bag of Words on the mixed corpus, while the opposite holds for the Hedvig corpus. Regarding N-grams, different values of N were manually tried before concluding that ngram_range =

(1, 3) yielded the highest F1. The idea that the order of words in questions is important, is therefore confirmed.

Word2vec was not compared to any other word embedding, which makes it more difficult to evaluate. Other state of the art models like ELMo could have improved the performance of LSTM. Since Word2vec is widely used and said to work well for problems like question-answering, its performance should however not be much worse.

8.1.2 Optimization of Hyperparameters

Note that the F1 score, increases only slightly or not at all after the random search (illustrated in figure 4.3 and 4.4). This result can be explained in several ways. Firstly, this optimization technique is per definition dependent on chance even though the prespecified subspace of hyperparameters also affects the result. Especially LSTM has many hyperparameters and several of those were left on their default values. Secondly, the outcome is dependent on how many iterations were

48 done. Due to high time complexity in the models, only 146 iterations could be done on Random Forest (with Bag of Words on mixed corpus) for example. Hence, it may be enough to apply the theory and knowledge about the models to find sufficiently good hyperparameters.

Furthermore, figure 4.1 and 4.2 form the basis for additional insights. Even though all parameters were varied simultaneously, F1 seems to be highly correlated with training time. Figure 4.1 also indicates that F1 tends to be higher with more epochs, which is reasonable. Contrasting the theory in section 2.3.1, no clear relationship can be seen between number of classification trees and F1 in figure 4.2.

8.1.3 Data and Human Baseline

There is a noticeable difference in people’s performance between the Hedvig- and mixed corpus, as can be seen in table 4.1. The humans performed better with respect to all metrics on the Hedvig corpus, while the machine learning models are as good as humans on the mixed corpus. The human F1 score was for example 0, 862 on mixed corpus and 0, 945 on the Hedvig corpus. Mainly two reasons behind this can be stated. Firstly, machine learning models, generally benefit from more data and the mixed corpus contained more data. Secondly, the Hedvig corpus was manually reviewed but not the mixed corpus. This eliminated some noise from the results and obviously favored the humans. As can be seen in table 4.2, the most important words in both corpora, seem however intuitively reasonable. Words like ”hur” and ”vad” is affecting the predictions of Random Forest significantly, which indicates that both the model and data is somewhat adequate.

Nonetheless, it must be mentioned that the removal of punctuation and special characters such as question marks from the corpora, might have worsened the performance of the models. Documents like ”Ni har inte studentförsäkring!” and ”Ni har inte studentförsäkring?” should for instance be simple to classify, while the difficulty increases dramatically when the ending special characters (! and ?) are removed. Other examples of similar documents are shown in table 3.5. On the

49 contrary might this decision have prevented the models from being overtrained on for example question marks. The language used in the chat was observed to be informal and a significant proportion of the presumed questions were found to lack a question mark, as described in section 3.1.3. Although achieving high performance is not the purpose of this thesis, trying both alternatives would have removed this uncertainty.

8.2 Chatbot’s Effect on Adoption Rate

The literature study in combination with the analysis in section 7.2, showed that a chatbot could affect the attributes of Hedvig. Regarding the relative advantage of high user experience, it is clear that Hedvigs’ branding strategy and mission can lead to an increased feeling of trust. At the same time do high expectations also require a well developed chatbot, in order to have it improve the over all user experience. The key here is to acquire enough financial resources and competence to develop that high performing bot. A suitable practice for when to forward users to humans, is also required to have an improved user experience.

Hedvig’s compatibility may furthermore be improved by a chatbot, at least if aspects like privacy are dealt with. Multiple values and habits among people in society work in favour of Hedvig’s fast and mobile friendly service, and a chatbot would make it even faster.

A chatbot’s effect on the remaining three attributes is however not as certain. Both the trialability and observability are rather poor today and especially the observability would probably not be improved significantly by implementing a chatbot. The literature study even showed that the innovation’s complexity might increase, which would lead to a slower rate of adoption. However, the fact that the models in this thesis performed better than humans (at least on the mixed corpus), indicates that the reported interpretation problems with chatbots can be erased in the future.

Finally, it is important to mention some sources of error in this study of adoption rate. First of all was a decision made to focus only on user experience, during the literature study. That decision was necessary to narrow the scope, but also

50 meant that less effort was put on evaluating trialability and observability effects of a chatbot. Some important points and insights might therefore have been missed. Secondly, no empirical studies or experiments were made, which could have yielded more accurate results that were applicable on Hedvig.

8.3 Answer to Problem Statements

How does the relatively more complex machine learning model LSTM compare to the easier to interpret Random Forest when it comes to classifying insurance related messages as questions or non-questions? This study has shown that LSTM is only slightly better than Random Forest, in terms of F1, precision, recall and accuracy. The highest achieved F1 for LSTM was 0, 877, while Random Forest had a maximum of 0, 864. The training time complexity of the models was also observed to be similar, which of course favors Random Forest, since it is generally said to be easier to understand and does not require as high competence as LSTM does. The possibility to display feature importances in Random Forest, can for example be useful when optimizing or debugging the model. The performance of Random Forest is at human level, at least when training on a large corpus and scores higher than several of the neural network models in the related studies (see section 2.5). Even though this study suffers from several sources of error, the recommendation is therefore to start with Random Forest when facing similar problems.

The insights from this study may be used as a basis for further studies. Future work can for example investigate the difference between the models for other classification tasks and in other industries. It could also be interesting to include the context in which the messages are written, in the hope of getting higher performance.

How would a chatbot affect the adoption rate of Hedvig? Since the problem formulation is rather hypothetical, no definite answers can be provided. This study did however show that a chatbot could have a positive impact on Hedvig, especially on its relative advantage and compatibility. These two attributes are Hedvig’s core strengths today and improving them would

51 reasonably increase the rate of adoption. Relative advantage and compatibility were also pointed out by Rogers to be the most important for achieving a rapid adoption rate. The effect a chatbot would have on complexity, trialability and observability is at the same time suggested to be small, if not negative. To obtain more reliable results, Hedvig is therefore recommended to conduct real experiments or interviews. This future work is especially important, since the conclusions may change over time, as both Hedvig and chatbots develop and the attitude towards them changes.

52 References

[1] Joulin, Armand et al. “Bag of Tricks for Efficient Text Classification”. In: CoRR abs/1607.01759 (2016). arXiv: 1607.01759. URL: http://arxiv. org/abs/1607.01759.

[2] Mikolov, Tomas et al. “Efficient Estimation of Word Representations in Vector Space”. In: CoRR abs/1301.3781 (2013). arXiv: 1301 . 3781. URL: https://arxiv.org/abs/1301.3781.

[3] Samuel, Arthur. “Some studies in machine learning using the game of checkers”. In: Journal of research and development 44 (1.2 Jan. 1959), pp. 210–229. DOI: 10.1147/rd.441.0206.

[4] Mitchell, Tom M. Machine Learning. 1st ed. New York City. McGraw-Hill Education, 1997. ISBN: 0070428077.

[5] James, Gareth et al. An Introduction to Statistical Learning. 1st ed. New York City. Springer-Verlag New York Inc, 2013.

[6] Alpaydin, Ethem. Introduction to Machine Learning. 3rd ed. Cambridge, Massachusetts. The MIT Press, 2014.

[7] Jurafsky, Dan and Martin, James H. Speech and Language Processing. 3rd ed. draft. Stanford, 2018.

[8] 4.2. Feature extraction. sci-kit learn. 2018. URL: https://scikit-learn. org/stable/modules/feature_extraction.html (visited on 03/08/2019).

[9] Aiyar, Shreyas and Shetty, Nisha P. “N-Gram Assisted Youtube Spam Comment Detection”. In: Procedia Computer science 132 (2018), pp. 174– 182.

[10] Smith, Noah A. “Contextual Word Representations: A Contextual Introduction”. In: CoRR abs/1902.06006 (2019). arXiv: 1902.06006. URL: http://arxiv.org/abs/1902.06006.

[11] Wang, Rujuan, Chen, Gang, and Sui, Xin. “Multi label text classification method based on co-occurrence latent semantic vector space”. In: Procedia Computer science 131 (2018), pp. 756–764.

53 [12] Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. 2nd ed. New York City. Springer-Verlag New York Inc., 2009.

[13] Olah, Christopher. Understanding LSTM Networks. colah’s blog. Aug. 2018. URL: http://colah.github.io/posts/2015-08-Understanding- LSTMs/ (visited on 03/20/2019).

[14] Hochreiter, Sepp and Schmidhuber, Jürgen. “Long Short-Term Memory”. In: Neural Computation 9.8 (1997), pp. 1735–1780. DOI: 10.1162/neco. 1997.9.8.1735. eprint: https://doi.org/10.1162/neco.1997.9.8.1735. URL: https://doi.org/10.1162/neco.1997.9.8.1735.

[15] Wang, Xin et al. “Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory”. In: International Joint Conference on Natural Language Processing (2015), pp. 1343–1353.

[16] Kohavi, Ron. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and ”. In: International Joint Conference on Articial Intelligenc (1995).

[17] Bergstra, James and Bengio, Yoshua. “Random Search for Hyper- Parameter Optimization”. In: Journal of Machine Learning Research 13 (2012), pp. 281–305.

[18] Bojanowski, Piotr et al. “Enriching Word Vectors with Subword Information”. In: Transactions of the Association for Computational Linguistics 5 (2017), pp. 135–146. DOI: 10.1162/tacl\ _a\_00051. URL: https://doi.org/10.1162/tacl_a_00051.

[19] Joulin, Armand et al. “FastText.zip: Compressing text classification models”. In: CoRR abs/1612.03651 (2016). URL: http://arxiv.org/abs/ 1612.03651.

[20] Socher, Richard et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In: Conference on Empirical Methods in Natural Language Processing 1631 (2013), pp. 1631–1642.

54 [21] Zhou, Chunting et al. “A C-LSTM Neural Network for Text Classification”. In: CoRR abs/1511.08630 (2015). arXiv: 1511.08630. URL: http://arxiv. org/abs/1511.08630.

[22] Xu, Baoxun et al. “An Improved Random Forest Classifier for Text Categorization”. In: JOURNAL OF COMPUTERS 7 (12 2012), pp. 2913– 2920.

[23] Rogers, Everett M. Diffusion of innovations. eng. 5th. New York, NY [u.a.]: Free Press, Aug. 2003, p. 576. ISBN: 0-7432-2209-1, 978-0-7432-2209-9.

[24] Kotler, Philip, Armstrong, Gary, and Opresnik, Marc Oliver. Principles of Marketing. 17th. Harlow, United Kingdom. Pearson Education Limited, 2017. ISBN: 978-1-292-22017-8.

[25] ISO 9241-210:2010(en). Online Browsing Platform. 2010. URL: https:// www.iso.org/obp/ui/#iso:std:iso:9241:-210:ed-1:v1:en (visited on 04/02/2019).

[26] Mou, Yi and Xu, Kun. “The media inequality: Comparing the initial human- human and human-AI social interactions”. In: Computers in Human Behaviour 72 (2017), pp. 432–440.

[27] Hill, Jennifer, Ford, W. Randolph, and Farreras, Ingrid G. “Real conversations with : A comparison between human– human online conversations and human–chatbot conversations”. In: Computers in Human Behaviour 49 (2015), pp. 245–250.

[28] Følstad, Asbjørn, Nordheim, Cecilie Bertinussen, and Bjørkli, Cato Alexander. “What Makes Users Trust a Chatbot for Customer Service? An Exploratory Interview Study”. In: Internet Science, 5th International Conference (2018), pp. 194–208.

[29] Castro, Fernanda et al. “Developing a Corporate Chatbot for a Customer Engagement Program: A Roadmap”. In: Intelligent Computing Theories and Application, 14th International Conference (2018), pp. 400–412.

[30] Davis, Scott M., Dunn, Michael, and Aaker, David. Building the Brand- Driven Business: Operationalize Your Brand to Drive Profitable Growth. 1st ed. Jossey Bass, 2002. ISBN: 0787962554.

55 [31] Phillips, Maria. Backas av tungviktare – nu ska Hedvig utmana försäkringsbranschen. DI Digital. May 2018. URL: https : / / digital . di . se / artikel / backas - av - tungviktare - nu - ska - hedvig - utmana - forsakringsbranschen (visited on 04/08/2019).

[32] Michalco, Jaroslav, Simonsen, Jakob Grue, and Hornbæk, Kasper. “An Exploration of the Relation Between Expectations and User Experience”. In: International Journal of Human–Computer 31 (9 2015), pp. 603–617.

56 9 Appendices

Contents

A Random Forest 58 A.1 Manual tuning ...... 58 A.2 Random Search ...... 58

B LSTM 59 B.1 Manual tuning ...... 59 B.2 Random Search ...... 59

57 A Random Forest

A.1 Manual tuning

After manually tuning the hyperparameters of Random Forest and its different types of word embedding, the following parameters were found to yield high performance for all versions of Random Forest. ngram_range = (1, 3) and min_df = 5 were found to be optimal for the word embedding.

Max Max Min samples Min samples Num depth features leaf split estimators None sqrt 1 10 150

Table A.1: The hyperparameters that were found after manually tuning Random Forest and its word embedding

A.2 Random Search

Corpus Mean Max Max Min Min Num F1 depth fea- samples samples estimators tures leaf split Mixed 0,852 None sqrt 1 2 320 Hedvig 0,854 None sqrt 1 2 320

Table A.2: Random Forest (with BoW): best candidates from the random search with 146 iterations on mixed corpus and 200 iterations on Hedvig. Note that the F1 score is computed on the training corpus

Corpus Mean Max Max Min Min Num F1 depth fea- samples samples estimators tures leaf split Mixed 0,819 90 None 1 5 180 Hedvig 0,849 None sqrt 1 2 320

Table A.3: Random Forest (with tf-idf): best candidates from the random search with 83 iterations on mixed and 200 iterations on Hedvig corpus. Note that the F1 score is computed on the training corpus

58 B LSTM

B.1 Manual tuning

After manually tuning the hyperparameters of LSTM, the following parameters were found to yield high performance.

Batch Epochs Embedding LSTM units Output layer Output size dropout rate dropout rate layer units 45 3 0,2 100 0,05 50

Table B.1: The hyperparameters that were found after manually tuning LSTM

B.2 Random Search

Corpus Mean Batch Epochs Embedding LSTM Output layer Output F1 size dropout units dropout rate layer units rate Mixed 0,887 1112 23 0,299 18 0,400 31 Hedvig 0,877 240 9 0,384 155 0,109 50

Table B.2: LSTM: best candidates from the random search with 50 iterations on Hedvig corpus and 110 iterations on mixed. Note that the F1 score is computed on the training corpus

59

TRITA -SCI-GRU 2019:151

www.kth.se