LSTM Vs Random Forest for Binary Classification of Insurance Related Text

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2019 LSTM vs Random Forest for Binary Classification of Insurance Related Text HANNES KINDBOM KTH SKOLAN FÖR TEKNIKVETENSKAP LSTM vs Random Forest for Binary Classification of Insurance Related Text HANNES KINDBOM ROYAL Degree Projects in Applied Mathematics and Industrial Economics (15 hp) Degree Programme in Industrial Engineering and Management (300 hp) KTH Royal Institute of Technology year 2019 Supervisors at Hedvig John Ardelius Supervisors at KTH: Tatjana Pavlenko och Julia Liljegren Examiner at KTH: Jörgen Säve-Söderbergh TRITA-SCI-GRU 2019:151 MAT-K 2019:07 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci Abstract The field of natural language processing has received increased attention lately, but less focus is put on comparing models, which differ in complexity. This thesis compares Random Forest to LSTM, for the task of classifying a message as question or non-question. The comparison was done by training and optimizing the models on historic chat data from the Swedish insurance company Hedvig. Different types of word embedding were also tested, such as Word2vec and Bag of Words. The results demonstrated that LSTM achieved slightly higher scores than Random Forest, in terms of F1 and accuracy. The models’ performance were not significantly improved after optimization and it was also dependent on which corpus the models were trained on. An investigation of how a chatbot would affect Hedvig’s adoption rate was also conducted, mainly by reviewing previous studies about chatbots’ effects on user experience. The potential effects on the innovation’s five attributes, relative advantage, compatibility, complexity, trialability and observability were analyzed to answer the problem statement. The results showed that the adoption rate of Hedvig could be positively affected, by improving the first two attributes. The effects a chatbot would have on complexity, trialability and observability were however suggested to be negligible, if not negative. Keywords Random Forest, Classification, Natural Language Processing, Machine Learning, Neural Networks, Bag of Words, Bachelor Thesis, Diffusion of Innovation, Adoption Rate, User Experience i Sammanfattning Det vetenskapliga området språkteknologi har fått ökad uppmärksamhet den senaste tiden, men mindre fokus riktas på att jämföra modeller som skiljer sig i komplexitet. Den här kandidatuppsatsen jämför Random Forest med LSTM, genom att undersöka hur väl modellerna kan användas för att klassificera ett meddelande som fråga eller icke-fråga. Jämförelsen gjordes genom att träna och optimera modellerna på historisk chattdata från det svenska försäkringsbolaget Hedvig. Olika typer av word embedding, så som Word2vec och Bag of Words, testades också. Resultaten visade att LSTM uppnådde något högre F1 och accuracy än Random Forest. Modellernas prestanda förbättrades inte signifikant efter optimering och resultatet var också beroende av vilket korpus modellerna tränades på. En undersökning av hur en chattbot skulle påverka Hedvigs adoption rate genomfördes också, huvudsakligen genom att granska tidigare studier om chattbotars effekt på användarupplevelsen. De potentiella effekterna på en innovations fem attribut, relativ fördel, kompatibilitet, komplexitet, prövbarhet and observerbarhet analyserades för att kunna svara på frågeställningen. Resultaten visade att Hedvigs adoption rate kan påverkas positivt, genom att förbättra de två första attributen. Effekterna en chattbot skulle ha på komplexitet, prövbarhet och observerbarhet ansågs dock vara försumbar, om inte negativ. Nyckelord Random Forest, Klassificering, Språkteknologi, Maskininlärning, Neurala nätverk, Bag of Words, Kandidatexamensarbete, Användarupplevelse ii Acknowledgements Firstly, I would like to thank the Swedish insurance company Hedvig for providing me with the adequate data and for the project idea. Special thanks to Hedvig’s CTO John Ardelius and his collegue Ali Mosavian for excelent guidance throughout the entire process. I would also like to acknowledge my supervisor Tatjana Pavlenko at the Department of Mathematics, who always showed dedication and interest in my work. Tatjana allocated time for advice and discussion whenever I ran into trouble or had questions. I would also like to express my gratitude to my supervisor Julia Liljegren at the Department of Industrial Economics and Management. Julia primarily helped me with questions related to the second part of the bachelor thesis. iii 2019-05-22 Author Hannes Kindbom, Industrial Engineering and Management KTH Royal Institute of Technology Supervisors Associate Professor Tatjana Pavlenko, Department of Mathematics KTH Royal Institute of Technology PhD Julia Liljegren, Department of Industrial Economics and Management KTH Royal Institute of Technology CTO John Ardelius, Hedvig AB Contents 1 Introduction 1 1.1 Background ................................ 1 1.2 Problem statements ........................... 2 1.3 Purpose .................................. 2 1.4 Delimitations ............................... 3 Part I - Binary Classification with LSTM and Random Forest 4 2 Theoretical Background 5 2.1 Introduction to Machine Learning ................... 5 2.2 Word embedding ............................. 6 2.2.1 Bag of Words ........................... 6 2.2.2 Word2vec ............................. 7 2.3 Classification ............................... 9 2.3.1 Random Forest .......................... 9 2.3.2 LSTM ............................... 12 2.4 Validation ................................. 15 2.4.1 Training and Test Split ..................... 15 2.4.2 K-fold Cross-validation ..................... 15 2.4.3 Evaluation Metrics ........................ 16 2.4.4 Optimization of Hyperparameters . 18 2.5 Related Work ............................... 18 3 Method 21 3.1 Data .................................... 21 3.1.1 Raw Data ............................. 21 3.1.2 Formatting ............................ 21 3.1.3 Labeling .............................. 22 3.1.4 Sampling Additional Data .................... 23 3.1.5 Train and Test Data ....................... 23 3.2 Implementing Random Forest ..................... 23 3.2.1 Word embedding for Random Forest . 24 3.2.2 Random Forest .......................... 25 v 3.3 Implementing LSTM ........................... 25 3.3.1 Word2vec ............................. 26 3.3.2 LSTM ............................... 26 3.4 Hyperparameter Optimization and Valuation . 27 3.5 Human Baseline ............................. 28 4 Results 29 4.1 Optimization of Hyperparameters ................... 29 4.2 Human Baseline ............................. 32 4.3 Feature Importance ........................... 33 4.4 The Final Models ............................. 33 Part II - Chatbot’s Effect on Adoption Rate 35 5 Theoretical Background 36 5.1 Diffusion of Innovations ......................... 36 5.2 SWOT ................................... 38 6 Method 39 6.1 Literature Search ............................. 40 6.2 Selection ................................. 40 7 Results 41 7.1 Literature Study ............................. 41 7.2 Five Attributes of Hedvig ........................ 42 7.2.1 Relative Advantage ........................ 42 7.2.2 Compatibility ........................... 44 7.2.3 Complexity ............................ 44 7.2.4 Trialability ............................ 45 7.2.5 Observability ........................... 45 7.3 SWOT ................................... 45 Part I and II - Discussion 47 8 Discussion 48 8.1 Classification with LSTM and Random Forest . 48 vi 8.1.1 Word Embedding ........................ 48 8.1.2 Optimization of Hyperparameters . 48 8.1.3 Data and Human Baseline .................... 49 8.2 Chatbot’s Effect on Adoption Rate ................... 50 8.3 Answer to Problem Statements ..................... 51 References 53 9 Appendices 57 vii 1 Introduction 1.1 Background Machine learning has received increased attention lately, mainly due to improvements in computational power and access to vast amounts of data. Especially the sub-field natural language processing (NLP), which deals with how to utilize computers to process and analyze language data, has been heavily researched. A common task within NLP is to create chatbots, which can be used to reduce response time and increase availability for users. Much of the focus is directed to improve the performance of these chatbots, where text classification is a main part of it. ”Text classification is an important task in NLP with many applications, such as web search, information retrieval, ranking and document classification.” [1] This thesis is focused on text classification, on behalf of the Swedish insurance company Hedvig. Different approaches for determining whether a given message is a question or not, will be evaluated. The problem is modeled as a NLP-problem and can be decomposed into two main parts. The first part is to represent the piece of text (also called the corpus) as something the computer can interpret. Current state of the art is to use artificial neural networks to create word vectors which make up a semantically meaningful vector space. One widely used model for this is Word2vec [2]. The second part of the problem is to use these word vectors in some classifier. ”Recently, models based on neural networks have become increasingly popular” [1], where recurrent neural networks like the LSTM (Long-Short-Term- Memory) is an example

LSTM Vs Random Forest for Binary Classification of Insurance Related Text

Malware Classification with BERT

No-Decision Classification: an Alternative to Testing for Statistical

Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection

Machine Learning Methods for Classification of the Green

Random Forest Regression of Markov Chains for Accessible Music Generation

Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading Gcforest in Identifying Sentiment Polarity

Comparison of Machine Learning Techniques When Estimating Probability of Impairment

10-601 Machine Learning, Project Phase1 Report Random Forest

Binary Classification Models with “Uncertain” Predictions

Social Dating: Matching and Clustering

How Will AI Shape the Future of Law?

Evaluation of Adaptive Random Forest Algorithm for Classification of Evolving Data Stream