Development of a Language Model for the Medical Domain

Hochschule Rhein-Waal Rhine-Waal University of Applied Sciences Faculty of Communication and Environment Prof. Dr.-Ing. Rolf Becker Dr. Lukas Gilz Development of a Language model for the medical Domain Master Thesis by Manjil Shrestha Hochschule Rhein-Waal Rhine-Waal University of Applied Sciences Faculty of Communication and Environment Prof. Dr.-Ing. Rolf Becker Dr. Lukas Gilz Development of a Language model for the medical Domain A Thesis submitted in Partial Fulfillment of the Requirements of the Degree of Master of Science in Information Engineering and Computer Science by Manjil Shrestha Dietrichstr. 77 53175 Bonn Matriculation Number: 24925 Submission Date: 03.09.2020 Abstract Language models are widely used as a representation of written language in various machine learning tasks, with the most commonly used model being Bidirectional Encoder Representations from Transformers (BERT). It was shown that the prediction quality strongly benefits from language model pretraining on domain-specific data. The publicly available models, though are always trained on Wikipedia, news or legal data, thereby missing the domain specific knowledge about medical terms. In this thesis, we will train a BERT language model on medical data and compare performance with domain-unspecific language models. The dataset used for this purpose is the Non-technical Summaries - In- ternational Statistical Classification of Diseases (NTS-ICD) task of classification of ani- mal experiment descriptions into International Statistical Classification of Diseases (ICD) categories. Keywords: Multi-label Classification, ICD-10 Codes, Fine-tuning, BERT, Transfer Learn- ing, NTS classification iii Contents List of Abbreviations vii List of Figures x List of Tables xii List of Equations xiv 1 Introduction 1 1.1 Motivation.................................. 3 1.2 Problem Statement ............................. 4 1.3 Project Objective .............................. 5 2 Related Works 6 2.1 Transfer Learning.............................. 6 2.1.1 ULM-FiT.............................. 7 2.1.2 ELMo................................ 8 2.2 Bidirectional Encoder Representations from Transformers......... 9 2.2.1 Pre-training............................. 13 2.2.2 Fine-tuning............................. 13 2.3 NLP research in Non-technical Summaries (NTS)............. 14 2.3.1 Classification of NTS Documents ................. 14 2.4 Domain Specific BERT Models....................... 15 2.4.1 BioBERT.............................. 15 2.4.2 SciBERT .............................. 16 2.4.3 ClinicalBERT............................ 17 3 Dataset Description 19 3.1 ICD-10 ................................... 20 3.2 Non-Technical Summaries description................... 25 3.3 Code Statistics................................ 29 3.4 Dataset for Fine-tuning........................... 34 4 Theoretical Foundations 37 4.1 Machine Learning.............................. 37 iv Contents 4.1.1 Supervised Learning ........................ 37 4.1.2 Unsupervised Learning....................... 38 4.1.3 Self-supervised Learning...................... 39 4.2 Loss Functions ............................... 39 4.3 Deep Learning................................ 40 4.3.1 Neural Network........................... 41 4.3.2 Error Backpropagation....................... 43 4.3.3 Convolutional Neural Network................... 44 4.3.4 Recurrent Neural Network..................... 44 4.4 Natural Language Processing........................ 46 4.4.1 Word Embeddings ......................... 47 4.4.2 Bag of Words (BoW)........................ 49 4.4.3 Term Frequency - Inverse Document Frequency (TF-IDF) . 50 4.4.4 Language Modeling......................... 51 4.4.5 NLP Tasks.............................. 52 4.5 BERT .................................... 52 4.5.1 Attention .............................. 53 4.5.2 Transformers ............................ 54 4.6 Domain Adaptation............................. 58 4.7 Feature Extraction.............................. 59 4.8 Multi-label Classification.......................... 59 4.9 Evaluation.................................. 60 4.9.1 Confusion Matrix.......................... 61 4.9.2 Precision .............................. 62 4.9.3 Recall................................ 63 4.9.4 F1-score............................... 63 4.9.5 Example: Micro and Macro Performance ............. 64 5 Experiments 66 5.1 Experimental Setup............................. 66 5.2 Fine-tuning German BERT model with medical domain dataset . 67 5.2.1 Usecase 1: Sentence train - Sentence eval ............. 68 5.2.2 Usecase 2: Sentence train - Article eval .............. 70 5.2.3 Usecase 3: Article train - Article eval ............... 71 5.2.4 Usecase 4: Article train - Sentence eval .............. 72 5.3 BERT Predicting Masked Tokens Experiment............... 73 5.3.1 Experiment 1: SubWord Masking ................. 74 5.3.2 Experiment 2: WholeWord Masking................ 77 5.4 NTS-ICD-10 document classification.................... 80 5.4.1 Linear SVC............................. 81 5.4.2 Logistic Regression......................... 81 v Contents 5.4.3 FastText............................... 82 5.4.4 Bidirectional Encoder Representations from Transformers . 82 5.5 Results.................................... 88 5.5.1 Masking experiment results..................... 88 5.5.2 NTS-ICD-10 classification results ................. 92 6 Conclusion and Future Work 97 6.1 Review.................................... 97 6.2 Discussion.................................. 99 6.3 Future Work.................................100 Appendices 106 Tools and Environment 106 1 Software...................................106 2 Hardware ..................................106 Learning Parameters 107 1 Fine-tuning parameters ...........................107 2 Training parameters.............................107 Experiments 108 1 BoW and TF-IDF dataset sample......................108 2 Masking experiment sample dataset ....................112 2.1 NTS dataset.............................112 2.2 Wikipedia dataset..........................114 3 Sub-Word Token prediction for NTS articles . 116 4 Whole Word Token prediction for NTS articles . 118 5 Sanity check on NTS token predictions...................120 5.1 NTS-random words sentences ...................120 5.2 NTS-original sentences.......................121 6 NTS datasets to evaluate model.......................122 6.1 Doc 6198 ..............................122 6.2 Doc 6184 ..............................122 6.3 Doc 6158 ..............................123 6.4 Doc 6183 ..............................125 Declaration 127 vi List of Abbreviations 3R Replacement, Reduction, Refinement. 25, 27, 28 AI Artificial Intelligence. 14, 37, 100 BCE Binary Cross Entropy. xi, 82, 84, 85 BERT Bidirectional Encoder Representations from Transformers. iii, xii, 4–6, 9–18, 35–37, 39, 46, 52, 59, 66–70, 72–78, 80, 82–100, 107 BFR Federal Institute for Risk Assessment. 19 Bi-LSTM Bidirectional Long Short-Term Memory. 1, 3, 4 BoW Bag of words. 46, 47, 49, 50 CBOW Continuous Bag of Words. 82 CLS Text Classification. 15, 17 CLSTM Codes Attentive LSTM. 14 CNN Convolutional Neural Network. 14, 41, 44 CNS Central Nervous System. 23 DEP Dependency Parsing. 17 DF Document Frequency. 50 DL Deep Learning. 6 ELMo Embeddings from Language Models. 6, 8 F1 F1-score. 60, 61, 63–65, 67, 93–95 FN False Negative. 60–62, 64, 93–95 FP False Positive. 60–62, 64, 87, 93–95 GCTR German Clinical Trials Register. 14, 15 vii List of Abbreviations GM German Modification 2016 Version. 19, 29 ICD International Statistical Classification of Diseases. iii, 14, 19–23, 25, 28–32, 66, 67, 86, 87, 93–95, 98–100 IDF Inverse Document Frequency. 50, 51 IR Information Retrieval. 1 LM Language Model. 7, 8, 19, 46, 51, 52, 100 LSI Latent Semantic Indexing. 3 LSTM Long Short Term Memory. 1, 7, 8, 41, 45, 46 ML Machine Learning. 1–3, 6, 14, 19, 37–41, 58, 93 MLM Masked Language Model. 9, 12, 13, 73, 74, 79, 80, 98, 99, 107 NER Named Entity Recognition. 15–17, 99 NLI Natural Language Inference. 15, 17 NLP Natural Language Processing. 1–3, 6–8, 13–16, 19, 39, 45–47, 51, 52, 54, 59, 100 NSP Next Sentence Prediction. 9, 10, 13 NTS Non-technical Summaries. iv, 14, 15, 19, 23, 25–29, 31–34, 69, 70, 79, 80, 94, 95 NTS-ICD Non-technical Summaries - International Statistical Classification of Diseases. iii, xi–xiii, 14, 15, 31, 34, 49, 59, 60, 64, 66–72, 74, 80, 82–87, 93–95, 98–100 OAR Open Agrar Repository. 19 OOV Out of Vocabulary. 10, 16 OVA One-vs-All. 82 P Precision. 60–65, 67, 93 PICO PICO Extraction. 17 QA Question Answering. 15, 16, 99 R Recall. 60, 61, 63–65, 67, 93 RE Relation Extraction. 15, 16 REL Relation Extraction. 17 viii List of Abbreviations RNN Recurrent Neural Network. 1, 41, 45, 52 SVC Support Vector Classification. 81 SVM Support Vector Machine. 14 TF Term Frequency. 50, 51 TF-IDF Term Frequency - Inverse Document Frequency. 3, 14, 46, 50, 51, 67 TN True Negative. 61 TP True Positive. 60–62, 64, 95 ULM-FiT Universal Language Model Fine-Tuning. 6–8, 39 ix List of Figures 2.1 Learning process of transfer learning. (Taken from Tan et al. 2018).... 7 2.2 BERT input representations. (Taken from Devlin et al. 2019)....... 9 2.3 Overview of the pre-training and fine-tuning of BioBERT. (Taken from Lee et al. 2019) ............................... 16 3.1 Number of training, development and test documents in the evaluation dataset-(NTS-ICD-10)............................ 20 3.2 Chapter II (C00-D48) sub-groups present in ICD-10 tree (Based on WHO 2019) ...................................

Load more