Improving Historical Language Modelling Using Transfer Learning
Total Page:16
File Type:pdf, Size:1020Kb
MSc Artificial Intelligence Master Thesis Improving historical language modelling using Transfer Learning by Konstantin Todorov 12402559 August 10, 2020 48 EC Nov 2019 - Aug 2020 Supervisor: Dr G Colavizza Assessor: Dr E Shutova Institute of Logic, Language and Computation University of Amsterdam Contents Page Abstract iv Acknowledgements v List of Figures vi List of Tables vii Abbreviations viii 1 Introduction 1 1.1 Motivation and Problem statement ............................ 1 1.2 Contributions ....................................... 2 1.3 Structure .......................................... 3 2 Background 4 2.1 Machine learning ..................................... 4 2.2 Natural language processing ................................ 5 2.3 Transfer learning ...................................... 6 2.3.1 Multi-task learning ................................ 7 2.3.1.1 Introduction ............................... 7 2.3.1.2 Benefits ................................. 7 2.3.2 Sequential transfer learning ............................ 8 2.3.2.1 Motivation ............................... 8 2.3.2.2 Stages .................................. 9 2.3.3 Other ....................................... 10 2.4 Optical Character Recognition .............................. 10 2.5 Historical texts ....................................... 12 2.6 Transfer learning for historical texts ............................ 13 3 Empirical setup 14 3.1 Model architecture ..................................... 14 3.1.1 Input ........................................ 14 3.1.2 Embedding layer .................................. 14 3.1.3 Task-specific model and evaluation ........................ 16 3.2 Problems .......................................... 16 3.2.1 Named entity recognition ............................. 16 3.2.1.1 Motivation ............................... 16 3.2.1.2 Tasks .................................. 16 3.2.1.3 Data ................................... 17 3.2.1.4 Evaluation ................................ 19 3.2.1.5 Model .................................. 20 3.2.1.6 Training ................................. 21 i 3.2.2 Post-OCR correction ............................... 24 3.2.2.1 Motivation ............................... 24 3.2.2.2 Data ................................... 25 3.2.2.3 Evaluation ................................ 26 3.2.2.4 Model .................................. 26 3.2.2.5 Training ................................. 28 3.2.3 Semantic change ................................. 29 3.2.3.1 Motivation ............................... 30 3.2.3.2 Tasks .................................. 30 3.2.3.3 Data ................................... 30 3.2.3.4 Evaluation ................................ 31 3.2.3.5 Model .................................. 31 3.2.3.6 Training ................................. 32 4 Results 33 4.1 Named entity recognition ................................. 33 4.1.1 Main results .................................... 33 4.1.2 Convergence speed ................................ 37 4.1.3 Parameter importance ............................... 38 4.1.4 Hyper-parameter tuning .............................. 40 4.2 Post-OCR correction ................................... 41 4.2.1 Main results .................................... 41 4.2.2 Convergence speed ................................ 43 4.2.3 Hyper-parameter tuning .............................. 44 4.3 Semantic change ...................................... 45 4.3.1 Main results .................................... 45 4.3.2 Hyper-parameter tuning .............................. 46 5 Discussion 47 5.1 Data importance ...................................... 47 5.2 Transfer learning applicability ............................... 48 5.3 Limitations ......................................... 49 5.4 Future work ........................................ 49 6 Conclusion 51 Bibliography 52 ii “Those who do not remember the past are condemned to repeat it.” George Santayana Abstract Transfer learning has recently delivered substantial gains across a wide variety of tasks. In Natural Language Processing, mainly in the form of pre-trained language models, it was proven beneficial as well, helping the community push forward many low-resource languages and domains. Thus, natu- rally, scholars and practitioners working with OCR’d historical corpora are increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by documents from the past, including OCR quality and language change, call for a critical assessment over the use of pre-trained language models in this setting. We consider three shared tasks, ICDAR2019 (post-OCR correction), CLEF-HIPE-2020 (Named Entity Recognition, NER) and SemEval 2020(Lexical Semantic Change, LSC) and systematically assess using pre-trained language models with historical data in French, German and English for the first two and English, German, Latin and Swedish for the third. We find that pre-trained language models help with NER but not with post-OCR correction. Furthermore, we show that this improvement is not coming from the increase of the network size but precisely from the transferred knowledge. We further show how multi-task learning can speed up historical training while achieving similar results for NER. In all challenges, we investigate the importance of data quality and size, with them emerging as one of the fragments currently hindering progress in the historical domain the most. Moreover, for LSC we see that due to the lack of standardised evaluation criteria and introduced bias during annotation, important encoded knowledge can be left out. Finally, we share with community our implemented modular setup which can be used to further assess and conclude the current state of transfer learning applicability over ancient documents. As a conclusion, we emphasise that pre-trained language models should be used critically when working with OCR’d historical corpora. iv Acknowledgements Working on this thesis proved to be a challenge of a great value for me and also one of great joy. I am truly pleased and immensely proud for finalising this important step of my life. None of this would have been possible without, first and foremost, Giovanni Colavizza, whom I would like to thank for his constant advice, great support and outstanding supervision throughout the last nine months. His experience and thinking made me gain invaluable experience and deliver more than I ever imagined. Furthermore, I want to thank my family for everything that they have ever done for me which all resulted in this achievement. Благодаря от сърце на моето семейство – на моите майка и баща – Ренета и Живко Тодорови и сестра – Деница, без които никога нямаше да постигна това, което съм постигнал до този момент в живота си. Благодаря ви за всичко, което някога сте направили за мен. Искрено се надявам наличието на този документ да ви прави горди. My sincere gratitude goes also towards two of my colleagues at Talmundo, namely Reinder Meijer and Esther Abraas, who throughout the last months reached out to me to verify that I am doing okay more times than I did myself. I would not have been able to finalise this work in such a state if it was not for their understanding and the freedom they gave me over my time. Finally, I also want to thank Veronika Hristova for supporting me throughout this difficult period, for motivating me to do more and for taking huge leaps of faith because of me. Благодаря ти! v List of Figures 2.1 General OCR flow ..................................... 11 3.1 General model architecture for assessing transfer learning on a variety of tasks ..... 15 3.2 Amount of tokens per decade and language ....................... 17 3.3 Amount of tag mentions per decade and tag ....................... 18 3.4 Non-entity tokens per tag and language in the training datasets ............ 18 3.5 NERC base model architecture .............................. 20 3.6 NERC multi-task model architecture ........................... 22 3.7 Post-OCR correction model, encoder sub-word and character embedding concatenation 27 3.8 Post-OCR correction model, decoder pass ........................ 28 4.1 Levenshtein edit distributions per language ........................ 44 4.2 Word neighbourhood change over time, 2-D projection using t-SNE .......... 46 vi List of Tables 3.1 NERC sub-task comparison ................................ 17 3.2 NERC entity types comparison .............................. 19 3.3 NERC hyper-parameters .................................. 23 3.4 ICDAR 2019 Data split .................................. 25 3.5 ICDAR 2019 Data sample ................................. 25 3.6 Post-OCR correction hyper-parameters .......................... 29 3.7 SemEval 2020 corpora time periods per language .................... 30 4.1 NERC results, French, multi-segment split ........................ 34 4.2 NERC results, French, document split .......................... 35 4.3 NERC results, German, multi-segment split ....................... 36 4.4 NERC results, German, document split .......................... 37 4.5 NERC results, English, segment split ........................... 37 4.6 NERC, convergence speed (averaged per configuration). ................ 38 4.7 NERC, parameter importance, French .......................... 38 4.8 NERC, parameter importance, German .......................... 39 4.9 NERC, parameter importance, English .......................... 40 4.10 NERC hyper-parameter configurations .......................... 40 4.11 Post-OCR correction results, French