MSc Master Thesis

Improving historical language modelling using Transfer Learning

by Konstantin Todorov 12402559

August 10, 2020

48 EC Nov 2019 - Aug 2020

Supervisor: Dr G Colavizza

Assessor: Dr E Shutova

Institute of Logic, Language and Computation University of Amsterdam Contents

Page

Abstract iv

Acknowledgements v

List of Figures vi

List of Tables vii

Abbreviations viii

1 Introduction 1 1.1 Motivation and Problem statement ...... 1 1.2 Contributions ...... 2 1.3 Structure ...... 3

2 Background 4 2.1 ...... 4 2.2 Natural language processing ...... 5 2.3 Transfer learning ...... 6 2.3.1 Multi-task learning ...... 7 2.3.1.1 Introduction ...... 7 2.3.1.2 Benefits ...... 7 2.3.2 Sequential transfer learning ...... 8 2.3.2.1 Motivation ...... 8 2.3.2.2 Stages ...... 9 2.3.3 Other ...... 10 2.4 Optical Character Recognition ...... 10 2.5 Historical texts ...... 12 2.6 Transfer learning for historical texts ...... 13

3 Empirical setup 14 3.1 Model architecture ...... 14 3.1.1 Input ...... 14 3.1.2 Embedding ...... 14 3.1.3 Task-specific model and evaluation ...... 16 3.2 Problems ...... 16 3.2.1 Named entity recognition ...... 16 3.2.1.1 Motivation ...... 16 3.2.1.2 Tasks ...... 16 3.2.1.3 Data ...... 17 3.2.1.4 Evaluation ...... 19 3.2.1.5 Model ...... 20 3.2.1.6 Training ...... 21

i 3.2.2 Post-OCR correction ...... 24 3.2.2.1 Motivation ...... 24 3.2.2.2 Data ...... 25 3.2.2.3 Evaluation ...... 26 3.2.2.4 Model ...... 26 3.2.2.5 Training ...... 28 3.2.3 Semantic change ...... 29 3.2.3.1 Motivation ...... 30 3.2.3.2 Tasks ...... 30 3.2.3.3 Data ...... 30 3.2.3.4 Evaluation ...... 31 3.2.3.5 Model ...... 31 3.2.3.6 Training ...... 32

4 Results 33 4.1 Named entity recognition ...... 33 4.1.1 Main results ...... 33 4.1.2 Convergence speed ...... 37 4.1.3 Parameter importance ...... 38 4.1.4 Hyper-parameter tuning ...... 40 4.2 Post-OCR correction ...... 41 4.2.1 Main results ...... 41 4.2.2 Convergence speed ...... 43 4.2.3 Hyper-parameter tuning ...... 44 4.3 Semantic change ...... 45 4.3.1 Main results ...... 45 4.3.2 Hyper-parameter tuning ...... 46

5 Discussion 47 5.1 Data importance ...... 47 5.2 Transfer learning applicability ...... 48 5.3 Limitations ...... 49 5.4 Future work ...... 49

6 Conclusion 51

Bibliography 52

ii “Those who do not remember the past are condemned to repeat it.”

George Santayana Abstract

Transfer learning has recently delivered substantial gains across a wide variety of tasks. In Natural Language Processing, mainly in the form of pre-trained language models, it was proven beneficial as well, helping the community push forward many low-resource languages and domains. Thus, natu- rally, scholars and practitioners working with OCR’d historical corpora are increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by documents from the past, including OCR quality and language change, call for a critical assessment over the use of pre-trained language models in this setting. We consider three shared tasks, ICDAR2019 (post-OCR correction), CLEF-HIPE-2020 (Named Entity Recognition, NER) and SemEval 2020(Lexical Semantic Change, LSC) and systematically assess using pre-trained language models with historical data in French, German and English for the first two and English, German, Latin and Swedish for the third. We find that pre-trained language models help with NER but not with post-OCR correction. Furthermore, we show that this improvement is not coming from the increase of the network size but precisely from the transferred knowledge. We further show how multi-task learning can speed up historical training while achieving similar results for NER. In all challenges, we investigate the importance of data quality and size, with them emerging as one of the fragments currently hindering progress in the historical domain the most. Moreover, for LSC we see that due to the lack of standardised evaluation criteria and introduced bias during annotation, important encoded knowledge can be left out. Finally, we share with community our implemented modular setup which can be used to further assess and conclude the current state of transfer learning applicability over ancient documents. As a conclusion, we emphasise that pre-trained language models should be used critically when working with OCR’d historical corpora.

iv Acknowledgements

Working on this thesis proved to be a challenge of a great value for me and also one of great joy. I am truly pleased and immensely proud for finalising this important step of my life. None of this would have been possible without, first and foremost, Giovanni Colavizza, whom I would like to thank for his constant advice, great support and outstanding supervision throughout the last nine months. His experience and thinking made me gain invaluable experience and deliver more than I ever imagined. Furthermore, I want to thank my family for everything that they have ever done for me which all resulted in this achievement. Благодаря от сърце на моето семейство – на моите майка и баща – Ренета и Живко Тодорови и сестра – Деница, без които никога нямаше да постигна това, което съм постигнал до този момент в живота си. Благодаря ви за всичко, което някога сте направили за мен. Искрено се надявам наличието на този документ да ви прави горди. My sincere gratitude goes also towards two of my colleagues at Talmundo, namely Reinder Meijer and Esther Abraas, who throughout the last months reached out to me to verify that I am doing okay more times than I did myself. I would not have been able to finalise this work in such a state if it was not for their understanding and the freedom they gave me over my time. Finally, I also want to thank Veronika Hristova for supporting me throughout this difficult period, for motivating me to do more and for taking huge leaps of faith because of me. Благодаря ти!

v List of Figures

2.1 General OCR flow ...... 11

3.1 General model architecture for assessing transfer learning on a variety of tasks ..... 15 3.2 Amount of tokens per decade and language ...... 17 3.3 Amount of tag mentions per decade and tag ...... 18 3.4 Non-entity tokens per tag and language in the training datasets ...... 18 3.5 NERC base model architecture ...... 20 3.6 NERC multi-task model architecture ...... 22 3.7 Post-OCR correction model, encoder sub-word and character embedding concatenation 27 3.8 Post-OCR correction model, decoder pass ...... 28

4.1 Levenshtein edit distributions per language ...... 44 4.2 Word neighbourhood change over time, 2-D projection using t-SNE ...... 46

vi List of Tables

3.1 NERC sub-task comparison ...... 17 3.2 NERC entity types comparison ...... 19 3.3 NERC hyper-parameters ...... 23 3.4 ICDAR 2019 Data split ...... 25 3.5 ICDAR 2019 Data sample ...... 25 3.6 Post-OCR correction hyper-parameters ...... 29 3.7 SemEval 2020 corpora time periods per language ...... 30

4.1 NERC results, French, multi-segment split ...... 34 4.2 NERC results, French, document split ...... 35 4.3 NERC results, German, multi-segment split ...... 36 4.4 NERC results, German, document split ...... 37 4.5 NERC results, English, segment split ...... 37 4.6 NERC, convergence speed (averaged per configuration)...... 38 4.7 NERC, parameter importance, French ...... 38 4.8 NERC, parameter importance, German ...... 39 4.9 NERC, parameter importance, English ...... 40 4.10 NERC hyper-parameter configurations ...... 40 4.11 Post-OCR correction results, French ...... 41 4.12 Post-OCR correction results, German ...... 42 4.13 Post-OCR correction results, English ...... 43 4.14 Post-OCR correction – convergence speed (averaged, in minutes)...... 43 4.15 Post-OCR correction hyper-parameter configuration ...... 45 4.16 Semantic change – Spearman correlation coefficients ...... 46

vii Abbreviations

General notation et al...... et alia (en: and others) e.g...... exemplum gratia (en: for example) etc...... et cetera (en: and other similar things) i.e...... id est (en: that is)

Machine learning

CNN ...... Convolutional Neural Network

CRF ...... Conditional Random Field

DL ......

GRU ...... Gated Recurrent Unit

IR ...... Information Retrieval

LSTM ...... Long Short-Term Memory

ML ...... Machine Learning

MTL ...... Multi-Task Learning

RNN ......

STL ...... Sequential Transfer Learning

Natural language processing

BOW ...... Bag-of-Words

LSC ...... Lexical Semantic Change

NE ...... Named Entity

NEL ...... Named Entity Linking

NER ...... Named Entity Recognition

NERC ...... Named Entity Recognition and Classification

NLI ...... Natural Language Inference

NLP ...... Natural Language Processing

OCR ...... Optical Character Recognition

viii Chapter 1

Introduction

Language is one of the most fundamental parts of human interaction. It is what binds us, defines us as humans, enabling us to communicate and store information about countless events. It allows us to preserve current knowledge throughout history for future generations – sometimes thousands of centuries later. As a consequence, it is naturally one of the most fundamental areas of past and present research in many different scientific fields. Machine learning is not an exception and Natural Language Processing (NLP) encompasses the foundations of many of the current advances in text processing. Further, using language is crucial for advancing in other fields such as and – an understandable fact, considering that we use language and in particular text to explain almost everything around us.

Despite the countless advances in the direction of true text processing, little is done about our cul- tural heritage and ancient texts in particular. Many old manuscripts and newspapers are collecting dust, long forgotten in libraries around the world, waiting to tell us the stories they contain. For without knowing the past, it is impossible to understand the true meaning of the present and the goals of the future.

1.1 Motivation and Problem statement

Developing systems that can understand human language, interact seamlessly and offer information retrieval portals has been a lifelong dream for scientists and alike from more than one generation by now. Early symbolic text representations were designed rule-based but proved to be inefficient, mostly because of their limited focus towards a particular domain they had been designed for (Winograd, 1972). They failed to generalise to unseen data which eventually made them obsolete (Council et al., 1966).

Ultimately, statistical handling of data proved as the most robust way to deal with different types of data, including text (Manning and Schutze, 1999). At the core of such applications stand mathematical models that automatically learn data features. The advancements in Deep Learn- ing inevitably improved systems all around and direct human participation decreased even more. Many problems, once considered unsolvable, are now solved. Many domains, once considered unex- plorable and incompatible with artificial intelligence, are now at the center of numerous researches.

Areas which deep learning in general, and NLP in particular, influenced range from machine translation and spell checking to automatic summarization and sentiment analysis. Even more so, the advance in NLP opened up new fields of study and introduced tools such as chatbots, as well as helping to advance previously ‘stuck’ ones such as optical character recognition (OCR).

Precisely because of the progress made in OCR, researchers were able to achieve partial success by applying it to some non-standard documents in other domains. One such, inspired by Digital Hu- manities, was the historical domain, consisting of many ancient books, newspapers and magazines. These documents are important part of our cultural heritage and many organisations are entrusted

1 with their preservation – a task, which difficulty grows as time passes. Thus, full digitisation of historical texts should become a critical goal to aim for. Unfortunately, there are several factors that make the general accessibility of collections of digitised historical records, and their use as data for research, challenging (Piotrowski, 2012; Ehrmann et al., 2016; Bollmann, 2019). First and foremost, Optical or Handwritten Character Recognition is still error prone during the extraction of text from images. Challenges also include the drastic language variability over time and the lack of linguistic resources for automatic processing. This all imposes the need to integrate, but at the same question, modern NLP techniques in order to improve the state of digitisation of historical records. One of the most promising techniques in this regard is transfer learning. Transfer learning focuses on sharing knowledge, learned from one problem, to another, often related problem. It is heavily used in modern day linguistic problems, offering many benefits, among which (i) faster convergence, (ii) requiring less compute resources, (iii) overcoming lack of linguistic resources, including annotated data and (iv) achieving higher levels of generalisation compared to traditional learning. Due to its nature, transfer learning promises to alleviate many issues which exist in historical cor- pora. Recently, it has started to be applied on historical collections in a variety of ways. Examples include measuring lexical semantic change of specific words (Shoemark et al., 2019; Giulianelli et al., 2020) and extracting named entity information (Labusch et al., 2019). Unfortunately, its applications are sporadic and unsystematic. One of the biggest problems that hinders progress is the lack of a standardised assessment of its characteristics when applied over ancient texts. Com- munity is still uncertain when and how it can be successfully applied to this domain. As most of the currently available models are being pre-trained on modern texts, their application must be questioned and analysed in details. Thus we set on the path and take the first step into providing initial knowledge of how historical digitisation can be improved using transfer learning.

1.2 Contributions

This work provides several contributions to the community which we consider as a beginning of a full hands-on research that can be further performed over the problem. • We introduce a modular architecture which is meant to ease the disabling or addition of new modules into an existing neural network setup. We provide the code freely and open-source it1 (§3.1). • We visualise how transfer learning can be used in several, important for Digital Humanities, challenges using ancient documents as data. We focus on named entity recognition and clas- sification(NERC) (§3.2.1), post-OCR correction (§3.2.2) and lexical semantic change(LSC) (§3.2.3). We further visualise the two sides of it – (i) it can improve setups significantly (§4.1.1) and encode information, otherwise unavailable through alternative deep learning methods (§4.3.1). But it can also (ii) produce only marginal benefits, sometimes non at all (§4.2.1). We further make examples of the increase in time that it requires (§4.2.2 and §4.1.2), thus concluding that researchers should not blindly follow a trend, thinking that transfer learning should always work better and immediately produce benefits. • We further provide results in experiments, commonly conducted in studies, related to transfer learning, but applied over historical documents. We show that fine-tuning BERT is also questionable and often does not give any benefits over historical texts, opposite to common beliefs (§4.2.1 and §4.1.1). Additionally, we use multi-task learning for our NERC setup (§3.2.1) and compare it against a single-task learning configuration, showing that we can achieve similar, even better results while cutting off training time. • We uncover the severe data limitations currently hindering progress in the area, stemming mostly from the overall bad quality of the OCR’d documents, but also from the lack of large single-origin datasets. We gain our best results for tasks, having both high amount of data and originating from the same historical domain background (§5.1). For the purpose of this work, we participate actively in two open challenges – (i) CLEF-HIPE- 2020, where we work on NERC and rank second overall for French and German, using the team

1https://github.com/ktodorov/eval-historical-texts/

2 Chapter 1 Introduction name Ehrmama (Ehrmann et al., 2020b; Todorov and Colavizza, 2020b), and (ii) SemEval 2020 (Schlechtweg et al., 2020) where we rank 22nd and 24th in the two sub-tasks. We analyse the latter and show that there is a serious bias introduced in current evaluation measurements of lexical semantic change (§5.1). Finally, our contributions and findings are also available summarised in a paper that resulted from this thesis (Todorov and Colavizza, 2020a).

1.3 Structure

In Chapter 2, we explain briefly the main concepts of Machine Learning(ML) and Natural Language Processing(NLP) that are used throughout this work. Afterwards we talk about the current state of Transfer learning, the different applications and types that it has. We explain in a similar manner Optical Character Recognition(OCR) and the problems that are hindering historical texts digitisation. Finally, we combine these problems and demonstrate how transfer learning can be beneficial for OCR over historical texts. Further, in Chapter 3 we firstly show the generic modular setup that we use throughout this thesis. Then, we describe the specifics of the challenges that we participate in, the different modules and task-specific models that we use in each one and the motivation behind these. Chapter 4 displays our results for each challenge and visualises the analysis we conduct. For post- OCR and named entity recognition, we additionally provide tracked convergence times differences between our different configurations. Finally, we also provide the hyper-parameter configurations that we used from the extensive lists that were available. We discuss our main findings in Chapter 5 and show the limitations of our work, as well as in the current state of Digital Humanities and in particular the historical domain. We also provide guidelines for future work, which we believe can build on top of this study. Lastly, in Chapter 6 we conclude this work and try to motivate the community, aiming to attract more interest in this area.

3 Chapter 2

Background

This chapter provides a background and explanations on the methods that we use throughout the thesis, as well as information about the research that has been done over the years on those. We show how the field has changed and how currently is a very dynamic area for studies. We discuss briefly the origins of Machine Learning and why is it so important (§2.1). We then display Natural Language Processing(NLP) – its linguistic subsection (§2.2). Further, we show the importance of Transfer Learning and its branches (§2.3) and then do the same for Optical character recognition (OCR) (§2.4). Finally, we talk about the digitisation of historical texts and the problems they pose (§2.5) before combining everything together and discussing how those aforementioned problems can be solved (§2.6).

2.1 Machine learning

Machine learning is the basis of our work and is a term used to identify computer algorithms that improve automatically through experience. More specifically, it builds mathematical models from data and uses probability and information theory to assign probabilities to possible outcomes of a specific event.

In machine learning, input data is usually represented as a vector x ∈ Rd of d features where each feature contains a particular attribute of the data. For example, in the context of text, these features could simply be the characters of the sequence or when working with images – the pixel values and so on. Furthermore, the learning process of a model which uses these vectors is split into – where for every input xi there is an output value, typically a separate label yi – and where no designated labels are available. Moreover, machine learning tasks are split into different categories, one of the most common being classification. In classification, the label yi comes from a predefined set of classes. Depending on this set, we have binary classification, multi-class classification and multi-label classification. The first one always deals with two classes, usually representing a Boolean value, whereas the second – with more than two. In the third case, unlike the most common case where every input xi only has one correct corresponding label yi, instead we have more than one. The multi-labels may also happen to overlap for different input entries. An ultimate goal of such models is to eventually achieve generalisation – that is a state of the model which would allow to apply it to previously unseen data and expect as good results as the ones observed previously. We will frequently revisit the topic of generalisation throughout this work as it is a fundamental part of our research too. For the purpose of generalisation, the available data is usually split into different sets which then play different roles. A bigger part is taken and called training set which is used for training the model. A smaller part is then reserved to evaluate the model after the training. It is called test set and it is used to mimic unseen data, thus showing a glimpse of the generalisation ability of the system. From these two sets, two measures emerge when comparing model’s outputs to original

4 labels. While training, we compute the training error over the training set. More important however, is the test error which shows the amount of generalisation that is achieved. Thus, first of all, it is important for the split of training and test sets to be such that the latter is unique and unseen in the first indeed. Secondly, the main difference between optimisation and machine learning algorithms is shown - while the first seeks to minimise the training error, the second aims to minimise the generalisation(test) error. Machine learning remains a fundamental field, combining many areas and techniques. Naturally, it is at the core of this study too, in particular through its linguistic focused sub-category.

2.2 Natural language processing

A crucial part from machine learning is natural language processing (NLP) – representing a field which combines active areas of research of digital texts and their modelling. One of its main goals is to bring the distance between computers and humans in understanding natural languages closer. Due to the contrast between different languages, usually a generalisation is achieved by taking abstract overview and mapping languages to similar linguistic models and tasks (Smith, 2011). Linguistic models that help with the abstraction in NLP are language models. As a key element from this field, they represent one of the earliest and biggest milestones in the progress made towards true autonomous text processing and are nowadays one of the core parts of all related structures. They are mathematical machine learning models, using sequences of texts (words, characters or other) as the input vectors xi. They aim to learn the vector mapping towards the output labels y and also provide the benefit of having a low level generalisation of text-based systems which allows us to be able to compare and formulate them. Many studies have been performed as to how exactly to formulate such models. Earliest depictions included human rule-based systems. These required people to manually create rules which will then be used to map the data. These were a good first step but had many flaws, the biggest of which was their limitation and inability to generalise well on unseen data (Committee, 1966). For the past 20 years, focus has shifted towards mathematical models which automatically learned representations from the input data (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006). This allowed human effort to be used to create features instead, which were supposed to teach the model what is important and what is not. Unfortunately, having required human interaction once again proved to an be error-prone approach and one mostly limited to the human understanding. Not only it was time-consuming, but it was also reducing generalisation as even though most often humans knew a certain field very well, they were failing to foresee unexpected circumstances outside of their specialisation. With latest developments in the recent years, deep neural networks have now become the most widely used models out there (Goodfellow et al., 2016). Their main difference is the automated learning of data features. As a result, they reduce and in some cases fully remove the need of human supervision. In cases of supervised learning, the human participation is required only for labelling the data, whereas for unsupervised learning, the model is fully automated. Being an important component of language models – the different ways of representing and labelling the input data have changed over the years too. In the beginning, as previously mentioned, task- specific representations, created by human annotators were the most widely used case. After these – a better, more autonomous approach of storing data was proposed, called bag-of-words (BOW). This made use of a vocabulary V which contained all words that occur in the data corpus and the number of occurrences for each. This frequency number was sometimes additionally weighted with term frequency-inverse document frequency(tf-idf) which also helped with weighting non-important words (e.g. is, and, etc.) lower compared to other task-meaningful, but rare words. Original BOW models used unigrams – that is sequences of one word – but some also made use of other n-grams, such as bigrams for example, which are sequences of two words. It is important to note that this approach ignored grammar and word order, therefore losing information about the context of the words. This had a major drawback where occurrences of words like bank were treated equally, irrelevant of the fact that they may appear in the context of withdrawing money from my bank versus I sat on the river bank to admire the water. These problems paved the way to the introduction of word embeddings - a representation that boosted NLP progress tremendously and is arguably one of the most important novelties. It con-

5 verts words to embedding vectors which can differ in dimensionality and allow words to encode contextual information later on. This approach did not exclude the usage of BOW, what is more, some of the earliest researches used both and only replaced the vocabulary words with their cor- responding vectors, thus creating a word matrix (Levy and Goldberg, 2014; Goldberg and Levy, 2014; Joulin et al., 2016). One of the most widely recognisable approaches which is still used today is (Mikolov et al., 2013). This used skip-grams-based training – that is using a context window of N words around a target word, which the model had to predict using the vocabulary. The downside to this approach was the inability to encode much context, since the window was usually small, up to 2 words, and increasing it made training more time consuming and more importantly introduced a lot of unwanted noise. Still, many studies proved that word embeddings are an important piece of the NLP puzzle, showing how they can even be used to compare different words, precisely by using their vectors. Even more so, one can add vectors of words such as Russia and river resulting in a vector corresponding to the word Volga. With the addition of global matrix factorisation and local context window methods, GloVe (- nington et al., 2014) was introduced as improvement, where the vectors now used global matrix factorisation. All those pre-trained word representations quickly became key components in many of the neural language architectures out there. However, they all lacked encoding of higher knowledge of their surroundings and were therefore still not perfect. Some improvements were made over the years, such as the inclusion of sub-word information (Wieting et al., 2016; Bojanowski et al., 2017) and adding additional vectors for each sense of the words (Neelakantan et al., 2015) but they were not as significant. To this end and due to the advances in Deep Learning(DL), ELMo (Peters et al., 2017, 2018a) was introduced which gave the first deep contextualised representations, building on top of these improvements. The model now contained much more complex characteristics, using a function of the entire input sentence as a representation instead, but also including sub-word information. This advanced the state of the art for most of the NLP challenges. However, it was shown that feature-based systems like ELMo can be improved further more if they were fine-tune-based instead (Fedus et al., 2018). The recent introduction of the transformer architecture (Vaswani et al., 2017) allowed significantly faster training when compared to tradi- tional recurrent- or convolutional-based systems, which at the time were state-of-the-art systems in many tasks. Using this new architecture, BERT was unveiled (Devlin et al., 2019). Although much bigger, more power- and time-consuming than alternatives, it proved itself over many times (Goldberg, 2019; Clark et al., 2019), having one key benefit, similar to ELMo – the lack of need to train the model from scratch. Since then, many alternatives have been proposed, both in different languages (Martin et al., 2020) and having slightly different architectures or training setups (Liu et al., 2019; Yang et al., 2019). They all had a common benefit, precisely the ability to simply take the pre-trained representations from the model and use them to extract features. This all proved the transfer of representations between tasks as a highly promising option.

2.3 Transfer learning

In standard learning methods, we train randomly initialised models. In cases where we work on multiple tasks, separate training operations for each task are performed, even if they use the exact same model architecture underneath. Many works further proved the importance of transferring information over. BERT solidified this knowledge, providing better generalisation and additional bidirectional information of the input data, as well as encouraging the usage of the same pre-trained embeddings on a wide range of different tasks (Sun et al., 2019). Tasks such as these gave more weight to one of the most important areas of study in the last decade – transfer learning (Pan and Yang, 2010). Similar as with humans where parents are continuously sharing existing knowledge with their children from young age, networks also proved to benefit greatly from such a distribution. This approach provided many benefits, some of which being (i) significantly faster convergence, due to systems not starting the learning from scratch; (ii) less computation power required, due to the faster convergence and much less parameters and arguably the most important – (iii) better gener- alisation, coming from the fact that models now contained information from multiple domains and thus being better prepared for unseen data. To add to these, research showed that transfer learning

6 Chapter 2 Background helps when there is lack of data available, which is often observed in many setups (Chronopoulou et al., 2019). The mentioned gains helped the community in almost all areas of Machine Learning, including Natural Language Processing. When it comes to the source task or source domain – that is the area from which we take existing knowledge – a decision needs to be made on what information exactly to use and apply to the target task or target domain, i.e. the area to which we apply the knowledge base. Most of the methods use a general-purpose source task. In addition, transfer learning is mostly used and found helpful when a linguistically under-sourced target task is at hand, i.e. when we have too small amount of data to train independently or even a lack of linguistic resources. Even if this is not the case, benefit such as saving of compute resources is observed because of the reduced amount of training needed and instead applying a pre-trained model. Looking at to what exactly to transfer – usually we would like to encode as much of the learned experience as possible. This can take on various forms depending on the area of application. In the case of NLP and also throughout our work, this will come down to the text representations learned by neural network systems. Having shown how powerful transfer learning can be, we now turn towards how to transfer exactly and the multiple ways of using it across different tasks. We focus on the approaches used in this thesis and only briefly mention remaining ones.

2.3.1 Multi-task learning A transfer learning does not always have to depend on a model trained on a source task before carrying over the knowledge. Instead, we can transfer knowledge during training. This approach is called multi-task learning(MTL) and is what many systems use nowadays to encode knowledge from broader context and multiple domains. This is shown as helpful in many applications of machine learning, including natural language processing (Collobert and Weston, 2008), (Deng et al., 2013), computer vision (Girshick, 2015) and even drug discovery (Ramsundar et al., 2015).

2.3.1.1 Introduction Naturally, systems set up a model which is trained to perform a single task. This model may or may not be transferred to a different target task later on. However, focusing on a multiple source tasks rather than a single one during the initial stages have several advantages – (i) we reduce the risk of forgetting information which is not relevant to the source task but is for the target task and more importantly (ii) our model can generalise the encoded knowledge much better (Caruana, 1998). This type of learning is also motivated by a number of real-life examples. Looking at human biology, it is natural for babies to learn new tasks by applying knowledge they learn simultaneously elsewhere, for example when they first learn to recognise faces before applying this knowledge to recognise other more complex scenarios(Wallis and B¨ulthoff,1999). It can also be seen in teaching when people are trying to learn new skills they will always first learn something simple which will provide them the necessary abilities to master more advanced techniques (Brown and Lee, 2015). Going back to machine learning, multi-task learning is also proven to coach models into preferring hypotheses that explain more than one task (Caruana, 1993).

2.3.1.2 Benefits We will look at the most important gains when using multi-task learning. We follow Caruana (1998) who first explained these. For simplicity reasons, we assume we have two tasks A and B in all examples.

Generalisation is arguably the most important perk. Learning in a multi-task approach literally increases the training data that we have at our disposal. And because all tasks contain some amount of noise, even if it is marginally small, increasing data size helps when dealing with it. Learning by focusing on either task A or B will introduce the risk of overfitting, while a joint approach lets us encode better, focused on generalisation representations.

7 Focus is a very important aspect in machine learning and as such can make a crucial difference, especially during the training process. If a certain task data contains a lot of noise, this can hinder the model’s ability to distinguish relevant from irrelevant features. Multi-task learning helps by introducing more data and directing the focus to the important cross-domain features.

Eavesdropping allows a model to learn some features required for task A by focusing on task B. Examples include features that are represented in a more complicated way in task A or because other features, specific to task A, are in the way.

Representations generalisation allow the model to learn in such a way that produced represen- tations in the future will be generalised to cover multiple tasks, which is something looked after during a multi-task approach. This also helps covering new tasks in the future, assuming they are from the same environment (Baxter, 2000).

Regularisation occurs when using a MTL approach, which helps the model to avoid overfitting as well as reduce the Rademacher complexity of the model, i.e. its ability to fit random noise (Søgaard and Goldberg, 2016). Multi-task learning is a natural fit for domains where we are originally interested in learning multiple tasks at once. Examples are (i) fields of finance or economics forecasting, where we might need to predict multiple likely related features, (ii) marketing, where considerations for multiple consumers are taken at once (Allenby and Rossi, 1998), (iii) drug discovery, where many active compounds should be predicted together and (iv) weather forecasting, where predictions are often dependant on multiple conditions and situations in neighbouring cities. However, multi-task learning can also be beneficial in cases when there is one main task. In such cases, auxiliary tasks are introduced to help better learn the desired representations. These can range from tasks, predicting similar features to the desired ones, to a task learning the inverse of the original one. In NLP, tasks that have been used as auxiliary tasks include speech recognition (Toshniwal et al., 2017), machine translation (Dong et al., 2015; Zoph and Knight, 2016; Johnson et al., 2017; Malaviya et al., 2017), multilingual tasks (Duong et al., 2015; Ammar et al., 2016; Gillick et al., 2016; Yang et al., 2016; Fang and Cohn, 2017; Pappas and Popescu-Belis, 2017; Mulcaire et al., 2018), semantic parsing (Guo et al., 2016; Fan et al., 2017; Peng et al., 2017; Zhao and Huang, 2017; Hershcovich et al., 2018), representation learning (Hashimoto et al., 2017; Jernite et al., 2017), question answering (Choi et al., 2017; Wang et al., 2017), information retrieval(Jiang, 2009; Liu et al., 2015; Katiyar and Cardie, 2017; Yang and Mitchell, 2017; Sanh et al., 2019) and chunking (Collobert and Weston, 2008; Søgaard and Goldberg, 2016; Ruder et al., 2019).

2.3.2 Sequential transfer learning We now turn towards the oldest, most widely known and arguably most frequently used transfer learning scenario, namely the sequential transfer learning (STL). Here we train each task separately instead of jointly as it was the case in multi task learning. The goal is then to transfer some amount of information from the training which is performed on the source task, and to improve performance on the target task. This also made people use the term model transfer (Wang and Zheng, 2015).

2.3.2.1 Motivation Compared to MTL approach, sequential transfer learning often requires initial model convergence before transferring information. There are also studies where it is used before final convergence on the source task, resulting in the so-called less common curriculum learning (Bengio et al., 2009). This type of sequential transfer learning imposes an order to the different tasks at hand and selects when each one needs to be used, most often based on pre-defined difficulty scale. To further show how important this approach is, we look at some situations where using it seems like a natural choice, that is when we have (i) data which is not available for all tasks at the same time, (ii) when there is significant data imbalance between the different tasks or (iii) when we want to adapt the trained model to multiple target tasks. Looking at compute requirements, sequential transfer learning is usually much more expensive during training on the source task. However, afterwards, during the application to the target task,

8 Chapter 2 Background it is then much faster compared to MTL, which, while faster to train on the source task, might be expensive when applying to the target task.

2.3.2.2 Stages This type of transfer learning consists of two main stages, which correspond to the time periods during the learning process – pre-training and adaptation. During pre-training we perform the training over the source task data. This is the operation that is usually expensive in computational terms. After this, at the adaptation phase, the learned information is transferred to the target task. The latter stage – being time efficient and relatively easy to perform – is what gives compelling benefit to sequential transfer learning over other types of learning as systems can build on top of this pre-trained knowledge instead of starting their training flows from scratch. In order for this benefit to be maximal, the source task must be chosen carefully and in such a way that it combines as many general domain features as possible. Ideally, we would be able to produce universal representations which encode information about the whole area of interest. Even though the no free lunch theorem (Wolpert and Macready, 1997) states that fully universal representations will never be available, pre-trained ones are still significantly better when compared to a learn from scratch approach.

Pre-training stage can be further split into three sub-types taking into account the type of super- vision that is applied during training. We have unsupervised pre-training which, following the unsupervised machine learning, does not require labelled data and only relies on raw forms. This is closer to the way humans learn (Carey and Bartlett, 1978; Pinker, 2013) and, in theory, more general than other types. Because there is no labelling, we remove any unwanted human introduced bias from the data and rely on the model to learn important features on its own, producing a so-called self-taught learning (Raina et al., 2007) or simply unsupervised transfer learning (Dai et al., 2008). The second sub-type, called distantly supervised pre-training (Mintz et al., 2009) uses heuristic functions and domain knowledge to obtain labels from unlabelled data on its own. Tasks used in such approach include sentiment analysis (Go et al.), word segmentation (Yang et al., 2017) and predicting unigram or bigrams (Ziser and Reichart, 2017), conjunctions (Jernite et al., 2017) and discourse markers (Nie et al., 2019). Lastly, supervised pre-training combines traditional supervised methods which require manual la- belling of the available data. The data used here is usually existing one and sometimes such existing tasks which are related to the target task are chosen. Examples include training a machine trans- lation model on a language which is highly-resourced and then transferring it to a low-resource language (Zoph et al., 2016) or training a part-of-speech(POS) tagging model and then applying it to a word segmentation task (Yang et al., 2017). Another options which emerged recently are pre-training on large datasets and tasks which help the model to learn generic representations. Such examples are dictionaries(Hill et al., 2016), natural language inference(Conneau et al., 2018), translation(McCann et al., 2017) and image captioning(Kiela et al., 2018). Combining some or all of those three approaches can lead to a so called multi-task pre-training. This is done by elevating some of the heuristics mentioned at §2.3.1 and pre-training at multiple tasks at once. This can provide resulting representations with further generalisation, as well as reduce the noise they contain.

Adaptation stage combines the different ways of adapting the already pre-trained knowledge. From the discussion up until now, we see that the gains of transferring existing knowledge are indisputable. However, based on the task at hand, different methods of applying this knowledge might be more suitable than other. After initial convergence, a final decision remains as to how exactly to use the pre-trained informa- tion. One can decide to utilise the knowledge in a feature-based approach, where the transferred model weights are kept ‘frozen’ and the whole pre-trained setup is only used to extract features. These are then used as an input for another model or plugged into the current one using some heuristic merging(Koehn et al., 2003). In the case of NLP, an example using are word representations (Turian et al., 2010). The second approach is referred to as fine tuning and it requires additional continuation of the training of the existing model, this time on the target domain.

9 Both adaptations have their own benefits. Feature extraction allows us to use existing models easily and repeatedly. On the other hand, fine-tuning allows us to further specialise in the target domain, which is proven to perform better in most cases (Kim, 2014). However, there are exceptions where fine-tuning is not as good and can even hinder performance, such as when the training set is very small, if the test set contains many OOV tokens (Dhingra et al., 2017) or if the training data is very noisy (Plank et al., 2018). Most works point towards fine tuning being the more promising choice (Gururangan et al., 2020). However, while it has also been shown that fine-tuning can work best if source and target tasks are similar, if this is not the case, fine-tuning will easily be overcame by a simple feature extraction (Peters et al., 2019). This also corresponds to one of the negatives of fine-tuning – it can introduce specific focus over certain tokens which might lead to other becoming stale.

2.3.3 Other One other, less known transfer learning type is lifelong learning (Thrun, 1996, 1998). It differs from sequential transfer learning which originally uses two tasks – a source one and target one. In lifelong learning we can have many tasks which are learned sequentially and there is no clear distinction over which tasks are source and which target. Another studied approach is called domain adaptation. This field seeks to learn representations and functions which will be beneficial for the target domain instead of trying to generalise. Finally, a separate type is reserved to cross-lingual learning which combines methods for learning representations across languages. These come down to cross-lingual word representations and stand behind the hypothesis that many different models trained separately actually optimise for similar objectives and are in fact closely related, differing only in the optimisation and hyper-parameter strategies.

2.4 Optical Character Recognition

We now turn towards another major technology that has impacted our lives for the better, namely optical character recognition(OCR) (Mori et al., 1999). This represents a group of techniques which goal is to convert, i.e. digitise images of typed or printed text into a machine-encoded text. It can also be applied to a handwritten text which would result in the so-called handwritten character recognition (HCR). The conversion is made through an optical mechanism. One can consider the way humans read also optical recognition with the eyes being the optical mechanism. The technology is of great importance as most of the search and mining performed on digitised collections is done using OCR’d texts. The field has been an active area of development for more than a century (Mori et al., 1992; Rikowski, 2011) and may be traced all the way back to 1914 when Emanuel Goldberg developed a machine that reads characters and converts them into standard telegraph code. There are many applications for a digitised text – the first ones were creating devices that can help the blind. Nowadays, technology is advancing and recognised texts can be indexed and used in search engines, which can also help with one of the biggest problems in NLP and ML in general - the lack of data. Even simple outcomes, such as being able to access the text digitally, word by word pave the way for more important scenarios like the ability to read books out loud which can make the difference for visually impaired people. We identify several main steps during the process of recognising optical characters. These can all be seen in Figure 2.1. We describe them briefly.

Scanning involves converting the paper documents, which can be books, newspapers, magazines, etc. into digital images. This step can be skipped if we already have the images and not the original source.

Pre-processing applies some amount of processing over the scanned images. This can be ‘de- skewing’ the image, i.e. aligning in case it wasn’t originally, ‘despeckling’, i.e. removing very bright or very dark spots and smoothing edges, binarisation of the image which can make it black- and-white for example, cropping edges, removing lines and more(Mori, 1984). This step aims to make it easier for the system to detect the text in the latter steps.

10 Chapter 2 Background Figure 2.1: General OCR flow. Blue steps use Computer vision techniques, while red steps use NLP ones. During recognition step both CV and NLP approaches can be used. Not all steps are necessary and some might be merged together or skipped depending on the requirements.

Segmentation step aims to find the important segments in the image which contain the text(Hirai and Sakai, 1980; Jeng and Lin, 1986). These are also called ‘regions of interest’(ROI) of the image. These are usually represented as coordinates of the starting pixels of a ROI and additionally the size in terms of width and height of the box that wraps the region.

Recognition can combine both NLP and Computer Vision (CV) techniques while trying to recog- nise the text in terms of characters and words out of the resulted regions of interest from the previous step. This usually produces a text that can contain some grammatical or lexical errors, which is also why we often require to apply the next step in the flow.

Post-processing groups different strategies of fixing errors resulted in the recognition step. This can involve character- or word-vocabulary check-up, post-OCR correction, probability verification and more. This is an important step because of many documents having extremely low quality of preservation and are therefore very hard to digitise naturally (Nguyen et al., 2019).

Evaluation represents a final step which is used to train better recognition systems. Assuming we have a ground truth of the text we are trying to recognise, we can compare it to the output of the OCR system using a specific metric, usually that being accuracy. However, at this step we can also compare different OCR systems based on their speed and flexibility, i.e. how error-prone is the system to changes in the input. While in the previous century focus has been mostly on improving results of OCR setups, recently it has been shifting towards usability of the output and application in other domains. There is less room for improvement now to the point of this challenge being considered a solved problem (Hartley and Crumpton, 1999). There are many OCR systems which would give accuracy percentage rates in the high nineties. However, with that being said, those results are not always consistent and are subject to a wide variety of pre-conditions, the biggest one being the quality of the original document (van Strien et al., 2020). This hinders advances in the field and the impact of using OCR’d text remains relatively unknown (Milligan, 2013; Cordell, 2017, 2019; Smith and Cordell, 2018). As a result, there have been many works related to OCR quality and its development (Alex and Burns, 2014). With the advances of Deep Learning in the recent years, naturally, some methods have been applied to OCR too. Research has been done, analysing its effect on most of the aforementioned steps. Post-processing benefited greatly from applications of machine translation models during the post-OCR correction phase (Nastase and Hitschler, 2018), some of which were used to de-noise

11 the OCR’d text (Hakala et al., 2019). Due to the nature of digitised texts, it is important to be able to process data without having a ground truth – a requirement that is uncommon for most ML tasks. There have been significant advances in this regard with many models currently being able to correct OCR outputs in an unsupervised manner (Dong and Smith, 2018; H¨am¨al¨ainenand Hengchen, 2019; Soni et al., 2019). In general, due to the complex nature of the OCR systems which combines many different fields of studies, large percentage of the novelties that are being introduced to the machine learning field are directly applicable to one or more of the architecture steps. However, as we will see next, problems arise when the domain on which OCR is applied changes, as most of the known techniques have been applied solely on modern day-to-day text corpora.

2.5 Historical texts

Historical documents have long been neglected when it comes down to NLP and digitisation efforts. In the recent years, research has increased its efforts, mainly due to the significant advances in OCR which allowed for some of the thousands historical texts to be successfully digitised (Piotrowski, 2012), including multiple different languages (Smith and Cordell, 2018). This is an important first step towards a cultural heritage preservation, however, it remains only a first step and there are still many challenges which are blocking the progress towards a successful full digitisation. And even then, digitisation is only a necessary beginning. To enable intelligent search and navigation, the texts must be processed and enriched, which is harder for historical ones, due to the OCR output there having extremely low quality, often times making it unreadable. We describe several main differences of historical corpora from modern day texts that are hindering OCR performance.

Quality of historical documents is often much worse compared to their modern counterparts. Very often the paper of a document starts degrading when it is centuries old which can then result into parts of the pages falling down to start fading, etc. This all leads to additional noise in the OCR process, usually enough to deceive current modern workflows.

Linguistic change is another major complication which stems from the fact that all languages change over the years. Nowadays, many systems are able to return exploitable results from docu- ments in modern day English. However, when applied on manuscripts and newspapers in English, originating from more than one or two centuries ago, most fail to grasp any meaningful knowledge out of them. Changes also happen at the typewriting itself with many old books having Gothic typewriting – an unknown type for current systems. Looking at linguistic change, it is natural to consider the old version of the language as a separate language and attempt to perform current techniques such as machine translation to bridge the gap. Unfortunately, this leads us to the next key difference when comparing with languages from our time.

Lack of data is blocking many advances as current deep learning systems rely on significant amount of raw data. Digitised resources, and in particular quality and clean ground truth corpora, are still scarce and insufficient for the demands of the current big NLP architectures, thus slowing down necessary advances. Fortunately, more and more projects are being launched to help narrow one or more of these gaps. Similar to modern day text collections such as Project Gutenberg (Project Gutenberg, 2020), many initiatives have been launched with the sole goal of improving the state of historical corpora and its accessibility. They bring to us an unprecedented amount of digitised historical books and newspapers. Examples of such are (i) NewsEye research project (NewsEye, 2020) which not only stores digitised historical texts but also supports competitions aimed at advancing research in this area; (ii) IMPRESSO project (Ehrmann et al., 2020a) which aims to enable critical text mining of newspaper archives; (iii) HIMANIS project (Bluche et al., 2017) aiming at developing cost- effective solutions for querying large sets of handwritten document images; (iv) IMPACT dataset (Papadopoulos et al., 2013) which contains over 600 000 historical document images and more. As of this date, the community still lacks the required knowledge for a general approach towards processing historical texts. There have already been many works which successfully deal with such, however, all of them are focused specifically at the problem at hand and are unable to generalise when taken out of their original context. Gezerlis and Theodoridis (2002) and Laskov (2006) de- velop systems for recognising characters used in the Christian Orthodox Church Music notation,

12 Chapter 2 Background whereas Ntzios et al. (2007) presents a segmentation-free approach for recognising old Greek hand- written documents especially from the early ages of the Byzantine empire. Many systems were also proposed which extract information from digitised historical documents, empowering document ex- perts to focus on improving of the input quality (Droettboom et al., 2002, 2003; Choudhury et al., 2006). Research has also been performed on languages which are not used actively nowadays, such as Latin (Springmann et al., 2014). Because of the amount of errors in historical OCR, focus was naturally also targeted at analysing their impact and different ways they can be overcame (Holley, 2009; van Strien et al., 2020). Hill and Hengchen (2019) aimed to quantify the impact OCR has on the quantitative analysis of his- torical documents, while Jarlbrink and Snickars (2017) investigated the different ways newspapers are transformed during the digitisation process, whereas Traub et al. (2018) showed the effects of correcting OCR errors during the post-processing stage. Being a necessity for information retrieval, named entity recognition(NER) is one of the important tasks for digitised historical texts. Ehrmann et al. (2016) analysed the complications and this task, while achieving near-perfect results for modern texts, is far from reaching at the very least compelling results when applied to their historical counterparts (Vilain et al., 2007).

2.6 Transfer learning for historical texts

There is much to be desired when it comes to digitising historical texts and, due to the great amount of obstacles, progress is slow. However, transfer learning (§2.3) gives us many promises, mostly due to the problems that it was designed to help with. Fortunately, many of those problems occur in historical OCR’d corpora as well. Research is already benefiting from the power that knowledge transfer provides, with works fo- cusing on measuring and representing semantic change (Shoemark et al., 2019; Giulianelli et al., 2020) and extracting named entity information (Labusch et al., 2019). Other works focused on transfer learning using CNN feature-based networks, such as Tang et al. (2016) who investigated the applicability of transfer learning over historical Chinese character recognition, while Granet et al. (2018) investigated handwriting recognition. Recently, Cai et al. (2019) showed that ap- plying pre-trained generative adversarial network (GAN) can be useful, resulting in the TH-GAN architecture which proved effective over historical Chinese character recognition. Nevertheless, while there have been minor advances in the field, some questions remain open, mostly due to the challenges which historical corpora pose. It is still unclear when (i.e., for which tasks, languages, etc.) and how (i.e., which approach) transfer learning can be successfully applied on historical collections. Given that most pre-trained language models have been trained on modern-day, high-resource languages (e.g., Wikipedia in English), their applicability to historical collections is not straightforward. In this work, we start bridging the gap of systematically assessing transfer learning for historical tex- tual collections. To this end, we consider three tasks which help us analyse further: SemEval 2020 challenge, aimed at lexical semantic change evaluation (Schlechtweg et al., 2020), the ICDAR2019 Competition on Post-OCR Text Correction (Rigaud et al., 2019) and the CLEF-HIPE-2020 chal- lenge on Named Entity Recognition, Classification and Linking (Ehrmann et al., 2020b). These tasks are of importance to practitioners as they directly influence the usability and accessibility of digitised historical collections.

13 Chapter 3

Empirical setup

Our aim is to assess the importance of pre-trained language models and whether such represen- tations can be of use when applied to historical text corpora. We analyse different scenarios and environments and look into some of the most pressing challenges that the community is facing currently. To this end, we introduce a generic representation that can help future research over historical corpora. This includes a highly modular embedding layer which allows us to activate and deactivate different embeddings, working on different levels of the data, so we can further analyse their influence. We use task-specific inner and output layers and keep core parts as similar as possible. This enables better tracking of results and progress. The general architecture of our model is explained in details in §3.1. In §3.2 we explain some of the current problems that historical text representations face. We use these throughout the thesis as they allow us to test our hypotheses in distinctive scenarios, requiring us to rely on various modern-text state-of-the-art Natural Language Processing features and test them in this different domain.

3.1 Model architecture

Our modular architecture empowers our abilities to fully utilise comparison metrics. We stick to one general architecture when designing our models which can be seen at Figure 3.1. This gives us the freedom to test different linguistic representations while still keeping core parts of the model uniform across different tasks and languages. We further split the key sections of the model into four distinguishable parts.

3.1.1 Input This layer simply represents the data that is being used. It is dependent on the challenge itself and can represent masked or unmasked sequence of text tokens or characters.

3.1.2 Embedding layer We group the modules which are responsible for building representations of the input data in the embedding layer. These are then further split into two sub-groups based on their origin - (i) pre-trained (transferred) representations and (ii) newly trained representations. The former groups modules which are using pre-trained models to generate embeddings. Furthermore, these can have frozen weight matrices and only used for extracting features or having their weights trained (more specifically fine-tuned) along with the rest of the model. The other sub-type, newly trained representations, links embedding matrices, which are usually initialised randomly and learned from scratch, along with other parts of the model at hand. These do not leverage existing knowledge and try to learn task-specific features from scratch.

14 Figure 3.1: General model architecture for assessing transfer learning on a variety of tasks

To give more flexibility to these two sub-groups, each of them can further present different gran- ularity of the input data. This depends on the specific task and whether or not the specific type would be helpful. The three granularity types, motivated by previous research and following exist- ing knowledge, we use are (i)word-, (ii)sub-word- and (iii)character-level embedding types (Mikolov et al., 2012; Kudo, 2018; Kudo and Richardson, 2018). They can be used on their own or in com- bination of each other. Whenever more than one approach is used at the same time, we introduce different processes on how exactly to merge the different information. For simplicity reasons, we group these processes based on the desired output type, i.e. based on the task itself. Generally, there are tasks where we would want to work on character-, word- or sub-word level. Here we explain how the merging occurs based on this organisation.

Character-level output works in a similar manner as the sub-word approach. We concatenate sub-word tokens and word tokens to each character representation, often repeating multiple times one sub-word or word token, thus enriching the final representation with more information.

Sub-word-level output allows us to use the same approach as previously explained for word em- beddings, repeating one word representation over multiple sub-word ones. As for the characters, in cases, where character embeddings come encoded from a neural network, we simply use the same character output size as the sub-word embedding size. In all other cases, we take the average of the embeddings for all characters that occur in a specific sub-word token. In the end, we always concatenate the character embeddings to the sub-word embedding representation.

Word-level output requires us to present one word by multiple sub-word and character embeddings respectively. In such cases, we take the average of the character and the sub-word embeddings of

15 the word respectively and concatenate them to the existing word representation.

3.1.3 Task-specific model and evaluation This part is, as the name suggests, highly specific to the task itself. In our work here a state-of- the-art approach is incorporated which allows us to quickly reach promising results and focus on embeddings and model transfers. We explain what models exactly are used at this stage for the different challenges and the motivation behind our choice in §3.2. We also present the grouping of the outputs of our system where we evaluate the different charac- teristics of the architecture and analyse how helpful the different parts are. The target values and predictions are compared and then analysed for their performance changes when using different bits of information, where those bits might be coming from a pre-trained model but also by the addition of another part to the model.

3.2 Problems

We use challenges that the community is currently facing when working over historical corpora. Most of those are achieving very promising results when applied over modern day texts but when turned towards past documents that are prone to OCR errors and linguistic change differences, performance drops significantly. We pick three of the most pressing research challenges focused on historical NLP – Named Entity Recognition and Classification (NERC) (§3.2.1), post-OCR correction (§3.2.2) and Lexical Semantic Change (LSC) (§3.2.3). All of these were shared tasks of which we actively participate in two – NERC and LSC. We do that so we can better compare our results and models, more specifically not just against baseline models and scoring systems, but also against other competitors.

3.2.1 Named entity recognition

Named entity(NE) processing has become an essential component to the Natural Language field after being introduced as a problem some decades ago. Recently, the task has seen major im- provements thanks to the inclusion of novel deep learning techniques and the usage of learned representations (embeddings) (Akbik et al., 2018; Lample et al., 2016; Labusch et al., 2019).

3.2.1.1 Motivation Named entity recognition and classification (NERC) is also of great importance in the digital humanities and cultural heritage. However, applying existing NERC techniques is also made challenging by the complexities of historical corpora (Sporleder, 2010; van Hooland et al., 2015). Crucially, transferring NE models from one domain to another is not straightforward (Vilain et al., 2007) and in many cases performance is consequently greatly impacted (van Strien et al., 2020). In attempt to fix some of the problems, this year’s CLEF-HIPE-2020 challenge (Ehrmann et al., 2020b) is presented in which we participate actively. It aims to solve three of the most pressing problems about NE processing, particularly • Strengthening the robustness of existing approaches on non-standard input; • Enabling performance comparison of NE processing on historical texts; and, in the long run, • Fostering efficient semantic indexing of historical documents in order to support scholarship on digital cultural heritage collections.

3.2.1.2 Tasks The organisers provide two tasks, one focused on Named Entity Recognition and Classifica- tion(NERC) and one on Named Entity Linking(NEL). For the purpose of our research, we de- cide to focus only on the first one as it allows us to analyse effects of using information from pre-trained networks, where the latter is part of studies in Information Retrieval(IR) field which remains beyond the scopes of this work.

16 Chapter 3 Empirical setup The task which we evaluate our models on is further split into two sub-tasks which correspond to difficulty and class types that are used. Sub-task 1 is regarding coarse-grained types and more specifically the recognition and classification of entity mentions according to coarse-grained types. Those are the most general representations of the entities, most often a grouping of multiple sub- types – for example loc represents all location entities and sub-entities. This sub-task includes the literal sense and, when it applies, also the metonymic. Metonymy stands for a figure of speech in which a specific thing is referred to by the name of something closely related to it. Sub-task 2 is aimed towards fine-grained entity types. Following previous example, we now have detailed sub- entities such as loc.adm.town which corresponds to administrative town or loc.adm.nat which corresponds to administrative nationality, etc. In the second sub-task, recognition and classification of fine-grained types must be performed, including again literal and – when it applies, metonymic sense. Additionally, detection and classification of nested entities of depth one and entity mention components (title, function, etc.) is required. Table 3.1 shows comparison of the two sub-tasks and the difference in expected predictions.

Sub-task 1 Sub-task 2 NE mentions with coarse types yes yes NE mentions with fine types no yes Consideration of metonymic sense yes yes NE components no yes Nested entities of depth one no yes

Table 3.1: NERC sub-task comparison

3.2.1.3 Data The data consists of Swiss, Luxembourgian and American historical newspapers written in French, German and English languages respectively and organised in the context of the IMPRESSO project(Ehrmann et al., 2020a).

Figure 3.2: Amount of tokens per decade and language

For each newspaper, articles were randomly sampled among articles that (i) belong to the first years of a set of predefined decades covering the life-span of the newspaper, and (ii) have a title, have more than 50 characters, and belong to any page (no restriction to front pages only). For each decade, the set of selected articles is additionally manually triaged in order to keep journalistic content only. Time span of the whole data goes from 1790 until 2010 decades and the OCR quality corresponds to real-life setting, i.e. it varies according to digitisation time and archival material. Information about the amount of tokens per decade and per language can be seen at Figure 3.2. We have in total 569 articles and 1 894 741 characters with a vocabulary of 151 unique characters. Additionally, the amount of mentions, i.e. entity occurrences per decade are also visible at Figure 3.3. They are broken down per coarse-grained type.

17 Unfortunately, English language is not given a specific training dataset but rather only evaluation one. For this reason we use the Annotated Corpus for Named Entity Recognition built on top of Groningen Meaning Bank (GMB)1. This dataset is annotated specifically for training NER classifiers, and contains most of the coarse grained tag types which occur in the English evaluation dataset provided by organisers. We consolidate some tags with the same meaning but different labels ourselves. The dataset contains in total 1 354 149 tokens of which 85% are labelled as O originally. We convert the tag types that are not part of this challenge to “Other” as well, resulting in total of 94.92% tokens having O literal tags.

Figure 3.3: Amount of tag mentions per decade and tag

Another important characteristic of the data is the abundance of non-entity tokens. More specifi- cally, percentage-wise tokens, which are labelled as “Other” or O, are making up a large sub-part of each of the different tag types and for each of the available languages. We show exact numbers for the training datasets that we are given in Figure 3.4. In total, we have 94.92%, 95.95%, and 96.5% other labels out of all for English, French and German languages respectively. This increases difficulty for learning some of the representations, especially for scarce tag types such as nested and the two metonymic, where we have more than 99.5% non-entities.

Figure 3.4: Non-entity tokens per tag and language in the training datasets

The annotation was made by native speakers using the INCEpTION annotation platform (Klie et al., 2018). Before starting, collaborators were first trained on a ’mini-reference’ corpus - con- sisting of 10 content items per language - in order to ensure their understanding of the guidelines. Additionally, some items of the test set were double-annotated and adjudicated, as well as randomly sampled items among the training and development sets.

1https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus [accessed 2020-07-16].

18 Chapter 3 Empirical setup To showcase all entity types and compare their properties, we visualise them in Table 3.2. The data is released in IOB format (Inside-Outside-Beginning format), which we also use during training and evaluating. This format is derived in a similar fashion, following arguably the most popular modern-day dataset being used for NE processing, namely the CoNLL-U format1 We list some additional design choices made by organisers which also affect our development choices. There is no sentence splitting applied, nor sophisticated tokenization to the input data. Given the amount of existing noise, they provide all necessary information to rebuild the OCR text. Tokenization is a simple white space splitting, leaving all punctuation signs (including apostrophes) as separate tokens. In addition, some of the tokens contain a specific flag called NoSpaceAfter which provides information which ones are parts of full words. We are also given pre-trained in-domain FastText(Joulin et al., 2016) model from organisers, trained on the level of sub-words by keeping length between 3 and 6 character n-grams. It is available for all three languages and trained on text material from the IMPRESSO corpus which covers the newspapers and time periods that are part of the training and development sets. We use it to extract features and include those along the newly trained and BERT representations. We further experiment its effect by switching it on and off and comparing results. The organisers also provide Flair embeddings (Akbik et al., 2019) which we do not use here.

Coarse-grained tag set Fine-grained tag set Metonymy applies Entity nesting applies pers.ind pers pers.coll yes yes pers.ind.articleauthor org.adm org org.ent yes yes org.ent.pressagency prod.media prod yes no prod.doctr time time.date.abs no no loc.add.elec loc.add.phys loc.adm.town loc.adm.reg loc.adm.nat loc.adm.sup yes yes loc loc.phys.geo loc.phys.hydro loc.phys.astro loc.oro loc.fac loc.unk no no

Table 3.2: NERC entity types comparison

3.2.1.4 Evaluation Named Entity Recognition and Classification is evaluated in terms of macro- and micro-precision, recall, F1-measure. Two evaluation scenarios are considered: (i) strict(exact boundary matching) and relaxed(fuzzy boundary matching). Each column is evaluated independently, according to the Micro average Precision, Recall and F1-score at entity level (not at token level), i.e. consideration of all true positives, false positives, true negatives and false negatives over all documents. This

1https://universaldependencies.org/format.html

19 is done both for strict and fuzzy matching, as well as separately per type and cumulative for all different types. It must be noted that evaluation is performed on entities (inside or at the beginning) only and tokens that are predicted as or are originally outside (O) are left out. Furthermore, we compare the differences between our own models by looking at these metrics, as well as looking at convergence speed and parameters count. Organisers provide a simple baseline model based on the sklearn crfsuite library1 and using manual hand-crafted features. We compare our results to it and show them in §4.1.

3.2.1.5 Model Our initial model is an LSTM-CRF for Named Entity Recognition as proposed by Lample et al. (2016) with some additional simplifications such as no tanh layer after the LSTM. The architecture can be seen in Table 3.5. The embedding layer combines four different types of embeddings which are all tested for their importance later.

Figure 3.5: NERC base model architecture

BERT embeddings work on sub-word level and use a pre-trained or fine-tuned version of BERT. We are using -base-multilingual-cased, bert-base-german-cased(DeepSet, 2020) and bert-base-cased for French, German and English languages respectively. Decision on whether BERT model remains frozen and only used for features extraction or fine-tuned to produce in-domain embeddings is a configuration option and taken at the beggining of training. We show what are the differences when using either later on. The size of these embeddings is 768. We use HuggingFace Transformers library (Wolf et al., 2020) for incorporating BERT into our system as being the most popular and obvious choice. This has a specific limitation of only working

1https://sklearn-crfsuite.readthedocs.io/en/latest [version used: 0.3.6, last accessed: 2020-07-24].

20 Chapter 3 Empirical setup over sequences with maximum of 512 character length. As our text sequences are usually longer, we implement a sliding-window splitting of the sequences before passing these through BERT. That is, we split each sequence longer than the maximum length into separate chunks, where each chunk is following the limitation. Different than regular splitting, we keep the first and last N characters of each chunk overlapping with the last and first characters of the previous and next chunks respectively. We pick a sufficient N = 5. After embedding each chunk, we then concatenate them together again, using a special operation for the overlapping characters representations. We try several different procedures, but end up taking the mean value of the overlapping embeddings.

Newly trained embeddings work on sub-word level as well but their weight matrices are randomly initialised and learned during training instead. The vocabulary that they are using is the same one used by BERT. The size of these embeddings is a hyper-parameter and ranges between 64 and 512.

In-domain embeddings are using the pre-trained in-domain model provided by organisers for feature extraction only. We choose not to fine-tune it as it is already trained on data from the same domain. These embeddings have size of 300 and are again working on sub-word level. The model is pre-trained following the FastText library (Joulin et al., 2016) and used in a similar way in our architecture.

Character embeddings follow the original implementation of the model and consist of an embed- ding layer, followed by RNN. In our implementation, we use a bi-directional LSTM. The embedding layer’s size and the LSTM hidden size are both hyper-parameters with values ranging from 16 to 128 and 16 to 256 respectively. We tune these and report final values in §4.1.4. We use a character-level custom vocabularies for each language built from the training and validation data sets.

Bi-LSTM is implemented as in the original paper with the only difference being the additional pre-trained embeddings which we use and concatenate to the newly trained and character ones. For concatenation once again we test out different approaches and found out that the simplest and fastest to use, while keeping similar results, is simply concatenating the embeddings on top of each other, resulting in at largest sub-word embeddings of size 1836. Given that the Bi-LSTM is working on sub-word level, we merge the sub-words into words before then feeding the resulted representation through a fully connected layer which then outputs tag probabilities for each token. We tested merging sub-words at concatenation time before the Bi-LSTM in addition to merging them after or not merging at all but found that this approach works best. Our results follow the ones from Ruder (2019) where similar findings were reported.

Conditional Random Field(CRF) (Lafferty et al., 2001) is finally used over the produced tag probabilities which learns information about the transitions between different tags to decode the final tag predictions. We test out our model on the modern-day text based CoNLL-2003 dataset (Sang and De Meulder, 2003) which is arguably one of the most used such, related to evaluating and comparing NER models. We achieve results comparable to current state-of-the-art which lead us to believe that the model can be successfully used on historical corpora.

Multi-task transfer learning is introduced to analyse its importance on this setup. We transform the original model by introducing additional output heads, one for each of the different entity types that the task aims to predict. The final two layers of the original model, namely the fully connected layer and the conditional random field are now available for each of the tasks. They are used separately for each tag type whereas all the previous layers up to the Bi-LSTM are shared. The updated model can be seen at Figure 3.6. In the case of the CLEF-HIPE-2020 challenge, the amount of entity tag types that are predicted for each token can be at maximum six which is also the maximum amount of tasks that we train simultaneously.

3.2.1.6 Training We want to investigate as detailed as possible how different configurations, embedding combina- tions and single- to multi-task transition affects the performance of our model and of the different

21 Figure 3.6: NERC multi-task model architecture components. To this end, we introduce an extensive set of hyper-parameters which are all config- urable before or during training and allow us to solidify our knowledge. The full list can be seen at Table 3.3. Furthermore, we explain in details some of the available options that we investigate.

We embed a regular text pre-processing, looking at the option of replacing numbers during the text pre-processing stage with a specific character or number. We follow previous work where this has been experimented with and also the baseline model provided by organisers and pick zero as the replacement number. We see that this provides benefits during training later on.

The original provided structure of an input sequence is where each document is split into multiple segments where usually one segment corresponds to one line in the original historical document. As straight-forward options we pick segment and document splitting. Using segment level leads to much faster convergence due to the much smaller lengths of input sequences that are being used. Still document splitting, which is also the approach used by the baseline model, yields significantly better results. Therefore we decide to further analyse the importance of splitting by introducing a multi-segment term which combines more than one consecutive segments but less than all available ones in one document. We pick the maximum length of one multi-segment to be the maximum length allowed by the HuggingFace Transformers library(Wolf et al., 2020) – currently the most used one for using pre-trained BERT models. Note that this is not a real limitation, as for document level, we work around that and here we choose it purely for simplicity reasons.

We do not lowercase, nor do we remove any punctuation or other characters during the pre- processing stage.

For the amount of tag types simultaneously trained we have the possibility to pick any of the six tag types available in the task. We also hypothesise that there will be a benefit when picking the tags that a single sub-task is using. For example, in the case of the coarse-grained type, those

22 Chapter 3 Empirical setup Parameter name Value options one of 6 tags; coarse(literal and metonymic); amount of tag types simultaneously trained fine(literal, metonymic, nested and component); all 6 tags; replace numbers during pre-processing yes/no sequences split type segment/multi-segment/document use of character embeddings yes/no - characters embedding layer size 16/32/64/128 - characters RNN hidden size 16/32/64/128/256 use of newly trained sub-word embeddings yes/no - newly trained sub-word embedding layer size 16/32/64/128 - newly trained sub-word embedding layer dropout 0/0.2/0.5/0.8 use of pre-trained (FastText) embeddings yes/no

Model hyper-parameters use of pre-trained (BERT) embeddings yes/no - weights usage type fine tune/freeze - fine tune type from beginning/after initial convergence - fine tune BERT learning rate same as global learning rate/1e−3/1e−4 LSTM options - hidden size 128/256/512 - dropout 0/0.2/0.5/0.8 - directionality bi-directional/uni-directional - number of layers 1/2 use of manually crafted features yes/no use of weighted loss yes/no Optimiser SGD/Adam/AdamW Training - learning rate 1e−2/1e−3/1e−4

Table 3.3: NERC hyper-parameters would be the literal and metonymic coarse tag types. As for the fine-grained sub-task, the system is looking at literal and metonymic fine tag types, as well as at nested entity and component tag types. We include those two combinations to the set of available options and finally also add the possibility to make use of the full set of tag types in the model. We end up comparing multi-task learning on all tasks versus single-task learning per task. There are two possibilities when using pre-trained BERT embeddings – we either fine-tune or we freeze the weights of the model from which we are extracting features. Fine-tuning the model lets us introduce two additional configuration options. The first one is related to the time of starting the fine-tuning. This is most often performed in the beginning and the pre-trained model is fine-tuned along the full NER (in our case) model. We additionally test fine-tuning after initial convergence of the full model. In this case, we keep the pre-trained weights frozen at the beginning of the training. After initial convergence we then unfreeze and fine-tune them. This is something that we also investigate as to whether there are any differences to reported observations when switching over to a historical dataset. Literature, as well as our experiments, show that BERT is best fine-tuned when using very low learning rate. However for our main model, we find that a high global learning rate is most beneficial. Therefore we use another decision related to fine-tuning - whether to have the same global learning rate for the pre-trained BERT model or a separate one. We show later how this choice affects results and if it makes sense to focus resources on this hyper-parameter. We introduce manually crafted features to the model based on the existing literature (Ghaddar and Langlais, 2018) and test their importance on our system. We use AllLower, AllUpper, IsTitle, FirstLetterUpper, FirstLetterNotUpper, IsNumeric and NoAlphaNumeric. We find that they do not give us any performance boost. We believe that this is due to the knowledge already being

23 encoded as information in the neural network. What is more, in some cases and configurations, they actually hurt performance. We hypothesise that because of the complexity of the model, they could not provide anything important, while also due to their simplicity, sometimes their encoding is confusing inner layers of the structure.

The nature of most NER tasks out there, as well as the current dataset, results in most of the ground truth being composed of outside tags or simply O as how they are usually labelled. In our case, these make up for approximately 94.92%, 95.95%, and 96.5% of the total tokens for English, French and German languages respectively. To counteract this tag imbalance, we test a weighted loss, which we plug into the CRF layer, giving more weight on probabilities for tokens which are predicted as outside ones but are in fact part of entities and less on tokens which are predicted as inside of an entity but are actually outside. Needless to say, we only do this during training to avoid adding unwanted bias. We find that this weighted loss actually does not help and in fact hinders performance by a significant margin and thus, we leave it switched off.

Looking at the pure training hyper-parameters, we compare using SGD, Adam and AdamW op- timizers. We notice that both Adam and AdamW lead the model to converge significantly faster compared to SGD. Final performance is similar but we end up using AdamW for the regularisation improvements (Loshchilov and Hutter, 2018). For learning rate we see that higher values benefit the model more. When using SGD, lower values tend to produce better results. We use default value of 0 for momentum – valid only when using SGD – and pick a similar default value of 1e−8 for weight decay for all optimizers.

We show fully in §4.1.4 which of the aforementioned training settings lead to best results and also comparisons of different setups.

3.2.2 Post-OCR correction OCR tools have become a well established group of systems over the past few decades. Today, they have already benefited us greatly in many different areas of human research. While the most obvious benefits come to digitising texts, OCR is also used in self-driving cars, self-service stores and even assisting visually-impaired people. And while OCR already works well in many cases, when even small amounts of noise are introduced to the input, even as simple as hiding one letter, many issues arise at the processing stage.

3.2.2.1 Motivation

Looking at historical texts and the digital libraries nowadays, we have many transcriptions with below-average quality, ancient manuscripts and books with ranging levels of conservation and different layouts from anything that a modern OCR tool has ever seen. This all leads to a degrading OCR quality for historical newspapers differing vastly compared to modern OCR results. But the quality of OCR’d texts is crucial for achieving stable performance in NLP (Traub et al., 2015; Linhares Pontes et al., 2019; Chiron et al., 2017a) and as a result a trend is observed where old documents are now ignored and priority is given to newly arriving documents which continue to grow more and more with each day. This leads to the problem of historical pieces being left out and sometimes even getting ruined by the time they are picked up for preservation.

Noisy OCR can be the result of the acquisition process, of the document conservation state or even of some of its properties, such as the use of worn-out types (Smith and Cordell, 2018). This makes post-processing techniques such as post-OCR correction, potentially important as they could overcome some of the noise introduced during OCR. However, still little work has been devoted to this area. It was found that working at narrower linguistic levels, such as focusing on characters or sub-words instead of words can lead to better results (Bojanowski et al., 2017).

Specifically, the ICDAR2019 Competition on Post-OCR Text Correction(Rigaud et al., 2019) is opened which aims to advance the field and to help relieve the pressure from systems tasked with the digitising of ancient texts. In addition two main sub-tasks are proposed to further clarify the goals - (i) the detection of OCR errors and (ii) the correction of OCR errors. Given the nature of our research and the fact that our focus is towards pure text processing and not information retrieval, we focus our analysis towards the latter sub-task and leave out the former.

24 Chapter 3 Empirical setup 3.2.2.2 Data The dataset supplied with the challenge consists of noisy OCR of printed text from different sources and languages(English, French, German, Finish, Spanish, Dutch, Czech, Bulgarian, Slovak and Polish). We focus on small subset of those languages, namely English, German and French which together combine to 13 628 files and 17 884 116 characters. Full details about the data split can be seen in Table 3.4. Note that German and French are further split in different sources depending on where the data is taken from. The corresponding ground truth for these languages comes from initiatives such as HIMANIS (Bluche et al., 2017), IMPACT (Papadopoulos et al., 2013), IMPRESSO (Ehrmann et al., 2020a) and RECEIPT(Artaud et al., 2018). We use the original split provided where 80% of the data is used for training and 20% is reserved for evaluation which we also use to test our performance and compare to original participants.

Characters count Language Subset File count Total Trainset Testset DE1 102 575 416 460 333 115 083 DE2 200 494 328 309 973 106 862 DE3 7623 10 018 258 8 014 606 2 003 652 German DE4 321 509 757 407 806 101 951 DE5 654 818 711 654 969 163 742 DE6 773 935 014 748 011 187 003 DE7 415 527 845 422 276 105 569 English EN1 200 243 107 189 085 52 112 FR1 1172 2 792 067 2 233 654 558 413 French FR2 200 227039 170 569 52 989 FR3 1968 742 574 594 059 148 515

Table 3.4: ICDAR 2019 Data split

Each of the files included contains three versions of a text sequence - (i) OCR toInput which has the raw version of OCR-ed text, (ii) OCR aligned which has the same raw version but aligned towards the ground truth character by character. Finally (iii) GS aligned contains the ground truth. Note that the this might have some or all of the characters removed. A sample from such a structure is displayed at Table 3.5. Due to the nature of our work, we keep our focus towards the aligned OCR version of the text and leave out the raw one. This allows our models to not spend additional resources learning to align sequences.

Version Text OCR toInput sail sliort of that than they who OCR aligned sail sliort of that than they who@@@@@@@ GS aligned fall s@hort of that than they who aspire

Table 3.5: ICDAR 2019 Data sample

As seen in the data split, English and French languages consist of significantly less data when compared to German language. To counteract on this we make use of data from a previous edition of this challenge, namely ICDAR2017 Competition on Post-OCR Text Correction(Chiron et al., 2017b). This adds 12 000 000 OCR-ed characters along with the corresponding ground truth. Both English and French have an equal share in amount of characters, approximating to 6 000 000 each. The documents come from several digital collections available, some of those being National Library of France (BnF) and the British Library (BL). The ground truth tokens come from BnF’s internal projects and external activities such as Gutenberg(Project Gutenberg, 2020), Europeana Newspapers, IMPACT (Papadopoulos et al., 2013) and Wikisource. There is an existing

25 split of 80% of the data originally reserved for training and validation purposes and 20% used by organisers for evaluation purposes. In order to combine as much data as possible, we use the original evaluation split for training as well. We include one final dataset for English language making use of OverProof service1 which com- bines publications of historical newspapers coming from the National Library of Australia’s Trove newspaper archive2 database and randomly selected articles from several issues and pages through Library of Congress Chronicling America3. The service uses ABBYY’s FineReader tool to digi- tise the texts which range from 1842 to 1954 year. Full details about the data can be found on OverProof’s website4.

3.2.2.3 Evaluation Evaluation is performed by making use of the weighted sum of the Levenshtein distances (Leven- shtein, 1966) of the tokens between the correction candidates and the corresponding Ground Truth. Originally, the weights are used to make it possible for the inclusion of more than one prediction, i.e. taking the top 3 for example. In our setup, as we work with the top one prediction always, we do not use weighting at all. The sum is finally used to calculate a percentage of improvement which is what originally organisers and we report in the results. This is measured by comparing the sum between no replacement and the predicted replacement. It must be noted that during the original evaluation phase of the ICDAR 2019 challenge, a decision is made to leave out tokens which contain hyphens due to the significantly higher complexity that those introduce. Because we do not focus on finding erroneous tokens (i.e. the first sub-task) we work over the whole dataset, splitting it into equal sequences, each of 50 characters length. This brings the benefit of making use of the additional datasets previously mentioned, namely ICDAR 2017 and OverProof. OverProof dataset is left in original shape without being split as it is already aligned line-by-line which is enough for our experiments. However, it also negatively impacts us as we are not able to fully compare to original results reported in the challenge anymore.

3.2.2.4 Model Our model is based on an encoder-decoder translation architecture with attention mechanism, closely following the one introduced by Bahdanau et al. (2016). We consider the uncorrected post-OCR version of the text and the corrected one as two separate languages, thus allowing us to use a translation model. A key difference to traditional multi-lingual setups is that we are still relying on the same character and sub-word vocabularies for input and output instead of having a vocabulary per language. We also use our own modular embedding layer, which combines the different representations of the input sequences. Note that each of the representations can be turned on and off where only one must remain active during training. We test different combinations and report if some are more valuable than other. In addition, our embedding layer can be used twice or it can be shared, depending on configuration, once in the encoder and once in the decoder.

Newly trained embeddings come from an embedding layer which is randomly initialised in the beginning. They use a character vocabulary built over the training data used for this task.

BERT pre-trained embeddings are used as a way of introducing previously learned representations. Their weights can be either frozen or fine-tuned and we investigate both approaches. Due to the nature of BERT, they work on sub-word level, whereas our embedding layer outputs character level embeddings. We use a special case of concatenation where each sub-word is repeated over all the embeddings produced for characters contained inside. An example of this approach can be seen in Figure 3.7. We benefit from different pre-trained models, depending on which language is used. Similarly as with NER, we use bert-base-cased for English, bert-base-german-cased for German and bert-base-multilingual-cased for French language.

In-domain representations are initialised from the pre-trained on historical corpora embeddings that we originally use in §3.2.1. They are kept frozen during training and used purely as a feature

1https://overproof.projectcomputing.com/ 2https://trove.nla.gov.au/ 3https://chroniclingamerica.loc.gov/ 4https://overproof.projectcomputing.com/evaluation

26 Chapter 3 Empirical setup Figure 3.7: Post-OCR correction model, encoder sub-word and character embedding concatenation extraction. We adopt this encoded knowledge and analyse its importance for this task as well due to its already encoded in-domain knowledge. As those are originally working on sub-word level, we use the same operation as with BERT to scale them down to character level. All character representations are finally concatenated. We use a simple vertical concatenation due to its significant speed benefits and almost no negatives compared to other heuristic and more complicated decisions such as using another RNN or CNN to combine the embeddings. The concatenated version is then used in the encoder, which produces a sequence of hidden states h1,..., hM , one for each embedding. The encoder in our case is a bi-directional recurrent neural network – in particular a bidirectional Gated Recurrent Unit(GRU)(Cho et al., 2014) as proposed in the original paper. The result of the Bi-GRU, being bidirectional, contains two representations – one left-to-right and one right-to-left, represented as: −→ h−→ −→i h = h1,... hM and ←− h←− ←−i h = h1,... hM

Where each hidden state is calculated as −→  −−→ hi = GRU xi, hi−1 ←−  ←−− hi = GRU xi, hi+1

The final representation of the Bi-GRU is then taken as h−→ ←−i h = h ; h

Where ; is the concatenation between the left-to-right-pass and right-to-left-pass. This final hidden state is then forwarded to the decoder. This has a similar structure as the encoder with two major differences. The whole pass can be seen in Figure 3.8. We first use an embedding layer which this time does not make use of any pre-trained knowledge, independent of initial configuration and whether such is used in the encoder. In cases where we use the same embedding layer in the encoder and the decoder, i.e. when we share it, we skip the pre-trained model embeddings generation when passing through the decoder. We justify this with the fact that in the decoder, we work character-by-character at each time step which makes the pre-trained

27 Figure 3.8: Post-OCR correction model, decoder pass models – which are trained and working on sub-word level, obsolete in this case. Secondly, we use attention (Bahdanau et al., 2016) at each time step. The final representation of each target character can be formulated as: si = GRU (si−1, yi−1, ci)

We use GRU with hidden state si. It follows a similar logic as the encoder with few differences. We decode each character separately by using the encoded hidden state h in the attention mechanism which then outputs a context vector ci (shown in yellow). This is done at each time step where the attention dynamically selects that part of the encoded source sentence that is considered most relevant for the current target character. Additionally, we use the previously decoded characters hidden state as input knowledge too and we label them as yi−1.

After computing the decoder state si, we use a non-linear function g – which applies softmax in our case – and take the probability of the target character yi for this time step:

p (yi|y

Here, X = (x1, . . . , xM ) is the input sequence. As g uses a softmax, it finally provides a vector of the same size as the character vocabulary used and that sums to 1.0. This is a distribution over all target characters and during evaluation phase, we simply select the character with the highest probability for our correction. This setup ends with a cross entropy loss that is used to maximise the probability of selecting the correct character at this time step.

3.2.2.5 Training Similar to the setups of the other tasks, we want to analyse the effect of this challenge and its different configuration options when applied on historical corpora. To this end, we put down an extensive list of hyper-parameters, which we test and then report at our observations and findings. The full list can be seen at Table 3.6. We work with the raw Post-OCR output. We don’t lowercase or remove any punctuation, neither do we transform the numbers in any way. This is required since we need information about all individual input tokens. For each word, we keep information about (i)its characters, using our character vocabulary built from the data we are working with and (ii)its sub-words which are split using WordPiece tokenization (Schuster and Nakajima, 2012; Kudo, 2018) as used in BERT. We have two configuration sets related to newly-trained embeddings, due to the possibility of having two embedding layers – for the encoder and the decoder respectively. There is a special case in which we only use one and we share it between the encoder and decoder. It must be reminded that, shared or not, when used in the decoder, the embedding layer ”disables” pre-trained information due to the single-character embedding type - that is we only use the newly-trained part which is therefore also required at this stage.

28 Chapter 3 Empirical setup Parameter name Value options

use of newly trained encoder character embeddings yes/no - newly trained encoder character embedding layer size 32/64/128 - newly trained encoder character embedding layer dropout 0/0.2/0.5/0.8 newly trained decoder character embedding layer size 32/64/128 newly trained decoder character embedding layer dropout 0/0.2/0.5/0.8 share encoder and decoder embedding layers yes/no use of pre-trained (FastText) embeddings yes/no use of pre-trained (BERT) embeddings yes/no - weights usage type fine tune/freeze - fine tune type from beginning/after initial convergence - fine tune BERT learning rate same as global learning rate/1e−3/1e−4 Encoder GRU options Model hyper-parameters - hidden size 128/256/512/1024 - dropout 0/0.2/0.5/0.8 - directionality bi-directional/uni-directional - number of layers 1/2/3 Decoder GRU options - hidden size 128/256/512/1024 - dropout 0/0.2/0.5/0.8 - number of layers 1/2/3 Optimiser SGD/Adam/AdamW Training - learning rate 1e−2/1e−3/1e−4/1e−5

Table 3.6: Post-OCR correction hyper-parameters

If BERT embeddings are included in our configuration, we face similar decisions as mentioned before - we see if freezing the weights hinders performance compared to fine-tuning them furthermore. If we choose the latter, we additionally pick the fine-tune type – similarly as before, we choose between fine-tuning from the beginning of the session or alternatively, we first train the main encoder-decoder model until convergence before restarting the session and unfreezing the BERT setup. Our global learning rate is generally similar to the one used for fine-tuning BERT, but we keep the option to configure those two separately and use a different one for the pre-trained partition. Due to having encoder and decoder parts, once again we have two distinct configuration sets for the RNNs taking part in those. We test the hidden size, dropout, directionality and number of layers for each GRU. For the decoder, we keep a uni-directional setup due to the specifics of working with a single character and therefore not having any benefits from using two directions. We use AdamW optimizer but also try out SGD and Adam, although they do not provide any sig- nificant benefits. We test out different learning rate values, keep the momentum of SGD optimiser to its default value of 0 and set a weight decay of 1e−8 for all optimizers. Finally, we use negative log-likelihood loss to optimise our parameters.

3.2.3 Semantic change A rise of academic studies focused on computations of lexical semantic and vocabulary change has been observed in recent years. Some of the earlier works focused on long-term semantic shifts, sometimes even centuries long (Sagi et al., 2011) while others focused later on shorter time spans (Michel et al., 2011; Mihalcea and Nastase, 2012). Originally, this task was performed manually by linguists and other scholars in the humanities and social sciences. Transitioning from manual labour to automation, as expected, leads to many problems and negatives. One example is how most of the computations, that have been proposed, have vastly different evaluation algorithms and concern different languages, corpora and periods, thus making generalising and evaluation of different systems difficult. A lack of generalisation then leads to more “reinventing of the wheel” in future research. Even more so, with the surge of digitisation of historical documents, the lack

29 of a standardised evaluation method leads to incomparable outputs. One can think of how many evaluations he can come up with if asked to explain how different the meaning of a word like plane would be before and after the invention of the airplane.

3.2.3.1 Motivation Aiming to move the community further towards such a gold standard of evaluation, this year’s SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020) is introduced. Its goal is to overcome said differences and find a common ground for a better and easier future research. We participate in the challenge – analysing how powerful and useful pre-trained resources can turn out to be in such a context. Since most of those have been trained on modern day texts, we hypothesise that they can be used to monitor and to track the differences of historical texts from their modern counterparts.

3.2.3.2 Tasks The challenge to identify words that change their meaning over time is still not a simple one to solve. The goal is further split into two different sub-tasks on which we must evaluate. Their explanation follows.

Sub-task 1: Binary classification presents us with two corpora C1 and C2 for two time periods – t1 and t2 respectively. Given the aforementioned set of target words that we have, we must evaluate which words gained or lost senses between t1 and t2 and which ones did not. The original data is annotated by human judges. It is a binary classification problem as we assign one of two possible classes to each word - whether its meaning has changed or it has remained the same.

Sub-task 2: Ranking once again provides the two corpora C1 and C2 and the set of target words. Here the goal is to rank the words according to their degree of lexical semantic change between t1 and t2. A higher rank corresponds to a stronger change. Here, the concept of lexical corresponds to the vocabulary and target words that come from the organisers and more specifically the dataset that is provided. Due to the nature of our research, we only focus on the second sub-task, making use of our transfer learning setup.

3.2.3.3 Data The provided data is available in four different languages, namely English, German, Latin and Swedish. Each language contains texts coming from the two previously mentioned corpora that differ on the time period from when they were taken. Exact years for each corpus and language can be seen in Table 3.7.

Language t1 t2 English 1810 – 1860 1960 – 2010 German 1800 – 1899 1946 – 1990 Latin -200 – 0 0 – 2000 Swedish 1790 – 1830 1895 – 1903

Table 3.7: SemEval 2020 corpora time periods per language

Some text processing is applied from organisers, more specifically they remove the punctuation and leave out one-word sentences, as well as replacing all tokens with their lemmas. Sentences are randomly shuffled within each corpus and no lower-casing is applied. Some of those do not apply fully to each language, so we explain the particularities.

English data comes from CCOHA (Alatrash et al., 2020), a cleaned version of the COHA corpus (Davies, 2017). Transformation has been performed, replacing ten words with ”@” at every 200 words. These are then used to split the original sentences and are later removed.

30 Chapter 3 Empirical setup German combines the DTA corpus (Deutsches Textarchiv, 2017) and the BZ and ND corpora (Berliner Zeitung, 2018; Neues Deutschland, 2018). The latter two are used as they contain frequent OCR errors.

Latin uses the LatinISE corpus (McGillivray, 2012). The data was automatically lemmatised and part-of-speech tagged, after which it has been corrected by hand. Homonyms were further followed by the ”#” symbol and the number of the homonym according to the Lewis-Short1 dictionary of Latin. This was only done when the number was greater than one.

Swedish makes use of the KubHist corpus, which is digitised by the National Library of Sweden, and available through Spr˚akbanken corpus infrastructure Korp (Borin et al., 2012). More detailed information about this corpus can be found in the study of Adesam et al. (2019). Each word which the lemmatiser recognises is replaced with the lemma, otherwise it is left as it is, i.e. unlemmatised and not lower-cased. This dataset contains very frequent OCR errors, especially for the first time period which contains the older data.

3.2.3.4 Evaluation The systems will be evaluated against a ground truth, which is annotated by human native speakers (except for Latin, which is annotated by scholars of Latin). The DURel framework is used to achieve language-independent annotation (Schlechtweg et al., 2018). The Ranking sub-task is evaluated using Spearman’s rank-order correlation coefficient(SPR) against the true rank as annotated by humans. There are two baseline models against which we are also comparing our models when it comes to the Ranking challenge, namely (i) normalised frequency difference (FD) and (ii) counting vectors with column intersection and cosine distance (CNT + CI + CD). FD first calculates the frequency for each target word in each of the two corpora, normalises it by the total corpus frequency and then calculates the absolute difference in these values as a measure of change. The second baseline model learns vector representations instead and then aligns them by intersecting their columns and measures change by cosine distance between the two vectors for a target word. We calculate cosine and euclidean distances of the word vectors coming from the two time corpora. We use these to calculate the degree of change and report both in §4.3.1.

3.2.3.5 Model We use this challenge to test out pre-trained embeddings importance. Due to the nature of it, we leave out entirely newly trained embeddings in our own setup and only keep on using these in the baseline models provided by organisers. Because of their random initialisation, they simply require much more data than what we have in store to get near the state of the widely used systems out there. Our focus instead turns towards models like BERT, which up until the existence of this challenge, to our knowledge, has never been applied. We assume that the knowledge incorporated in pre-trained embeddings contains information that is helpful to detect changes during time. Another participating team (Kutuzov and Giulianelli, 2020) has successfully applied very similar techniques at the same time as the research performed by us and they encounter similar problems as we do, which we discuss later on. We use a pre-trained BERT and participate in all four languages, changing our configuration from using bert-base-cased for English to bert-base-german-cased for German and bert-base-multilingual- cased weights for Latin and Swedish respectively. We simply fine-tune the model and do not add any other layers on top. Taking into consideration the specifics of the challenge, we fine-tune two separate models, both starting from the same initialisation, that is the original pre-trained BERT. After fine-tuning, we extract the embeddings of the target words from the two fine-tuned models that now differ because of the different data that they have seen and thus we evaluate their similarities and differences. For cases where a target word is split into more than one sub-word token, we simply merge the sub-word embeddings by taking their mean. To compare BERT performance to simpler non-contextualised embeddings we add our own baseline model. We experiment on fine-tuning a simple Word2vec-based model(Mikolov et al., 2013). This is another highly used model which is nowadays being heavily outmatched by contextual Deep

1http://www.perseus.tufts.edu/hopper//morph

31 Learning architectures. We use the original word vectors as initialisation for our embedding layer. Our architecture contains two extra linear layers that help upscale the embeddings size to the required vocabulary size. The goal of this model is to predict a target word based on the N previous and N following words as input data, where N is a hyper-parameter. Due to this setup being much simpler in terms of pre-trained information, we add two more initialisation styles to exclude the possibility of different representations of a similar knowledge. On top of having the models for both corpora starting from the same point, we also have a setup where we first train one model on one of the corpora and then train the other one but using the weights of the first model instead of the original word vectors. We test both corpora as starting ones. Unfortunately, this baseline proves inefficient to capture something meaningful and thus we leave it out of the reports.

3.2.3.6 Training We follow original work at the time of building the datasets used and therefore do not apply lower-casing, nor any other pre-processing. As previously mentioned, our setup aims at fine-tuning the pre-trained models. To this end, we use a similar flow as the one used during the original BERT training as well as use the same vocabulary. We work on sub-word level on the sentences provided by organisers. At each iteration and for each sequence, we pick randomly 15% of the sub-word tokens. We then use the default values from the original training - we mask 80% of the selected tokens, we replace another 10% tokens with random sub-words from the vocabulary and leave the remaining 10% untouched. The goal of the model is then to predict the masked words. We use Adam optimizer with a learning rate of 1e−4 after finding it to be best for this task using evaluation. For the baseline model, we build our own vocabulary model from the text corpus. We lowercase all words and leave out those which occur less than ten times in total. For each word that is available in the word2Vec vocabulary, we use the pre-trained representation and for all the rest, we initialise their vectors randomly. We then use this as starting point of our embedding layer. The model is then trained to recognise words based on their neighbourhood window - that is the previous two and next two words disregarding their order. We use embedding size of 300 as this is the one also used in the Word2Vec model and default parameters for the two linear layers sizes after. More specifically, we use 128 for the first linear layer which the second then maps to our vocabulary size.

32 Chapter 3 Empirical setup Chapter 4

Results

We perform extensive analysis on the impact of transfer learning over historical corpora. This is an application that is nowadays being extensively used to the point of people always choosing it by default without actually considering its drawbacks but rather only focusing on the benefits. This is mostly due to the widespread belief that transfer learning always leads to improvement of existing systems. In this chapter, we want to showcase that this is not always a valid assumption. Depending on project requirements, one might actually prefer to leave out pre-trained information and focus on other components. We display our results for each challenge – we show results from the named entity recognition challenge in §4.1, we then show outcome of the post-OCR correction challenge in §4.2 and finally show results from semantic change experiments in §4.3. For each section, we provide sub-sections splitting the main results and the extensive hyper-parameter tuning that we have applied into different sub-sections. We also perform error analysis based on the different challenges and results that we receive.

4.1 Named entity recognition

The problem of recognising named entities is not new and has been around almost as long as Natural language processing itself. However, while there have been significant advances for modern day text corpora, when applying existing models to digitised historical texts, people often get lost and recognition is missing out on many important features. This is mostly because of the lack of research performed in this domain and in particular the absence of information about expectations from previously known components – not knowing what helps and what does not, slows down research and opens up the path to repetitions. Aiming to give clear answers to these questions, we perform detailed analysis using our modular architecture and provide results, also following ablation studies and revealing how inclusion or exclusion of some parts affects performance (§4.1.1). We provide our tracked convergence times in §4.1.2. We then experiment with the size of our network and show that reported benefits do not come from increasing the number of available parameters (§4.1.3). We also provide a list of the configurations of hyper-parameters that we found to work best for this challenge (§4.1.4). We get all data and perform evaluation by actively participating in the CLEF-HIPE-2020 challenge. We end up second on average for French and German languages (Todorov and Colavizza, 2020b).

4.1.1 Main results We report on all three languages that we have available, namely French, German and English and we use the official test set v1.3 that organisers also use to evaluate runs. In addition, we report results based on the split type that we perform at the pre-processing step. We use multi-segment and document split types for French and German and segment split type for English languages.

33 Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Baseline .825 .721 .769 .693 .606 .646 .541 .179 .268 .541 .179 .268 Base .776 .69 .73 .618 .55 .582 .5 .424 .459 .495 .42 .454 Base + CE .806 .739 .771 .649 .594 .62 .552 .379 .45 .545 .375 .444 Base + CE + FT .789 .78 .784 .65 .642 .646 .481 .339 .398 .468 .33 .387 Base + CE + BERT .886 .801 .841 .782 .707 .743 .424 .397 .41 .41 .384 .396 Base + CE + BERT - newly .859 .818 .838 .719 .685 .702 .417 .384 .4 .417 .384 .4 Base + CE + FT + BERT .866 .836 .851 .767 .739 .753 .664 .362 .468 .656 .357 .462 Base + CE + FT + BERT - newly .864 .848 .856 .765 .751 .758 .766 .321 .453 .766 .321 .453 Base + CE + FT + BERT (single) .872 .835 .853 .769 .737 .753 .036 .069 .0 .036 .069 .0 + Fine-tuning (unfreezing) BERT Base + CE + BERT .876 .824 .849 .775 .729 .751 .442 .375 .406 .432 .366 .396 Base + CE + BERT - newly .877 .804 .839 .775 .711 .742 .754 .384 .509 .754 .384 .509 Base + CE + FT + BERT .857 .836 .846 .759 .741 .75 .551 .482 .514 .541 .473 .505 Base + CE + FT + BERT - newly .845 .838 .842 .742 .737 .74 .659 .5 .569 .659 .5 .569

(a) coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Baseline .838 .693 .758 .644 .533 .583 .564 .196 .291 .538 .187 .278 .799 .531 .638 .733 .487 .585 .267 .049 .082 .267 .049 .082 Base .8 .67 .729 .548 .459 .499 .476 .451 .463 .472 .446 .459 .774 .531 .63 .692 .475 .563 .383 .14 .205 .333 .122 .179 Base + CE .825 .708 .762 .562 .482 .519 .594 .366 .453 .594 .366 .453 .779 .556 .649 .72 .514 .6 .5 .067 .118 .364 .049 .086 Base + CE + FT .801 .763 .781 .568 .541 .554 .567 .228 .325 .533 .214 .306 .762 .598 .67 .682 .535 .6 .425 .207 .279 .375 .183 .246 Base + CE + BERT .889 .781 .831 .658 .578 .616 .532 .366 .434 .519 .357 .423 .803 .579 .673 .715 .515 .599 .0 .0 .0 .0 .0 .0 Base + CE + BERT - newly .865 .748 .802 .613 .53 .568 .54 .241 .333 .54 .241 .333 .821 .504 .625 .732 .449 .557 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .866 .818 .842 .672 .634 .653 .702 .263 .383 .643 .241 .351 .804 .563 .662 .712 .499 .587 .357 .03 .056 .143 .012 .022 Base + CE + FT + BERT - newly .873 .82 .846 .672 .631 .651 .771 .241 .367 .743 .232 .354 .842 .546 .663 .774 .503 .61 .393 .067 .115 .286 .049 .083 Base + CE + FT + BERT (single) .868 .818 .842 .676 .636 .655 .538 .442 .485 .533 .438 .48 .752 .677 .713 .659 .594 .625 .0 .0 .0 .0 .0 .0 + Fine-tuning (unfreezing) BERT Base + CE + BERT .877 .806 .84 .654 .6 .626 .434 .379 .405 .429 .375 .4 .77 .598 .673 .673 .523 .588 .267 .049 .082 .133 .024 .041 Base + CE + BERT - newly .885 .782 .83 .672 .593 .63 .739 .29 .417 .705 .277 .397 .818 .524 .639 .745 .477 .582 .107 .018 .031 .071 .012 .021 Base + CE + FT + BERT .871 .814 .842 .687 .642 .664 .568 .411 .477 .543 .393 .456 .741 .672 .705 .648 .587 .616 .232 .159 .188 .179 .122 .145 Base + CE + FT + BERT - newly .852 .837 .845 .663 .652 .658 .681 .42 .519 .609 .375 .464 .785 .626 .697 .701 .559 .622 .333 .183 .236 .244 .134 .173

(b) fine grained entity type

Table 4.1: NERC, French, multi-segment split. The best result per table and column is given in bold, the second best result is underlined

The difference in the latter is due to the usage of an external training dataset which does not have the document split that is needed in order to work on longer sequence lengths level. All results are reported in the two scoring types that are also used in the challenge - fuzzy and strict. As a reminder, fuzzy scoring works in a relaxed way, allowing fuzzy boundary matching of entities. That is if an entity is only partially recognised, i.e. if 4 out of total of 6 tokens are recognised correctly, this is still considered successful recognition. On the contrary, a strict matching requires all tokens to match with exact boundary matching – in previous example this would require 6 out of 6 total tokens to be predicted correctly. For each scoring type, we provide three of the most common metrics, namely precision(P), recall(R) and F-score(F). All reported scores are taken from the best run out of three in total where the three runs have all converged successfully. We also report the baseline model provided from organisers ran over the test set. It only has one run available and no different split types available, working only on document level. We still display it next to our multi-segment runs for a better comparison. We order the different configurations for all languages following our ablation studies, that is we start with the simplest model which is only using newly trained embeddings and no pre-trained information of any type, we call this model Base. We then continue by adding character embeddings which use RNN (+ CE). Due to the huge improvements observed by including those, we keep the character embeddings enabled in all of our next reported setups. Up to this point, we are still not using any pre-trained information at all. We then report results that were achieved by adding firstly the FastText embeddings provided by organisers(+ FT), then

34 Chapter 4 Results Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Baseline .825 .721 .769 .693 .606 .646 .541 .179 .268 .541 .179 .268 Base .812 .686 .743 .671 .566 .614 .444 .536 .486 .444 .536 .486 Base + CE .802 .762 .782 .658 .625 .641 .575 .272 .37 .566 .268 .364 Base + CE + FT .815 .737 .774 .673 .608 .639 .51 .469 .488 .505 .464 .484 Base + CE + BERT .871 .831 .851 .779 .743 .76 .684 .232 .347 .684 .232 .347 Base + CE + BERT - newly .89 .828 .858 .788 .733 .759 .564 .277 .371 .545 .268 .359 Base + CE + FT + BERT .872 .828 .849 .772 .733 .752 .433 .696 .534 .428 .688 .527 Base + CE + FT + BERT - newly .869 .872 .871 .78 .782 .781 .755 .357 .485 .755 .357 .485 Base + CE + FT + BERT (single) .89 .856 .873 .807 .776 .791 .699 .424 .528 .691 .42 .522

(a) coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Baseline .838 .693 .758 .644 .533 .583 .564 .196 .291 .538 .187 .278 .799 .531 .638 .733 .487 .585 .267 .049 .082 .267 .049 .082 Base .822 .672 .739 .594 .486 .534 .446 .513 .477 .419 .482 .448 .738 .6 .662 .657 .534 .589 .512 .25 .336 .35 .171 .23 Base + CE .809 .752 .78 .586 .546 .565 .521 .223 .313 .521 .223 .313 .743 .618 .675 .65 .541 .59 .35 .171 .23 .275 .134 .18 Base + CE + FT .811 .722 .764 .599 .534 .565 .54 .362 .433 .507 .339 .406 .759 .603 .672 .684 .544 .606 .453 .177 .254 .406 .159 .228 Base + CE + BERT .885 .799 .84 .696 .629 .661 .654 .304 .415 .654 .304 .415 .719 .686 .702 .625 .596 .61 .304 .104 .155 .25 .085 .127 Base + CE + BERT - newly .896 .79 .84 .675 .595 .633 .568 .223 .321 .568 .223 .321 .808 .603 .69 .696 .52 .595 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .883 .8 .839 .717 .649 .682 .741 .371 .494 .679 .339 .452 .794 .631 .703 .715 .568 .633 .341 .183 .238 .318 .171 .222 Base + CE + FT + BERT - newly .881 .841 .861 .703 .671 .687 .705 .384 .497 .689 .375 .486 .792 .644 .71 .704 .572 .631 .233 .043 .072 .067 .012 .021 Base + CE + FT + BERT (single) .882 .853 .867 .729 .704 .716 .741 .357 .482 .741 .357 .482 .734 .726 .73 .65 .642 .646 .438 .299 .355 .393 .268 .319

(b) fine grained entity type

Table 4.2: NERC, French, document split. The best result per table and column is given in bold, the second best result is underlined

those where we have BERT embeddings enabled(+ BERT) and finally those runs where we have both pre-trained information sources enabled simultaneously. For the cases where we have BERT enabled, we also report runs where we disable the newly trained embeddings(- newly). Finally, we report three different setups where we unfreeze the weights of BERT and fine-tune them on the current task. All best results per column are marked in bold and second best are underlined. Additionally we provide the results for multi-segment split type and document split type separately and also split them for the two tasks at hand - coarse grained and fine grained. It must be noted that due to the extremely long sequence lengths when working on document level, which in turn require extreme amounts of compute power, we are unable to perform fine-tuning of BERT on document level. We therefore report fine-tuning difference only on multi-segment split type. We experience this problem both for French and German languages. We now look into the results, where a similar trend is observed for French (Table 4.1 for multi- segment and Table 4.2 for document level split type) and German (Table 4.3 for multi-segment and Table 4.4 for document level split type) languages. We see that adding character-level embeddings and BERT consistently improves results. Better results overall for both languages and both split types are obtained using all available embeddings, with sometimes configurations excluding newly trained ones performing better. However, there is not one model which consistently outperforms all other in all evaluation metrics. Additionally, we observe that most of our configurations for German and French are struggling on tag types with sparser annotations such as Nested. This is especially visible in German where we are not predictive at all on Nested tags, but also on the Metonymic ones. However, we see a specific benefit of using FastText embeddings as all results which achieve above zero on the metonymic annotations are using this type of pre-trained knowledge. We see that fine-tuning BERT does not seem to improve performance in neither of the two languages overall, with only the German Literal fine tag type gaining some benefits, although marginal. The exclusion of the newly trained embeddings from our modular setup, interestingly only when FastText is also enabled, leads to improvements. For completeness, we report results for English in Table 4.5. These are limited to the Literal coarse task as organisers do not provide ground truth for the fine grained entity type and the training

35 Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Baseline .79 .464 .585 .643 .378 .476 .814 .297 .435 .814 .297 .435 Base .698 .526 .6 .535 .404 .46 .559 .602 .58 .551 .593 .571 Base + CE .685 .605 .642 .535 .473 .502 .588 .568 .578 .588 .568 .578 Base + CE + FT .691 .554 .615 .528 .424 .47 .534 .602 .566 .534 .602 .566 Base + CE + BERT .801 .675 .733 .596 .502 .545 .0 .0 .0 .0 .0 .0 Base + CE + BERT - newly .759 .706 .732 .582 .541 .561 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .784 .724 .753 .639 .589 .613 .598 .542 .569 .598 .542 .569 Base + CE + FT + BERT - newly .84 .64 .726 .696 .53 .602 .696 .466 .558 .696 .466 .558 Base + CE + FT + BERT (single) .827 .731 .776 .708 .625 .664 .492 .53 .51 .472 .508 .49 + Fine-tuning (unfreezing) BERT Base + CE + BERT .756 .718 .737 .546 .519 .532 .0 .0 .0 .0 .0 .0 Base + CE + BERT - newly .752 .718 .734 .56 .534 .547 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .738 .678 .707 .575 .528 .551 .562 .5 .529 .543 .483 .511 Base + CE + FT + BERT - newly .802 .689 .741 .658 .565 .608 .621 .521 .567 .616 .517 .562

(a) coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Baseline .792 .419 .548 .641 .339 .444 .805 .28 .415 .805 .28 .415 .783 .34 .474 .727 .316 .44 .333 .014 .026 .333 .014 .026 Base .723 .517 .602 .479 .343 .4 .593 .593 .593 .585 .585 .585 .589 .298 .396 .486 .246 .327 .0 .0 .0 .0 .0 .0 Base + CE .704 .585 .639 .466 .388 .424 .667 .559 .608 .667 .559 .608 .589 .432 .498 .506 .371 .428 .25 .014 .026 .0 .0 .0 Base + CE + FT .706 .521 .6 .478 .353 .406 .538 .602 .568 .538 .602 .568 .654 .266 .378 .571 .232 .33 .0 .0 .0 .0 .0 .0 Base + CE + BERT .773 .693 .731 .348 .312 .329 .0 .0 .0 .0 .0 .0 .562 .222 .318 .382 .151 .216 .0 .0 .0 .0 .0 .0 Base + CE + BERT - newly .8 .647 .716 .358 .289 .32 .0 .0 .0 .0 .0 .0 .455 .48 .467 .31 .327 .318 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .8 .626 .703 .515 .403 .452 .581 .547 .563 .568 .534 .55 .67 .471 .553 .525 .369 .433 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT - newly .816 .639 .717 .551 .432 .484 .627 .542 .582 .627 .542 .582 .533 .227 .319 .397 .169 .237 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT (single) .776 .569 .656 .477 .35 .403 .0 .0 .0 .0 .0 .0 .841 .423 .563 .751 .378 .503 .0 .0 .0 .0 .0 .0 + Fine-tuning (unfreezing) BERT Base + CE + BERT .759 .703 .73 .311 .288 .299 .0 .0 .0 .0 .0 .0 .418 .295 .346 .276 .195 .229 .0 .0 .0 .0 .0 .0 Base + CE + BERT - newly .758 .696 .726 .29 .267 .278 .0 .0 .0 .0 .0 .0 .399 .328 .36 .239 .197 .216 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .736 .687 .711 .433 .405 .418 .524 .551 .537 .508 .534 .521 .474 .508 .49 .338 .362 .349 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT - newly .801 .685 .738 .548 .469 .506 .691 .475 .563 .691 .475 .563 .58 .51 .543 .472 .415 .442 .0 .0 .0 .0 .0 .0

(b) fine grained entity type

Table 4.3: NERC, German, multi-segment split. The best result per table and column is given in bold, the second best result is underlined

data we are using does not have a metonymic sense labelled for the coarse task. Additionally, we only have a segment split level for this language as the training data that is external and does not come from organisers lacks document level separation and thus can not be split into documents or multi-segments. For a better comparison, we provide results from two baseline models: (i) the baseline from the organisers and (ii) the baseline model trained on the English dataset that we use. Our models are mostly not able to perform beyond the provided baseline. This is likely due to the training data we use as we can see that the baseline model ran over our training data also performs worse than the original. However, focusing on comparing the different module setups – which is also our main goal, we see similar trends as with the other two languages. It is visible that fine tuning BERT does not help to perform better and again, most stable on average results are the ones that enable all modules, i.e. the ones using newly trained, character embeddings, BERT and FastText pre-trained representations. Overall, for all languages, we can see common results. Fine-tuning can be beneficial, but not always and only slightly so. Additionally, taking into account how much more computing power and costs it requires, the benefits are further reduced. Moreover, all modules seem to be important for the final setup. Character embeddings seem to always improve results, whereas newly trained ones seem to be mostly beneficial whenever BERT is not used. FastText embeddings, complementing nicely, often seem to help with scarce data, i.e. nested and metonymic tag types, as well as acting as a better replacement of the newly trained ones in the Base + CE + FT + BERT - newly configuration. It comes to no surprise, given research done already, that BERT is arguably the most important

36 Chapter 4 Results Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Baseline .79 .464 .585 .643 .378 .476 .814 .297 .435 .814 .297 .435 Base .678 .552 .609 .519 .422 .465 .571 .581 .576 .567 .576 .571 Base + CE .688 .573 .626 .548 .456 .498 .618 .576 .596 .618 .576 .596 Base + CE + FT .706 .548 .617 .549 .426 .48 .725 .492 .586 .725 .492 .586 Base + CE + BERT .763 .752 .758 .642 .632 .637 .714 .508 .594 .714 .508 .594 Base + CE + BERT - newly .805 .654 .722 .641 .52 .574 .433 .517 .471 .426 .508 .463 Base + CE + FT + BERT .767 .765 .766 .647 .645 .646 .622 .627 .624 .622 .627 .624 Base + CE + FT + BERT - newly .799 .726 .761 .671 .609 .639 .696 .542 .61 .696 .542 .61 Base + CE + FT + BERT (single) .86 .738 .795 .753 .647 .696 .709 .517 .598 .709 .517 .598

(a) coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Baseline .792 .419 .548 .641 .339 .444 .805 .28 .415 .805 .28 .415 .783 .34 .474 .727 .316 .44 .333 .014 .026 .333 .014 .026 Base .69 .53 .599 .448 .344 .389 .586 .606 .596 .582 .602 .592 .592 .394 .474 .491 .327 .393 .312 .068 .112 .25 .055 .09 Base + CE .706 .555 .622 .483 .38 .426 .67 .534 .594 .67 .534 .594 .683 .447 .54 .589 .385 .466 .154 .027 .047 .077 .014 .023 Base + CE + FT .726 .53 .613 .527 .384 .445 .766 .5 .605 .766 .5 .605 .722 .332 .455 .636 .292 .401 .5 .082 .141 .5 .082 .141 Base + CE + BERT .782 .734 .757 .571 .536 .553 .75 .508 .606 .75 .508 .606 .7 .5 .583 .623 .445 .52 .333 .027 .051 .333 .027 .051 Base + CE + BERT - newly .806 .594 .684 .496 .365 .421 .5 .508 .504 .5 .508 .504 .565 .09 .156 .42 .067 .116 .0 .0 .0 .0 .0 .0 Base + CE + FT + BERT .791 .763 .777 .594 .574 .584 .649 .61 .629 .649 .61 .629 .703 .582 .637 .585 .485 .53 .25 .014 .026 .25 .014 .026 Base + CE + FT + BERT - newly .84 .679 .751 .615 .497 .55 .744 .517 .61 .744 .517 .61 .792 .397 .529 .699 .35 .467 .25 .007 .013 .0 .0 .0 Base + CE + FT + BERT (single) .839 .743 .788 .669 .593 .629 .667 .525 .588 .645 .508 .569 .718 .588 .647 .632 .517 .569 .0 .0 .0 .0 .0 .0

(b) fine grained entity type

Table 4.4: NERC, German, document split. The best result per table and column is given in bold, the second best result is underlined

module as it brings great improvements comparing to the similar configurations that do not use BERT.

Literal coarse Configuration Fuzzy Strict PRFPRF Baseline (organisers) .736 .454 .562 .531 .327 .405 Baseline (ours) .377 .612 .466 .19 .31 .236 Base .307 .576 .401 .139 .261 .181 Base + CE .3 .64 .409 .139 .296 .189 Base + CE + FT .309 .627 .414 .14 .285 .188 Base + CE + BERT .457 .538 .494 .261 .307 .282 Base + CE + BERT - newly .475 .535 .503 .265 .298 .281 Base + CE + FT + BERT .408 .59 .482 .229 .332 .271 Base + CE + FT + BERT - newly .415 .528 .465 .179 .227 .2 + Fine-tuning (unfreezing) BERT Base + CE + BERT .421 .622 .502 .203 .301 .243 Base + CE + BERT - newly .493 .53 .511 .292 .314 .303 Base + CE + FT + BERT .404 .582 .477 .205 .296 .242 Base + CE + FT + BERT - newly .462 .508 .484 .261 .287 .274

Table 4.5: NERC, English, segment split. The best result per table and column is given in bold, the second best result is underlined

4.1.2 Convergence speed For German language, we see that sometimes single-task runs perform better than a multi-task approach. However, the difference is mostly marginal and not observed in all tag type predictions. What is more – we find that during our experiments, six single-task runs for each of the available tag types require 2.5 times more time to converge than one multi-task run on document level. Surprisingly, they require the same time when applied on multi-segment split level. Therefore, when convergence speed is an important factor, multi-task learning should be an important choice, as expected. We compare the exact speed of a multi-task versus a single-task learning setup. Exact numbers

37 are reported in Table 4.6. As expected, multi-task training is slower compared to a single-task approach. In addition to using multi-segment split type, running six single-task runs for each tag type takes about the same time as one multi-task but can sometimes yields better results as is the case with German.

Time Configuration Minutes Hours Multi-task (document) 347.10 5.78 Single-task (document) 144.15 2.40 Multi-task (multi-segment) 162.96 2.72 Single-task (multi-segment) 26.86 0.45

Table 4.6: NERC, convergence speed (averaged per configuration).

Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Base-64 + CE + FT + BERT .866 .836 .851 .767 .739 .753 .664 .362 .468 .656 .357 .462 Base-1132 + CE .781 .744 .762 .629 .599 .614 .592 .259 .36 .592 .259 .36 Base-364 + CE + BERT .862 .849 .856 .772 .761 .766 .71 .317 .438 .7 .312 .432 Base-832 + CE + FT .774 .754 .764 .635 .619 .627 .429 .429 .429 .42 .42 .42

(a) multi-segment split type, coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Base-64 + CE + FT + BERT .866 .818 .842 .672 .634 .653 .702 .263 .383 .643 .241 .351 .804 .563 .662 .712 .499 .587 .357 .03 .056 .143 .012 .022 Base-1132 + CE .79 .732 .76 .583 .539 .56 .568 .223 .321 .568 .223 .321 .699 .603 .648 .614 .53 .569 .471 .201 .282 .429 .183 .256 Base-364 + CE + BERT .869 .838 .853 .701 .676 .688 .716 .281 .404 .705 .277 .397 .75 .662 .703 .663 .586 .622 .293 .146 .195 .244 .122 .163 Base-832 + CE + FT .782 .742 .762 .576 .547 .561 .409 .442 .425 .388 .42 .403 .7 .601 .647 .611 .525 .565 .473 .213 .294 .405 .183 .252

(b) multi-segment split type, fine grained entity type

Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Base-64 + CE + FT + BERT .872 .828 .849 .772 .733 .752 .433 .696 .534 .428 .688 .527 Base-1132 + CE .782 .723 .752 .634 .587 .61 .54 .424 .475 .534 .42 .47 Base-364 + CE + BERT .873 .839 .856 .784 .754 .769 .741 .549 .631 .735 .545 .626 Base-832 + CE + FT .771 .763 .767 .624 .618 .621 .43 .491 .458 .422 .482 .45

(c) document split type, coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Base-64 + CE + FT + BERT .883 .8 .839 .717 .649 .682 .741 .371 .494 .679 .339 .452 .794 .631 .703 .715 .568 .633 .341 .183 .238 .318 .171 .222 Base-1132 + CE .761 .709 .734 .559 .521 .54 .471 .429 .449 .471 .429 .449 .737 .573 .645 .659 .513 .577 .377 .244 .296 .321 .207 .252 Base-364 + CE + BERT .872 .824 .847 .729 .689 .709 .767 .5 .605 .767 .5 .605 .78 .677 .725 .697 .606 .648 .492 .372 .424 .435 .329 .375 Base-832 + CE + FT .764 .742 .753 .558 .541 .549 .49 .433 .46 .485 .429 .455 .709 .609 .655 .625 .537 .577 .5 .226 .311 .459 .207 .286

(d) document split type, fine grained entity type

Table 4.7: NERC, parameter importance, French. The best result per table and column is given in bold, the second best result is underlined

4.1.3 Parameter importance Seeing that performance is best when using all modules together, we test whether improvement is not solely due to the addition of more parameters in our embedding layer and therefore, later in our full model as well. We therefore perform studies where we use our setup which has all modules

38 Chapter 4 Results enabled and compare it with other configurations where we switch off one or more of the modules but most importantly keeping the same amount of parameters of the representation that results from the embedding layer. We do this by increasing the newly trained embeddings size in order to preserve the parameters amount of the output.

Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Base-64 + CE + FT + BERT .784 .724 .753 .639 .589 .613 .598 .542 .569 .598 .542 .569 Base-1132 + CE .644 .537 .586 .487 .406 .443 .645 .547 .592 .64 .542 .587 Base-364 + CE + BERT .824 .663 .735 .698 .561 .623 .619 .593 .606 .619 .593 .606 Base-832 + CE + FT .639 .578 .607 .488 .441 .463 .57 .517 .542 .561 .508 .533

(a) multi-segment split type, coarse grained entity type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Base-64 + CE + FT + BERT .8 .626 .703 .515 .403 .452 .581 .547 .563 .568 .534 .55 .67 .471 .553 .525 .369 .433 .0 .0 .0 .0 .0 .0 Base-1132 + CE .658 .527 .585 .448 .358 .398 .665 .53 .59 .66 .525 .585 .544 .4 .461 .454 .334 .385 .25 .014 .026 .25 .014 .026 Base-364 + CE + BERT .838 .646 .73 .621 .479 .541 .626 .589 .607 .622 .585 .603 .757 .441 .557 .657 .383 .484 .0 .0 .0 .0 .0 .0 Base-832 + CE + FT .653 .56 .603 .439 .377 .405 .63 .492 .552 .62 .483 .543 .542 .391 .454 .45 .325 .377 .9 .062 .115 .8 .055 .103

(b) multi-segment split type, fine grained entity type

Literal coarse Metonymic coarse Configuration Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF Base-64 + CE + FT + BERT .767 .765 .766 .647 .645 .646 .622 .627 .624 .622 .627 .624 Base-1132 + CE .635 .558 .594 .483 .425 .452 .613 .504 .553 .598 .492 .54 Base-364 + CE + BERT .79 .744 .766 .672 .633 .652 .787 .534 .636 .787 .534 .636 Base-832 + CE + FT .638 .561 .597 .483 .425 .452 .595 .53 .561 .59 .525 .556

(c) document split type, coarse grained coarse type

Literal fine Metonymic fine Component Nested Configuration Fuzzy Strict Fuzzy Strict Fuzzy Strict Fuzzy Strict PRFPRF PRFPRF PRFPRF PRFPRF Base-64 + CE + FT + BERT .791 .763 .777 .594 .574 .584 .649 .61 .629 .649 .61 .629 .703 .582 .637 .585 .485 .53 .25 .014 .026 .25 .014 .026 Base-1132 + CE .646 .542 .59 .445 .374 .407 .647 .504 .567 .63 .492 .552 .563 .426 .485 .485 .367 .417 .455 .068 .119 .455 .068 .119 Base-364 + CE + BERT .796 .732 .763 .616 .566 .59 .792 .517 .626 .792 .517 .626 .738 .594 .658 .663 .534 .591 .4 .027 .051 .4 .027 .051 Base-832 + CE + FT .657 .539 .592 .455 .373 .41 .611 .513 .558 .606 .508 .553 .665 .341 .451 .566 .29 .383 .45 .062 .108 .4 .055 .096

(d) document split type, fine grained coarse type

Table 4.8: NERC, parameter importance, German. The best result per table and column is given in bold, the second best result is underlined

For simplicity reasons, we name the original Base model as Base-64 to show that this module uses embedding size of 64. We then compare the original configuration Base-64 + CE + FT + BERT with three other configurations: (i) Base-364 + CE + BERT, where we switch off FastText em- beddings and increase original newly trained embeddings with the output size of FastText (300); (ii) Base-832 + CE + FT, where we switch off BERT instead and similarly increase the size of the newly trained embeddings with 768, which is the output size of the BERT model that we are using. Finally, we have (iii) Base-1132 + CE, where we switch off both BERT and FastText modules and increase the size of the newly trained representations significantly to 1132, adding both FastText and BERT output sizes. In the end, we always result in with embedding layer having an output size of exactly 1132 for all four configurations, meaning we can test if, having the same size, the network will still keep the results from Sub-section 4.1.1. Results for French are visible at Table 4.7 and for German at Table 4.8. For completeness reasons, we again provide results for English at Table 4.9. All tables correlate with the findings we already observe in the original results. We see that BERT is necessary for good performance as the configurations that do not use BERT fail to achieve promising results. This is also valid no matter the increase in the embedding size of the Base module. Most importantly, we see that FastText does not prove to be beneficial most of the time

39 Literal coarse Configuration Fuzzy Strict PRFPRF Base-64 + CE + FT + BERT .421 .622 .502 .203 .301 .243 Base-1132 + CE .266 .627 .374 .123 .29 .173 Base-364 + CE + BERT .385 .532 .447 .205 .283 .237 Base-832 + CE + FT .307 .592 .404 .146 .283 .193

Table 4.9: NERC, parameter importance, English, segment split. The best result per table and column is given in bold, the second best result is underlined as disabling it but rather increasing embedding size (configuration Base-364 + CE + BERT) gives us very similar results to the original configuration. We observe these trends in all languages and thus we can conclude that the rise in the number of parameters in the network setup is not the reason for the improvements we see when using transfer learning over the named entity recognition task.

4.1.4 Hyper-parameter tuning We now report which hyper-parameters combinations we are using during our reported runs in §4.1.1 and §4.1.3. We use Table 3.3 as a starting point and perform a simple grid search over the available combinations. In the end, we use three final configurations, namely Configuration I which is used for our Base model, Configuration II used for Base + CE + BERT and Base + CE + BERT - newly models, as well as their fine-tuned counterparts, and finally Configuration III which is used in all remaining reported models.

Hyper-parameters Configuration I Configuration II Configuration III RNN hidden size 512 256 512 RNN directionality bi-directional bi-directional bi-directional RNN dropout 0.5 0.5 0.5 Newly trained embeddings size 64 64 64 Character embeddings size - 16 16 Character embeddings RNN hidden size - 32 32 Replace numbers during pre-processing yes yes yes Weighted loss usage no no no Optimizer AdamW AdamW AdamW Learning rate 1e−2 1e−2 1e−2 Fine-tune learning rate 1e−4 1e−4 1e−4

Table 4.10: NERC hyper-parameter configurations. Configuration I is used for Base. Configuration II is used for Base + CE + BERT and Base + CE + BERT - newly. Configuration III is used for all remaining setups.

These are all shown in details in Table 4.10. Most importantly, we find that bigger RNN hidden size for our system only improves results when we are not using pre-trained information from BERT, i.e. when BERT module is disabled. We use one layer for our RNN as increasing this number always worsened performance. Dropout works best when set to 0.5 and higher values decrease effectiveness. The character RNN module does not benefit from large parameter size and we use 16 for the embeddings and 32 for the RNN hidden size there. Document split type works best as reported previously, but we keep the multi-segment for comparison. We don’t use weighted loss, nor manually crafted features as they both seem to impact results negatively. For training we use AdamW optimiser (Loshchilov and Hutter, 2018) with default values, except for the weight decay which is set to 1e−8, and use a learning rate of 1e−2 for the whole system and a value of 1e−4 for BERT when fine-tuning is enabled. Finally, we only fine-tune from the start of the training process as fine-tuning after convergence does not prove to be beneficial.

40 Chapter 4 Results 4.2 Post-OCR correction

Post-OCR correction is a major step towards a better digitisation of historical documents. It allows us to overcome obstacles such as bad quality of OCR’d results and pre-processing inefficiencies. Working at the post-processing step, we can in theory find a way to overcome errors from previous steps. Given the importance of this task, it is of no surprise that there are numerous challenges coordinated from the community in the recent years, with examples such as ICDAR 2019 (Rigaud et al., 2019) and ICDAR 2017 (Chiron et al., 2017b). To help with identifying beneficial parts, we use our modular architecture and investigate whether pre-trained information helps when applied using transfer learning. We report final results as best ones from total of three runs per each setup and configuration. We aim this as a guide and a starting point for community in future research.

4.2.1 Main results We report the average Levenshtein distance and normalised Jaccard similarity. The first one is averaged across the sequences we are trying to correct. We always provide, for reference, both measures calculated on the raw OCR’d text (No correction) and the percentage of improvement. We report results firstly with a model without pre-trained embeddings and only having newly trained ones enabled (Base). We then report a model which enables FastText in-domain embed- dings (+ FT), then BERT (+ BERT) and finally both. We further assess fine-tuning BERT models by unfreezing BERT embeddings from the start or after convergence. As a reminder, the first type of fine-tuning unfreezes BERT weights from the start of the training process, whereas the latter requires firstly initial convergence of the full model before unfreezing it. We report all our results over the corresponding language test set provided from ICDAR 2019 competition. Our results are not directly comparable to the ones reported in the competition, as we do not remove hyphen separated tokens, nor do we use erroneous tokens to make predictions. Instead, we evaluate on the full text corpus. Furthermore, we show in Figure 4.1 the normalised histogram of the Levenshtein distances for all documents and languages, comparing the raw OCR’d text, the Base model and the best model we found for each language.

Levenshtein distance Normalised Jaccard similarity Configuration Average % improvement Average % improvement No correction 3.568 - 0.926 - Base 3.369 5.579 0.934 0.824 Base + FT 3.442 3.522 0.934 0.855 Base + BERT 3.393 4.896 0.934 0.881 Base + FT + BERT 3.389 5.020 0.934 0.850 + Fine-tuning (unfreezing, from start) BERT Base + BERT 3.441 3.565 0.933 0.722 Base + FT + BERT 3.397 4.784 0.936 1.008 + Fine-tuning (unfreezing, after initial convergence) BERT Base + BERT 3.401 4.668 0.935 0.923 Base + FT + BERT 3.448 3.347 0.935 0.900

Table 4.11: Post-OCR correction, French. The best result per table and column is given in bold, the second best result is underlined.

Starting with French (Table 4.11), we find that the raw OCRed texts for this language are already of very high quality, thus we are able to get but minor improvements. Moreover, using pre-trained

41 embeddings does not seem to help in any significant way. When considered using the Levenshtein distance, the best model is the one without any pre-trained information (Base), while fine-tuning BERT leads to slightly higher performance under the Jaccard similarity. Nevertheless, these gains are very marginal. Looking at Figure 4.1b it is visible that the improvement rate is not that significant.

German raw OCR’d texts are of lower quality than French while their ground truth is in comparison of very high one. Thus we are able to get substantial improvements with correction (up to 64% with Levenshtein distance). These are visible in Table 4.12. This is also confirmed when looking at Figure 4.1c which clearly shows the improvements of the two German models compared to the No correction original results. While our post-OCR correction gains are substantial, still, similar as with French language, the impact of pre-trained embeddings remains negligible and non-existent even, as we achieve best results in Levenshtein distance improvement with the Base model.

Levenshtein distance Normalised Jaccard similarity Configuration Average % improvement Average % improvement No correction 12.008 - 0.656 - Base 4.302 64.172 0.900 37.290 Base + FT 4.439 63.034 0.896 36.584 Base + BERT 4.464 62.827 0.896 36.630 Base + FT + BERT 5.393 55.088 0.872 32.938 + Fine-tuning (unfreezing, from start) BERT Base + BERT 4.283 64.334 0.900 37.324 Base + FT + BERT 4.340 63.863 0.899 37.131 + Fine-tuning (unfreezing, after initial convergence) BERT Base + BERT 4.344 63.828 0.898 37.039 Base + FT + BERT 4.411 63.271 0.898 36.908

Table 4.12: Post-OCR correction, German. The best result per table and column is given in bold, the second best result is underlined.

Lastly, with English we face another challenge namely the bad quality of the ground truth. As a consequence, results are largely inconclusive. They can be seen in Table 4.13. More specifically, the results are all worse than without any correction whatsoever when looking at the Levenshtein distance improvement. However, for some, using BERT seems to help, especially when fine-tuning, in particular after convergence. We get our best results when using BERT, not using FastText and fine-tuning after convergence.

We see that in many instances the ground truth proved to contain errors, hindering the training process. In some cases even, our models correct successfully, but are then rejected after comparing to the ground truth. Two such examples of incorrect ground truth, taken at evaluation time are the following:

• input: “any glimpse, or sign *f Eight trom the Earth, it” ground truth: “any glimpse, or ffgn of Light from the Earth, it” prediction: “any glimpse, or sign of Eight from the Earth, it”

• input: “• Henry K Concert—AU flddledidee—Triumph* ot” ground truth: “.... Henry A Concert All ddledidee Triumphs of” prediction: “Henry a Concert All dreest the”

Finally, for English, we can see the histogram of the different Levenshtein edit distances at Figure 4.1a. It is clearly visible that distances are having a much wider range compared to the other two languages. Further, our models, while performing worse overall, have more predictions which have zero edit distance compared to the No correction baseline, meaning their precision is on par, sometimes better, whereas their recall is much worse.

42 Chapter 4 Results Levenshtein distance Normalised Jaccard similarity Configuration Average % improvement Average % improvement No correction 9.397 - 0.825 - Base 9.955 -5.944 0.822 -0.310 Base + FT 9.864 -4.971 0.825 -0.005 Base + BERT 10.228 -8.840 0.822 -0.364 Base + FT + BERT 9.992 -6.338 0.825 -0.006 + Fine-tuning (unfreezing, from start) BERT Base + BERT 9.835 -4.665 0.825 0.062 Base + FT + BERT 9.787 -4.151 0.825 0.078 + Fine-tuning (unfreezing, after initial convergence) BERT Base + BERT 9.724 -3.483 0.829 0.510 Base + FT + BERT 9.927 -5.639 0.826 0.108

Table 4.13: Post-OCR correction, English. The best result per table and column is given in bold, the second best result is underlined.

4.2.2 Convergence speed We further log and report the time that our post-OCR correction model configurations require to converge. We report this in averaged minutes and for all languages. Results are visible in 4.14. We see that with every added module on top of Base the convergence speed decreases as the time increases. The only exception is the FastText module which for some of the German and English runs helps for a faster convergence. Additionally, it is evident that fine-tuning after convergence is faster when compared to a regular fine-tuning. This, along with the fact that it performs better overall, improves its advantage. Nevertheless, fine-tuning overall increases required time significantly compared to runs without it.

Configuration English French German Base 475.882 662.735 1560.595 Base + FT 475.494 735.93 1481.484 Base + BERT 710.549 1070.933 1593.157 Base + FT + BERT 659.277 1141.16 1873.405 + Fine-tuning (unfreezing, from start) BERT Base + BERT 1218.266 1292.461 3153.515 Base + FT + BERT 1300.52 1529.498 3064.282 + Fine-tuning (unfreezing, after initial convergence) BERT Base + BERT 990.149 1200.414 2639.135 Base + FT + BERT 1087.442 1570.694 2393.629

Table 4.14: Post-OCR correction – convergence speed (averaged, in minutes).

In conclusion, we find that for post-OCR correction pre-trained embeddings do not provide any significant gains over a baseline with newly-trained embeddings. Considering that the convergence speed (and hence compute cost) is higher when using pre-trained embeddings, in particular when fine-tuning them, transfer learning does not appear to help with post-OCR correction. We under- line that the data provided for the ICDAR2019 challenge is far from uniform across languages, and this has a major impact on our results. While data for German contains bad raw OCR’d texts and good ground truth – being ideal setting for a post-OCR correction, data for French contains high-quality raw OCR’d texts and data for English contains a low-quality ground truth.

43 (a) English language

(b) French language

(c) German language

Figure 4.1: Levenshtein edit distance distributions comparing the raw OCR texts, the Base model and the best model for each language over the evaluation dataset. The bars are ordered, for each bin, from the smallest to the largest value, in order to show them all

4.2.3 Hyper-parameter tuning We finally report the hyper parameters that we use during the reported training runs. We perform grid search over all available hyper-parameters and configuration options mentioned in Table 3.6. Final picked values are shown in Table 4.15. We end up with one set of hyper-parameters which

44 Chapter 4 Results Hyper-parameters Value GRU encoder hidden size 512 GRU directionality bi-directional GRU encoder dropout 0.5 GRU encoder number of layers 2 GRU decoder hidden size 512 GRU decoder dropout 0.5 GRU decoder number of layers 2 Share embedding layer yes Newly trained embeddings size 128 Newly trained dropout 0.5 Optimiser AdamW Learning rate 1e−4 Fine-tune learning rate 1e−4

Table 4.15: Post-OCR correction hyper-parameter configuration we use throughout all our configurations and available languages. Important findings are the rather high dimensional requirements for the encoder GRU – it has a hidden size of 512, two layers and is bi-directional. The same values also apply to the decoder’s GRU. Additionally, we share the embedding layer in the encoder and decoder and use newly trained embeddings size of 128 for it. Finally, during training we use AdamW optimiser (Loshchilov and Hutter, 2018) with learning rate of 1e−4 and other options set to default values, except for the weight decay which is set to 1e−8. We keep the learning rate the same for BERT when fine-tuning.

4.3 Semantic change

Lexical semantic change is a key process towards better understanding of the newly digitised historical documents and our cultural heritage. It allows us to explain how certain languages changed over time and shed a light on important moments of our history. However, as much as it is important, it is also extremely difficult to grasp and come up with a generic representation of what semantic change is exactly. Thus, in recent years, machine learning and in particular deep learning have been used to help community advance forward in the field, as well as the introducing of challenges such as SemEval-2020 (Schlechtweg et al., 2020). We participate in the latter, focused on producing an evaluation strategy on measuring semantic change. Using only transfer learning in the face of BERT, we aim to find out the degree of change which a list of target words in four languages go through.

4.3.1 Main results We report our results, focusing on the Spearman correlation tracking our predicted rankings and the ground truth rankings. We further split our predictions using the two metrics we have – the cosine distance and the euclidean distance. Final results can be seen in Table 4.16. We can see that both baseline models fail to come close to a good correlation coefficient and remain far from the truth. Even though on average, BERT performs better, it is clearly visible that our results are far from perfect, with not a single language and metric achieving a good correlation coefficient as well. This is further proven from our participation in the challenge itself, where we end up 24th on the second sub-task out of 33 in total. We analyse the word neighbourhoods that result from our original BERT models as a custom experiment over BERT ability to encode meaningful time information. We use the target words for English that organisers provide and build a word neighbourhood consisting of the five words that have representation vectors with the closest cosine distance to the target one. We use t-SNE (Maaten and Hinton, 2008) to visualise the high dimensional vectors into 2-dimensional plots. These are visible in Figure 4.2. We use two words – plane (Figure 4.2a and land (Figure 4.2b).

45 Model Metric English German Latin Swedish Frequency (FD) -0.217 0.008 0.024 -0.167 Baselines Count (CNT + CI + CD) 0.017 0.215 0.356 -0.037 Cosine distance 0.149 0.237 0.268 0.150 BERT Euclidean distance 0.209 0.086 0.346 0.027

Table 4.16: Semantic change – Spearman correlation coefficients

Additionally we provide neighbourhood difference for the word gay which is not part of the target words but is notoriously famous as a mean to display semantic change throughout time – this is visible in Figure 4.2c. Looking at the figures, it is clear that our models are able to encode the distinct information for each time period. The word plane shows how in the past, before the invention of traditional aircraft, people used to explain mathematical topics whereas currently, the words that occur in a similar context are those about flying. Similar as with land, where in the past it was much more common to talk about sailing and distant shores, rather than now when the word is used to distinguish territories and continents. The word gay is also clearly showing its lexical semantic change, with past meaning referring to words such as chick and blush, whereas current meaning pointing towards sexuality.

(a) neighbourhoods of the word plane (b) neighbourhoods of the word land

(c) neighbourhoods of the word gay

Figure 4.2: Word neighbourhood change over time, 2-D projection using t-SNE

4.3.2 Hyper-parameter tuning There is no extensive hyper-parameter tuning involved for this challenge similar to the previous two. However, by focusing on fine-tuning BERT we have the learning rate and optimiser options to be configured our hyper-parameters. As to the first one, we decide from values ranging from 1e−5 to 1e−2 and find that our model is most stable and works best with 1e−4 – something we observe also for NER(§4.1.4) and for Post-OCR correction(§4.2.3). Additionally, we use the Adam optimiser with default values as it also provides us with the most stable results.

46 Chapter 4 Results Chapter 5

Discussion

Throughout this thesis we have investigated transfer learning, applying it over three active and important challenges, showing its impact on digitised texts. We analyse different ways of how it can help in the preservation of our cultural heritage – an application which was lacking a detailed research previously. We discuss the major role that current datasets have in stability and performance of existing setups aimed towards historical data (§5.1). We then show the key takeaway points related to transfer learning in §5.2 and also mention the limitations that we had during our research (§5.3). Finally, we share our advice towards future research over this topic and also mention possible experiments that would be useful for the community (§5.4).

5.1 Data importance

We see that the data used is arguably one of the most important components of a deep learning setup applied over the historical domain. Unfortunately, a big limitation of current ancient digitised text corpora is precisely the quality of the data. During our post-OCR correction experiments, specifically for English language (§4.2.1), it is often the case where we have a ground truth which contains large amounts of noise. With the addition of a pre-trained information from BERT, we furthermore observe a trend where our model successfully corrects words which are incorrect in the ground truth, thus confusing the network during training even more. Stability of existing systems is heavily dependent on the quality of the data. This is emphasised even more when we compare the three languages that we work with during post-OCR correction. For English, we are unable to get any benefits and even perform worse than the original uncorrected pair. This language has extremely low quality not just for the OCR’d data, but also for ground truth. For French language, which is on the opposite spectrum – having high quality both input and ground truth, all of our models are all able to improve but only marginally. Finally for German we get a notable, more than 60% improvement over the original data because of it being a perfect setup for a deep learning problem, having input of a low quality and ground truth of a high one. The differences in our post-OCR correction results hint at another major problem of historical corpora that we encounter, namely the inconsistency of the quality of the various datasets at our disposal. This is observed during our experiments over Named Entity Recognition (§4.1.1) as well. Similarly, English language is severely inferior. On one hand, it does not have a separate training data, which puts unwanted contrast between the training and evaluation performances. On the other hand, the evaluation data provided by organisers lacks labelling in five out of the six total entity types, which disables possible benefits of multi-task learning, which can learn to predict better multiple labels together instead of single ones. Remaining languages, having significantly higher quality, are successfully used as a field over which we conduct and report our experiments. Finally, we observe that evaluation can sometimes fail to take into account all of the data spec- ifications and generalise, thus severely undermining some of the conducted research. During our

47 analysis on the lexical semantic change of specific target words, we observe that achieving higher correlation for the different languages entirely depends on choosing a similar score metric to the one used during the human annotation of the gold scores in the test set and their distribution. We see cosine distance results having higher correlation to ground truth for some languages, whereas for others it’s the euclidean distance. This is also confirmed by Kutuzov and Giulianelli (2020) who encounter the same problems using a very similar setup and perform more detailed analysis to prove this. Furthermore, we see that BERT is successfully able to encode information about words meaning in different time periods in §4.3.1 and in particular in Figure 4.2, proving that con- textualised embeddings are outperforming their static counterparts. However, choosing a specific ranking function could dismiss the benefits gained from using pre-trained models. This is a process that feels unsystematic and lacking a proper explanation. We believe the most likely reason is the current early stage of the field and the lack of comparable challenges.

5.2 Transfer learning applicability

The main goal of this thesis is to provide initial link between transfer learning and historical language modelling and a better understanding of the applicability of the two. Due to the diversity of the challenges we operate on, we are able to see benefits that transfer learning usually promises also appearing in the historical domain. A scenario where transfer learning usually aids us is when a lack of data is observed. Fortunately, this is also confirmed in our experiments, comparing our results from Named Entity Recognition (§4.1) and Post-OCR correction (§4.2). We have less labelled data at hand in the first compared to the latter. This translates to a much more noticeable improvement after enabling our pre-trained representation modules in NER. However, we see another characteristic from which our models often benefit, one that is not de- pendent on transfer learning, namely having more single-origin data, i.e. in cases of Post-OCR correction, where German has approximately 3.5 times more data than original French and 57 times more than original English, we achieve biggest improvement precisely in German. Adding data in French and English from past competitions and other historical sub-domains does not help because of the differences the new data has with the original one. Similarly, in NER we achieve best results overall in French, the language having most data once again. Further, we show that the observed benefits of transfer learning do not come from increasing the output size of the embedding layer and consequently the amount of parameters in the network (§4.1.3). This shows that, as expected, pre-trained representations contain encoded data which the model is unable to learn on its own. However, we can also see that BERT contains significantly more important information compared to another pre-trained model in the face of FastText. This is confirmed by the same set of experiments which show that disabling BERT can not be compensated by increasing newly trained embeddings size, whereas disabling FastText can. A key consideration to take into account is that benefits from transfer learning are not always compelling. Often times, as is the case during post-OCR correction, transferring knowledge adds extremely small gains, sometimes non at all (§4.2). We stress on the fact that researchers should not throw themselves into using transfer learning always just because in theory it should work better. It is often the case that not only it does not improve performance, but significantly slows down the whole process as well (§4.1.2 and §4.2.2). We see that the increase in time is significant, especially when fine-tuning, with some cases requiring twice as much time to converge. In such cases where available time is an issue, multi-task transfer learning proves to be a solution (Table 4.6). Not only do we see multiple occurrences where multi-task proves to be a better solution than a single-task learning, but we also show that it can be significantly faster. On average, multi- task requires twice the time to converge, but with the key difference that in the end of the training, it has encoded knowledge about all tag sub-tasks. In comparison, six single task runs, while faster individually, will take three times this sequentially. One can imagine tasks which require days to converge where this can be of a high importance. We see some minor benefits from unfreezing and fine-tuning BERT during training – something generally advised by the community when working with modern day corpora, although the observed benefits are not as significant. We further try unfreezing the weights only after initial convergence of the main model – another advocated technique – but this also seems to not yield the promising improvements. Fine-tuning remains as the way to go for tracking lexical semantic change (§3.2.3),

48 Chapter 5 Discussion though in traditional scenarios where there is a wrapper model around BERT a decision should be made taking into consideration the slow down that it additionally causes. Overall, transfer learning keeps most of its benefits when applied to historical language tasks. We see a clear improvement using purely BERT, without the need to fine-tune, when applied to NERC. We consider this work only a beginning and no definite conclusion can be made towards a preference for transfer learning. If anything, this research proves the necessity of a detailed research at the time of each application and focusing on the quality of the data at hand.

5.3 Limitations

We focus on the importance of data quality and transfer learning applicability throughout our discussion, nevertheless we must mention some limitations that stop us from further exploring the current state of the problem. During our Named Entity Recognition experiments, we are unable to fine-tune BERT when using document-level split type. Even after our implementation of a sliding window cutting approach, where we forward process sequences through BERT of only 512 sub-word tokens length at maxi- mum, during back-propagation our computing resources never proved to be enough for a successful iteration. We tried to overcome this obstacle by reducing network size and the size of our batches to as low as two, however nothing worked. Seeing that fine-tuning can be useful at times during multi-segment splitting, we would like to be able to test this out on document level too in the future. Due to the nature of this research, we end up with substantial amount of hyper-parameters. Because of this and due to our time constraints, we are simply unable to do fully extensive grid search for every hyper-parameter and every language for the different challenges. We often all tune hyper-parameters on a single language and only a subset of the best ones on the remaining languages, whereas ideally each language should be tuned independently. Finally, we encounter a limitation related to the type of BERT pre-trained weights that we are using. As a reminder, we use the library called HuggingFace, which is arguably the most popular one for incorporating BERT into a setup, coming from the team that proposed BERT. During the tokenization process, i.e. when we are splitting words into sub-words and assigning them to items from BERT’s vocabulary, we make use of the so-called ‘offset points’, which represent the positions in the original word of the new sub-word characters. We need those to be able to concatenate the sub-word tokens back to word level later. Unfortunately, the french-specialized version of BERT – CamemBERT (Martin et al., 2020) does not support these offset points. We ended up implementing such a calculation ourselves, but it proved too error-prone and thus, we decided to stick to using the multilingual version of BERT. As of the time of writing this thesis, owners of the library promise to implement this calculation in a future release, so we would this experiment to be finalised in a future work.

5.4 Future work

We once again lay emphasis on the fact that this work is, even though highly necessary, only a starting point towards a major conclusion over historical domain transfer learning applicability. Having this in mind, it is only natural that there are many future applications where we would like to see a continuation of our work. First and foremost, we believe it will be of imperative knowledge for the community to apply transfer learning to more challenges, while keeping focus on historical texts. We can gain better knowledge by trying to find discrepancies, but also similarities between the historical and modern day text domain in such cases. Example challenges that may be used to analyse historical news- papers and texts further can be Natural Language Inference (NLI), event extraction and machine translation between two different but historical languages. Arguably, the more important ones would have to be beneficial towards information retrieval, as a key goal of digitising documents is to make them available to the general public precisely through retrieval algorithms. Having seen how important data quality is – something stressed by other related works too (Ehrmann et al., 2016; van Strien et al., 2020), we believe one of the main goals should be working towards firstly achieving better quality of the data, as the current one falls behind the available

49 setups, and secondly creating larger datasets which can help models learn the specifics of our cultural heritage in more details. Similarly to the research done in NLP in recent years, we believe it will be fruitful to investigate many different languages. It will be interesting to analyse other, low-resource historical ones for example. A natural and a simple first step would be an investigation of the seven remaining languages of the ICDAR 2019 challenge. We believe they could immediately benefit from the modular setup that we propose which we already used for French, German and English in the same challenge. We consider exploring the effect of single task learning over the data provided by CLEF-HIPE- 2020 challenge (§3.2.1.3) could be beneficial for the community. We see that sometimes it can produce better results than a multi-task setup (§4.1.1), however, due to time constraints, we fail to analyse in details how much better exactly this type of learning is compared to a multi-task one. So a detailed comparison between the two approaches could prove profitable. Moreover, analysis on which tasks during a multi-task setup are beneficial could prove efficient too, as we hypothesise that training only on coarse-grained tasks or only on fine-grained tasks can lead to some improvements. However, due to lack of time we could not test this and leave it out to future developments. Finally, due to the modular setup that we work with, naturally, we would like for more modules to be attached and tested. We expect an analysis on the information encoded in the Flair embeddings, provided by CLEF-HIPE-2020 organisers, to be fruitful in a similar way as we did with FastText embeddings. Furthermore, we suspect there could be new, pre-trained language modules that are able to perform even better than BERT when applied over ancient documents. Some of these, already proven better at certain other domains and applications, are XLNet (Yang et al., 2019) and GPT-3 (Brown et al., 2020). We sincerely hope that as more state-of-the-art systems are being introduced, they can be further tested for generating improvements over historical corpora.

50 Chapter 5 Discussion Chapter 6

Conclusion

Throughout this thesis, we have established the foundations and performed the first step of a detailed, long overdue research aimed at the applicability of transfer learning over historical texts. We analysed different areas of transfer learning and showed that many of its promises are valid in this new domain as well. However, such benefits were proven to not be consistent and instead, to be heavily dependent on the challenge and the data at hand. Furthermore, we also showed major drawbacks that are currently hindering significant progress over the area, the biggest one being the overall bad quality of the data. Additionally, we implemented and open-sourced a highly modular setup, designed specifically for testing and analysing the importance of different pre-trained knowledge. We fully believe transfer learning should be considered as a technique which brings major improve- ments over historical corpora. Nevertheless, we emphasise on the fact that community should not follow it blindly and should instead take into consideration the possibility of it not being beneficial when applied on specific tasks or data. Further research is thus required – to build on top of this work and one day conclude the question. Full digitisation of ancient manuscripts should be sought-after, ultimately helping Digital Human- ities and leading towards our cultural heritage preservation for future generations. As the more we know about our past, the better prepared we are for our future.

51 Bibliography

Yvonne Adesam, Dana Dann´ells, and Nina Tahmasebi. 2019. Exploring the quality of the digital historical newspaper archive kubhist. page 9.

Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), page 54–59. Association for Computational Linguistics.

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, page 1638–1649. Association for Computational Linguistics.

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, and Sabine Schulte im Walde. 2020. Ccoha: Clean corpus of historical american english. In Proceedings of The 12th Language Resources and Evaluation Conference, page 6958–6966. European Language Resources Association.

Beatrice Alex and John Burns. 2014. Estimating and rating the quality of optically character recognised text. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, DATeCH ’14, page 97–102. Association for Computing Machinery.

Greg M. Allenby and Peter E. Rossi. 1998. Marketing models of consumer heterogeneity. Journal of Econometrics, 89(1):57–78.

Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics, 4:431–444.

Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.

Chlo´eArtaud, Nicolas Sid`ere, Antoine Doucet, Jean-Marc Ogier, and Vincent Poulain D’Andecy Yooz. 2018. Find it! fraud detection contest report. In 2018 24th International Conference on (ICPR), page 13–18.

Dzmitry Bahdanau, Kyunghyun Cho, and . 2016. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs, stat]. ArXiv: 1409.0473.

J. Baxter. 2000. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198.

Yoshua Bengio, J´erˆomeLouradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learn- ing. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48. Association for Computing Machinery.

Zeitungsabteilung Berliner Zeitung. 2018. Diachronic newspaper corpus published by staatsbiblio- thek zu berlin.

John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, page 120–128. Association for Computational Linguistics.

52 Th´eodore Bluche, Sebastien Hamel, Christopher Kermorvant, Joan Puigcerver, Dominique Stutz- mann, Alejandro H. Toselli, and Enrique Vidal. 2017. Preparatory kws experiments for large- scale indexing of a vast medieval manuscript collection in the himanis project. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, page 311–316.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vec- tors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.

Marcel Bollmann. 2019. A large-scale comparison of historical text normalization systems. Pro- ceedings of the 2019 Conference of the North, page 3885–3898. ArXiv: 1904.02036.

Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp — the corpus infrastructure of spr˚akbanken. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), page 474–478. European Language Resources Association (ELRA).

H. Douglas Brown and Heekyeong Lee. 2015. Teaching by Principles: An Interactive Approach to Language Pedagogy, 4 edition edition. Pearson Education ESL.

Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–480.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.

Junyang Cai, Liangrui Peng, Yejun Tang, Changsong Liu, and Pengchao Li. 2019. Th-gan: Gener- ative adversarial network based transfer learning for historical chinese character recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), page 178–183.

Susan Carey and Elsa Bartlett. 1978. Acquiring a single new word.

Rich Caruana. 1998. Multitask Learning, page 95–133. Springer US.

Richard Caruana. 1993. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, page 41–48. Morgan Kaufmann.

Guillaume Chiron, Antoine Doucet, Mickael Coustaty, Muriel Visani, and Jean-Philippe Moreux. 2017a. Impact of ocr errors on the use of digital libraries: Towards a better access to information. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), page 1–4.

Guillaume Chiron, Antoine Doucet, Micka¨elCoustaty, and Jean-Philippe Moreux. 2017b. Ic- dar2017 competition on post-ocr text correction. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, page 1423–1428.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv:1406.1078 [cs, stat]. ArXiv: 1406.1078.

Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. 2017. Coarse-to-fine question answering for long documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 209–220. Association for Computational Linguistics.

G. Sayeed Choudhury, Tim DiLauro, Robert Ferguson, Michael Droettboom, and Ichiro Fujinaga. 2006. Document recognition for a million books. D-Lib Magazine, 12(3).

Alexandra Chronopoulou, Christos Baziotis, and Alexandros Potamianos. 2019. An Em- barrassingly Simple Approach for Transfer Learning from Pretrained Language Models. arXiv:1902.10547 [cs]. ArXiv: 1902.10547.

53 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv:1906.04341 [cs]. ArXiv: 1906.04341.

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, ICML ’08, page 160–167. Association for Computing Machinery.

National Research Council (U S. ) Automatic Language Processing Advisory Committee. 1966. Language and Machines: Computers in Translation and Linguistics; a Report. National Academies. Google-Books-ID: Q0ErAAAAYAAJ.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2018. Su- pervised learning of universal sentence representations from natural language inference data. arXiv:1705.02364 [cs]. ArXiv: 1705.02364.

Ryan Cordell. 2017. “q i-jtb the raven”: Taking dirty ocr seriously. Book History, 20(1):188–225.

Ryan Cordell. 2019. Why you (a humanist) should care about optical character recognition · ryan cordell.

National Research Council, Automatic Language Processing Advisory Committee, et al. 1966. Lan- guage and machines: Computers in translation and linguistics; a report, volume 1416. National Academies.

Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. 2008. Self-taught clustering. In Pro- ceedings of the 25th international conference on Machine learning, ICML ’08, page 200–207. Association for Computing Machinery.

Mark Davies. 2017. Corpus of historical american english (coha).

DeepSet. 2020. DeepSet. https://deepset.ai/german-bert. [Online; accessed 08-August-2020].

Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learn- ing for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, page 8599–8603.

A Deutsches Textarchiv. 2017. German text archive (dta).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs]. ArXiv: 1810.04805.

Bhuwan Dhingra, Hanxiao Liu, Ruslan Salakhutdinov, and William W. Cohen. 2017. A com- parative study of word embeddings for reading comprehension. arXiv:1703.00993 [cs]. ArXiv: 1703.00993.

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), page 1723–1732. Association for Computational Linguis- tics.

Rui Dong and David Smith. 2018. Multi-input attention for unsupervised ocr correction. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 2363–2372. Association for Computational Linguistics.

Michael Droettboom, Ichiro Fujinaga, Karl MacMillan, G. Sayeed Chouhury, Tim DiLauro, Mark Patton, and Teal Anderson. 2002. Using the gamera framework for the recognition of cultural heritage materials. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL ’02, page 11–17. Association for Computing Machinery.

Michael Droettboom, Karl MacMillan, and Ichiro Fujinaga. 2003. The gamera framework for build- ing custom recognition systems. In Symposium on Document Image Understanding Technologies, pages 275–286.

54 Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Con- ference on Natural Language Processing (Volume 2: Short Papers), page 845–850. Association for Computational Linguistics. Maud Ehrmann, Giovanni Colavizza, Yannick Rochat, and Fr´ed´ericKaplan. 2016. Diachronic evaluation of ner systems on old newspapers. Maud Ehrmann, Matteo Romanello, Simon Clematide, Phillip Benjamin Str¨obel, and Rapha¨el Barman. 2020a. Language resources for historical newspapers: the impresso collection. In Pro- ceedings of The 12th Language Resources and Evaluation Conference, page 958–968. European Language Resources Association. Maud Ehrmann, Matteo Romanello, Alex Fl¨uckiger, and Simon Clematide. 2020b. Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers. In CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS. Xing Fan, Emilio Monti, Lambert Mathias, and Markus Dreyer. 2017. Transfer learning for neural semantic parsing. arXiv:1706.04326 [cs]. ArXiv: 1706.04326. Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. arXiv:1705.00424 [cs]. ArXiv: 1705.00424. William Fedus, , and Andrew M. Dai. 2018. Maskgan: Better text generation via filling in the . arXiv:1801.07736 [cs, stat]. ArXiv: 1801.07736. Velissarios G. Gezerlis and Sergios Theodoridis. 2002. Optical character recognition of the orthodox hellenic byzantine music notation. Pattern Recognition, 35(4):895–914. Abbas Ghaddar and Phillippe Langlais. 2018. Robust lexical features for improved neural network named-entity recognition. In Proceedings of the 27th International Conference on Computational Linguistics, page 1896–1907. Association for Computational Linguistics. Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language processing from bytes. arXiv:1512.00103 [cs]. ArXiv: 1512.00103. Ross Girshick. 2015. Fast r-cnn. page 1440–1448. Mario Giulianelli, Marco Del Tredici, and Raquel Fern´andez.2020. Analysing lexical semantic change with contextualised word representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3960–3973, Online. Association for Com- putational Linguistics. Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. page 6. Yoav Goldberg. 2019. Assessing BERT’s Syntactic Abilities. arXiv:1901.05287 [cs]. ArXiv: 1901.05287. Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving Mikolov et al.’s negative- sampling word-embedding method. arXiv:1402.3722 [cs, stat]. ArXiv: 1402.3722. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Google- Books-ID: omivDQAAQBAJ. Adeline Granet, Emmanuel Morin, Harold Mouch`ere,Solen Quiniou, and Christian Viard-Gaudin. 2018. Transfer learning for handwriting recognition on historical documents:. In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods, page 432–439. SCITEPRESS - Science and Technology Publications. Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2016. Exploiting multi-typed treebanks for parsing with deep multi-task learning. arXiv:1606.01161 [cs]. ArXiv: 1606.01161. Suchin Gururangan, Ana Marasovi´c,Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv:2004.10964 [cs]. ArXiv: 2004.10964.

55 Kai Hakala, Aleksi Vesanto, Niko Miekka, Tapio Salakoski, and Filip Ginter. 2019. Leveraging text repetitions and denoising in ocr post-correction. arXiv:1906.10907 [cs]. ArXiv: 1906.10907.

Roger T. Hartley and Kathleen Crumpton. 1999. Quality of OCR for Degraded Text Images. arXiv:cs/9902009. ArXiv: cs/9902009.

Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv:1611.01587 [cs]. ArXiv: 1611.01587.

Daniel Hershcovich, Omri Abend, and Ari Rappoport. 2018. Multitask parsing across semantic representations. arXiv:1805.00287 [cs]. ArXiv: 1805.00287.

Felix Hill, Kyunghyun Cho, Anna Korhonen, and Yoshua Bengio. 2016. Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Lin- guistics, 4:17–30.

Mark J Hill and Simon Hengchen. 2019. Quantifying the impact of dirty ocr on historical text analysis: Eighteenth century collections online as a case study. Digital Scholarship in the Hu- manities, 34(4):825–843.

S Hirai and K Sakai. 1980. Development of a high performance chinese character reader. In Proc. 5th Int. Conf. Pattern Recognition, pages 692–702.

Rose Holley. 2009. How good can it get? analysing and improving ocr accuracy in large scale historic newspaper digitisation programs.

Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, and Rik Van de Walle. 2015. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, 30(2):262–279.

Mika H¨am¨al¨ainenand Simon Hengchen. 2019. From the paft to the fiiture: a fully automatic nmt and word embeddings method for ocr post-correction. arXiv:1910.05535 [cs]. ArXiv: 1910.05535.

Johan Jarlbrink and Pelle Snickars. 2017. Cultural heritage as digital noise: nineteenth century newspapers in the digital archive. Journal of Documentation, 73(6):1228–1243.

Bor-Shenn Jeng and Chih-Heng Lin. 1986. Chinese character segmentation using the character-gap feature.

Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv:1705.00557 [cs, stat]. ArXiv: 1705.00557.

Jing Jiang. 2009. Multi-task transfer learning for weakly-supervised relation extraction. In Proceed- ings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, page 1012–1020. Association for Computational Linguistics.

Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi´egas,Martin Wattenberg, Greg Corrado, and et al. 2017. Google’s multi- lingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv:1607.01759 [cs]. ArXiv: 1607.01759.

Arzoo Katiyar and Claire Cardie. 2017. Going out on a limb: Joint extraction of entity mentions and relations without dependency trees. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 917–928. Association for Computational Linguistics.

Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel. 2018. Learning visually grounded sentence representations. arXiv:1707.06320 [cs]. ArXiv: 1707.06320.

Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv:1408.5882 [cs]. ArXiv: 1408.5882.

56 Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and Iryna Gurevych. 2018. The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented In- teractive Annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico. Association for Compu- tational Linguistics.

Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation.

Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv:1804.10959 [cs]. ArXiv: 1804.10959.

Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, page 66–71. As- sociation for Computational Linguistics.

Andrey Kutuzov and Mario Giulianelli. 2020. Uio-uva at semeval-2020 task 1: Contextualised embeddings for lexical semantic change detection. arXiv:2005.00050 [cs]. ArXiv: 2005.00050.

Kai Labusch, Clemens Neudecker, and David Zellhofer. 2019. Bert for named entity recognition in contemporary and historical german. page 9.

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289. Morgan Kaufmann Publishers Inc.

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 260–270. Association for Computational Linguistics.

Lasko Laskov. 2006. Classification and recognition of neume note notation in historical documents. page 4.

Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals.

Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. page 9.

Elvys Linhares Pontes, Ahmed Hamdi, Nicolas Sidere, and Antoine Doucet. 2019. Impact of ocr quality on named entity linking. In Digital Libraries at the Crossroads of Digital Information for the Future, Lecture Notes in Computer Science, page 102–115. Springer International Publishing.

Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Represen- tation learning using multi-task deep neural networks for semantic classification and information retrieval.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 [cs]. ArXiv: 1907.11692.

Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam.

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.

Chaitanya Malaviya, Graham Neubig, and Patrick Littell. 2017. Learning language representations for typology prediction. arXiv:1707.09569 [cs]. ArXiv: 1707.09569.

Christopher Manning and Hinrich Schutze. 1999. Foundations of statistical natural language pro- cessing. MIT press.

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Su´arez,Yoann Dupont, Laurent Romary, Eric´ Villemonte de la Clergerie, Djam´eSeddah, and BenoˆıtSagot. 2020. Camembert: a tasty french . arXiv:1911.03894 [cs]. ArXiv: 1911.03894.

57 Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in Transla- tion: Contextualized Word Vectors, page 6294–6305. Curran Associates, Inc. Barbara McGillivray. 2012. Latinise corpus. https://www.sketchengine.co.uk/latin-corpus/. Ac- cepted: 2017-10-30T13:14:02Z. Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, et al. 2011. Quantitative analysis of culture using millions of digitized books. science, 331(6014):176–182. Rada Mihalcea and Vivi Nastase. 2012. Word epoch disambiguation: Finding how words change over time. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 259–263. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. page 9. Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, and Stefan Kombrink. 2012. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 8:67. Ian Milligan. 2013. Illusionary order: Online databases, optical character recognition, and canadian history, 1997–2010. Canadian Historical Review, 94(4):540–569. Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, page 1003–1011. Association for Computational Linguistics. S. Mori. 1984. Line filtering and its application to stroke segmentation of handprinted chinese characters. Proc. International Conference on Pattern Recognition (CVPR’84). S. Mori, C.Y. Suen, and K. Yamamoto. 1992. Historical review of ocr research and development. Proceedings of the IEEE, 80(7):1029–1058. Shunji Mori, Hirobumi Nishida, and Hiromitsu Yamada. 1999. Optical Character Recognition, 1st edition. John Wiley & Sons, Inc., USA. Phoebe Mulcaire, Swabha Swayamdipta, and Noah Smith. 2018. Polyglot semantic role labeling. arXiv:1805.11598 [cs]. ArXiv: 1805.11598. Vivi Nastase and Julian Hitschler. 2018. Correction of ocr word segmentation errors in articles from the acl collection through neural machine translation methods. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Lan- guage Resources Association (ELRA). Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2015. Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv:1504.06654 [cs, stat]. ArXiv: 1504.06654. Zeitungsabteilung Neues Deutschland. 2018. Zeitungsinformationssystem zefys - staatsbibliothek zu berlin. NewsEye. 2020. NewsEye. https://www.newseye.eu/. [Online; accessed 31-July-2020]. Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, and Antoine Doucet. 2019. Deep statistical analysis of ocr errors for effective post-ocr processing. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), page 29–38. IEEE. Allen Nie, Erin D. Bennett, and Noah D. Goodman. 2019. Dissent: Sentence representation learning from explicit discourse relations. arXiv:1710.04334 [cs]. ArXiv: 1710.04334. K. Ntzios, B. Gatos, I. Pratikakis, T. Konidaris, and S. J. Perantonis. 2007. An old greek hand- written ocr system based on an efficient segmentation-free approach. International Journal of Document Analysis and Recognition (IJDAR), 9(2–4):179–192. Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.

58 Christos Papadopoulos, Stefan Pletschacher, Christian Clausner, and Apostolos Antonacopoulos. 2013. The impact dataset of historical document images. In Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, HIP ’13, page 123–130. Association for Computing Machinery. Nikolaos Pappas and Andrei Popescu-Belis. 2017. Multilingual hierarchical attention networks for document classification. arXiv:1707.00896 [cs]. ArXiv: 1707.00896. Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic depen- dency parsing. arXiv:1704.06855 [cs]. ArXiv: 1704.06855. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi- supervised sequence tagging with bidirectional language models. arXiv:1705.00108 [cs]. ArXiv: 1705.00108. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. arXiv:1802.05365 [cs]. ArXiv: 1802.05365. Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv:1903.05987 [cs]. ArXiv: 1903.05987. Steven Pinker. 2013. Learnability and Cognition, new edition: The Acquisition of Argument Struc- ture. MIT Press. Google-Books-ID: adivAAAAQBAJ. Michael Piotrowski. 2012. Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 5(2):1–157. Barbara Plank, Sigrid Klerke, and Zeljko Agic. 2018. The best of both worlds: Lexical resources to improve low-resource part-of-speech tagging. arXiv:1811.08757 [cs]. ArXiv: 1811.08757. Project Gutenberg. 2020. Project Gutenberg. http://www.gutenberg.org/. [Online; accessed 31-July-2020]. Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. 2007. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international confer- ence on Machine learning, ICML ’07, page 759–766. Association for Computing Machinery. Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and Vijay Pande. 2015. Massively multitask networks for drug discovery. arXiv:1502.02072 [cs, stat]. ArXiv: 1502.02072. Christophe Rigaud, Antoine Doucet, Mickael Coustaty, and Jean-Philippe Moreux. 2019. ICDAR 2019 Competition on Post-OCR Text Correction. R. Rikowski. 2011. Digitisation Perspectives. Springer Science & Business Media. Google-Books-ID: IUNg7dj3Ue0C. Sebastian Ruder. 2019. Neural Transfer Learning for Natural Language Processing. page 329. Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2019. Latent multi- task architecture learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(0101):4822–4829. Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2011. Tracing semantic change with latent semantic analysis. Current methods in historical semantics, 73:161–183. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv:cs/0306050. ArXiv: cs/0306050. Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multi-task approach for learning embeddings from semantic tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 33(0101):6949–6956.

59 Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tah- masebi. 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In To appear in Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Association for Computational Linguistics.

Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic usage relatedness (durel): A framework for the annotation of lexical semantic change. arXiv:1804.06517 [cs]. ArXiv: 1804.06517.

Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5149–5152.

Philippa Shoemark, Farhana Ferdousi Liza, Dong Nguyen, Scott Hale, and Barbara McGillivray. 2019. Room to Glo: A systematic comparison of semantic change detection approaches with word embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 66–76, Hong Kong, China. Association for Computational Linguistics.

David A Smith and Ryan Cordell. 2018. A research agenda for historical and multilingual optical character recognition. page 36.

Noah A. Smith. 2011. Linguistic structure prediction. Synthesis Lectures on Human Language Technologies, 4(2):1–274.

Sandeep Soni, Lauren Klein, and Jacob Eisenstein. 2019. Correcting whitespace errors in digitized historical texts. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, page 98–103. Association for Computational Linguistics.

Caroline Sporleder. 2010. Natural language processing for cultural heritage domains. Language and Linguistics Compass, 4(9):750–768.

Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek, and Florian Fink. 2014. Ocr of historical printings of latin texts: problems, prospects, progress. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage - DATeCH ’14, page 71–75. ACM Press.

Daniel van Strien, Kaspar Beelen, Mariona Ardanuy, Kasra Hosseini, Barbara McGillivray, and Giovanni Colavizza. 2020. Assessing the impact of ocr quality on downstream nlp tasks:. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence, page 484–496. SCITEPRESS - Science and Technology Publications.

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune bert for text clas- sification? In Chinese Computational Linguistics, Lecture Notes in Computer Science, page 194–206. Springer International Publishing.

Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), page 231–235. Association for Computational Linguistics.

Yejun Tang, Liangrui Peng, Qian Xu, Yanwei Wang, and Akio Furuhata. 2016. Cnn based transfer learning for historical chinese character recognition. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), page 25–29.

Sebastian Thrun. 1996. Is Learning The n-th Thing Any Easier Than Learning The First?, page 640–646. MIT Press.

Sebastian Thrun. 1998. Lifelong Learning Algorithms, page 181–209. Springer US.

Konstantin Todorov and Giovanni Colavizza. 2020a. Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition. In CHR 2020: Workshop on Computational Humanities Research.

Konstantin Todorov and Giovanni Colavizza. 2020b. Transfer Learning for Named Entity Recog- nition in Historical Corpora. In CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS.

60 Shubham Toshniwal, Hao Tang, Liang Lu, and Karen Livescu. 2017. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. arXiv:1704.01631 [cs]. ArXiv: 1704.01631. Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman. 2015. Impact analysis of ocr quality on research tasks in digital archives. In Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, page 252–263. Springer International Publishing. Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, and Lynda Hardman. 2018. Impact of crowdsourcing ocr improvements on retrievability bias. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL ’18, page 29–36. Association for Computing Machinery. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, page 384–394. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv:1706.03762 [cs]. ArXiv: 1706.03762. Marc Vilain, Jennifer Su, and Suzi Lubar. 2007. Entity extraction is a boring solved problem—or is it? In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, page 181–184. Association for Computational Linguistics. Guy Wallis and Heinrich B¨ulthoff.1999. Learning to recognize objects. Trends in Cognitive Sciences, 3(1):22–31. Dong Wang and Thomas Fang Zheng. 2015. Transfer learning for speech and language process- ing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), page 1225–1237. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2017. R3: Reinforced reader-ranker for open- domain question answering. arXiv:1709.00023 [cs]. ArXiv: 1709.00023. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. arXiv:1607.02789 [cs]. ArXiv: 1607.02789. Terry Winograd. 1972. Understanding natural language. Cognitive psychology, 3(1):1–191. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emiLouf, Morgan Funtowicz, and et al. 2020. Hugging- face’s transformers: State-of-the-art natural language processing. arXiv:1910.03771 [cs]. ArXiv: 1910.03771. D.H. Wolpert and W.G. Macready. 1997. No free lunch theorems for optimization. IEEE Trans- actions on Evolutionary Computation, 1(1):67–82. Bishan Yang and Tom Mitchell. 2017. A joint sequential and relational model for frame-semantic parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, page 1247–1256. Association for Computational Linguistics. Jie Yang, Yue Zhang, and Fei Dong. 2017. Neural word segmentation with rich pretraining. arXiv:1704.08960 [cs]. ArXiv: 1704.08960. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding, page 5753–5763. Curran Associates, Inc. Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv:1603.06270 [cs]. ArXiv: 1603.06270. Kai Zhao and Liang Huang. 2017. Joint syntacto-discourse parsing and the syntacto-discourse treebank. arXiv:1708.08484 [cs]. ArXiv: 1708.08484.

61 Yftah Ziser and Roi Reichart. 2017. Neural structural correspondence learning for domain adap- tation. arXiv:1610.01588 [cs]. ArXiv: 1610.01588. Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. arXiv:1601.00710 [cs]. ArXiv: 1601.00710.

Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low- resource neural machine translation. arXiv:1604.02201 [cs]. ArXiv: 1604.02201.

62