The Case for a Norwegian Transformer Model Per E Kummervold Javier De La Rosa [email protected] [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model Per E Kummervold Javier de la Rosa [email protected] [email protected] Freddy Wetjen Svein Arne Brygfjeld [email protected] [email protected] The National Library of Norway Mo i Rana, Norway Abstract (Devlin et al., 2019). Later research has shown that the corpus size might have even been too In this work, we show the process of build- small, and when Facebook released its Robustly ing a large-scale training set from digi- Optimized BERT (RoBERTa), it showed a consid- tal and digitized collections at a national erable gain in performance by increasing the cor- library. The resulting Bidirectional En- pus to 160GB (Liu et al., 2019). coder Representations from Transformers (BERT)-based language model for Nor- Norwegian is spoken by just 5 million peo- wegian outperforms multilingual BERT ple worldwide. The reference publication Ethno- (mBERT) models in several token and se- logue lists the 200 most commonly spoken na- quence classification tasks for both Nor- tive languages, and it places Norwegian as num- wegian Bokmal˚ and Norwegian Nynorsk. ber 171. The Norwegian language has two differ- Our model also improves the mBERT per- ent varieties, both equally recognized as written formance for other languages present in languages: Bokmal˚ and Nynorsk. The number of the corpus such as English, Swedish, and Wikipedia pages written in a certain language is Danish. For languages not included in the often used to measure its prevalence, and in this corpus, the weights degrade moderately regard, Norwegian Bokmal˚ ranges as number 23 while keeping strong multilingual prop- and Nynorsk as number 55. However, there exist erties. Therefore, we show that build- more than 100 times as many English Wikipedia ing high-quality models within a mem- pages as there are Norwegian Wikipedia pages ory institution using somewhat noisy op- (2021b). When it comes to building large text cor- tical character recognition (OCR) content pora, Norwegian is considered a minor language, is feasible, and we hope to pave the way with scarce textual resources. So far, it has been for other memory institutions to follow. hard to train well-performing transformer-based models for such languages. 1 Introduction As a governmental entity, the National Library Modern natural language processing (NLP) mod- of Norway (NLN) established in 2006 a mass digi- els pose a challenge due to the massive size of tization program for its collections. The Language the training data they require to perform well. Bank, an organizational unit within the NLN, pro- For resource-rich languages such as Chinese, En- vides text collections and curated corpora to the glish, French, and Spanish, collections of texts scholarly community (Sprakbanken,˚ 2021). Due from open sources such as Wikipedia (2021a), to copyright restrictions, the publicly available variations of Common Crawl data (2021), and Norwegian corpus consists mainly of Wikipedia other open-source corpora such as the BooksCor- pages and online newspapers, and it is around 5GB pus (Zhu et al., 2015) are generally used. When (818M words) in size (see Table 1). However, in researchers at Google released their Bidirec- this study, by adding multiple sources only acces- tional Encoder Representations from Transform- sible from the NLN, we were able to increase that ers (BERT) model, they trained it on a huge corpus size up to 109GB (18,438M words) of raw, dedu- of 16GB of uncompressed text (3,300M words) plicated text. While such initiatives may produce textual data that can be used for the large-scale data for Norwegian (5GB), Danish (9.5GB), and pre-training of transformer-based models, relying Swedish (24.7GB). Unfortunately, we were unable on text derived from optical character recognition to make the Norwegian models work, as they seem (OCR)–based pipelines introduces new challenges to be no longer updated. Similarly, the KBLab related to the format, scale, and quality of the nec- at the National Library of Sweden trained and re- essary data. On these grounds, this work describes leased a BERT-based model and an A Lite BERT the effort to build a pre-training corpus and to use (ALBERT) model, both trained on approximately it to train a BERT-based language model for Nor- 20GB of raw text from a variety of sources such wegian. as books, news articles, government publications, Swedish Wikipedia, and internet forums (Malm- 1.1 Previous Work sten et al., 2020). They claimed significantly bet- ter performance than both the mBERT and the Before the advent of transformer-based models, Swedish model by BotXO for the tasks they eval- non-contextual word and document embeddings uated. were the most prominent technology used to ap- At the same of the release of our model, the proach general NLP tasks. In the Nordic region, Language Technology Group at the University of the Language Technology Group at the Univer- Oslo released a monolingual BERT-based model sity of Oslo, as part of the joint Nordic Lan- for Norwegian named NorBERT. It was trained guage Processing Laboratory, collected a series of on around 5GB of data from Wikipedia and the monolingual resources for many languages, with a Norsk aviskorpus (2019). We were unable to get special emphasis on Norwegian (Kutuzov et al., sensible results when finetuning version 1.0 of 2017). Based on these resources, they trained their model. However, they released a second and released collections of dense vectors using version shortly thereafter (1.1) fixing some errors word2vec and fastText (both with continuous skip- (Language Technology Group at the University of gram and continuous bag-of-words architectures) Oslo, 2021a). We have therefore included the eval- Mikolov et al. 2013; Bojanowski et al. 2017, and uation results of this second version of the model even using an Embeddings from Language Mod- in our benchmarking. They have also evaluated els (ELMo)–based model with contextual capabil- their and our model themselves (Kutuzov et al., ities (Peters et al., 2018). Shortly thereafter, De- 2021) with consistent results. vlin et al. (2019) introduced the foundational work on the monolingual English BERT model, which 2 Building a Colossal Norwegian Corpus would later be extended to support 104 different languages including Norwegian Bokmal˚ and Nor- As the main Norwegian memory institution, the wegian Nynorsk, Swedish, and Danish. The main NLN has the obligation to preserve and give ac- data source used was Wikipedia (2021a). In terms cess to all published information in Norway. A of Norwegian, this amounted to around 0.9GB of large amount of the traditional collection is now uncompressed text (140M words) for Bokmal˚ and available in digital format. As part of the cur- 0.2GB (32M words) for Nynorsk (2021b). While rent legal deposit, many born-digital documents it is generally agreed that language models ac- are also available as digital documents in the col- quire better language capabilities by pre-training lection. The texts in the NLN collection span hun- with multiple languages (Pires et al., 2019; Wu dreds of years and exhibit varied uses of texts in and Dredze, 2020), there is a strong indication society. All kinds of historical written materials that this amount of data might have been insuffi- can be found in the collections, although we found cient for the multilingual BERT (mBERT) model that the most relevant resources for building an ap- to learn high-quality representations of Norwegian propriate corpus for NLP were books, magazines, at a level comparable to, for instance, monolingual journals, and newspapers (see Table 1). As a con- English models (Pires et al., 2019). sequence, the resulting corpus reflects the varia- In the area of monolingual models, the Danish tion in the use of the Norwegian written language, company BotXO trained BERT-based models for a both historically and socially. few of the Nordic languages using corpora of var- Texts in the NLN have been subject to a large ious sizes. Their repository (BotXO Ltd., 2021) digitization operation in which digital copies were lists models trained mainly on Common Crawl created for long-term preservation. The NLN em- Digital SafeStore unpack METS/ALTO files create Text & Meta files clean Training Non OCR clean Clean text deduplicate Corpus sample Library sources Set External sources clean Figure 1: The general corpus-building process. ploys METS/ALTO1 as the preferred format for tags in the collection and counting the frequency storing digital copies. As the digitized part of of words of certain types (e.g., personal pronouns). the collection conforms to standard preservation Our estimate is that 83% of the text is in Norwe- library practices, the format in which the texts are gian Bokmal˚ and 12% is in Nynorsk. Close to 4% stored is not suitable for direct text processing; of the texts are written in English, and the 1% left thus, they needed to be pre-processed and manip- is a mixture of Sami, Danish, Swedish, and a few ulated for use as plain text. One major challenge traces from other languages. was the variation in the OCR quality, which varied The aforementioned process was carefully or- both over time and between the types of materials chestrated, with data moving from preservation digitized. This limited the number of usable re- storage, through error correction and quality as- sources and introduced some artifacts that affected sessment, and ending up as text in the corpus. As the correctness of the textual data. shown in Figure 1, after filtering, OCR-scanned The basic inclusion criterion for our corpus was documents were added to the other digital sources. that as long as it was possible for a human to infer After this step, the data went through the cleaning the meaning from the text, it should be included.