HANA: a Handwritten Name Database for Offline
Total Page:16
File Type:pdf, Size:1020Kb
HANA:A HAndwritten NAme Database for Offline Handwritten Text Recognition∗ Christian M. Dahl1, Torben Johansen1, Emil N. Sørensen2, Simon Wittrock1 1University of Southern Denmark 2University of Bristol Abstract large-scale, and highly unbalanced databases. The availability of large scale databases for training and test- Methods for linking individuals across historical data sets, ing HTR models is a core prerequisite for constructing high typically in combination with AI based transcription models, performance models. While several databases based on his- are developing rapidly. Probably the single most important torical documents are available, only few have been made identifier for linking is personal names. However, personal available for personal names. For linking, matching, or ge- names are prone to enumeration and transcription errors and nealogy, the personal names of individuals is one of the although modern linking methods are designed to handle such most important pieces of information, and being able to read challenges these sources of errors are critical and should be personal names across historical documents is of great im- minimized. For this purpose, improved transcription methods portance for linking individuals across e.g. censuses, see, and large-scale databases are crucial components. This pa- for example, Abramitzky, Boustan, and Eriksson (2012), per describes and provides documentation for HANA, a newly Abramitzky, Boustan, and Eriksson (2013), Abramitzky, constructed large-scale database which consists of more than Boustan, and Eriksson (2014), Abramitzky, Boustan, and 1.1 million images of handwritten word-groups. The database Eriksson (2016), Abramitzky, Boustan, Eriksson, Feigen- is a collection of personal names, containing more than 105 baum, and Pérez (2020), Abramitzky, Mill, and Pérez (2020), thousand unique names with a total of more than 3.3 million Feigenbaum (2018), Massey (2020), Bailey, Cole, Henderson, examples. In addition, we present benchmark results for deep and Massey (2020), and Price, Buckles, Van Leeuwen, and learning models that automatically can transcribe the per- Riley (2019). Importantly, Abramitzky et al. (2020) and Bai- sonal names from the scanned documents. Focusing mainly ley et al. (2020) both discuss the rather low matching rates on personal names, due to its vital role in linking, we hope when linking transcriptions of the same census together. This to foster more sophisticated, accurate, and robust models for is partly due to low transcription accuracy of names. Fur- handwritten text recognition through making more challeng- thermore, both papers are concerned about the lack of repre- ing large-scale databases publicly available. This paper de- sentativeness of the linked samples. These two observations scribes the data source, the collection process, and the image- strongly motivates the HANA database, i.e., collecting and processing procedures and methods that are involved in ex- sharing more data of higher quality, and to the work on im- tracting the handwritten personal names and handwritten text proving transcription methods in order to reduce the potential in general from the forms. biases in record linking. arXiv:2101.10862v1 [cs.CV] 22 Jan 2021 I Introduction In total, the HANA database consists of more than 1.1 million personal names written as word-groups on single-line images. As part of the global digitization of historical archives, the While most of the existing databases contain single isolated present and future challenges are to transcribe these efficiently characters or isolated words, such as names available on the and cost-effectively. We hope that the scale, label accuracy, repository from Appen, this database contributes to the lit- and structure of the HANA database can offer opportunities erature of more unconstrained HTR. All original images are for researchers to test the robustness of their handwritten text made electronically available by Copenhagen Archives and recognition (HTR) methods and models on more challenging, the processed database described here is made freely avail- ∗Acknowledgements: We gratefully acknowledge support from the Copenhagen Archives who have supplied large amounts of scanned source material. The authors also gratefully acknowledge valuable comments from Philipp Ager, Anthony Wray and Paul Sharp. 1 able with this paper.1 to 1923 are registered in these forms. Children between 10 and 13 were registered on their father’s register sheet. Once A complete description of the image-processing methods for they turned 14, they obtained their own sheet. Married women transcription is included in the appendix, allowing users to were recorded on their husband’s register sheet, while single experiment with the image-processing for further improving women were recorded on their own sheet. The documents the database. One of the important features of this database is cover the municipality of Copenhagen, which from around the resemblance with other challenging historical documents, 1901 through 1902 also included Brønshøj, Emdrup, Sundby- such as general image noise, different writing styles, and vary- erne, Utterslev, Valby and Vanløse. With the only exception ing traits across the images. being Nyboderne as individuals residing in these buildings are only registered in the censuses. Aside from the municipality As pointed out by Combes, Gobillon, and Zylberberg (2020), of Copenhagen, there are some addresses from the munici- a new, flourishing literature uses historical data to study the pality of Frederiksberg due to the frequent migration between development of regions, cities, and neighbourhoods. Many these two municipalities. of these new studies use the very rich data source based on historical maps. The machine learning models and image- Prior to 1890, the main registers used by the police were the processing methods discussed in this paper for the visual census lists which goes back to 1865 and lasted until 1923. recognition of handwritten text would also efficiently be able However, due to the census lists only being registered twice a to turn historical maps into data. The methods can easily be year, in May and November, some migration across addresses adapted and trained for general pattern recognition and for rec- was not recorded, and individuals residing only shortly in the ognizing the shapes of objects and their surroundings and are city would not have been recorded (Copenhagen Archives, powerful enough to be used to identifying buildings, roads, 2020a). agricultural parcels, and more. The remainder of the paper is organized as follows: In Section Table I: II, we describe the database and the data acquisition procedure Population of Denmark in detail. Section III presents the benchmark results on the Year 1890 1901 1911 1921 database using a ResNet-50 deep neural network in three dif- Denmark 2,172,380 2,449,540 2,757,076 3,267,831 ferent model settings. In Section III, we discuss the database Copenhagen 312,859 378,235 462,161 561,344 and the benchmark methods and results, in addition to some The table shows the population of Denmark and its capital, Copenhagen. This illustrates that the register sheets approximately cover 16% of the considerations for future research. Section V concludes. adult population of Denmark in the period from 1890 to 1923. This table is from Statistics Denmark (2020). II Constructing the HANA Database A wealth of information is recorded in the police register sheets, including birth date, occupation, address, name of This section describes the HANA database in detail and the spouse, and more, all of which is systematically structured image-processing procedures involved in extracting the hand- across the forms. While this paper focuses on extracting and written text from the forms. In 1890, Copenhagen introduced creating benchmark results for the personal names, the re- a precursor to the Danish National Register. This register was maining information can be constructed using similar proce- organized and structured by the police in Copenhagen and dures to those presented in this paper and may serve as addi- has been digitized and labelled by hundreds of volunteers at tional databases for HTR models. Copenhagen Archives. In Figure I, we present an example of Rare information is also included in the register sheets, such one of these register sheets. as whether the individuals were wanted, had committed prior The Register Sheets In total, we obtain 1,419,491 scanned criminal offences, or owed child benefits. This kind of in- police register sheets from Copenhagen Archives. All adults formation is written as notes and is therefore typically writ- above the age of 10 residing in Copenhagen in the period 1890 ten under special remarks in the documents. As opposed to 1The HANA database will soon be made available for download from an online repository but can already now be obtained by request from the au- thors. The Appen database on handwritten names (names with clearly separated characters) is available here: https://appen.com/datasets/handwritten-name- transcription-from-an-image/. 2 the censuses, which were sorted by streets and dates, the po- subsequently found using Harris corner detection (Harris and lice register sheets were sorted by personal names. This made Stephens, 1988). Once we have the point space defined, we it easier for the police to control the migration of citizens of use Coherent Point Drift (Myronenko and Song, 2010), which Copenhagen and track individuals over time. Once