A Generic Deep-Learning Approach for Document Segmentation

dhSegment: A generic deep-learning approach for document segmentation Sofia Ares Oliveiray∗, Benoit Seguiny∗, Frederic Kaplany yDigital Humanities Laboratory, EPFL, Lausanne, VD, Switzerland fsofia.oliveiraares, benoit.seguin, frederic.kaplang@epfl.ch ∗These authors contributed equally to this work Abstract—In recent years there have been multiple successful This work is a contribution towards this goal and introduces attempts tackling document processing problems separately by dhSegment, a general and flexible architecture for pixel-wise designing task specific hand-tuned strategies. We argue that segmentation related tasks on historical documents. We present the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing the surprisingly good results of such a generic architecture generic approaches in order to handle the variability of historical across tasks common in historical document processing, and series. In this paper, we address multiple tasks simultaneously show that the proposed model is competitive or outperforming such as page extraction, baseline extraction, layout analysis or state-of-the-art methods. These encouraging results may have multiple typologies of illustrations and photograph extraction. important consequences for the future of document analysis We propose an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing pipelines based on optimized generic building blocks. The 1 blocks. We show that a single CNN-architecture can be used implementation is open-source and available on Github . across tasks with competitive results. Moreover most of the task-specific post-precessing steps can be decomposed in a small II. RELATED WORK number of simple and standard reusable operations, adding to In the recent years, Long et al. [1] popularized the use of the flexibility of our approach. end-to-end fully convolutional networks (FCN) for semantic Index Terms—document segmentation, historical document processing, document layout analysis, neural network, deep segmentation. They used an ImageNet [2] pretrained network, learning deconvolutional layers for upsampling and combined final prediction layer with lower layers (skip connections) to im- prove the predictions. The U-Net architecture [3] extended the I. INTRODUCTION FCN by setting the expansive path (decoder) to be symmetric When working with digitized historical documents, one is to the contracting path (encoder), resulting in an u-shaped frequently faced with recurring needs and problems: how to cut architecture with skip connections for each level. out the page of the manuscript, how to extract the illustration Similarly, the architectures for Convolutional Neural Net- from the text, how to find the pages that contain a certain works (CNN) have evolved drastically in the last years with type of symbol, how to locate text in a digitized image, architectures such as Alexnet [4], VGG [5] and ResNet [6]. etc. However, the domain of document analysis has been These architecture contributed greatly to the success and the dominated for a long time by collections of heterogeneous massive usage of deep neural networks in many tasks and segmentation methods, tailored for specific classes of problems domains. and particular typologies of documents. We argue that the To some extent, historical document processing has also arXiv:1804.10371v2 [cs.CV] 14 Aug 2019 variability and diversity of historical series prevent us from experienced the arrival of neural networks. As seen in the tackling each problem separately, and that such specificity has last competitions in document processing tasks [7]–[9], several been a great barrier towards off-the-shelf document analysis successful methods make use of neural network approaches solutions, usable by non-specialists. [10]–[12], such as u-shaped and MDLSTM architectures for Lately, huge improvements have been made in semantic pixel-wise segmentation tasks. segmentation of natural images (roads, scenes, ...) but histor- III. APPROACH ical document processing and analysis have, in our opinion, not yet fully benefited from these. We believe that a tipping A. Outline point has been reached and that recent progress in deep learn- The system is based on two successive steps which can be ing architectures may suggest that some generic approaches seen in Figure 1: would be now mature enough to start outperforming dedicated • The first step is a Fully Convolutional Neural Network systems. Also with the growing interest in digital humanities which takes as input the image of the document to be research, the need for simple-to-use, flexible and efficient tools to perform such analysis increases. 1dhSegment implementation : https://github.com/dhlab-epfl/dhSegment number of parameters and memory usage. The upsampling is performed using a bilinear interpolation. The architecture contains 32.8M parameters in total but page post- processing since most of them are part of the pre-trained encoder, only 9.36M have to be fully-trained.3 C. Post-processing Our general approach to demonstrate the effectiveness and line post- processing genericity of our network is to limit the post-processing steps dhSegment to simple and standards operations on the predictions. Thresholding: Thresholding is used to obtain a binary map from the predictions output by the network. If several classes box post- processing are to be found, the thresholding is done class-wise. The threshold is either a fixed constant (t 2 [0; 1]) or found by Otsu’s method [15]. Fig. 1. Overview of the system. From an input image, the generic neural net- Morphological operations: Morphological operations are work (dhSegment) outputs probabilities maps, which are then post-processed non-linear operations that originate from mathematical mor- to obtain the desired output for each task. phology theory [16]. They are standard and widely used methods in image processing to analyse and process geometrical processed and outputs a map of probabilities of attributes structures. The two fundamental basic operators, namely the predicted for each pixel. Training labels are used to erosion and dilation, can be combined to result in opening and generate masks and these mask images constitute the closing operators. We limit our post-processing to these two input data to train the network. operators applied on binary images. • The second step transforms the map of predictions to the Connected components analysis: In our case, connected desired output of the task. We only allow ourselves simple components analysis is used in order to filter out small standard image processing techniques, which are task connected components that may remain after thresholding or dependent because of the diversity of outputs required. morphological operations. The implementation of the network uses TensorFlow [13]. Shape vectorization: A vectorisation step is needed in order to transform the detected region into a set of coordinates. To B. Network architecture do so, the blobs in the binary image are extracted as polygonal The architecture of the network is depicted in Figure 2. shapes. In fact, the polygons are usually bounding boxes dhSegment is composed of a contracting path2, which follows represented by four corner points, which may be the minimum the deep residual network ResNet-50 [6] architecture (yellow rectangle enclosing the object or quadrilaterals. The detected blocks), and a expansive path that maps the low resolution shape can also be a line and in this case, the vectorization encoder feature maps to full input resolution feature maps. consists in a path reduction. Each path has five steps corresponding to five feature maps’ D. Training sizes S, each step i halving the previous step’s feature maps size. The training is regularized using L2 regularization with The contracting path uses pretrained weights as it adds weight decay (10−6). We use a learning rate with an expo- robustness and helps generalization. It takes advantage of the nential decay rate of 0:95 and an initial value in [10−5; 10−4]. high level features learned on a general image classification Xavier initialization [17] and Adam optimizer [18] are used. task (ImageNet [2]). For simplicity reasons the so-called “bot- Batch renormalization [19] is used to ensure that the lack of tleneck” blocks are shown as violet arrows and downsampling diversity in a given batch is not an issue (with rmin = 0:1, bottlenecks as red arrows in Figure 2. We refer the reader to rmax = 100, dmax = 1 ). The images are resized so that the [6] for a detailed presentation of ResNet architecture. total number of pixels lies between 6 · 105 and 106. Images The expanding path is composed of five blocks plus a are also cropped into patches of size 300 × 300 in order to fit final convolutional layer which assigns a class to each pixel. in memory and allow batch training, and a margin is added to Each deconvolutional step is composed of an upscaling of the the crops to avoid border effects. The training takes advantage previous block feature map, a concatenation of the upscaled of on-the-fly data augmentation strategies, such as rotation feature map with a copy of the corresponding contracting (r 2 [−0:2; 0:2] rad), scaling (s 2 [0:8; 1:2]) and mirroring, feature map and a 3x3 convolutional layer followed by a rectified linear unit (ReLU) [14]. The number of features 3Actually one could argue that the 1.57M parameters coming from the dimensionality reduction blocks do not have to be fully

A Generic Deep-Learning Approach for Document Segmentation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support