Deep Convolutional Neural Networks for Document Classification

POLITECNICO DI TORINO Master Degree Course in Computer Engineering Master Degree Thesis Deep Convolutional Neural Networks for Document Classification Supervisor Candidate prof. Bartolomeo Montrucchio Fabio Ellena student ID: 231614 Internship Tutor Docapost BPO Fabien Aïli April 2018 Abstract In the last decades the demand for always faster and more optimized business processing management solutions has continuously increased, boosting the growth of completely digital solutions. This, together with the standardization of documents and procedures, allowed to reach excellent throughputs. However, in the document processing system field new challenges are coming regarding the automation of whole procedures. Standardization allowed to split complex procedures into simple and repetitive tasks, while digitalization allowed to collect huge quantities of documents. Many companies are driving the effort to hit the autonomous document processing in upcoming years, that roughly means to build a system which is able to mimic the human operator in all the operations that involve the processing of a document. In this report, we present LEIA, a software solution that aims to reduce the document processing time. Our approach is to automate the simple and repetitive processes. LEIA uses the ultimate findings in the Artificial Intelligence field and applies them to the whole business process solution. We will describe automatic annotation and classification study on documents, the LEIA classifier design, and test on real applications, where we reach almost human performance in real time. ii Acknowledgements At the end of my internship experience, I really would like to thank my manager Fa- bien Aïli for giving me, in cooperation with EURECOM and Politecnico di Torino, this amazing opportunity. I would like to thank my tutor Marius Mézerette for all the patience, the time and the experience he put to guide me in my work. I would like to thank the supervisors from Politecnico di Torino, Bartolomeo Mon- trucchio for his supervision and very helpful tips. It was a pleasure to work with the Innovation team at Docapost: Emmanuel, Sébastien, Cyril, Amélie. Thanks for all your support and the work done together. iii Contents 1 Introduction 1 1.1 Docapost and Innovation Team . .1 1.2 Contents of the thesis . .2 2 Document processing systems 5 2.1 Overview . .5 2.1.1 LEIA . .6 2.2 State of the art . .7 2.2.1 Image classification . .7 2.2.2 Text classification . 10 2.3 Problem analysis and solution proposal . 10 3 Hardware environment 13 3.1 Hardware and software environment . 13 3.1.1 The machine . 13 3.1.2 Software environment . 14 3.1.3 Scientific stack . 14 3.1.4 Deep learning stack . 15 3.2 Experimental methodology . 15 3.2.1 Experiment reproducibility . 16 3.2.2 Resources management . 16 3.2.3 Hyperparameter optimization . 17 3.2.4 Measurement methodologies . 17 4 Datasets 19 4.1 Dataset biases . 19 4.1.1 Selection bias . 19 4.1.2 Temporal bias . 20 4.1.3 Capture bias . 20 4.1.4 Label bias . 21 4.1.5 Negative set bias . 22 4.2 Test datasets . 22 iv 4.2.1 CIFAR-10 . 22 4.2.2 Tobacco-3482 . 22 4.2.3 ADMINISTRATIVE . 24 4.2.4 ENTERPRISE . 25 4.3 Evaluation . 26 5 Dataset annotation 27 5.1 Problem definition . 27 5.1.1 Time estimation . 28 5.1.2 Solutions . 29 5.2 Annotation by clustering . 30 5.2.1 Feature extraction . 31 5.2.2 Dimensionality reduction . 33 5.2.3 Clustering . 35 5.3 Fine annotation . 37 6 Image preprocessing and data augmentation 43 6.1 Preprocessing . 43 6.1.1 Colors . 43 6.1.2 Resizing . 44 6.1.3 Value scaling . 47 6.2 Data augmentation . 48 6.2.1 Data augmentation and image preprocessing . 48 6.2.2 Online data augmentation . 49 6.2.3 Image transformations . 49 7 Neural Network Introduction 57 7.1 Feedforward networks . 57 7.1.1 Activations . 57 7.1.2 Loss function . 58 7.2 Training . 59 7.2.1 Back-propagation . 59 7.2.2 Transfer learning . 60 8 Image classification 61 8.1 Convolutional Neural Networks . 61 8.1.1 Convolutions . 61 8.1.2 Pooling . 62 8.2 MobileNet . 63 8.2.1 Architecture . 63 8.2.2 Training . 63 8.3 DenseNet . 64 8.3.1 Architecture . 64 v 8.3.2 Training . 67 8.4 Results . 67 8.4.1 Tobacco . 67 8.4.2 ADMINISTRATIVE . 68 8.4.3 ENTERPRISE . 70 9 Text classification 71 9.0.1 OCR . 71 9.1 FastText . 72 9.1.1 Architecture . 72 9.1.2 Embedding . 72 9.1.3 N-gram features . 72 9.2 Results . 73 9.2.1 ADMINISTRATIVE . 73 9.2.2 ENTERPRISE . 73 10 Ensemble model 75 10.1 Classic ensembles . 75 10.2 Stacked ensembles . 76 10.2.1 Weighted average . 77 10.2.2 Classic classifiers . 77 10.3 Results . 77 10.3.1 Cifar-10 . 78 10.3.2 ADMINISTRATIVE . 78 11 The black box issue 81 11.1 Opening the black box . 81 11.1.1 Interpretability as explaination . 81 11.1.2 Grad-CAM . 82 11.1.3 Grad-CAM experiments . 83 12 Deployment 87 12.1 Scenarios . 87 12.2 Mobile deployment . 88 12.2.1 Model conversion . 88 12.2.2 Mobile Application . 89 13 Conclusions and future work 93 13.1 Objectives and findings . 93 13.2 Future work . 94 Bibliography 97 vi Abbreviations API Application Programming Interface CNN Convolutional Neural Network CPU Central Processing Unit GPU Graphics Processing Unit HDD Hard Disk Drive LSA Latent Semantic Analysis OCR Optical Character Recognition OS Operating System PCA Principal Component Analysis RAM Random Access Memory ReLU Rectified Linear Unit SGD Stochastic Gradient Descent SSD Solid State Drive SVD Singular Value Ddecomposition SVM Support Vector Machine vii Chapter 1 Introduction This report was written during a six-month internship in the Innovation Team team of Docapost. All the presented material and results are the outcome of the work done by Fabio Ellena (the writer) during the internship, in collaboration with the other team members. All the work presented in this report was made by Fabio Ellena, except chapter 9. The theoretical chapter 7 uses as references [Goodfellow et al., 2016][Li et al., ][Lecun et al., 1998]. 1.1 Docapost and Innovation Team Docapost is the digital branch of the group “La Poste”, providing IT services from 2007. Docapost provides its main services in the following areas: • Business process management • Document management • Mobile services • Digital transformation At Docapost the Innovation Team is composed of about ten people. Its role is to enrich the existing offers by proposing technological innovations and uses. The diversity of the profiles and an agile mode of operation allow the group to be able to take charge of projects from their creations (Design and graphic charter, the establishment of the specifications), to the deployment in production. I joined the Innovation Team for a 6 month internship as a Machine Learning Engineer and I worked mainly on the development of the classifier module of LEIA. 1 1 – Introduction 1.2 Contents of the thesis In Chapter 2 we start presenting an overview of the problems related to the document processing domain. We continue with a detailled description of the current state of the art concerning the document classification. We describe the LEIA pro- gram of Docapost and more in detail the LEIA Classifier, the main topic of this report. Ultimately the goals of the project are listed. Chapter 3 describes the hardware and software supplies used for this work. We continue with the description of the experimental protocols used during this work and the methodologies to analyze and optimize the results. In chapter 4 we present in a detailed way all the datasets that we used in this work, with the objective to make clear to the reader the kind of data involved and their peculiarities. We describe each dataset briefly, and we show their unique peculiarities. Along with the dataset description, the chapter also treats the common problems that occur in the definition of a dataset and describes the possible effects on the classification. Chapter 5 describes the annotation process for the unlabeled datasets that we use in this work. We first analyze the main manual solution to the annotation problem. Then we present a solution that automatically annotates the dataset with a minor human effort. We show the results of the annotation both regarding accuracy and required time. In chapter 6 we firstly show the preprocessing operations that we apply to the documents before the classification. Then we treat the topic of data augmentation, with an in-depth.

Load more