Training a Neural Network Using Synthetically Generated Data
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 TRAINING A NEURAL NETWORK USING SYNTHETICALLY GENERATED DATA FREDRIK DIFFNER HOVIG MANJIKIAN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Att träna ett neuronnät med syntetiskt genererad data FREDRIK DIFFNER HOVIG MANJIKIAN Degree Project in Computer Science Date: June 2020 Supervisor: Christopher Peters Examiner: Pawel Herman KTH Royal Institute of Technology School of Electrical Engineering and Computer Science Abstract A major challenge in training machine learning models is the gathering and labeling of a sufficiently large training data set. A common solution is the use of synthetically generated data set to expand or replace a real data set. This paper examines the performance of a machine learning model trained on synthetic data set versus the same model trained on real data. This approach was applied to the problem of character recognition using a machine learning model that implements convolutional neural networks. A synthetic data set of 1’240’000 images and two real data sets, Char74k and ICDAR 2003, were used. The result was that the model trained on the synthetic data set achieved an accuracy that was about 50% better than the accuracy of the same model trained on the real data set. Keywords Synthetic data set, Generating synthetic data set, Machine learning, Deep Learning, Convolutional Neural Networks, Machine learning model, Character recognition in natural images, Char74k, ICDAR2003. ii Sammanfattning Vid utvecklandet av maskininlärningsmodeller kan avsaknaden av ett tillräckligt stort dataset för träning utgöra ett problem. En vanlig lösning är att använda syntetiskt genererad data för att antingen utöka eller helt ersätta ett dataset med verklig data. Denna uppsats undersöker prestationen av en maskininlärningsmodell tränad på syntetisk data jämfört med samma modell tränad på verklig data. Detta applicerades på problemet att använda ett konvolutionärt neuralt nätverk för att tyda tecken i bilder från ”naturliga” miljöer. Ett syntetiskt dataset bestående av 1’240’000 samt två stycken dataset med tecken från bilder, Char74K och ICDAR2003, användes. Resultatet visar att en modell tränad på det syntetiska datasetet presterade ca 50% bättre än samma modell tränad på Char74K. Nyckelord Syntetiskt dataset, Generera syntetiskt data, Maskininlärning, Maskininlärningsmodell, Djuplärning, Konvolutionära neurala nätverk, teckenigenkänning i bilder, Char74k, ICDAR2003,. iii Acknowledgements We would like to express our gratitude to Christopher Peters for his invaluable constructive criticism and aspiring guidance throughout this work. iv Contents 1 Introduction 1 1.1 Problem .................................. 2 1.2 Methodology ............................... 2 1.3 Delimitations ............................... 3 2 Background 4 2.1 Synthetic data sets ............................ 4 2.2 Neural Networks ............................. 5 2.3 Char74K and ICDAR2003 data sets .................. 8 2.4 Related Work ............................... 9 3 Methodology 10 3.1 Generating the synthetic data set .................... 10 3.2 Pre-processing Char74K and ICDAR2003 data sets . 12 3.3 Neural Network ............................. 14 3.4 Evaluation ................................ 15 4 Results 17 4.1 Results from related studies ....................... 17 4.2 Results from this study ......................... 18 5 Discussion and Conclusions 21 5.1 Discussion on the results ........................ 21 5.2 Discussion on the synthetic data set . 21 5.3 Discussion on the neural network ................... 22 5.4 Conclusions ................................ 22 5.5 Future Work ............................... 23 References 24 v 1 Introduction The rapid growth of high resolution video material generated by modern devices (e.g. mobile devices, security cameras) makes the problem of detecting and recognizing characters in these materials an important problem, especially in areas related to data mining, categorizing, etc. Text detection and character recognition is a classic pattern recognition problem. For the Latin script, this is largely a solved problem in certain cases like in the case of images of scanned documents[2]. However, text detection and character recognition in natural images (photographs) pose a much more difficult problem, where characters can be much more difficult to recognize (eg. characters in neon script outside a restaurant). In recent years numerous different approaches has been evaluated to solve the problem, both through dividing the problem into stages of text detection, extraction and recognition, or by end-to-end solutions [20]. This thesis will only focus on the character recognition problem in natural images. For the problem of character recognition, algorithms implementing neural networks has proven to be performing well [2, 5, 20]. Thus, this paper will concentrate mainly on these algorithms. However, one major problem when training neural networks is the lack of a big and proper labeled data set for training, which also applies to character recognition in natural images [4, 5, 20]. A solution in this kind of situation is to generate a synthetic data set. Generating synthetic data sets is a relatively new area in the machine-learning domain. The concept is basically to generate fake data which can be used for training machine-learning models. The key advantages of training a machine-learning model using synthetically generated data set is its feasibility and flexibility. It is often the case that the real data set is too small or missing completely, and in many of these cases it is not feasible gathering more real data because of time or budget constraints [3]. Furthermore, there are cases where real data is available in abundance but still can not be used because of some privacy or confidentiality aspects [14]. In such cases generating a synthetic data set would be a viable option to solve the problem, given that the results of training 1 the machine learning model on such a data set turns out to be acceptable for the purpose. An area where the approach of generating synthetic data is applicable is in the area of character recognition. Even though the use of synthetic data have been evaluated in end-to-end solutions and word recognition [6, 7], the area of character recognition lacks of studies evaluating the use of synthetic data alone for training neural networks. 1.1 Problem For the problem of character recognition in natural images, the lack of big data sets for training neural networks is a constraint. Synthetic data sets could be a substitute when the real data set is too small or missing completely. The research problem posed in this thesis is thus, can a synthetic data set alone be used to train a neural network on the problem of character recognition in natural images? Our hypothesis is that a convolutional neural network trained on a synthetically generated data set will have a similar performance to the same network trained on a real data set. 1.2 Methodology Generally, creating a realistic synthetic data set for training can be complicated, since there are lots of aspects which need to be considered. However, achieving realism artificially in the area of characters in natural images is not difficult due to the narrow variety of aspects of realism. A natural image often contains distortion, blurriness, noise, and gradient lighting. These artifacts can easily be created by computer softwares. Tensorflow is used for creating, training and evaluate the neural network. The performance of the network is then evaluated against the state of the art results from other studies. 2 1.3 Delimitations Text recognition in natural images has traditionally been divided into several sub problems detection, extraction and recognition. This thesis only focusing on generating a synthetic data set for the sub problem of character recognition, where characters already have been detected and extracted from the natural image. There are however recent studies which achieved good results by developing an end-to-end solution [7, 20]. To evaluate a synthetic data set, a neural network with good performance is needed. When building a neural network, the performance of the network often depends on several parameters. Setting these parameters requires extensive testing, which consumes time especially when heavy computing power is absent. Due to limitations in time and computing power (a desktop PC with a single mid- range GPU), only a limited amount of testing will be possible. Consequently, the parameters of the neural network will be calibrated lightly, which could lead to an insufficient optimization of the neural network. The limited access to computing power will also be a constraint when creating the synthetic data set, which will affect the size of the data set. 3 2 Background In this section, a background on the techniques used in this paper will be given. First an introduction to synthetic data sets will be presented, then the area of neural networks will be covered, with the focus on convolutional neural networks. This part is mainly based the book Neural Networks and Deep Learning by Nielsen 1. Then, the Char74K and ICDAR2003 data sets will be introduced, and finally, related work will be presented. 2.1 Synthetic data sets One common problem when training neural networks is the lack of a large and properly labeled data set for training. Gathering such data set could be very time and resource consuming