DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2017

Character Recognition in Natural Images Utilising TensorFlow

ALEXANDER VIKLUND

EMMA NIMSTAD

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Character Recognition in Natural Images Utilising TensorFlow

ALEXANDER VIKLUND EMMA NIMSTAD

Degree project in Computer Science, DD143X Date: June 12, 2017 Supervisor: Kevin Smith Examiner: Örjan Ekeberg Swedish title: Teckenigenkänning i naturliga bilder med TensorFlow School of Computer Science and Communication Abstract

Convolutional Neural Networks (CNNs) are commonly used for character recogni- tion. They achieve the lowest error rates for popular datasets such as SVHN and MNIST. Usage of CNN is lacking in research about character classification in nat- ural images regarding the whole English alphabet. This thesis conducts an experi- ment where TensorFlow is used to construct a CNN that is trained and tested on the Chars74K dataset, with 15 images per class for training and 15 images per class for testing. This is done with the aim of achieving a higher accuracy than the non-CNN approach by de Campos et al. [1], that achieved 55.26%. The thesis explores data augmentation techniques for expanding the small training set and evaluates the result of applying rotation, stretching, translation and noise- adding. The result of this is that all of these methods apart from adding noise gives a positive effect on the accuracy of the network. Furthermore, the experiment shows that with a three layered convolutional neural network it is possible to create a character classifier that is as good as de Campos et al.’s. It is believed that even better results can be achieved if more experiments would be conducted on the parameters of the network and the augmentation.

ii Sammanfattning

Det är vanligt att använda konvolutionära artificiella neuronnät (CNN) för bildigen- känning, då de ger de minsta felmarginalerna på kända datamängder som SVHN och MNIST. Dock saknas det forskning om användning av CNN för klassificering av bok- stäver i naturliga bilder när det gäller hela det engelska alfabetet. Detta arbete beskri- ver ett experiment där TensorFlow används för att bygga ett CNN som tränas och testas med bilder från Chars74K. 15 bilder per klass används för träning och 15 per klass för testning. Målet med detta är att uppnå högre noggrannhet än 55.26%, vilket är vad de Campos et al. [1] uppnådde med en metod utan artificiella neuronnät. I rapporten utforskas olika tekniker för att artificiellt utvidga den lilla datamäng- den, och resultatet av att applicera rotation, utdragning, translation och bruspåslag utvärderas. Resultatet av det är att alla dessa metoder utom bruspåslag ger en positiv effekt på nätverkets noggrannhet. Vidare visar experimentet att med ett CNN med tre lager går det att skapa en bokstavsklassificerare som är lika bra som de Campos et al.s klassificering. Om fler experiment skulle genomföras på nätverkets och utvidgningens parametrar är det tro- ligt att ännu bättre resultat kan uppnås.

iii Contents

Contents iv

1 Introduction 1 1.1 Problem Statement ...... 2 1.2 Scope and constraints ...... 2 1.3 Thesis outline ...... 2

2 Background 3 2.1 Text recognition ...... 3 2.1.1 Pre-processing ...... 3 2.1.2 Character recognition ...... 3 2.2 Neural networks ...... 4 2.2.1 Feedforward networks ...... 4 2.2.2 Convolutional neural networks ...... 4 2.2.3 Training ...... 5 2.2.4 Training neural networks with small datasets ...... 7

3 Method 8 3.1 Overview ...... 8 3.2 Chars74K dataset ...... 9 3.3 Image processing ...... 9 3.3.1 Pre-processing ...... 9 3.3.2 Dataset augmentation ...... 10 3.4 Neural network ...... 11 3.4.1 TensorFlow ...... 11 3.4.2 Input ...... 11 3.4.3 Convolutional layers ...... 12 3.4.4 Fully connected layers ...... 12 3.4.5 Cost function ...... 12 3.5 Training ...... 13

4 Result 14 4.1 Processed images ...... 14 4.1.1 Pre-processing ...... 14 4.1.2 Data augmentation ...... 16 4.2 Neural network accuracy ...... 16

iv CONTENTS CONTENTS

5 Discussion 17 5.1 Discussion on the dataset ...... 17 5.2 Discussion on image processing methods ...... 17 5.3 Discussion on neural network approach ...... 18

6 Conclusion 19

Bibliography 20

v Chapter 1

Introduction

The problem of recognising text has arisen from the massive data collection performed with modern technology. One aspect of this is how text recognition is used in tech- niques for document digitisation, automated data entry from scanned form sheets and text separation from graphics, that is, tasks that can be performed with classic OCR (Optical Character Recognition). These tasks can be performed by OCR with certain limitations; it is required that the images are fronto-parallel and skew compen- sated [2]. Another aspect of text recognition is its usage for analysing images. The amount of data generated from an abundance of camera-equipped devices, such as smart- phones and wearables, is increasing [3], and it is desired that computers become bet- ter at analysing and indexing those images. Possibly, there is text in the images that could give important contextual information, and thus, the indexing would be im- proved if the text could be found and read. There is also a need for accurate text recognition in natural scenes within the field of automation, such as automatic read- ing of signs in driverless cars. This thesis is treating the problem of identifying already individually extracted characters from natural images. There are many different approaches to this problem, where the methods range from very traditional ones to using modern and more ad- vanced techniques. In general, the methods use some assumptions about the letters, such as them being monochrome with a significant contrast to their background. Yet in a picture of a natural setting, the light conditions and texture on materials will af- fect the colours and contrasts. This means that even if the letters originally are mono- chrome, that is not what they will look like in a digital picture, and that is one of the challenges that this problem poses. Character recognition is well suited for an artificial neural network approach, an approach that has had great success in recent research. Neural networks have for example achieved the lowest error rate of the SVHN (Street View House Number) dataset of 1.64% [4] and the lowest error rate of the MNIST dataset of 0.21% [5]. A tool that has simplified working with neural networks is TensorFlow, which was released by in 2015. The TensorFlow API includes functions for build- ing and training neural networks, as well as for drawing diagrams and visualising the data.

1 CHAPTER 1. INTRODUCTION 1.1. PROBLEM STATEMENT

1.1 Problem Statement

One of the biggest datasets with characters in natural images is the Chars74K dataset [6]. It was created by de Campos et al., who describe how they created it in the re- port Character Recognition in Natural Images [1]. The report also provides a couple of baseline experiments for character classifica- tion, neither of which uses neural networks. The most accurate method is a Multiple Kernel Learning (MKL) algorithm that evaluates six histograms of different features. Their classification scheme is trained with 15 images per character class and then tested on 15 images per character class. The scheme results in an accuracy of 55.26%. This thesis aims to evaluate how a neural network as a classification scheme com- pares to the MKL algorithm executed by de Campos et al. To make it fair, the com- parison is made with the same training and test data, even though it is a small amount of data for training a neural network. The problem statement for this is:

Is an artificial neural network method for classification of characters in natural images more accurate than the method tested by de Campos et al. [1]?

1.2 Scope and constraints

The problem of identifying text in an image is a complex problem with many sub- problems. A full-fledged text recognition software would consist of algorithms to detect and extract the characters, categorise the individual characters and then use lexical and grammatical analysis to correct characters that may have been incorrectly categorised. This thesis is only concerned with the most central task of categorising characters already detected and extracted from images. No further lexical or grammatical analy- sis will be performed to improve the results.

1.3 Thesis outline

This thesis begins with presenting the background of character recognition and neural networks. Focus is put on how convolutions and training of neural networks works. In the method chapter algorithms for processing the images and neural network teach- ing are described in detail, the Chars74K dataset is also presented. Then the achieved results are presented and compared to de Campos et al.’s results, to conclude how our algorithm performs on the given dataset. Ending the thesis is an analysis of the tested algorithm, to indicte its strengths and how it could be improved.

2 Chapter 2

Background

In the background chapter, the techniques used in this thesis are described. First, the area of text recognition is outlined. Secondly, an introduction to neural networks is given, both their structure and how they are trained. Training optimisations and data augmentation methods are also described.

2.1 Text recognition

The text recognition problem can according to Jung et al. [7] be divided into four sub- problems: 1. detection, 2. localisation, 3. extraction and enhancement, 4. recognition. The first three steps may be grouped as the pre-processing steps, the steps that are performed before the main task of recognising the characters. Jung et al. [7] mention that the terms for these steps are used interchangeably in many reports. The defini- tions used in this report follow in this section.

2.1.1 Pre-processing The detection and localisation steps are to detect if there is any text in the image and respectively to locate where there is text. When the text have been located, it is useful to extract it to facilitate the recognition method and enhance it as the text region may be of low resolution and suffer from noise. Enhancements may include deskewing, binarisation and smoothing. Extraction is simply to isolate each character in its own image. [8]

2.1.2 Character recognition There are several different techniques for recognising a character once it is extracted and appropriately pre-processed. Singh [9] performs a survey of OCR-techniques wherein he lists 5 different techniques for character classification, among these, struc- tural analysis and neural networks. Structural analysis is the technique used by de Campos et al. and is described by Singh as: “Structural Analysis identifies characters by examining their sub features- shape of the image, sub-vertical and horizontal histograms. Its character repair capa- bility is great for low quality text and newsprints.”

3 CHAPTER 2. BACKGROUND 2.2. NEURAL NETWORKS

Neural networks are described by Singh as the strategy that simulates the way the human neural system works. He also writes that it is a great technique for damaged text by recognising characters through abstraction to known character pixel patterns.

2.2 Neural networks

Neural networks are inspired by how the human brain works, built up by small pro- cessing units called neurons, and numerous directed connections between them. Each neuron in itself does little computation, but a network of many neurons is a powerful computational unit. The computation a neuron performs is dependant on its set of input connections X. Each input xj ∈ X to a neuron i has its own weight wij. The neuron also has a bias bi. The output of a neuron yi is then computed by the formula

 n  X yi = f bi + wijxj j=1 where f is the activation function.

2.2.1 Feedforward networks One way to model a neural network is as a feedforward network. Kriesel [10] says that in this model, the neurons are grouped together in several layers: an input layer, some hidden layers and an output layer. The neurons in one layer only have directed connections to the neurons in the next layer. A reason to use this model is the pos- sibility to create an abstraction of the network by looking at the layers instead of at each individual neuron. By doing this abstraction, the properties of all the neurons in one layer can be specified collectively.

2.2.2 Convolutional neural networks A convolutional neural network is a type of feedforward network that contains multi- ple hidden layers and levels of feature abstraction. Two different types of layers that a convolutional neural network may consist of are convolutional layers and fully con- nected layers.

Convolutional layers Convolutional layers are used for the network to be able to learn patterns in the in- put. Every neuron in the layer only has connections from a square patch of neurons in the input layer, see image 2.1, and the weights for the input connections are the same for every neuron. The patches can hence be seen as filters, also called kernels, that are applied at the input layer to find patterns. To be able to learn o different features, o filters needs to be created and the con- volutional layer will have o channels. Each filter would then have its own weights matrix and a bias. The filter is not necessarily applied at every possible patch, but can be applied at every other neuron, or every third or so forth. The value at which the filter moves is called the stride, s.

4 CHAPTER 2. BACKGROUND 2.2. NEURAL NETWORKS

When convolving with a k by k filter over an a by a input layer the resulting layer is of the size a − k + 1 by a − k + 1. If the stride s 6= 1 is used, the layer will a−k+1 a−k+1 have the size d s e by d s e. A convolutional layer may consist of several filters and feature maps. Then each filter has its own weights and its own bias. Figure 2.1: A visualisation of a feature map (right) with a 5 by 5 filter reading its input. Fully connected layers Source: Nielsen [11] A fully connected layer is a layer in which every neuron is connected to all neurons in the previous layer. In a con- volutional neural network, fully connected layers are used after the convolution to perform a kind of reasoning over the previous layer and bring the feature maps to predicted labels. As spatial information that has been preserved in previous layers is lost in a fully connected layer, there can be no more convolutional layers after a fully connected layer. [12]

2.2.3 Training There are several ways that a neural network can change in order to learn from train- ing data, including creating and deleting connections, creating and deleting neurons, and changing the neuron functions. However, the most common method is to modify the connection weights w. [10] How the rules for modifying the connection weights are formulated depend on which learning paradigm is used. Two general paradigms are supervised and unsuper- vised learning. These differ in which premises the network is given for learning. [13] A network that is trained with unsupervised learning is only given input pat- terns and tries to detect similarities between them to create classes of patterns. This paradigm is in many aspects close to how biological neural networks work; a baby living with cats will recognise that the neighbours’ cat is the same animal, even if it yet hasn’t learnt the word cat. When training with supervised learning, in addition to the given input patterns, we also give the correct class or label for each pattern. This would be equal to teach- ing a child animals by naming each animal they see. Obviously, this is a more effi- cient method for learning. The drawback of supervised learning is that it requires a lot of labelled data.

Training a single neuron Training of the network is done with the aim to minimise a cost function . This is a function that is mostly dependent on the discrepancies between the predicted class and the actual class for each input data, but it may specify any objective that needs to be optimised.

5 CHAPTER 2. BACKGROUND 2.2. NEURAL NETWORKS

∂C Through learning, the weight w is updated with the formula w ← w −η ∗ ∂w where η is the learning rate. That is, the weights are updated by the derivative of the cost function with regards to the weights themselves. The biases are updated in a similar way but with the derivative of the cost function with the regards to the specific bias. [11]

Training a convolutional network The general principle of training a multi-layered network is the same. The goal is to select the weights W and biases B that minimise the output of C(W, B). A network with multiple layers is trained through backpropagation. The goal of backpropaga- ∂C ∂C tion is to compute the derivatives ∂w and ∂b of the cost function with regards to any weight w or bias b in the network. [11] The process of backpropagation is complicated, but Nielsen covers the subject ex- tensively in his book on deep learning. [11]

Stochastic Gradient Descent Computing the gradient, which is the vector of partial derivatives, that updates the weights and biases of a network requires many computations, and it must be com- puted in each iterative step of the training process. This poses a problem when train- ing the network. A solution to this is to estimate the cost function by only computing it for a small random subset of the training data. By doing this, the number of computations needed in each step is greatly reduced with the drawback that the computed gradient not is optimal for training the network to fit the entirety of the input data. This is ac- counted for by performing more iterations when training the network. As the gra- dient is computed for a random subset in each iteration, it is likely that the misalign- ments in the gradients counteract each other over time. Ruder [14]

Adam optimiser There are optimisations that can be made to improve the learning rate when training with a stochastic gradient descent pattern. The best such stochastic optimiser today is the Adam optimiser. For each iteration t, it stores the exponentially decaying mean of the previous gradients, mt, and the exponentially decaying mean of past gradients squared, vt. These are computed as

mt = β1mt−1 + (1 − β1)gt

2 vt = β2vt−1 + (1 − β2)gt where gt is the computed gradient, and β1 and β2 are the chosen decay rates. Keep- ing these values reduces the impact a specific training batch has on the optimisation of the network and greatly increases the learning rate. [15] As m0 and v0 are initialised as 0, Kingma and Ba noticed m and v they are biased towards 0. They counteract this bias by computing bias corrected estimates:

mt mˆ t = t 1 − β1

6 CHAPTER 2. BACKGROUND 2.2. NEURAL NETWORKS

vt vˆt = t 1 − β2 These bias-corrected values are then used to update the weights and biases, that are stored in θ, with η θt = θt−1 − √ mˆ t vˆt +  where  is a small positive constant.

Dropout A training optimisation used in this report to improve the network’s learning ability is dropout. Dropout is a technique for counteracting over-fitting and making the net- work more robust by making individual units less dependent on each other. Dropout is done by randomly excluding hidden neurons from the training process. The results is that the neural network does not train at full capacity and trains different neurons in each iteration. Dropout is not used when testing the neural network. [12]

2.2.4 Training neural networks with small datasets Large amounts of labelled data are crucial to train a neural network. This proves a problem when we are faced with a small dataset. There are different methods for solving this problem. One method is semi-supervised learning in which the neural network is first trained with unlabelled data with the aim of finding structures in the data. The labelled data is then used to constrain and enable classification of the data. [13] A second method is augmenting the dataset to artificially create a bigger dataset.

Dataset augmentation Dataset augmentation is the practice of applying data-specific transformations to la- belled data and artificially expand a training dataset. This is a common tool in su- pervised learning. There is no universal way to augment all datasets, instead, all transformations must be carefully designed and tested for every domain and specific dataset. Designing dataset augmentations requires domain knowledge to ensure that the augmented data is a valid transformation, such that it could naturally occur in the domain. [16]

7 Chapter 3

Method

The method chapter describes the implementation of the conducted experiment. An overview of the data flow is presented in 3.1 to give an abstract understanding of how the images are processed. Section 3.2 introduces the used dataset. The key parts of the experiment are image processing, which is described in 3.3, and a neural net- work, which is described in 3.4. The choice of training method is presented in section 3.5.

3.1 Overview

Figure 3.1: Abstract overview of the data flow.

To begin with, the test and training sets are created identically to de Campos et al.’s sets, with images from the Chars74K dataset. Before the images are used in the neural network, they are enhanced and aug- mented. The purpose of this image processing is to create images that are more suited as input to the neural network and to artificially expand the dataset. The neural network built in the experiment is a three-layer convolutional network that is implemented in TensorFlow. The output of the neural network is a guessed label for each image. In the train- ing phase, the network compares this to the actual label of the image to adjust its weights and biases, and in the testing phase, this is used to calculate the accuracy of the network. The experiment is built entirely in Python 3. The external Python libraries that are used are SciPy, NumPy, Pillow and TensorFlow.

8 CHAPTER 3. METHOD 3.2. CHARS74K DATASET

3.2 Chars74K dataset

The Chars74K dataset is a collection of 74000 characters that are sampled from nat- ural images, hand drawn characters and computer fonts. The collection contains all lowercase English characters a to z, all uppercase English characters A to Z, and all digits 0 to 9, for a total of 62 different characters classes. All images are in PNG format and are fully opaque. The samples do not have a standardised resolution or size, but are in the size they appeared in the original images. With each sample there is an associated bitmask given as a 1-channel PNG image with pixel values of 0 and 255. The sample images should be interpreted to not contain the character where the corresponding pixels in the bitmask have a value of 0. The test set and training set that de Campos et al. used both contain 15 character samples from each character class, so there are 15 ∗ 62 = 930 in total. The test and training sets are disjoint. All the samples used in their experiment are obtained from natural images.

3.3 Image processing

Each image is passed individually to a pre-processing algorithm to create images that are more suited as input to the neural network. A suitable image, in this case, is an image in greyscale with distinctive edges. Certain random augmentations are also applied to the images before using them as input to the neural network.

3.3.1 Pre-processing The algorithm that is used for pre-processing, called PREPROCESS, is defined at a pixel level and implemented in Pillow. It takes as input one image of width w, height h and c colour channels, all of which can be arbitrarily large, and outputs an image with width w − 1, height h − 1 and one channel. The algorithm is defined below with the help of CONTRAST. In the pseudocode, a[i, j] refers to the pixel at column i and row j of image a, and p[c] refers to the value of colour channel c of pixel p.

9 CHAPTER 3. METHOD 3.3. IMAGE PROCESSING

function CONTRAST(pixel p1, pixel p2) c ← number of channels in p1 . equally many channels in p2 contr ← [1..c] for i ← 1 to c do contr[i] ← |p1[i] − p2[i]| end for return avg(contr) end function function PREPROCESS(image in) out ← [1..w − 1, 1..h − 1] for i ← 1 to (w − 1) do for j ← 1 to (h − 1) do out[i, j] ← avg(CONTRAST(in[i, j], in[i, j +1]), CONTRAST(in[i, j], in[i+1, j])) end for end for return normalise(out) end function The normalisation that is done to the output out is linear and scales the smallest value to 0 and the biggest to 1. This is done since the range [0, 1] is more standard- ised to use as input to the neural network. After the samples have been processed, their associated bitmask is applied to re- duce them. This is achieved with the algorithm APPLY_MASK. Recall that the values of the mask are either 0 or 255. function APPLY_MASK(image im, mask m) return im ∗ (m/255) . element-wise ∗ and / end function Throughout the algorithm, the images are formatted as Pillow Images. The im- ages can be pre-processed, masked and then saved to file separately from the follow- ing computations in the experiment. When converting the images to NumPy array, numpy.array is used.

3.3.2 Dataset augmentation The dataset is randomly augmented in real-time on-demand during training of the neural network. The augmentation methods that are utilised are

Rotation - Uniform random distribution between −15 and 15 degrees.

Stretching - Uniform random distribution of factors 0.7, 0.8, 0.9 and 1 size reduction, independently for both width and height.

Translation - Uniform random distribution between −6 and 6 pixels horizontally and between −4 and 4 pixels vertically.

The choices of the augmentations are based on a study by Klep [12] showing that these particular augmentations were successful in classifying handwritten characters. All random values are generated with the Python library random.

10 CHAPTER 3. METHOD 3.4. NEURAL NETWORK

Rotation SciPy’s ndimage.interpolation.rotate is used to rotate the images. The images are rotated a random angle between −15 and 15 degrees. Points previously outside the image are filled in with the value zero (black). The choice of span for the random angles is supported by the study by Klep con- cerning data augmentation methods and that most characters in the training and test sets are positioned upright. As it seems unlikely that the images would be biased to- wards rotation in a specific direction, an equal number of degrees in either direction were decided upon.

Stretching Stretching is where the height and/or width of the image are scaled by a factor. The images are effectively squeezed in the vertical and horizontal direction as all the fac- tors are less than or equal to one. SciPy’s ndimage.interpolation.zoom is used to stretch the images. This function uses spline interpolation to calculate the values of the augmented image. As this function shrinks the image, NumPy’s pad function is used to pad the image with zeroes back to the original size. The choice of possible factors is supported by testing different limitations and se- lecting the one that proved most successful for training the network.

Translation SciPy’s ndimage.interpolation.shift is used to translate the images. Values outside the original image are filled in with zeros. The possible spans for translation are selected through testing based upon the study by Klep.

3.4 Neural network

3.4.1 TensorFlow TensorFlow is Google’s open-source machine learning API that was first publicly re- leased in November 2015. Version 1.0 was released in February 2017. Prior to open- sourcing TensorFlow, Google had been using it internally, for example in their prod- ucts Search, Translate, Maps and Photos. Since the release, TensorFlow has grown and gained many new usage areas. [17] The TensorFlow API is implemented in Python, C++, Java and Go. It consists mainly of machine learning and neural network methods, but also methods that help developers with research as well as production. [18]

3.4.2 Input The input to the neural network are batches of images, represented as 4-dimensional NumPy ndarrays with the shape (n, 48, 48, 1). The shape follows the TensorFlow stan- dard NHWC, where n equals the number of images in the batch and the first dimen- sion corresponds to the indices of the n images, the second and third dimension cor- respond to height and width respectively, and the fourth dimension correspond to the colour channels of the image. Since the images are in greyscale, there is one colour

11 CHAPTER 3. METHOD 3.4. NEURAL NETWORK

channel. The height and width used are 48 pixels, so the images are scaled after the processing steps.

3.4.3 Convolutional layers The neural network has three convolutional layers with different parameters. In the experiment, the parameters are varied to find which ones give the best result. The parameters of a convolutional layer are the size of the convolution kernel, the number of output channels and the stride of the kernel. The first layer has a kernel size of 12 by 12, and 16 output channels. In the second layer, the kernel is 8 by 8, and there are 32 output channels. The third layer has a kernel size of 8 times 8 with 64 output channels. The strides of the kernel are 1, 2 and 2 for the first, second and third layer respectively.

3.4.4 Fully connected layers The purpose of the fully connected layers is to sample down the output of the con- volutional layers further so that the network can produce a guessed label for each image in the input batch. The first part of the down-sampling is to flatten the array from the convolution, which creates a 2-dimensional array with shape (n, x ∗ x ∗ o2), where x is the size of the last convolutional layer. This array is then inputted into a normal layer, where all the neurons have their own weights and biases, with o4 output channels. The next layer is a dropout layer, where some neurons are dropped before the fi- nal layer so that they aren’t trained in every training iteration. The final layer is the readout layer, which is a normal layer with 62 output channels. That means that for each image Ia in the input, there is a corresponding vector Va of length 62 in the out- put. The correct label for Ia is La and the guessed label is the index with the highest value of Va.

3.4.5 Cost function The cost function C that that is chosen and should be minimised by training is the mean of the cross entropy between the guessed labels and the actual labels. In Ten- sorFlow, the function .nn.softmax_cross_entropy_with_logits computes the cross entropy on unscaled output vectors V . It first scales V with the softmax function, such that all values add up to 1 and are between 0 and 1, which means that they can be seen as probabilities. After that, the cross entropy between each Va and the corresponding correct label La is computed as

H(Va,La) = − log Va,La so that H is a vector of length n. The value of our cost function in this step is calcu- lated from H as

C = mean(H)

12 CHAPTER 3. METHOD 3.5. TRAINING

3.5 Training

The neural network is trained with the Adam algorithm described in section 2.2.3, that is implemented in TensorFlow as tensorflow.train.AdamOptimizer. The values for the decay rates, training rate and  were selected as Kingma and Ba recom- −8 mends: β1 = 0.9, β2 = 0.999,  = 10 and η = 0.001. [15] The training is done in batches of a fifth of the training set, 186 images, at a time. Each image is randomly augmented as described in section 3.3.2 before being put through the network. After five iterations, when each image in the training set have been trained upon, the entire training set is shuffled with Python’s random.shuffle function. This ensures that the training batches are random but each image is trained with equal frequency. A dropout probability of 0.3 is chosen for the dropout layer during training.

13 Chapter 4

Result

This chapter presents the results of the experiment described in the method chapter.

4.1 Processed images

In this section, samples of the result of the image processing methods are presented. The contrast has been adjusted to more clearly show the difference between dark and light on all processed images that are shown.

4.1.1 Pre-processing When the character samples from the Chars74K dataset are processed with the PREPROCESS al- gorithm that was defined in 3.3.1, the output is generally an image with a dark background and light lines. The light lines are present where the input sample have adjacent areas with high contrast between them. The light lines will be thinner on images with higher quality and crisp contrasts. A good example of this is seen in Fig- Figure 4.1: A K in Chars74K before ure 4.1. An image with much lower quality is and after pre-processing. shown in Figure 4.2, and here the edges of the character are not as sharp but rather seen as noisy light areas on a darker background. The effect of the bitmask application with the APPLY_MASK algorithm, also defined in 3.3.1, can be seen in Figure 4.3. The precision of the bitmask varies between the samples. For many images the bitmask consists conclu- sively of pixels with value 255 and its applica- tion doesn’t change the image. This is the case A even for some samples where the character is Figure 4.2: An in Chars74K surrounded by bad data that likely affects the before and after pre-processing. results negatively, as in Figure 4.4.

14 CHAPTER 4. RESULT 4.1. PROCESSED IMAGES

Figure 4.3: A j from Chars74K through the pre-processing.

Figure 4.4: A sample from Chars74K where the bitmask doesn’t cut off bad parts of the image.

Figure 4.5: An example of the augmentation of the same j as in 4.3.

15 CHAPTER 4. RESULT 4.2. NEURAL NETWORK ACCURACY

4.1.2 Data augmentation Since the purpose of the data augmentation is to produce random translations, rota- tions and stretching of the images every iteration, it is only possible to give an exam- ple of the result. Figure 4.5 shows how a j is augmented with the rotation 6 degrees, vertical stretch factor 0.85 and translation (4, −6) pixels.

4.2 Neural network accuracy

Figure 4.6: Accuracy of the neural network on the training set and the test set during training. de Campos et al.’s accuracy of 55.26% is displayed. The vertical axis shows accuracy and the horizontal axis is the training iteration step (time).

The accuracy of the neural network as a classification scheme is shown in figure 4.6. The test accuracy is measured every 50 iteration. The maximum training accu- racy achieved by the network is 55.91% and the mean accuracy of the last 14 mea- surements, measured during the last 650 steps, is 54.78%.

16 Chapter 5

Discussion

The results show that the neural network with three convolutional layers and the pa- rameters presented in the method chapter reaches a 55% accuracy when classifying the test images. This is approximately as good as the scheme used by de Campos et al. [1]. Normally when trying a neural network approach to solve a problem there is more data to train on. The fact that we only had 930 pictures to use was one main concern. It required research and experiments on dataset augmentation methods and to adapt the methods to fit the specific data in Chars74K.

5.1 Discussion on the dataset

The Chars74K dataset is well suited for the problem of identifying characters in natu- ral images, as the images in the set are of varied size, resolution, colour and contrast. The dataset has a couple of limitations that probably have a negative impact on the result of the tested neural network approach. The fact that the test and training sets aren’t bigger than 15 samples per class is because the smallest character class doesn’t contain more than 40 images. As training data for a neural network, these are few samples, and the use of good dataset expansion techniques become essential. In most of the samples, the characters are not rotated. However, some characters are rotated as much as 90 degrees. Some rotated samples are even chosen by de Cam- pos et al. [1] to be in the test and training data, which is believed to be a cause for many wrong predictions.

5.2 Discussion on image processing methods

The methods used for the pre-processing steps have not been scientifically tested, and there are methods described in scientific studies that would work for this experiment. However, the reason that we chose the methods we did was their simplicity to im- plement and understand, and that the neural network performed well with the pro- duced images. It is not known how much the pre-processing affects the final result. More testing and adaption of different methods could answer this and possibly im- prove the results further. Similarly, no methodic research has been put into choosing factors for the aug- mentations of the dataset. Some tests were performed to find the ones used, but the

17 CHAPTER 5. DISCUSSION 5.3. DISCUSSION ON NEURAL NETWORK APPROACH

factors could be tested more with more time and computing power than we had ac- cess to in this research.

5.3 Discussion on neural network approach

Building a neural network requires testing of all the parameters before finding which values are feasible, and more testing to find which values give the best accuracy. We believe that the parameters can be tested more, and tuned more accurately to give an even better performing neural network.

18 Chapter 6

Conclusion

This thesis concludes that a convolutional neural network approach to the problem of classifying characters in natural images is at least as accurate as de Campos et al.’s multiple kernel learning method, which achieved an accuracy of 55.26%. Since the dataset is small compared to how much data a neural network normally receives for training, it is necessary to artificially expand the dataset to get a good accuracy. The augmentation methods that have been noticed to improve the result are rotation, translation and stretching. It is believed that the accuracy can be improved further with more accurate testing of the parameters of the network and the augmentations.

19 Bibliography

[1] T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural im- ages. In Proceedings of the International Conference on Computer Vision Theory and Ap- plications, Lisbon, Portugal, February 2009. URL http://personal.ee.surrey. ac.uk/Personal/T.Decampos/papers/decampos_etal_visapp2009. pdf.

[2] Paul Clark and Majid Mirmehdi. Recognising text in real scenes. International Journal on Document Analysis and Recognition, 4(4):243–257, 2002. ISSN 1433- 2833. doi: 10.1007/s10032-001-0072-2. URL http://dx.doi.org/10.1007/ s10032-001-0072-2.

[3] M. Ali and H. Foroosh. Character recognition in natural scene images using rank-1 tensor decomposition. In 2016 IEEE International Conference on Image Pro- cessing (ICIP), pages 2891–2895, Sept 2016. doi: 10.1109/ICIP.2016.7532888. URL http://ieeexplore.ieee.org/document/7532888/.

[4] Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling func- tions in convolutional neural networks: Mixed, gated, and tree. In International conference on artificial intelligence and statistics, 2016.

[5] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regulariza- tion of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066, 2013.

[6] The chars74k dataset, 2012. URL http://www.ee.surrey.ac.uk/CVSSP/ demos/chars74k/.

[7] Keechul Jung, Kwang In Kim, and Anil K. Jain. Text information extraction in images and video: a survey. Pattern Recognition, 37(5):977 – 997, 2004. ISSN 0031- 3203. doi: http://dx.doi.org/10.1016/j.patcog.2003.10.012. URL http://www. sciencedirect.com/science/article/pii/S0031320303004175.

[8] Ray Smith. An overview of the tesseract ocr engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), pages 629–633, 2007.

[9] Sukhpreet Singh. Optical character recognition techniques: a survey. Journal of emerging Trends in Computing and information Sciences, 4(6):545–550, 2013.

[10] David Kriesel. A Brief Introduction to Neural Networks. 2007. URL http://www. dkriesel.com/en/science/neural_networks.

20 BIBLIOGRAPHY BIBLIOGRAPHY

[11] Michael A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015. URL http://neuralnetworksanddeeplearning.com/.

[12] DMJ Klep. Data augmentation of a handwritten character dataset for a convolu- tional neural network and integration into a bayesian linear framework. 2016.

[13] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learn- ing. MIT Press, 2006. ISBN 9780262255899. URL http://ieeexplore.ieee. org/xpl/articleDetails.jsp?arnumber=6280908.

[14] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.

[15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[16] Terrance DeVries and Graham W Taylor. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538, 2017.

[17] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving an Michael Is- ard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev- enberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- war, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi- aoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous dis- tributed systems. CoRR, abs/1603.04467, 2016. URL http://arxiv.org/abs/ 1603.04467.

[18] Tensorflow developer summit, 2017. URL https://events.withgoogle. com/tensorflow-dev-summit/.

21 www.kth.se