A Neural Network Approach to Arbitrary Symbol Recognition on Modern Smartphones

FINAL

SAMUEL WEJÉUS

Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren Examiner: Anders Lansner Project Sponsor: Bontouch AB

TRITA xxx yyyy-nn

Abstract

Making a computer understand handwritten text and sym- bols have numerous applications ranging from reading of bank checks, mail addresses or digitalizing arbitrary note taking. Benefits include atomization of processes, efficient electronic storage, and possible augmented usage of parsed content.

This report will provide an overview of how off-line hand- writing recognition systems can be constructed. We will show how such systems can be split into isolated modules which can be constructed individually. Focus will be on handwritten single symbol recognition and we will present how this could be accomplished using convolutional neural networks on a modern smartphone. A symbol recognition prototype application for the Apple iOS operation system will be constructed and evaluated as a proof-of-concept.

Results obtained during this project shows that it is pos- sible to train a classifier to understand arbitrary symbols without the need to manually crafting class separating fea- tures and instead rely on for automatic struc- ture discovery. Referat

Identifiering av symboler genom användning av neurala nätverk på moderna mobiltelefoner

Att få en dator att förstå handskriven text och symboler har många användningsområden rörande allt från att läsa bankcheckar, postadresser eller att digitalisera godtyckliga notater. Fördelar av en sådan process inkluderar automa- tisering av procedurer, elektronisk lagring samt möjlig ut- ökad användning av inläst material.

Denna rapport avser att ge en översikt hur system för igen- känning av handskriven text i off-line läge kan skapas. Vi ämnar visa hur sådana system kan delas upp i mindre iso- lerade delar som kan realiseras individuellt. Fokus kommer att läggas på igenkänning av enskilda handskrivna symbo- ler och vi kommer presentera hur detta kan göras med hjälp av konvolutionella neurala nätverk på en modern mobiltele- fon. En applikation som utför igenkänning av nämnda sym- boler kommer att skapas för Apples iOS operativsystem i from av en proof-of-concept applikation.

Resultat erhållna genom detta projekt visar på att det är möjligt att träna en klassifierare att känna igen godtyckliga symboler utan att behöva manuellt skapa klassseparerande karaktärsdrag utan istället använda sig av djup inlärning för automatisk igenkänning av strukturer. Contents

List of Figures

List of Tables

Acronyms

1 Introduction 1 1.1 History ...... 2 1.2 The Client ...... 2 1.3 Applications ...... 2 1.4 Problem Statement ...... 3 1.5 Challenges ...... 3 1.6 Limitations of Scope ...... 4

2 Related Work 5 2.1 The MNIST Database ...... 5 2.2 Current State of Field ...... 6 2.3 Handwriting Recognition ...... 8 2.3.1 Writing Styles and Related Issues ...... 8 2.3.2 Writer-dependence vs. Writer-independence ...... 9 2.3.3 On-line vs. Off-line handwriting ...... 9 2.3.4 Segmentation ...... 9 2.3.5 Features and Feature Selection ...... 11 2.4 Problem Reduction Techniques ...... 13 2.5 Description of a Complete HWR System ...... 13

3 Theory 15 3.1 Choice of Symbol Classifier ...... 15 3.2 Artificial Neural Networks ...... 16 3.2.1 Network Structure and Network Training ...... 20 3.2.2 Deep Learning of Representations ...... 22 3.2.3 Convolutional Neural Networks ...... 23

4 Results 27 4.1 Comparison of Neural Network Libraries ...... 27 4.1.1 Investigated Libraries ...... 27 4.1.2 Evaluation ...... 29 4.2 Lua ...... 29 4.3 ...... 30 4.4 Prototype ...... 30 4.4.1 Network Training ...... 31 4.4.2 iPhone Application ...... 33 4.4.3 Testing ...... 36

5 Discussion 39 5.1 Conclusions ...... 41

6 Future Work 43

References 47 List of Figures

2.1 Samples from the MNIST dataset ...... 6 2.2 Hard samples from the MNIST set ...... 6 2.3 Example of recognition using the Evernote OCR system...... 7 2.4 Variation in writing style ...... 8 2.5 Example of a captured word image before explicit segmentation . . . . . 10 2.6 Example of different results after preforming explicit segmentation . . . 10 2.7 Example of sliding-window technique ...... 11 2.8 Hypothetical HWR System Pipeline ...... 14

3.1 Visual model of the McCulloch and Pitt’s neuron ...... 17 3.2 Overview of a simple neural network model ...... 18 3.3 Plot of network performance over time ...... 22 3.4 A typical two stage ConvNet ...... 24 3.5 Edge detection using ...... 25

4.1 Training using different learning rates ...... 32 4.2 Input views of application prototype ...... 33 4.3 Example of user drawing custom shapes ...... 34 4.4 Drawing classification pipeline ...... 34 4.5 Classification using camera capture ...... 35 4.6 Pre-processing stages of sample captured with camera ...... 36 4.7 Number of correct classifications using prototype (individual numbers) . 36 4.8 Number of correct classifications using prototype (total) ...... 37

5.1 30 filters trained on the MNIST set ...... 40 List of Tables

2.1 Best results reported on the MNIST set for various ML techniques. . . . 7

3.1 Mathematical notation used when describing neural network algorithms 16

4.1 Freehand drawing speed performance of prototype on various devices . . 37 4.2 Camera capture speed performance of prototype on various devices . . . 37 Chapter 1

Introduction

The process of parsing samples of a handwritten text into symbols is usually referred to as recognition or classification. The purpose of this representation is that it can be interpreted by a machine. One usually makes a distinction between two types of recognition. If the characters are printed typewriter fonts the recognition of such are referred to as Optical Character Recognition (OCR), and if the characters are written by hand we call the process Handwriting Recognition (HWR). Furthermore, handwriting can be distinguished as being either on-line or off-line, depending on when the text is captured. If the text is captured while the author is writing it is referred to as on-line mode otherwise it is referred to as off-line. A complete recognition system consists of three parts commonly corresponding to three separate problems: localization, segmentation, and recognition. The goal for each of these parts are: isolating and finding contours of individual words (lo- calization), separating a word into individual characters (segmentation), and finally mapping segmented chunks with the correct interpretation (recognition). Today the most popular approaches for character recognition involves using some form of (ML) technique. A classifier is the set of techniques used and can be seen as a black box producing output in the form of a classification given some input. Classification is regarded as an instance of . That is, learning is performed using a set of correctly identified observations. Machine learning can be used for a complete recognition system, or for specific parts. A recognition system involves complex tasks that need large computational resources. Building recognition systems for smartphones has not yet been widely investigated [24], consequently it is of interest to determine a set of suitable technologies for these type of devices. This report will focus on the off-line case of handwriting recognition on a smart- phone using neural networks. We will give an introduction to the various problems faced when building HWR systems. No assumptions or clear goal was given at the start of this project other than the question of what is the state of field today, what could be accomplished, and how would you do it. The outline of this report is of investigative nature. To validate results found a prototype for character recognition

1 on a smartphone will be presented as a proof-of-concept for how neural networks can be used efficiently on smartphones.

1.1 History

Character Recognition (CR), was first studied in the beginning of the 1900s taking a mechanical approach using photocells [1]. Common techniques investigated included simple template matching and structural analysis. Initial development came to a halt when researchers realized the huge diversity and variability in text when it comes to handwritten input [48]. The history of research efforts made from its infancy have not been a linear process. The problem of CR was at first very popular subject in the field of since it was regarded an easy problem to solve. Like many other fields in science and technology process is usually tangled and progress is often made when research diverge and then cross-breeding different re- sults [48]. Modern state of the art recognition systems uses techniques from various fields of pattern recognition, machine learning and artificial intelligence. Today on-line HWR is considered being a close to solved problem [54]. The problem of off-line handwriting is however much harder and is considered an open question in the research community [55].

1.2 The Client

Bontouch AB is a IT consulting company whose aim is to partner in long-term collaborations with its customers. The company focus on mobile solutions for plat- forms like Android (Google)1 and iOS (Apple)2. Bontouch is located in Stockholm but have a global market. Projects developed for clients includes, among others, Sweden’s first banking app for Skandinaviska Enskilda Banken AB (SEB)3 which make use of off-line OCR scanning of a predefined printed OCR font on invoices. For future projects Bontouch is very interested in how current recognition sys- tems can be extended, or replaced to recognize arbitrary input. The main interest for Bontouch AB is to get an overview of the state of the field. What can be ac- complished today and how could a recognition system be implemented on a mobile platform?

1.3 Applications

HWR software simplifies the process of extracting data from the handwritten doc- uments and storing it in electronic formats. There are numerous applications for

1http://developer.android.com/ 2https://developer.apple.com/devcenter/ios/ 3http://www.seb.se

2 HWR systems. In fact, many such applications are commercially available today. These systems range from reading bank checks, signature verification and automatic creation of digital versions of books. General recognition of symbols would also prove valuable as subsystems for instance in autonomously driving cars to interpret traffic signs. Interesting applications of CR, specifically by using a smartphone, could for instance be a currency converter for tourists. By simply aiming the smartphone camera at price tags, prices could be converted in real-time.

1.4 Problem Statement

This project seeks to determine how a system for recognition of handwritten sym- bols would be implemented on a modern smartphone. We will focus on the single symbol off-line case of handwriting i.e. the input is a static image containing one symbol. This report will discuss the different stages of a complete recognition sys- tem to give an overview of the challenges faced, but mainly focus on the recognition phase. A prototype application will be developed and evaluated in order to test such a recognition system with real-world data. The prototype needs to be plat- form agnostic in the sense that it should work on both Android and iOS equipped devices as requested by the customer.

1.5 Challenges

The real challenge in automatic handwriting recognition applications is how they conforms to changes in its environment. In the ideal case an HWR system should be able to operate properly without any assumptions made about the data cap- tured from the real-world. Conditions that can affect the results can for instance be various colors in the scene, illumination conditions, and variation in writing style of independent users. Creating recognition systems for embedded devices such as a smartphone adds additional challenges such as severe limitations in memory and CPU performance. Even with the impressive performance of modern day smart- phones such as iPhone 5 or Google Nexus 4, the performance of such devices is only a fraction of a modern desktop PC [47]. Studies have shown that users tend to rate responsive applications higher, resulting in positive market success [69]. The procedure from capturing input data to final classification thus have to be fast. The following list points out the problems an embedded recognition system must solve in order to meet stated requirements.

1. Localization. Given a document of several words and/or symbols the position have to be located.

2. Segmentation. Isolation of characters by splitting words into chunks of sym- bols. Determining where a symbol starts and ends.

3 3. Variance. Handwriting is subject to high variance in writing style between different authors. Factors include size, rotation, elongation and skew or the equivalence of different fonts. Even a single person writing the same symbols twice is subject to variation in size and position.

4. Real-time processing. Compared to a modern desktop PC - a smartphone put strong requirements on limited memory usage, battery consumption and CPU usage.

5. Scene invariance. Capturing scenes from the real-world includes heavy pre- processing separating correct channels and to separate foreground symbols from the background. Lighting conditions also play an important role in scene separation.

6. Platform agnostic. The proposed solution should be able to run on the major- ity of modern smartphone devices. In practice this means Android and iOS devices.

1.6 Limitations of Scope

Creating a complete handwriting recognition system is a huge task and hence some limitations of the scope in this report have to be established. The focus of this report will be on the recognition phase of a HWR system. The prototype will be trained to classify single digits using the Modified NIST (National Institute of Standards and Technology) (MNIST) database (explained further in section 2.1). It is hypothesized that the same classifier trained on digits can also be trained for arbitrary symbols, this it will be explained further in sections 3.2.2 and 3.2.3. Localization of words or lines in a documents is normally carried out in an isolated procedure and will not be discussed in depth other than mentioning rec- ommended procedures. As will be explained in the “Related Work” chapter there are strong suggestions that using techniques that combines several stages in the CR pipeline produce better results. Combining segmentation and recognition into one single classification step for word recognition uses different techniques than performing explicit segmentation then recognition. Since segmentation could be an integrated part of a HWR system an overview of the subject will be provided in sec- tion 2.3.4. The focus of the suggested prototype will be on single symbol recognition constrained by the assumptions listed below.

• Only one symbol at a time will be recognized.

• When using real-world data the input sample have to be illuminated well.

• The classifier will only deal with binary data and assumes the images have been pre-processed.

4 Chapter 2

Related Work

This chapter presents an overview of approaches for HWR, both historical and modern. We will elaborate on the problems inherent in CR and describe the current state of field. A description of a complete HWR system will be given in section 2.5. In order to train and test a HWR system, we will use the MNIST dataset, a popular benchmark set consisting of handwritten digits. Section 2.4 deals with possible methods that can make the CR problem easier to solve.

2.1 The MNIST Database

In order to make objective experiments an adequate dataset is needed. One such dataset is the MNIST database [71, 35]. The MNIST database was created by Yann LeCun et al. to be used as a benchmark for testing various classifiers of handwritten input. It was created by letting 250 different authors write different single digit numbers by hand. The set consists of a training set of 60,000 samples and a test set of 10,000 samples. Both sets are completely disjoint from each other in the sense that the authors in the training set are not the same as the authors in the test set. The digits are centered and size-normalized in fixed-sized images. Each sample consists of a binary image of size 28x28 pixels of a single digit together with a label of its correct classification. The MNIST database is a popular benchmark in the pattern recognition community, thus making comparison with other classifiers easier. Consisting of 60,000 samples the MNIST set is considered large enough for reliable training and testing of classifiers. Using the MNIST set will also make evaluation of our proposed solution easier since the prototype that will be created can then be compared with solutions available in the research community. An example of typical samples in the MNIST set is given below in figure 2.1. The set also includes samples that are hard to classify even for a human, as shown in figure 2.2. When using machine learning algorithms the most important part is not how well the classifier algorithm used adapts to training data but instead how well it generalizes to never before encountered samples. Since the MNIST is divided into

5 train and test set with different authors for both it should give a good indication of how well the classifier will perform on real-world data.

Figure 2.1. Samples from the MNIST dataset.

Figure 2.2. Hard samples from the MNIST set.

2.2 Current State of Field

Today there exists, to the best knowledge of the author, no complete system for handwriting recognition with impeccable accuracy. Theories for such systems have been investigated in [58] which uses a combination of several techniques including neural networks and so-called Hidden Markov models. Taking the step from single letter recognition to word recognition, researchers have found it is more effective to do both at once instead of separating the recogni- tion stages into explicit segmentation and recognition phases. Another extension, similar to Hidden Markov Model (HMM)s, from single symbol to word recognition, is to use neural networks in combination with Graph Transformer Network (GTN). GTNs, as presented in [7], gives an indication that using neural networks as a base for building word recognizers is a promising topic for future research. Relaxing the demand for correct word recognition and accepting a few errors on a character level commercial applications, such as the note taking application Evernote1, have successfully implemented what we here refer to as searchable word

1http://www.evernote.com

6 recognition [29, 53, 19]. Such a system do not try to find one correct interpretation of a word but instead finds all likely possibilities as can be seen in figure 2.3.

Figure 2.3. Example of recognition using the Evernote OCR system (image from Evernote techblog).

Since the scope of this report is limited to single digit recognition, an overview of recent results of this type of problem will now be presented. To this date, the best results reported for various types of classifiers trained using the MNIST dataset [71] is presented in table 2.1 below. The table is an aggregation of different attempts at building a classifier using the specific techniques listed. Only the the best result reported for the various types are listed. Looking at the last line of this table we can see that recent results imply that a specific type of neural networks, namely convolutional networks, have produced the lowest error and is considered to be the most efficient approach for recognition of handwritten symbols.

Classifier type Best result for type (error %) Reference Linear 7.6 Lecun et al. [35] Non-Linear 3.3 Lecun et al. [35] K-NN 0.52 Keysers et al. [30] Boosted Stumps 0.87 Kegl et al. [28] SVM 0.56 DeCoste, Scholkopf MLP 0.35 Ciresan et al. [11] ConvNet 0.23 Ciresan et al. [10]

Table 2.1. Best results reported on the MNIST set for various ML techniques.

7 2.3 Handwriting Recognition

In this section we will explain the problems inherent in creating a handwriting recognition system. Most pattern recognition problems include usage of extracted features as input for classifiers. In section 2.3.5 we will explain why this approach is considered unwieldy for CR systems.

2.3.1 Writing Styles and Related Issues

Different writing styles is the biggest source of difficulty when it comes to under- standing handwritten text. A simple illustration of the differences between styles can be seen in figure 2.4. The first lines, starting from the top, is referred to as discrete handwriting and the range moves towards more connected, or continuous type of writing. The style of the last couple of lines is commonly referred to as connected and pure cursive respectively. Common for most people is to use a mixed writing style alternating between discrete and connected style as can be seen on the last line in figure 2.4.

Figure 2.4. Variation in writing style [64].

Full discrete writing style is considered being easier to parse due to the great advantage that the text is easy to segment. Segmentation is one of the hardest, if not the hardest, problem when building text recognition systems. As mention earlier, in section 1.5, writing is highly affected by variance. The difficulty lies in constructing accurate and robust models that are invariant to vari- ability in space i.e relative scaling. This diversity is obviously due to the fact that no two authors writing style is the same and even for a single author it is hard to maintain consistency.

8 2.3.2 Writer-dependence vs. Writer-independence A writer-independent system is independent of the differences in writing style by different users. Such a system should be able to recognize handwriting not previ- ously seen during training. Writer-independent systems are generally regarded much harder to construct [68]. This is not only limited to how machines view handwriting, even humans are much more capable of recognizing own handwriting compared to that of a stranger. The difficulty stems from the high degree of variance inherent in handwriting across authors and a writer-independent system must learn to cope with this extra level of complexity by being better at generalizing. In the opposite case, for a writer-dependent system, we only have to learn a few different styles, making the problem easier.

2.3.3 On-line vs. Off-line handwriting There are two primary aspects of recognizing handwritten text. Either the text is capture in on-line or off-line mode. On-line recognition consist of recognizing text as it is being written, capturing data about speed, movement, and pressure from some input device. Several variants of capturing text in an on-line mode exists ranging from PDAs to pressure sensitive . In on-line mode a rich set of features can be captured from the writing process. The captured sensor data can include: speed, geometrical position, pressure and temporal information none of these are present in off-line mode. In off-line mode handwriting is recognized after the text have been written, usually from static images captured by using either a scanner or camera for instance. This convoys significantly less information. It is thus commonly agreed that on-line handwriting recognition corresponds to the easier of the two primary problems. In the on-line case the usage of extra information makes the problem easier to solve. The off-line case on the other hand is significantly more complex and is still considered an open research problem [20].

2.3.4 Segmentation Segmentation is a vital part of a CR system and is one major source of classification errors [56]. In the on-line case you have a great benefit from knowing the inputs device up and down movements. The input is time-ordered which helps deciding on accurate segmentation points. This information is not available in the off-line case. Handwritten words often do not contain clear boundary points between characters, especially in the case of cursive handwriting, where characters is mostly overlapping as illustrated in figure 2.4 in section 2.3.1. Many methods have been suggested for segmentation. According to [8] it is possible to identify three “pure” strategies:

• Classic approach Segmentation points are determined based on “character- like” properties. • Recognition-based Segmentation components are chosen by matching classes of a predefined alphabet.

9 • Holistic methods Instead of segmenting symbols the whole word is recognized, thus sidestepping individual character recognition.

For classical methods it is common to use heuristics to locate good segmenta- tion points [6]. These can be more or less sophisticated, examples include using histograms or character properties such as: average character width or contour ex- traction. A problem with classical methods such as explicit segmentation is that the decision made for segmentation points is not local and can affect future decisions. In figure 2.5 and 2.6 we can see how an incorrect segmentation decision can affect the segmentation of subsequent characters.

Figure 2.5. Example of a captured word image before explicit segmentation. Image taken from [20].

Figure 2.6. Example of different results after preforming explicit segmentation. Image taken from [20].

For recognition-based approaches a lot of research points in the direction that segmentation can not be done in isolation and must instead be carried out in con- junction with recognition [6, 8, 3]. This leads to an interesting paradox namely; it is necessary to segment in order to recognize, but it is also necessary to recognize to segment. One way of overcoming this is discussed in [33]. Unfortunately it is be- lieved that to create better HWR systems, segmentation can not be carried out in isolation but must instead be done as an integrated part of the recognition. Com- bining both segmentation and recognition leads to a completely different type of system than the main topic of this report and uses a different set of tools compared

10 to single symbol recognition. Popular techniques that have provided good results includes the use of HMMs and GTNs. As an example of how simultaneous segmentation and recognition can be achieved is the Sliding-window approach and is illustrated in figure 2.7. After localization finds a possible word chunk the word is split into several small strips and analyzed. Using statistical tools, the start and end of single character can be identified. The system can then conclude with high probability that some series of strips is very likely to correspond to a character. When a character is identified it also marks a segmentation point. Thus seg- mentation and recognition are performed simultaneously. A popular probabilistic model that can be used for this type of operation is HMMs. The HMM technique is a huge topic in itself and out of scope for this report. The interested reader is referred to [20, 7].

Figure 2.7. Example of sliding-window technique. Image chunks extracted from a single word. Image from IAM-OnDB.

2.3.5 Features and Feature Selection Features are used to make the classification process of a sample easier. Understand- ing what features are, how they are constructed, and the selection process is of high importance. Which features to use when classifying data is important for a deeper theoretical understanding of pattern recognition. But we will also see how selecting good features for CR is cumbersome work and how we can circumvent this.

What is a Feature? In the case of CR, as with many problems related to pattern recognition, a funda- mental problem is to find a measurement or function that can describe the data. Given some input data, an extracted feature is a number or vector used as an indicator of a set of predefined characteristics that the are present in the input. This number, or vector, is referred to as a feature. Examples of features that can be extracted from an image of a symbol are: histograms, color intensity, number of strokes, average distance from image center, etc. Feature vectors can be treated

11 as random variables. This is natural, as the measurements resulting from different patterns exhibit a random variation. For many methods used in machine learning it is of high importance to extract good features that give high class separability. The features we are interested in are those that constitute the invariants which makes class separation possible. Finding such invariant features in the case of symbol recognition is considered hard, since the structure of an arbitrary symbols is not known [1].

Selecting good Features A feature can be almost any type of function calculating some desired property and many features could be available (dozen to hundreds) for various objects. An important question is which ones to include in the classification process. Not all combinations are good since some have a high mutual correlation. If too many features are used, the classifier will be worse on generalization. The central as- sumption for feature selection techniques is that the data contains many redundant or irrelevant features. Selecting which features to use can be carried out in an algorithmic approach. A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different subsets. Selecting a subset of features to use provide three main benefits when constructing predictive models:

• Improved model interpretability

• Shorter training times

• Enhanced generalization by reducing over-fitting

It is widely reported in the literature that methods known as wrapper methods tend to be superior as feature selection technique [63]. Other types can be charac- terized as either filter or embedded methods [32]. A wrapper method works (in the simplest case) by starting with an empty set of features, test how well this feature performs on the dataset, extend the set of features to use with this feature if it performs better than previously and test again. Repeat. Terminate on acceptable error level or desired number of features found. In practice, for a finite number of features, increasing the set of features used an initial improvement in performance is obtained, but increasing the set further might result in an increase of the probability of error. This phenomenon is also known as the “peaking phenomenon” [66, p. 267].

The process of creating features and going through a selection process that gives the highest class separability can quickly turn out to be complicated and error prone. There are ways to circumvent this by simply telling the computer “we don’t know what to look for, you figure it out!”, which will be explained in section 3.2.3.

12 2.4 Problem Reduction Techniques

Decreasing the problem domain often helps in increasing the accuracy of HWR systems. Some techniques for problem reduction that can be applied is listed below. Neither of these actually solves the problem of handwriting recognition completely but can help decrease the error rates. The techniques listed will not be investigated further but are left here as a reference.

• Specifying allowed symbol ranges.

• Utilizing some form of dictionary (n-grams is especially popular in the litera- ture [9]).

• Input is captured from specialized forms.

• Train a new classifier for each new user (usually not relevant since training takes time and the amount of data needed is usually not available).

2.5 Description of a Complete HWR System

We now give an overview of how the different parts of a HWR system are connected. Figure 2.8 is a graphical description of dataflow for a HWR system in the form of a block diagram. This type of similar to what is described in [55] and [58]. The different blocks in the diagram constitute a separation of subsystems each of which could be developed individually using various approaches. The output from one submodule should be seen as input to the next. The input for the system is assumed to be a captured image of a document. The document is further assumed to contain arbitrary information such as a mixture of both text and figures for instance. Since we are interested in recognizing text we first need to locate these paragraphs. Efficient methods for text localization is given in [70]. Once a paragraph have been located, lines of text need to be extracted. Popular methods for line extraction is presented in [39]. Following the discussion on segmentation given in section 2.3.4, we need to choose to either perform explicit or implicit segmentation. Independent of choice the result is the same, one or more recognition hypotheses.

13 Input image of document

Localization

Location of paragraph

Line extraction

Lines of text

Pre-processing

Baseline correction, slant, and size normalization

Enhanced output

Word detection

Recognition < implicit or explicit >

Simultaneous segmentation and Explicit Character recognition Segmentation

Character Recognition

Post-processing

(optional) Language models (dictionary, N- grams)

Recognition hypotheses

Figure 2.8. A Hypothetical HWR System Pipeline.

14 Chapter 3

Theory

In this chapter a theoretical introduction of neural networks will be presented, focusing on concepts related to a specific network type called convolutional neural networks. For single digit recognition implemented on a smartphone, convolutional neural networks was found to be the most efficient model, in regards to low error, as explained previously in section 2.2. In this section the choice of why this is a suitable classifier will be further discussed and motivated.

3.1 Choice of Symbol Classifier

When choosing a suitable classifier to be used on a smartphone several factors must be taken into consideration. The most important factors are classification speed, performance, and ease of implementation. High performance in this setting is defined as low classification error rates. Since machine learning techniques normally is divided into distinct training and classification phases only classification speed was deemed important. This is because the training can be accomplished separately on any desktop PC. Deciding which classifier to use was highly dependent on finding libraries that could be used to implement this classifier on the target platforms. This final choice was highly affected by the discovery of a library called Torch which support the creation of arbitrary network structures. The comparison of libraries can be found in section 4.1. The best results reported for symbol recognition, as indicated in table 2.1, is by using neural networks. A Convolutional Neural Network (ConvNet), is a new ap- proach for network construction which builds upon a concept called deep learning. Results reported by [41, 42, 36] indicates that deep learning have achieved remark- able success. As discussed in section 2.3.5 creating good features to use is hard and error prone. The use of ConvNets can avoid this restriction and still perform well. The use of deep learning is what makes ConvNets independent of feature crafting and why will be explained throughout this chapter. For these reasons, ConvNets was chosen as classifier to be used in the prototype.

15 3.2 Artificial Neural Networks

In 1943, W.McCulloch and W.Pitts investigated how a computational model of the brains nervous activity could be constructed. The result was a mathematical model of a neuron, a binary device with fixed threshold logic capable of imitating the functionality of simple electrical circuits [44]. An Artificial Neural Network (ANN) is a network of simple processing elements operating on local data in isolation communicating with other elements [44]. Motivated by the structure of the human brain, ANNs draws inspiration from this but have gone far from their biological relative. The most basic building block of the brain is the nerve cells which does all the processing. These basic constructs are called neurons, both in the brain and our equivalent abstract simulation [43]. An artificial neuron, also referred to as a , is normally described as follows (using notation described in table 3.1):

• There are n + 1 inputs with signals x0 through xn

• Each input has a weight w0 through wn. • An that determines if the neuron should “fire” or not (produce some output) depending on given input.

` xj Input to node j of ` ` wij Weight from layer ` − 1 node i to layer ` node j σ(x) Activation function ` θj Bias of node j of layer ` ` Oj Output of node j in layer ` tj Target value of node j at current layer

Table 3.1. Mathematical notation used when describing neural network algorithms.

16 X0 w0

X1 w1 Output

X2 w2 !

X w n n Activation Function

Input Weights

Figure 3.1. Visual model of the McCulloch and Pitt’s mathematical model of a neuron. The inputs xi are multiplied by their respective weight wi and then sum- marized. If the activation function gives an output higher than some threshold the neuron fires otherwise it does not.

Usually the x0 input is referred to as a bias unit assigned the value +1. The illustration given in figure 3.1 is a helpful description of the equation 3.1 which is the real mathematical description of the neuron model.

 n  X yk = σ  wkjxj (3.1) j=0

Connecting a vast number of neurons in an interactive nervous system or net- work, it is possible to resemble very advanced types of functions. In principle neural networks have the ability to realize any type of function mapping from one domain to another [52, 40]. The most popular model for neural networks is the the so called Multi Layer Perceptron (MLP) model1. A network is ordered into several layers starting with the first called the input layer, the last layer is called the output layer and layers in-between is referred to as hidden layers.

1Also known as Multi-Layer Feed-forward Network (MLF)

17 • Input Layer A vector of input values (x0...xn) this usually corresponds to a set of features used or the raw input from a sample. In the case of images input neurons can be mapped to pixel values. There is an additional constant neuron called the bias in each layer and is used for normalization.

• Hidden Layer(s)[one or more] Values arriving at a neuron in the hidden layer from each input neuron is multiplied by a weight (wkj), summed up, and used as input to the activation function. The output value, yk, is used as input for consecutive layers. • Output Layer The output layer goes through the same process of multiplying weights and summation as in the hidden layer but the data outputted by the last layer is treated as the result of the classification. Its up to the designer of the network to make an interpretation of this data.

Each layer of neurons feeds only the very next layer and receives input only from the immediately preceding layer as seen in by the connections illustrated in figure: 3.2.

Input Layer Hidden Layer Output Layer

inputs Outputs

Weights Weights

Figure 3.2. Overview of a simple neural network model.

The data used as input for the first layer is typically the feature vectors obtained as described in section 2.3.5 and the output is a vector interpreted as a classification. The strength of neural networks is formed upon three properties:

• Adaptiveness and self-organization

• Non-linear network processing

• Parallel processing

18 For adaptiveness and self-organization it can be shown that a network can change its behavior while learning and adapting to changes in training data. Using more complex network structure it is possible to form arbitrary decision boundaries and parallelization is inherent in networks due to being built by many independent parts [22].

Feed-forward Algorithm Executing the feed-forward algorithm on a network is what constitutes as making a classification. Given a set of input features the resulting output is the answer to which class the input sample belongs to. The algorithm works as follows: each connection receives its input from the previous layer or in the case of the first layer it corresponds to the input from current sample xj. Every connection is associated with a weight wi and reflects the degree of importance for that specific weight. The output value of the ith perceptron is determined by multiplying each connection with its associated weight, summing up and applying an activation function as shown in equation 3.3. The threshold coefficient (a constant simply set to “1” in figure 3.1) is a so-called bias term and is used for normalization (in equation 3.3 the bias is denoted as θ).

Oi = σ(ξi) (3.2) X ξi = θ + wixj (3.3) Several of these basic units arranged in a layer and connecting multiple layers form a network. Each of these units can be trained. Training a network of basic units will make the network mimic the behavior of various functions. That is calculating output = f(input) of some arbitrary function f by sending input to the first layer, execute the feed-forward algorithm and reading the output of the last layer.

Backpropagation Algorithm The best-known and simplest training algorithm for feed-forward networks is back- propagation and is a classic steepest descent method [57]. This section will only provide a brief introduction on the topic. For a more extensive description see [57, 62]. Training of a network consist of making small adjustments to its weights. Train- ing begins by initializing all weights to small random numbers. It then proceeds iteratively making a by calculating partial derivatives of the er- ror (the difference between the ideal and actual outputs) made in each layer and updating their weights accordingly. Calculating the error for each layer is done by starting with the output layer (right-most) and iterate one layer at a time moving towards the left. Computing values for layer i depends on values computed from layer i − 1 and errors are computed one layer at a time.

19 The name Back-propagation comes from the fact that it is only possible to directly calculate the error for the last layer directly and errors for consecutive layers are calculated by “pushing” the error backwards. Iterating through a complete training set is called one epoch and in each epoch the weights are modified in the direction that reduces the error. There are many types of activation functions used in neural networks but we assumed a here since it has a nice derivative. The derivative of the sigmoid function is:

σ(x)0 = σ(x) ∗ (1 − σ(x)) (3.4)

Given a set of training data points tj the process is started by doing a feed- forward pass for some tj. We can calculate the output error, Ek, the network makes as:

0 Ek = ((Ok − tk) ∗ σ (Ok)) (3.5) After the error for each neuron in the output layer have been calculated, going backwards, one hidden layer at a time. The error for neurons in the hidden layers are calculated as the sum of the products between the errors of the neurons and its connecting weights in the next layer. Since the gradient points in the direction of the greatest rate of increase of a scalar field we subtract the weight change in the update stage because we are trying to find the minimum by gradient descent. The weights between layer ` and ` − 1 are then updated by subtracting product of the calculated error in layer ` − 1 and output of layer `. The output and error values for neurons are taken from those neurons connected by weight i, j.

∆w = −λE`O`−1 (3.6)

wij = wij + ∆wij (3.7) The λ in eq 3.6 is called the learning rate as is usually a small number between 0 and 1 and determines how fast the network converges. Larger values can cause the weights to change too quickly, and can actually cause the weights to diverge rather than converge. Some literature suggests determining the learning rate as a function of current epoch, decreasing the learning rate continuously as more epochs are completed.

3.2.1 Network Structure and Network Training Deciding on the structure of a network can have crucial impact on its performance. The problem for network structuring (also called model selection) is the decision of how many hidden layers and units per layer that should be used. Many authors offer many ‘rules of thumb’ examples of such decisions, the more popular includes:

20 • The number of hidden neurons should be between the size of the input layer and the size of the output layer.

• The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.

• The number of hidden neurons should be less than twice the size of the input layer.

Theoretical limits about the number of neurons needed have been proved in [31] but in reality no conclusions on choosing an optimal network topology have been discovered. The decision is highly dependent on the complexity of input data or function you want to imitate. Choosing the right network structure often starts out with using rules for best practice and is then further refined. Common approaches are based on variable addition and pruning of nodes. Normally an initial structure is chosen. By repeating a procedure of training, testing, and modification on different network sizes, changes in performance can be measured. Choosing which structure to use is then a decision of desired performance. First problem faced in model selection is how many hidden layers to use. When choosing number of hidden layers the list below can act as a guide, re-produced here as described by Jeff Heaton [25]. The decision of network topology is also important for how well the network generalize to preciously unseen data. Making the structure too complex might result in underfitting while too simple in overfitting [62].

• Zero hidden layers: Only capable of representing linear separable functions or decisions.

• One hidden layer: Can approximate any function that contains a continuous mapping from one finite space to another.

• Two hidden layers: Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.

Training time also affects the performance of the classifier. One method of dynamically optimize the training is that of early stopping. Training too little and the network will not learn the desired function or pattern. Training too much results in bad generalization. The error over time for training and testing is illustrated in figure 3.3. The principle of early stopping is to divide the data into two sets - one training set and one validation set. While training, the error made by the classifier on the validation set is periodically calculated and training is stopped when the error starts increasing. Unfortunately this is not necessarily a good estimate of since the validation set is still used as input for training. A better way is to further divide the dataset into a third set, which is totally separated and ideally

21 containing no relation to training data, and not used during training. The disad- vantage of this approach is that it reduces the amount of available training data both for training and validation [62].

Figure 3.3. Plot of network performance over time. Training too much result in bad generalization and the calculated error for the validation set increases. Image from Willamette University.

3.2.2 Deep Learning of Representations Most machine learning and pattern recognition systems relies on the user/programmer to provide some knowledge about the problem instance. The machine can then use this knowledge to learn about patterns inherent in training data. Knowledge in this case is the feature vectors calculated for some input sample. In other words we rely on a priori explicit knowledge about a set of objects in order to gain knowledge about a set of similar objects. As mentioned previously, finding a general structure in handwritten letters or symbols is an open research problem and with that trying to find a correct set of features to use for training a recognizer is, currently, not considered possible. So if we cannot find or describe the correct set of features maybe we can get that knowledge by observing the world around us? Machine learning is not about learning how to exactly classify already seen data (the training set) but instead to learning how to generalize from samples seen to samples not seen before. Learning is about guessing and adapt thereafter in order to be able to make a better guess in the future! What would happen if we instead would accept the fact that we cannot make the best decision on what to look for and instead hand over the entire problem to the computer letting our algorithms themselves discover underlying causes or factors that explain the data? Put another way, what we are trying to do is: let the computer discover what we do not know and use these new discoveries to classify data.

22 Deep learning is a new concept popularized in 2006 by Geoffrey E. Hinton [26]. Neural networks are modeled into layers corresponding to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower- level concepts can help to define many higher-level concepts. Techniques based on deep learning is in active use today at Microsoft and Google who for instance created a deep learning neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos [41, 42, 36]. There are theoretical evidence for multiple levels of representation. Evidence from brain research shows that humans first learn simple concepts and then compose them to more complex ones. It is hypothesized that the visual cortex functions as a generic algorithm built in layers of how we perceive the world [4] it extract features about an object or representation at different, increasing, levels of abstraction [59]. Ranzato et al. 2007 suggests that the brain learns by first extracting edges then patches, then surfaces, then objects and so on. Dividing recognition into different levels of understanding is what gave the inspiration to the so-called convolutional neural networks.

3.2.3 Convolutional Neural Networks Hubel et.al [27] showed that there exists a complex arrangement of cells within the visual cortex. These cells are sensitive to small sub-regions of the input space, called a receptive field and covers the entire visual field. Two types of basic cells have been identified: simple cells respond maximally to specific edge-like stimulus patterns and complex cells which are locally invariant to the exact position of the stimulus. The combination of these types of cells is what make up our human vision and are well suited to exploit the strong spatially local correlation present in images. ConvNets are a form of MLP feed-forward networks with a special architecture inspired from biology specifically constructed to learn about features in multiple levels. ConvNets can be viewed as a multi-modular architecture with trainable components of multiple stages. For completeness it might be interesting to mention that other types of models can be found in the literature such as the NeoCognitron, and HMAX [23, 59]. As shown in [60] convolutional neural networks achive the best performance in handwriting recognition tasks to date and thus serves as enough motivation for investigation.

Structure of a Convolutional Network The input and output of each stage consists of so-called feature maps which are a set of multi-dimensional arrays and forms the model equivalent of visual cells explained in previous section. Depending on the input source in the problem domain this array would correspond to a 1D array for audio input, 2D array for an image (corresponding to a color channel) or 3D if the input source is a video (where the

23 additional dimension could correspond to time). A network is divided into several stages wherein each stage is composed of three layers: a filter bank layer, a non- linearity layer, and a feature pooling layer. A typical ConvNet is composed of one, two or three such 3-layer stages, followed by a classification module. Each layer type is now described for the case of image recognition. For details on the inner workings of ConvNets and comparison overview we refer to the work by Dr. Yahn LeCun and Dr. Patrice Simard [35].

Figure 3.4. A typical two stage ConvNet. Image from Stanford VISTA Lab.

Filter layer A convolutional layer consisting of several feature maps. Each feature map is the result of a convolution between the input image and a kernel. The kernel used for the convolution is initialized with random values when a new training session is started and its values are modified during the training phase to more efficiently extract interesting features.

Non-Linearity Layer The purpose of this layer is to limit the output to some more reasonable range. Typically some sort of squashing function is used and is similar to the use of the activation function in an ordinary MLP.

Pooling Layer The pooling layer down-samples its input by making it k times smaller, where k is some arbitrary constant. Common down-sampling methods include taking the average of some region. The result of a pooling layer is that superior features are elicited, which leads to faster convergence and better gener- alization [49]. Normally this is achieved by taking the average of a neighborhood of pixels in effect producing a smaller image. These layers exist to summarize the results of previous layers and extract the most contributing parts of the result of a convolution making features more location invariant.

Classification Layer The last layer normally consists of a shallow MLP. This final layer is what makes the actual classification using the features extracted from previous layers.

24 Mathematical Convolution To give some more insight into what kind of features the filter layers of ConvNets can learn to extract and how it helps in doing classification a short introduction to mathematical convolution will now be presented. Convolution is a mathematical operation described by the function:

Z ∞ (f ∗ g)(t) def= f(τ)g(t − τ) dτ (3.8) −∞

Convolution of an image f corresponds to the input image and a kernel g [18] being multiplied according to equation 3.8. Convolution can be thought of as sliding the kernel (a small matrix) over an image, multiplying the pixels of the image that is under the kernel entries and then summarize. The result of this operation on one pixel in the input image is the new value for this pixel in the output image. As an example, a simple method for edge detection in an image is to calculate the derivative of color intensity for each position by convolution. The result of convolving an edge detection kernel known as the Laplace Operator, is shown in figure 3.5 below.

Figure 3.5. Edge detection using convolution.

Advantages of Convolutional Neural Networks The black box nature of neural networks works as a universal approximator and achieves its best performance if the data you are modeling has a high tolerance to errors. The scenarios for which NNs works best are:

• When you want to discover regularities or associations within a dataset.

• When volume, number of variables or diversity in the dataset is large.

• When the relationships between variables are hard to describe adequately.

25 The different types of layers in convolutional network combines three archi- tectural ideas to ensure some degree of shift, scale, and distortion invariance: lo- cal receptive field, shared weights (or weight replication) spatial or temporal sub- sampling. They have been designed especially to recognize patterns directly from digital images with the minimum of pre-processing operations. ConvNets are actu- ally something quite new and not explored so much but as pointed out the research around deep learning is gaining more and more recognition [41].

As a concluding note: it is really interesting to emphasize that compared to traditional methods for classification, using a ConvNet structure you never tell the network what structures to look for, no data on what constitutes a certain symbol is ever given before the network start is analysis [61]. When training the network forces itself to discover recurring structures. All in all the network itself basically invents its own concept of what a symbol is.

26 Chapter 4

Results

In this chapter the results of building a prototype application for symbol recognition running on a smartphone will be presented. It is important not to reinvent the wheel hence a comparison of libraries that could be used for neural network development will be presented. The result of the comparison is that a library called Torch meets the requirements stated in section 1.4. A convolutional neural network was trained using Torch and then evaluated for speed and classification performance using the constructed prototype.

4.1 Comparison of Neural Network Libraries

Numerous libraries for building neural networks exist. Many candidate libraries where found and some where rejected from further investigation for various reasons. Reasons for rejection included: many of them did not fulfill initial requirements described in section 1.5. The need for cross-platform support (as a reminder the target platforms was iOS and Android) was the top priority when searching for possible packages to use. This requirement ruled out many popular candidates due to being written either in Python (not supported on iOS at time of writing) or Java (also not supported on iOS). Some libraries that passed initial screening where later dismissed due to being considered not mature enough or not being actively maintained. The final list of libraries considered for evaluation that met stated requirements includes: FANN, OpenCV, OpenNN, NNFpp, Libann, and Torch. Each of which will be presented below. The result of the comparison was that Torch was the most suitable library to use. For obvious reason Torch will also be given a more thorough presentation.

4.1.1 Investigated Libraries FANN (Fast Artificial Neural Network Library) is a free open source neural net- work library, which implements multilayer artificial neural networks in C with sup-

27 port for both fully connected and sparsely connected networks. This library was chosen as a candidate due to it is written in pure C without external dependencies (thus cross-platform compilation is possible on iOS, Android). It is small in size and has a simple API stack. FANN have been reported to have been used successfully in gesture recognition together with the Microsoft Kinect system1. FANN supports MLP network structure and networks can be trained using various algorithms.

OpenCV (Open Source Library) is a library of programming functions mainly aimed at real-time computer vision originally developed by Intel. Besides being a library with a rich set of image manipulation functionality, OpenCV also contains various machine learning algorithms wherein neural networks is one of them. OpenCV was considered a candidate due to its extreme popularity and cross- platform binaries exists for both iOS and Android. OpenCV is also used internally at Bontouch for various other projects which would potentially make it easier for other developers familiar with OpenCV to use code developed in this project if that would ever be the case. NNs in OpenCV is very basic and only supports naive MLPs trained using simple backpropagation.

OpenNN (previously know as “Flood”) The neural network implemented in OpenNN is based on the . That classical model of neural network is also extended with scaling, unscaling, bounding, probabilistic and conditions layers [37].

NNFpp Similar in functionality to previously mention libraries the strength of NNFpp is its pure object oriented implementation as a small set of C++ classes. After passing initial screening it was discovered that the latest updates stems back to 2007-02 and is probably discontinued.

Libann is an other library written objectively in C++ using STL. Differentiating features include support for multi-layer perceptron networks, Kohonen networks a Boltzmann machine and a Hopfield network. Contrary to what is suggested on the homepage, libann is not being actively maintained with the latest release dating back to 2004-02 (probably discontinued).

Torch Torch provides a Matlab2-like environment for machine learning algorithms. Built using a combination of C and a scripting language called Lua. The advan- tage of Torch is that it is highly customizable and able to tune on fine grained level [12]. Torch also includes support for CUDA3 which offers dramatic increases in computing performance when training.

1http://leenissen.dk/fann/wp/2011/05/kinect-neural-network-gesture-recognition/ 2http://www.mathworks.se/products/matlab/ 3https://developer.nvidia.com/what-cuda

28 4.1.2 Evaluation Testing of different libraries where carried out in an iterative fashion partly learning more about neural network design and partly identifying features and limitations of the various libraries tested. As a way of learning how to use the different libraries the first test was to create a network that would resemble the XOR function. This was chosen as a test since its a classic example for neural network being the first function proved impossible to implement by a single-layer perceptron as shown by Marvin Minsky and Seymour Papert [45]. Starting the evaluation procedure it was discovered that all three of OpenNN, NNFpp, and Libann could be ruled out due to not being actively maintained or not considered mature. This requirement stem from that in the ideal case software developed in this project would might end up being used in a commercial system. Using immature or not actively maintained code is considered bad practice. The next step of evaluation was to investigate the support for building con- volutional neural networks. Both FANN and OpenCV is only able to implement MLPs. An attempt was made to extend either of these two to support the ConvNet structure. Implementing the convolutional step was deemed to be to time consum- ing since this would involve rewriting layer construction, training procedures, and implementing a general convolutional operator. Torch is partly being written in Lua which is possible to integrate in normal C/C++ programs and have extremely good performance [12]. Torch also have sup- port for ConvNets. The drawback of using Torch would be that it would introduce a new not used by the client previously, something that is preferably avoided. On the plus side: Torch is considered very stable with carefully crafted API and under active development. Lua is a small language and generally considered very easy to learn [51]. The conclusion is that using Torch would be the most beneficial due to high performance, possibilities for rapid development together with using Lua since being actively developed and rich on features.

4.2 Lua

Lua is a multi-paradigm programming language designed to be extremely lightweight designed as a dynamic scripting language. Lua is fully ANSI C compatible and im- plemented as a C library. This makes Lua highly portable (cross-platform) and very easy to use as an embedded language. Rather than providing a feature-rich environment Lua is designed to be extended easily, the language core is so light that about 180 kB is enough to contain a full reference interpreter [14]. Lua is dynami- cally typed (essentially typeless), runs by interpreting byte-code and uses garbage collection for automatic memory management. For these reasons Lua is ideal to be used for configuration, dynamic data exchange, and rapid prototyping. Lua is also very popular among developers and the most used scripting language in games [17]. Using a sophisticated just-in-time compiler Lua is faster than most popular

29 scripting languages by an order of magnitude and is in many cases comparable to the speed of C [38]. Lua is open-source under the MIT license making it free to use without limita- tions [65]. The cross-platform possibilities ranges from being embeddable in C/C++, Objective-C, Java, C#, Smalltalk, Fortran, Ada, Erlang, and even in other script- ing languages, such as Perl and Ruby. Lua supports numerous platforms including smartphone platforms such as Android, iOS4, and Windows Phone. For a developer new to Lua it also very easy to learn the basics and a developer can quickly start develop Lua programs in just 15 minutes without any prior knowledge [51]. Lua can also be made very secure through sand-boxing techniques and has proven to be a good choice in many systems available today [2, 46].

4.3 Torch

The advantages of using Torch is threefold:

1. It is easy to develop numerical algorithms.

2. It is easy to extend (including the use of other libraries).

3. It is fast [12].

Using Torch it is possible to implement arbitrary types of network structure and development is very rapid thanks to the dynamic scripting nature of being built using Lua. Torch relies on an object oriented model for neural network creation letting the user experiment with different network structures, activation functions, training procedures - anything a user want through a simple interface without the need to care about details if not desired. Also a benefit of being based on Lua, Torch is easy to extend, or embed, using either libraries written in Lua or C (and its derivatives) thanks to Lua’s transparent C interface. Torch is not only fast, it is the fastest machine learning library available to date when put in comparison with alternative first class implementations including Matlab, EBLearn and [12, 13, 5]. This high performance is obtained via efficient usage of OpenMP/SSE and CUDA for low-level numeric routines.

4.4 Prototype

All the technology used to create a symbol recognition system was chosen to be platform agnostic. In order to test this hypothesis a prototype for the Apple iOS was implemented. iOS apps are built using Objective-C which is a superset of C. Lua being C compliant the target platform was chosen because

4Apple previously put hard restrictions on using scripting languages on iOS but these are now somewhat relaxed [21]

30 of this easy bridge of code between Lua and iOS. For Android based systems it is possible to include C code in Java using Java Native Interface, JNI [15]. Speed performance of the various iPhone models released various greatly [47]. Based on recent internal reports from Bontouch, a market share of more than 80% of its customers in Sweden is using an iPhone 4 or more recent version and hence it was chosen as target device. In this section I will give a more exhaustive explanation of the prototype implemented.

4.4.1 Network Training

Creating a trained ConvNet consisted of designing the network using Lua and Torch and training the network on a desktop PC with the MNIST dataset. After around 50 epochs the error levels tend to stabilize around an error of 1% and training was aborted. The trained network was then saved to disk using Torch built-in functionality. Many different ConvNet structures have been investigated in the literature. Ac- cording to [34, 71] a network structure referred to as LeNet is recommended for character recognition tasks having achieved low error rates (0.95%) on the MNIST set. Structures with better results exists with the best being [10] with an error rate of 0.23% by a committee of 35 ConvNet using elastic distortions. The network structure chosen for the prototype was LeNet. This because the popularity of this type of network with many references available both from the research community and also private resources available on the Internet. Since the target platform was a smartphone size of network was an issue since larger network equals more compu- tations in the recognition stage and because of that the committee of 35 ConvNet used by Ciresan et al. [10] was ruled out from investigation. When training different values for learning rate, data randomization, and posi- tive/negative output where tested in order to find a set of training parameters that would result in a network with good generalization. Using a lower learning rate for training resulted in a smoother curve and can thus be argumented to be more likely to find a stable point for which the network would provide the best generalization. The learning rates tested is presented in figure 4.1. Early stopping was used to decide when to stop training. Best result achieved on the testing set was an error of 1%. As stated earlier, best result for LeNet achieved by LeCun et al. was 0.95% and it was concluded that the difference was within the error bound for training.

31 ConvNet. Learning rate: 0.01 8 Testing Training 7

6

5

4 error %

3

2

1

0 0 5 10 15 20 25 30 35 40 45 50 Epoch

ConvNet. Learning rate: 0.001 10 Testing Training 9

8

7

6

5 error %

4

3

2

1

0 0 5 10 15 20 25 30 35 40 45 Epoch

ConvNet. Learning rate: 0.0001 15 Testing Training

10 error %

5

0 0 5 10 15 20 25 30 35 40 45 50 Epoch

Figure 4.1. Training using32 different learning rates. 4.4.2 iPhone Application

To test the capabilities and performance of the classifier on real-world data using a smartphone an iPhone application was developed. The prototype application consist of two views to capture two types of input. Either a user can draw a single digit using a touch finger movements or capture a digit using the built in iPhone camera. In both cases the captured input was pre-processed and then classified in an off-line fashion by sending the pre-processed image to the previously trained network. The two different views for the application is show in figure 4.2.

Figure 4.2. Input views of application prototype.

Draw View

The freehand drawing view capture input by letting a user create shapes directly on the iPhone screen. While drawing the input is parsed as a Bezier curve. When the user hits the “Classify” button the canvas containing the Bezier curve is transformed to a static image using iOS built-in functionality, pre-processed, and then sent to the classifier. Capturing input as Bezier curve gives an advantage of being able to ignore any noise that might be inherent in a off-line capture of handwriting allowing us to see if the trained network has learned the structure of digits efficiently. The only pre-processing that takes place is centering and size adjustment by finding the bounding box of captured input, adding an approximative padding and scaling. Figure 4.4 illustrates the scenario from capturing input to correct classification.

33 Figure 4.3. Example of user drawing custom shapes.

Figure 4.4. Drawing classification pipeline consisting of drawing a symbol, pre- processing, and then present classification result.

34 Camera View It is also possible to capture input using the built-in camera. The procedure is similar to that of when input is taken from a freehand drawing. The main difference is that since the sample is taken from a photograph it is bound to include a lot of noise and image extraction is more complicated. A different pre-processing stage is used for camera capture and described in section 4.4.2.

Figure 4.5. Classification using camera capture.

Pre-processing and Classification Details Image pre-processing was performed using OpenCV. Later in order to extract the best possible sample to send to the classifier it was important to remove noise and prepare the sample. Noise removal was done by a series of blurring and thresholding steps. After noise removal an erosion operation was performed to make the sample even more robust. After the image had been filtered, the sample was size-normalized and centered by finding the bounding box and padded so the sample would be positioned in the middle of an image with a padding of 1/5 of the width of the bounding box. This to be more compliant with the MNIST training data. The procedure is summarized below and illustrated in figure 4.6.

1. Blurring

2. Thresholding

3. Erosion

4. Find bounding box

5. Resize

35 Figure 4.6. Pre-processing stages of sample captured with camera. Borders for bounding boxes have been added for clarity.

4.4.3 Testing

The prototype was tested by letting several users test the prototype by both using freehand drawing and taking pictures of digits written in their own handwriting. The survey was performed by letting 10 persons use the two different capture modes for all possible input (digits 0-9) and make a mark if the prototype made a correct classification or not. Users where told to try to mimic their normal writing style as much as possible and not try to “break” the classifier by willingly writing hard to interpret digits. See figures 4.7 and 4.8 for a full comparison. The survey was done in order to test the prototype in a real-world scenario. The result of the test was approximately 9/10 correct classifications. Calculation time for the various steps involved in the recognition system was measured and can be seen in figures 4.1 and 4.2 for various devices. The calculation time is taken as the average of 10 consecutive executions and measured in nanosec- onds and converted to seconds for readability. As can be seen the recognition time is low making ConvNets suitable as a classifier for smartphones.

Draw (percent correct) Camera (percent correct) 100 100

75 75

50 50

25 25

0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

Figure 4.7. Number of correct classifications using prototype (individual numbers).

36 Total (percent correct)

Draw

Camera

0 25 50 75 100

Figure 4.8. Number of correct classifications using prototype (total).

Device Pre-processing Classification iPhone 4 0.0076228 0.0320454 iPhone 4S 0.0074817 0.0176622 iPhone 5 0.0035123 0.0102338 iPhone 5S 0.0016783 0.0061551

Table 4.1. Freehand drawing speed performance of prototype on various devices (time in seconds).

Device Pre-processing Classification iPhone 4 0.0370634 0.0288854 iPhone 4S 0.0298186 0.0173494 iPhone 5 0.0134305 0.0099449 iPhone 5S 0.0086724 0.0056117

Table 4.2. Camera capture speed performance of prototype on various devices (time in seconds).

37

Chapter 5

Discussion

Neural networks have been successfully used to solve many complex problems and diverse tasks, ranging from self driving cars, autonomous flying vehicles and pre- venting credit card frauds [67, 50]. The power of a model lies in its ability to gen- eralize and predict and to make accurate quantitative anticipations of data in new experiments. Neural networks also has these properties. Convolutional neural networks (ConvNet) ability to generalize unknown data is highly dependent on the dataset used when training. The goal of machine learning is not to build a theory on what constitutes a certain symbol but instead use acquired knowledge and predictive power to understand what distinguishes one symbol from another. In order to visualize what the network “sees” we can plot the filters of a trained network. This can be seen in figure 5.1. As a reminder, a filter corresponds to a convolutional layer (defined by a weight matrix). The image in figure 5.1 is the resulting filters when a network is trained using the MNIST set and shows 30 trained filters [16]. Note however that the image is not completely accurate picture of knowledge. Biases have been neglected and weights have been scaled to fit gray- scale intensities.

39 Figure 5.1. 30 filters trained on the MNIST set.

The advantage of a ConvNets is its ability to acquire this knowledge without human intervention deciding on what features to use. This advantage lead to the difference between convolutional networks and other types of machine learning pro- cedures in their approach to classification. While others, for example Support Vector Machine (SVM), concentrate on finding a function (or kernel using SVM language) that uses predefined features to separate classes, ConvNets instead focus on the most efficient features that can separate classes. Since finding structure in handwritten symbols is still and open research problem the use of convolutional networks is ideal for classification problems for which the separating structure of individual samples are unknown. This due to the automatic discovery of featuers by deep learning. Using ConvNets in recognition of handwritten symbols offers a great potential since the problem of handwriting is subject to high variance and we do not (yet, if ever) know if any general structure can be found. Thus, instead of using handcrafted features ConvNets have the possibility to learn about the input space by discovering features itself as part of the learning process.

Data Distortion and Decreasing Error Rates One way of decreasing error rates is suggested by Dr Simard in [60]. He argues that the generalization can be improved by including distortions of sample images. Distortions for each sample is applied randomly just before training. Three different types of distortions that could be used are:

• Scale Factor: the sample is either enlarged or shrunk in horizontal or vertical direction.

40 • Rotation: A slight rotation is applied to the sample.

• Elastic: Non-linear transformation visually equivalent to pulling or pushing fields of pixels in a wave like fashion.

The reason to why these distortions decrease the error rates is believed to be twofold: first they increase the size of the training set and second, and more impor- tantly, that the application of these distortions force the network to look for, and learn more invariant features and aspects of the patterns thus focusing on separat- ing features of classes. Data distortion was not applied to the trainings samples for this project.

5.1 Conclusions

This report have presented an overview on how to design a modern handwriting recognition system for smartphones. A proof-of-concept application for character classification, using a neural networks approach, have been successfully developed and evaluated. The evaluation shows that using the guidelines suggested in section 3.2.1, neural networks can be trained efficiently with the purpose of recognizing digits in images. It should be evident from the discussion of in section 2.3.5 that crafting good features manually, with high class separability is hard, if not impossible. As a result this report have presented an alternative approach to crafting features, namely letting the classifier itself decide what con- stitutes a good feature. A suitable technique that have the ability to learn visual patterns is to use convolutional neural networks which is designed to mimic our human visual reception of images. It is observed that training of networks takes a long time. Thus it is not con- sidered reasonable to preform this type of training on a smartphone. There are several reasons for this. Firstly, a smarphone device has severely constrained re- sources (limited memory and CPU performance) compared to a desktop machine. Secondly, the design of the operating system used in both Android and iOS pre- vents computational intensive applications to run in the background for a long time. Applications are only allowed full access to system resources when running in the foreground. Once an application is put in the background the operating system gives no guarantees that these resources will be available for the application and it can be killed at anytime for the benefit of other applications. A typical user only keeps one application active for a short period of time then switch to another. The conclusion is thus that performing adaptive learning on the device is not considered possible when using neural networks. Once an offline training is completed on a desktop, using a trained convolutional neural networks for classification of data on a smartphone is fast and well suitable for recognition tasks in any type of application. From the survey of user tests the conclusion is that using convolutional neural networks on smartphones performs well in regard to correctness. Using the iPhone prototype, a recognition rate above

41 90% was been achieved on real-world data. Finally, using techniques available today it is not considered possible to create completely accurate systems for handwriting recognition tasks that outputs a 100% correct interpretation of a document.

42 Chapter 6

Future Work

Using symbol recognition techniques, outlined in earlier chapters, works quite well but can be improved in various ways. Expanding the training set using various distortions as explained in section 5 could lead to better results. A simple extension of the prototype is to implement this functionality to make data distortions while training. As presented in [60] using elastic distortions could lead to an decrease in error rates on the MNIST set. It would be very interesting to see how the inclusion of these would change the performance of the prototype on real-world data.

Extensions To extend the proposed system to facilitate word recognition, new architecture and recognition pipeline would be needed. Separate modules for localization, word seg- mentation and word recognition have to be constructed. All these modules can be built in isolation from each other but joined togheter to form a complete handwrit- ing recognition system. For the recognition module it is recommended to, instead of using neural networks, investigate the use of probabilistic based techniques. Sug- gestions for such approaches is to use hidden Markov models which combines both segmentation and recognition. Suggested future use of this type of technology is to build a searchable recognition systems (see section 2.2) which is not restricted by the limitation of not having a system making a 100% correct classification.

43

Acknowledgements

I would like to thank Bontouch AB for sponsoring this project. I also would like to thank my supervisor Jens Lagergren for interesting conversations. Thank you to everyone who have helped me by proof-reading the report, Johan Mezaros, Mar- tin Forsling, Markus Hirvi, and especially a big thank you to Mats Malmberg for providing invaluable feedback. Many thanks to my family for being supportive.

45

References

[1] N. Arica and F.T. Yarman-Vural. “An overview of character recognition fo- cused on off-line handwriting”. In: Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 31.2 (2001), pp. 216–233. issn: 1094-6977. doi: 10.1109/5326.941845. [2] Marc Balmer. “Lua as a Configuration And Data Exchange Language”. In: (2013). url: http://www.netbsd.org/~mbalmer/lua/lua_config.pdf. [3] K. B M R Batuwita and G. E M D C Bandara. “An Improved Segmentation Algorithm for Individual Offline Handwritten Character Segmentation”. In: Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Inter- net Commerce, International Conference on. Vol. 2. 2005, pp. 982–988. doi: 10.1109/CIMCA.2005.1631596. [4] . Deep Learning of Representations. Nov. 2012. url: https: //www.youtube.com/watch?feature=player_embedded&v=4xsVFLnHC_0#! (visited on 06/15/2013). [5] James Bergstra et al. “Theano: a CPU and GPU Math Expression Compiler”. In: Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation. Austin, TX, June 2010. [6] M. Blumenstein and B. Verma. “A new segmentation algorithm for handwrit- ten word recognition”. In: Neural Networks, 1999. IJCNN ’99. International Joint Conference on. Vol. 4. 1999, 2893–2898 vol.4. doi: 10.1109/IJCNN. 1999.833544. [7] Leon Bottou and Yann LeCun. “Graph Transformer Networks for Image Recognition”. In: Proceedings of ISI. (invited paper). 2005. [8] R.G. Casey and E. Lecolinet. “A survey of methods and strategies in character segmentation”. In: Pattern Analysis and Machine Intelligence, IEEE Transac- tions on 18.7 (1996), pp. 690–706. issn: 0162-8828. doi: 10.1109/34.506792. [9] William B Cavnar, John M Trenkle, et al. “N-gram-based text categorization”. In: Ann Arbor MI 48113.2 (1994), pp. 161–175.

47 [10] Dan Ciresan, Ueli Meier, and Jürgen Schmidhuber. “Multi-column deep neural networks for image classification”. In: Computer Vision and Pattern Recogni- tion (CVPR), 2012 IEEE Conference on. IEEE. 2012, pp. 3642–3649. [11] Dan Claudiu Ciresan et al. “Deep Big Simple Neural Nets Excel on Hand- written Digit Recognition”. In: arXiv preprint arXiv:1003.0358 (2010). [12] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. “Torch7: A - like environment for machine learning”. In: BigLearn, NIPS Workshop. 2011. [13] Pierre Sermanet et al. Computational and Biological Learning Laboratory of New York University. EBLearn - Open Soure C++ Machine Learning Li- brary. url: http://eblearn.cs.nyu.edu:21991/doku.php (visited on 08/16/2013). [14] Lua Contributors. About Lua. July 2013. url: http://www.lua.org/about. html#name (visited on 07/31/2013). [15] Oracle Corporation. Java Native Interface. url: http://docs.oracle.com/ javase/6/docs/technotes/guides/jni/ (visited on 08/16/2013). [16] Deeplearning.net. Tutorial: Denoising (dA). url: http://deeplearning. net/tutorial/dA.html (visited on 08/14/2013). [17] Mark Deloura. The Engine Survey: General Results. 2009. url: http://www. satori.org/2009/03/the-engine-survey-general-results/ (visited on 07/31/2013). [18] iOS Developer Library. Apple Inc. Performing Convolution Operations. 2011. url: http://developer.apple.com/library/IOs/documentation/Performance/ Conceptual / vImage / ConvolutionOperations / ConvolutionOperations . html (visited on 08/14/2013). [19] Dave Engberg. Even Grittier Details on Evernote’s Indexing System. Nov. 2011. url: http://blog.evernote.com/tech/2011/11/01/even-grittier- details-on-evernotes-indexing-system/ (visited on 08/14/2013). [20] Gernot A. Fink and SpringerLink (Online service). Markov Models for Hand- writing Recognition. SpringerBriefs in Computer Science, Springer London, 2011. url: http://dx.doi.org/10.1007/978-1-4471-2188-6. [21] Chris Foresman. Apple relaxes restrictions on iOS app code, iAd analytics. 2010. url: http://arstechnica.com/apple/2010/09/apple- relaxes- restrictions-on-ios-app-code-iad-analytics/ (visited on 07/31/2013). [22] BM Forrest et al. “Implementing neural network models on parallel comput- ers”. In: The Computer Journal 30.5 (1987), pp. 413–419. [23] Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. In: Biological cybernetics 36.4 (1980), pp. 193–202.

48 [24] A. Hartl, Clemens Arth, and D. Schmalstieg. “Instant segmentation and fea- ture extraction for recognition of simple objects on mobile phones”. In: Com- puter Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Com- puter Society Conference on. 2010, pp. 17–24. doi: 10.1109/CVPRW.2010. 5543245. [25] Jeff Heaton. The Number of Hidden Layer. 2008. url: http://www.heatonresearch. com/node/707 (visited on 08/01/2013). [26] Geoffrey E. Hinton. “Learning multiple layers of representation”. In: Trends in Cognitive Sciences 11.10 (2007), pp. 428 –434. issn: 1364-6613. doi: http: / / dx . doi . org / 10 . 1016 / j . tics . 2007 . 09 . 004. url: http : / / www . sciencedirect.com/science/article/pii/S1364661307002173. [27] David H Hubel and Torsten N Wiesel. “Receptive fields and functional archi- tecture of monkey striate cortex”. In: The Journal of physiology 195.1 (1968), pp. 215–243. [28] Balázs Kégl and Róbert Busa-Fekete. “Boosting products of base classifiers”. In: Proceedings of the 26th Annual International Conference on Machine Learn- ing. ACM. 2009, pp. 497–504. [29] Brett Kelly. How Evernote’s Image Recognition Works. 2013. url: http : / / blog . evernote . com / tech / 2013 / 07 / 18 / how - evernotes - image - recognition-works/ (visited on 08/14/2013). [30] Daniel Keysers et al. “Deformation models for image recognition”. In: Pat- tern Analysis and Machine Intelligence, IEEE Transactions on 29.8 (2007), pp. 1422–1435. [31] Věra Kůrková. “Kolmogorov’s theorem and multilayer neural networks”. In: Neural Networks 5.3 (1992), pp. 501 –506. issn: 0893-6080. doi: http:// dx . doi . org / 10 . 1016 / 0893 - 6080(92 ) 90012 - 8. url: http : / / www . sciencedirect.com/science/article/pii/0893608092900128. [32] L Ladha and T Deepa. “Feature selection methods and algorithms”. In: Inter- national Journal on Computer Science and Engineering 3.5 (2011), pp. 1787– 1797. [33] Eric Lecolinet and Olivier Baret. “Cursive word recognition: Methods and strategies”. In: Fundamentals in Handwriting Recognition. Springer, 1994, pp. 235–263. [34] Y. LeCun et al. “Comparison of learning algorithms for handwritten digit recognition”. In: International Conference on Artificial Neural Networks. Ed. by F. Fogelman and P. Gallinari. Paris: EC2 & Cie, 1995, pp. 53–60. [35] Y. LeCun et al. “Gradient-Based Learning Applied to Document Recogni- tion”. In: Intelligent Signal Processing. IEEE Press, 2001, pp. 306–351. [36] Quoc V Le et al. “Building high-level features using large scale ”. In: arXiv preprint arXiv:1112.6209 (2011).

49 [37] Roberto Lop. OpenNN Manual. url: http : / / flood . sourceforge . net / docs/manual.pdf (visited on 07/08/2013). [38] Lua mailing list. lua-users.org. What makes Lua tick? url: http://lua- users.org/lists/lua-l/2012-04/msg00331.html (visited on 08/16/2013). [39] S. Snoussi Maddouri et al. “Baseline Extraction: Comparison of Six Meth- ods on IFN/ENIT Database”. In: International Conference on Frontiers in Handwriting Recognition (ICFHR). 2008. [40] Gerald M Maggiora, David W Elrod, and Robert G Trenary. “Computational neural networks as model-free mapping devices”. In: Journal of chemical in- formation and computer sciences 32.6 (1992), pp. 732–741. [41] Gary Marcus. Is “deep learning” a Revolution in Artificial Intelligence? Nov. 2012. url: http://www.newyorker.com/online/blogs/newsdesk/2012/11/ is-deep-learning-a-revolution-in-artificial-intelligence.html (visited on 07/24/2013). [42] John Markoff. How Many Computers to Identify a Cat? 16,000. June 2012. url: http://www.nytimes.com/2012/06/26/technology/in- a- big- network - of - computers - evidence - of - machine - learning . html ? _r = 0 (visited on 07/24/2013). [43] Stephen Marsland. Machine Learning: An Algorithmic Perspective. 1st. Chap- man & Hall/CRC, 2009. isbn: 1420067184, 9781420067187. [44] Warren S McCulloch and Walter Pitts. “A Logical Calculus of the Ideas Im- manent in Nervous Activity”. In: The Bulletin of Mathematical Biophysics 5.4 (1943), pp. 115–133. [45] Marvin Minsky and Seymour Papert. - an introduction to compu- tational geometry. MIT Press, 1987, pp. I–XV, 1–292. isbn: 978-0-262-63111-2. [46] Rob Miracle. Tutorial: the Ultimate “config.lua” File. Dec. 2012. url: http: //www.coronalabs.com/blog/2012/12/04/the-ultimate-config-lua- file/ (visited on 07/31/2013). [47] Robert Mohns. iPhone 5 Review. 2012. url: http://www.macintouch.com/ reviews/iphone5/ (visited on 07/04/2013). [48] Shunji Mori, Ching Y. Suen, and Kazuhiko Yamamoto. “Document image analysis”. In: ed. by Lawrence O’Gorman and Rangachar Kasturi. Los Alami- tos, CA, USA: IEEE Computer Society Press, 1995. Chap. Historical review of OCR research and development, pp. 244–273. isbn: 0-8186-6547-5. url: http://dl.acm.org/citation.cfm?id=201573.201651. [49] Jawad Nagi et al. “Max-pooling convolutional neural networks for vision-based hand gesture recognition”. In: Signal and Image Processing Applications (IC- SIPA), 2011 IEEE International Conference on. IEEE. 2011, pp. 342–347.

50 [50] NASA. NASA NEURAL NETWORK PROJECT PASSES MILESTONE. 2003. url: http://www.nasa.gov/centers/dryden/news/NewsReleases/ 2003/03-49.html (visited on 07/05/2013). [51] Tyler Neylon. Learn Lua in 15 Minutes (more or less). 2013. url: http: //tylerneylon.com/a/learn-lua/ (visited on 08/12/2013). [52] Donald O.Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley and Sons, New York, 1949. isbn: 9780471367277. [53] Alex Pashintsev. Evernote Indexing System. 2011. url: http://blog.evernote. com/tech/2011/09/30/evernote-indexing-system/ (visited on 08/14/2013). [54] J.A. Pittman. “Handwriting Recognition: Tablet PC Text Input”. In: Com- puter 40.9 (2007), pp. 49–54. issn: 0018-9162. doi: 10.1109/MC.2007.314. [55] Thomas Plötz and GernotA. Fink. “Markov models for offline handwriting recognition: a survey”. English. In: International Journal on Document Anal- ysis and Recognition (IJDAR) 12 (4 2009), pp. 269–298. issn: 1433-2833. doi: 10.1007/s10032-009-0098-4. url: http://dx.doi.org/10.1007/s10032- 009-0098-4. [56] Amjad Rehman and Tanzila Saba. “Off-line cursive script recognition: cur- rent advances, comparisons and remaining problems”. English. In: Artificial Intelligence Review 37.4 (2012), pp. 261–288. issn: 0269-2821. doi: 10.1007/ s10462-011-9229-7. url: http://dx.doi.org/10.1007/s10462-011- 9229-7. [57] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: Cognitive modeling 1 (2002), p. 213. [58] A.W. Senior and A.J. Robinson. “An off-line cursive handwriting recognition system”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 20.3 (1998), pp. 309–321. issn: 0162-8828. doi: 10.1109/34.667887. [59] Thomas Serre and Maximilian Riesenhuber. Realistic modeling of simple and complex cell tuning in the HMAX model, and implications for invariant object recognition in cortex. Tech. rep. DTIC Document, 2004. [60] Patrice Y. Simard, Dave Steinkraus, and John C. Platt. “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”. In: In: Int’l Conference on Document Analysis and Recognition. 2003, pp. 958– 963. [61] Aaron Souppouris. Google X creates 16,000-core ’neural network’ for indepen- dent machine learning. June 2012. url: http://www.theverge.com/2012/6/ 26/3117956/google-x-object-recognition-research-youtube (visited on 07/31/2013).

51 [62] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. “Introduction to multi- layer feed-forward neural networks”. In: Chemometrics and intelligent labora- tory systems 39.1 (1997), pp. 43–62. [63] Luis Talavera. “An evaluation of filter and wrapper methods for feature selec- tion in categorical clustering”. In: Advances in Intelligent Data Analysis VI. Springer, 2005, pp. 440–451. [64] CC Tappert. “Adaptive on-line handwriting recognition”. In: Proceedings of the 7th International Conference on Pattern Recognition. 1984, pp. 1004–1007. [65] Massachusetts Institute of Technology. The MIT License (MIT). url: http: //opensource.org/licenses/MIT (visited on 08/16/2013). [66] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern recognition. Dé- monstrations et figures sur www.elsevierdirect.com/9781597492720. Amster- dam, Boston, Paris: Academic Press, 2009. isbn: 978-1-59749-272-0. url: http://opac.inria.fr/record=b1128027. [67] VISA. Counterfeit Fraud - Neural Networks (24/7 Monitoring). url: http: //www.visa.ca/en/personal/pdfs/counterfeit_fraud.pdf (visited on 07/05/2013). [68] Markus Wienecke, Gernot A. Fink, and Gerhard Sagerer. “Experiments in unconstrained offline handwritten text recognition”. In: IN PROC. 8TH INT. WORKSHOP ON FRONTIERS IN HANDWRITING RECOGNITION. IEEE, 2002. [69] Shengqian Yang, Dacong Yan, and Atanas Rountev. “Testing for Poor Re- sponsiveness in Android Applications”. In: reason 10 (), p. 20. [70] Cong Yao et al. “Detecting texts of arbitrary orientations in natural images.” In: CVPR. IEEE, 2012, pp. 1083–1090. isbn: 978-1-4673-1226-4. url: http: //dblp.uni-trier.de/db/conf/cvpr/cvpr2012.html#YaoBLMT12. [71] C. J.C. Burges Y. LeCun C. Cortes. The MNIST Database. url: http:// yann.lecun.com/exdb/mnist/ (visited on 07/15/2013).

52