Computer Science

Stefan Larsson, Filip Mellqvist

Automatic Number Plate Recognition for Android

Bachelor’s Project

Automatic Number Plate Recognition for Android

Stefan Larsson, Filip Mellqvist

2019 The author(s) and Karlstad University

This report is submitted in partial fulfillment of the requirements for the Bachelor’s degree in Computer Science. All material in this report which is not our own work has been identified and no material is included for which a degree has previously been conferred.

Stefan Larsson

Filip Mellqvist

Approved, 04-06-2019

Advisor: Tobias Pulls

Examiner: Stefan Alfredssoon

Abstract

This thesis describes how we utilize machine learning and image preprocessing to create a system that can extract a license plate number by taking a picture of a car with an Android smartphone. This project was provided by AF˚ at the behalf of one of their customers who wanted to make the workflow of their employees more efficient. The two main techniques of this project are object detection to detect license plates and optical character recognition to then read them. In between are several different image preprocessing techniques to make the images as readable as possible. These techniques mainly includes skewing and color distorting the image. The object detection consists of a convolutional neural network using the You Only Look Once technique, trained by us using Darkflow. When using our final product to read license plates of expected quality in our evaluation phase, we found that 94.8% of them were read correctly. Without our image preprocessing, this was reduced to only 7.95%. Contents

1 Introduction 1 1.1 Purpose of our project ...... 1 1.2 Dissertation layout ...... 2 1.3 Project outcome ...... 2

2 Background 3 2.1 AF...... ˚ 3 2.2 Machine Learning ...... 3 2.2.1 Neural Networks ...... 4 2.2.2 Convolutional Neural Networks ...... 4 2.2.3 K-means clustering ...... 5 2.3 Computer Vision ...... 6 2.3.1 What is Computer Vision? ...... 6 2.3.2 Low- and High-level Vision ...... 6 2.3.3 Binary Image and Adaptive Threshold ...... 8 2.3.4 The HSV model ...... 9 2.3.5 Image Classification and Object Detection ...... 9 2.3.6 Optical Character Recognition ...... 10 2.4 Tools ...... 11 2.4.1 OpenCV ...... 11 2.4.2 -OCR ...... 11 2.4.3 Pytesseract ...... 11 2.4.4 You Only Look Once ...... 12 2.4.5 Anaconda ...... 12 2.5 OpenALPR ...... 13

i 2.6 Summary ...... 13

3 Project Design 14 3.1 Android application ...... 15 3.2 OCR Service ...... 15 3.3 Database ...... 16 3.4 Summary ...... 16

4 Project Implementation 17 4.1 Building an object detector ...... 17 4.1.1 Setting up an environment ...... 17 4.1.2 Training ...... 19 4.1.3 Porting to an Android device ...... 21 4.2 OCR service ...... 23 4.2.1 The two colors of the plates ...... 24 4.2.2 Prepare the image ...... 25 4.2.3 Identify the contour ...... 28 4.2.4 Identify the corners ...... 30 4.2.5 Skew the plate ...... 32 4.2.6 Refining the image ...... 34 4.2.7 Read the image ...... 36 4.3 Summary ...... 37

5 Evaluation 38 5.1 Android performance ...... 38 5.2 Evaluation of object detection ...... 39 5.3 OCR service ...... 40 5.3.1 OCR performance ...... 40 5.3.2 Precision vs. time ...... 43

ii 5.3.3 Evaluating the impact of preprocessing ...... 48 5.4 Summary ...... 50

6 Conclusion 51 6.1 Project summary ...... 51 6.2 Future work ...... 52 6.3 Concluding remarks ...... 53

References 54

iii List of Figures

1.1 A simple model showing our whole system...... 2 2.1 An image with red and green channel (image: CC BY-SA 3.0 [9]). The colors are represented on a plot grouped into segments using k-means (image: public domain [11])...... 5 2.2 Performing canny edge detection on a pair of headphones...... 7 2.3 Local and global threshold applied on an image with both bright and dark areas. Local adaptive thresholding on the image in the middle and global fixed thresholding on the image to the right...... 8 2.4 The HSV cylinder showing the connection of the values: Hue, Saturation and Value (image: CC BY-SA 3.0 [10])...... 9 2.5 How an Image Classifier and an Object Detector would see a cup...... 10 3.1 The planned flow of our system...... 14 4.1 Annotating a picture of a license plate...... 19 4.2 A screenshot of the final application where the object detector is more than 90% confident that a license plate is found...... 23 4.3 The result of k-means color quantization where k=3...... 25 4.4 The desired HSV mask applied. The quadrilateral is highlighted and the characters are easy to identify...... 27 4.5 HSV Masks with six different ranges (values in upper left corner). Left three lower values and right three higher values...... 28 4.6 The final accepted mask, acquired when the saturation value ranges from 0-60 (second value in each array) and the value (third value) is ranging from 195-255...... 28 4.7 Desired contour drawn on the source image...... 30 4.8 The corners of the contour drawn on the source image...... 32

iv 4.9 A quadrilateral with the edges A, B, C and D and the corners E, F, G and H. 32 4.10 The result of grabbing the corners of the source image and skewing them to the desired corners...... 34 5.1 A flowchart of the image being put into the OCR service pipeline, first being preprocessed and then read by the OCR software, later to be matched or not. 42 5.2 A simplified flowchart of the binarization process where the image is put into the adaptive threshold, and if not matched by the OCR software, will go into the manual threshold together with a calculated threshold value. . 45

v List of Tables

5.1 Comparing the running time of our app to the specifications of our devices. 39 5.2 Comparison of size and time between two generic images of minimum and maximum potential dimensions with no need for iteration...... 42 5.3 A table showing the outcomes of the varying multipliers with the four most essential numbers for our evaluation...... 46 5.4 A table showing x, which is the number the attempt will get raised to every iteration, together with the four most important numbers for the evaluation. 47 5.5 A table comparing the two methods for tweaking the threshold in the bina- rization of the image...... 48 5.6 A table comparing the accuracy and speed of the OCR software with and without preprocessing...... 49 5.7 A table comparing the accuracy and speed of the OCR software with and without preprocessing on a subset of optimized images...... 50

vi Listings

4.1 Create and prepare an Anaconda virtual environment called tgpu2. . . . . 18 4.2 Clone Darkflow repository...... 18 4.3 Download Darkflow dependencies with Anaconda...... 18 4.4 Installing Darkflow using pip...... 18 4.5 Initiating a Darkflow training session...... 20 4.6 OpenCV k-means on image...... 24 4.7 Reading the image with OpenCV...... 25 4.8 Creating HSV ranges. That is the upper and lower limit...... 26 4.9 Creating the mask with the input of the image, the lower values in HSV, as well as the upper ones...... 27 4.10 Creating dynamic HSV ranges. The upper and lower limit respectively will part progressively ...... 27 4.11 Getting coordinates of all points that will make up the contours of the mask. 29 4.12 Confirms quadrilateral with arcLength() and approxPolyDP()...... 29 4.13 Locating the corners, iterating through every point in the contour...... 31 4.14 The function that utilizes OpenCV’s functions getPerspectiveTransform() and warpPerspective() to skew the image...... 33 4.15 The function where the binarized image is returned together with the cal- culated optimal threshold...... 36 4.16 The implementation of Pytesseract, configured to read the license plate. . . 37

vii List of Abbreviations

ML - Machine Learning OCR - Optical Character Recognition CNN - Convolutional Neural Network OpenCV - Open Source Computer Vision PyTesseract - Python-Tesseract YOLO - You Only Look Once HSV - Hue, Saturation, Value RGB - Red, Green, Blue

viii 1 Introduction

Machine learning, neural networks and artificial intelligence are all concepts which have exploded in popularity the past few years. It allows computers to automatically analyze immense amounts of data and make decisions based on its patters. This can be invaluable because the amounts of data used are often much too large for any human to analyze, comprehend and draw a conclusion from. It can be used almost anywhere, from self- driving cars to brewing the perfect pint of beer1. Computer vision is a field in computer science that has had great success due to the increasing popularity of machine learning. Instead of having a human look at images and decide what they depict, we are able to teach computers to recognize patterns of previous images and see the resemblance in new images. Computer vision can also be used to read alphanumeric characters in images and turn them into text.

1.1 Purpose of our project

The purpose of this project is to develop a system for our employer AF˚ that will change the workflow for one of their customers. This customer has employees who often file damage reports on cars they own using an application created by AF.˚ In its current state, the workflow consists of taking several pictures of the car in question and then opening a text editor to manually add the number of the license plate for the application to download information about it. The idea is to change the workflow so that information about the car is automatically gathered as part of the process of taking pictures. This would be done by creating a system where a computer is able to read the license plate directly off an image taken by the employees, and that is what our project consists of.

1https://news.microsoft.com/europe/2017/12/05/beer-fingerprinting-project-artificial-intelligence- create-next-pint/, [2019-05-09].

1 1.2 Dissertation layout

Chapter 2 consists of background on machine learning and computer vision such as tools and concepts as well as adding some background to our employer. It explains different concepts and tools used in machine learning as well as adding some background to our employer. In Chapter 3 we explain the overall design of our solution consisting of a mobile app and a backend service. Chapter 4 is a detailed explanation of how we implemented the project and what tools where used to accomplish this. In Chapter 5 we evaluate the accuracy of our object detection, performance of the application and both accuracy and performance on the OCR service. In Chapter 6 we discuss the project as a whole, problem that arose and how the system be could be improved further in the future.

1.3 Project outcome

The project outcome came very close to what we expected while planning it. We have an Android application capable of detecting and cropping license plates as well as an OCR service that is able to read a surprisingly high number of license plates. Our application’s performance did not reach the levels we though were necessary before creating it, but as we tested it with a real phone we realized that our estimated requirements were too high. A high-level image of our system is provided in Figure 1.1.

Figure 1.1: A simple model showing our whole system.

2 2 Background

This chapter begins by giving some information about our employer and then focus on machine learning and computer vision. It will bring up theory about the subjects as well as tools and techniques usable in practical implementations.

2.1 AF˚

AF˚ AB, formerly known as Angpannef¨oreningen(The˚ Steam Boiler Association), is a Swedish company founded in 1895. They are an engineering and consulting company whose main areas are industry, infrastructure, energy and digital solutions. [25] During this project we worked under their Digital Solutions division.

2.2 Machine Learning

Machine Learning (ML) is an automated process of data analysis. [29] By providing a machine learning algorithm with appropriate data, the algorithm will use it to detect patterns and be able to make decisions by itself. Giving software the ability of decision making allows for automation of various tasks where a human was previously required.

Machine learning algorithms are categorized into three different groups depending on what style of learning they use. [7] Supervised learning, which is the most common practice, is done by giving the machine labeled data which it learns from. When trained, the algorithm is expected to put those labels on unlabeled data. Unsupervised learning is when you use unlabeled data to train. This is done to allow the computer to learn the underlying structure of data by itself. The last category is called semi-supervised machine learning and is a mix between the two, used primarily because labeling data is time consuming.

3 2.2.1 Neural Networks

A Neural Network (NN) is a learning system commonly used when creating a supervised ML model which is inspired by neuron connections in human brains. [15] It consists of three types of layers: an input layer, one or more hidden layers and an output layer. The order of the layers are important as the input layer executes first, then the hidden layer(s) and finally the output layer. Each of those layers are built up by one or more nodes (commonly referred to as neuron or unit). These nodes have different purposes depending on which layer they are in. Nodes from the input layer will take some input from an external source and forward it to the nodes in the hidden layer. The hidden nodes will then apply a function to the input, generate an output and send it to either the next hidden layer if one exists or the output layer. The nodes in the output layer will also apply a function to the input, although their output is the final result of the network and is sent as output of the NN. Each node gets an initial random weight associated to it which is used in their respective function. When training the network this weight will be adjusted to fit the data. The rate at which the weight is adjusted can be improved by using an optimizer when training. Optimizers are algorithms which optimize the training of neural networks. These algorithms can improve both the speed of training as well as the accuracy of the final NN. There are many different optimizers—Adam, RMSProp and AdaGrad to name a few—and they all apply different functions to improve training. [6]

2.2.2 Convolutional Neural Networks

A Convolutional Neural Network (CNN) is a subclass of neural networks designed for applying machine learning to images. [28] An image contains a very big amount of data. A typical photo taken with a modern cellphone may have a resolution 2000x1000 pixels. Every pixel also has to store three values representing each color in the red, green and blue color wheel. This gives us an array of the size 2000x1000x3 for a single image. While this array could be put through a regular NN to detect patterns, it would be inefficient and

4 the end result would likely have insufficient precision. CNNs reduces the computational complexity by scaling down the size of an image while extracting and keeping important features. This does not only increase speed but also increases precision by reducing the amount of noise in an image. While still using nodes as the original definition as NN, the layers of a CNN are different and specifically made for imagery.

2.2.3 K-means clustering

K-means clustering is a way of grouping data into segments where k is a variable which tells the algorithm how many groups or clusters the final result should consist of. [12] The algorithm then takes k number of positions, that is starting points and divide the data into k partitions to calculate the means of each partition. This is done n times and for every iteration, the new partitions are calculated with regards to the current means. When using k-means to quantize colors, the same algorithm is applied on a set of colors. The colors are divided into k segments where the colors are converted into the average color of each respective groups they have fallen into. In Figure 2.1, we can see k-means being applied on an image with values ranging from 0-255 in two directions, becoming a color combined by red and green color. The k value is set to 16; hence 16 dots. The colors (dots), are each individually set to the average color of their associated group.

Figure 2.1: An image with red and green channel (image: CC BY-SA 3.0 [9]). The colors are represented on a plot grouped into segments using k-means (image: public domain [11]).

5 2.3 Computer Vision

Computer vision is an interdisciplinary field where engineers strives to make machines mimic the human visual system, to enable them to perceive the world in the same way as humans do.

2.3.1 What is Computer Vision?

We humans take information in through our eyes, in real-time and processes the information in our brains. [31] We can relatively easy estimate the distance to the different objects with good precision, we identify peoples and animals walking the street and separate them from every other object. We can identify the faces and the expression of those faces and even describe how they are walking. Not having seen the exact scene or picture before, we can still guess with near perfect accuracy the different objects in that picture; we can describe the scene in perfect detail, what is going on, when it is going on and how it is going on. For a computer, this is difficult. This is where machine learning and neural networks comes in handy. These networks are really built to mimic our brains, and with this, learn about its surroundings with the help of cameras, much like we do.

2.3.2 Low- and High-level Vision

There are mainly two levels of vision: low-level and high-level. Because human beings’ interpretation—in terms of vision—of the world, is often divided into these to groups, it applies and is discussed in the vision of computers as well. While high-level and low-level comprehension is two distinct ways to interpret an image, they are used together and build on one another, more so high-level on low-level than the other way around. [16] Before high-level vision takes place, low-level image processes are often applied on images, including edge detection, corner detection, blob detection amongst others. Edge detection includes identifying sharp deviations, often in form of light from dark pixels or

6 vice versa and is used to highlight changes and important features in an image. Widely used is the canny edge detection. It works by first identifying the earlier mentioned sharp deviations, to create thin lines. It takes into consideration neighbor pixels to look for potentially even brighter areas to then move the line to that place. The result is a black and white image consisting of only sharp lines. [8]

Figure 2.2: Performing canny edge detection on a pair of headphones.

Edge detection is a typical low-level image process where the computer applies mathe- matical formulas to extract potentially useful information. The image process could involve extracting colors, brightness, sharpness and more. This is considered preparatory work for coming image analysis, see Figure 2.2. A high-level vision—in this case computer vision—is usually partly dependent upon a low-level preprocessed image. This is because relevant in- formation has been extracted on the preprocessed image, and therefore the next algorithm optimally only has to process this data, speeding up the sequence. [16] Since neural networks gained such momentum lately we can, in very small part, mimic parts of our brain; in this case how we interpret an image. As earlier mentioned, there are layers in a neural network, where in this case, the different layers could represent different information types in an image. The first layer could represent finding edges, second layer could learn how corners look like, and the third looks for the object in whole, formed by the former layers. These layers, combining their information, will gradually build the trained data where objects in an image, or even image in large could be described. This example shows how high-level vision uses low-level information to gain greater understanding.

7 2.3.3 Binary Image and Adaptive Threshold

When binarizing an image you effectively transform the image into values of only ones and zeroes. This means that every pixel does not longer contain any layers of colors. A binary image is essentially an image where every pixel only has a possible value of either 0 or 1. A common way of transforming an image to a binarized one is called global fixed threshold. In global fixed threshold, a threshold is given and the program will then go through every pixel in the image, looking at the value of each and every one and if the value of the pixel is higher than the threshold, the pixel becomes white or 1. If the pixel’s value is lower than the threshold, the pixel becomes black or 0. [13] This is done equally all over the image, hence global fixed threshold, as we can see in Figure 2.3, on the image to the right.

A more advanced method for binarizing an image could utilize something called adaptive threshold. [5] This does what it sound like, adapts to its surroundings. This is what is called a local method of binarization. A block of N ×N pixels will walk over the image calculating the local value of each block. If a certain block is unusually bright, the program will adapt to this block and raise the threshold, making it harder for pixels to become white or 1. The same goes for the darker areas, where the program will lower the threshold. This method is exceptional when dealing with images that has both dark and bright areas, but still want to get details out of areas all around the image, see the image in the middle in Figure 2.3.

Figure 2.3: Local and global threshold applied on an image with both bright and dark areas. Local adaptive thresholding on the image in the middle and global fixed thresholding on the image to the right.

8 2.3.4 The HSV model

The HSV scale is an alternative to the RGB color model. Just like in the RGB model, there are three values: hue, saturation and value. The hue determines what is often referred to as color; for example: red, green, blue. The hue has a value ranging from red at 0◦and comes back to red at 360◦. The S stands for saturation and tells us the intensity of the color—ranging from 0-100, starting at black, white or anything between and stops at maximum intensity. The last value in the HSV model is called value. At minimum value, the color will always be black. Following the value from 0, all the way to 100, the clarity increases and the black disappears. The three values in the HSV model are depicted in cylinder in Figure 2.4.

Figure 2.4: The HSV cylinder showing the connection of the values: Hue, Saturation and Value (image: CC BY-SA 3.0 [10]).

2.3.5 Image Classification and Object Detection

Image classification and object detection are two closely related subjects in the field of computer vision, but they are also important to distinguish. An image classifier is a program, often created with machine learning and a neural network, which can take an image as input, recognize it and define the class of said image. This class is what the

9 picture depicts, e.g., a bus or a cat. [4] An object detector on the other hand will not classify what the image depicts but rather recognize objects that are in the image and label them. [14] A comparison between the two can be seen Figure 2.5.

Figure 2.5: How an Image Classifier and an Object Detector would see a cup.

2.3.6 Optical Character Recognition

Optical character recognition (OCR) is used to convert text from a physical paper or printable text into editable text, on cell phones and computers. OCR is in the field of computer vision and comes down to a machine, transforming visually received text into a digital format—character codes—such as ASCII. [19] These digital characters can later be used in data processing, such as algorithms applied on big (or small) data. OCR software often requires the image being scanned to be fairly clean and to have highlighted text , and there are several methods to achieve this. [30] The initial technique is called preprocessing. [17] This is where the image gets, if oblique, processed and manipulated by various algorithms, including: deskew, binarization, line removal, normalization and more.

10 2.4 Tools

There are a lot of possible approaches when deciding how to start a ML project. Below are explanations of different tools and techniques used in this thesis.

2.4.1 OpenCV

OpenCV (Open Source Computer Vision Library) is a software library with focus on com- puter vision. The library uses machine learning to provide vision for commercial products, such as home computers and cellphones. OpenCV utilizes over 2500 algorithms, ranging from classical, all the way to machine learning ones.

2.4.2 Tesseract-OCR

The Tesseract-OCR is a tool for converting printable text into editable text. [23] The Tesseract-OCR is an OCR engine where the user can input flags, values and configurations with pre-trained models for different languages, to make the OCR software read text in different ways. There are many ways in which an OCR program can analyze and scan an image with text on. While one configuration can be advantageous on an image with text forming words and sentences, another could be in favor of reading large characters in random fashion. This configurable OCR software is what differentiates the Tesseract-OCR from the ordinary OCR.

2.4.3 Pytesseract

Pytesseract is a wrapper for Terreract-OCR in Python. This wrapper enables invocation of the Tesseract-OCR from Python, and in its turn, because Python can read a wider variety of image formats, enables the Tesseract-OCR to read the same formats. [24] Because the OCR engine is invoked from Python it is not limited to write its guess to an output file, but rather write the text to Python, and thus make it useful in a script.

11 2.4.4 You Only Look Once

You Only Look Once (YOLO) is a technique for object detection developed at the Uni- versity of Washington. [27] Its objective is to provide fast and reliable real-time object detection. At the time of their publication, YOLO provided accuracy and speed almost twice as high compared to the next best real-time object detectors available. The reason why YOLO is much faster compared to other forms of object detection is because of its fundamentally different way of processing images to look for objects. Regular detection systems use a sliding window approach together with an image classifier. When an image is scanned for objects a box smaller than the image will slide through the picture and use an image classifier on each frame of the window to see if the smaller image resembles an object. This info is used to draw boxes on objects after the sliding window has gone through the entire image. This has a great computational cost since every frame tested has to go through the image classifier. [14] Instead of doing the sliding window approach, YOLO only processes an image once (hence the name). It is implemented as a CNN consisting of a varying amount of layers depending on their different versions, where those using fewer are faster but less accurate. [27] The original implementation was trained using the Darknet framework. Darknet is a neural network framework developed in C by Joseph Redmon, one of the people who developed YOLO. [26] The implementation has since been translated to TensorFlow—another machine learning framework—using Python, with the new implementation being called Darkflow. [32]

2.4.5 Anaconda

Anaconda is a program made for Python development. [1] It is specialized to make data handling and ML with Python easier, but it is also a powerful tool for managing virtual en- vironments. These environments are sandboxed Python workspaces where you can install packages, different versions of dependencies and Python as well as different tools without them interfering with what you have installed in other environments. It also has an as-

12 sociated terminal which works as an extension of the regular Windows command prompt. This allows the user to make regular Windows commands as well as Anaconda specific commands in the same terminal.

2.5 OpenALPR

OpenALPR is an open source project made to detect and read license plates in real-time. [2] It is very similar to what our project description specifies so we decided to test an Android application that OpenALPR had provided to see if we could use it in our project. We tested it on real cars (as opposed to images on a computer) and even when the pictures had a very clear view of the license plate, the results were underwhelming. After the tests we decided that the accuracy of the application was insufficient for our goal as it would fail to read the license plate too often.

2.6 Summary

This chapter briefly explains the field of machine learning and neural networks as well as its subcategory convolutional neural networks. It also presents computer vision as a subject and some of its different variations. Different tools commonly used when combining the two areas are also explained, such TensorFlow, YOLO and Pytesseract.

13 3 Project Design

This chapter describes how our system should work and it is split up into three parts. An application for a smartphone, an OCR service and a database. The application should be an Android application that is able to detect license plates, capture an image of said plates and send them to the second part which is the OCR service. Our OCR service should be a standalone service that is able to read the text of the license plates it receives and send the text to the last part of our system, the database. The database should then use the text received to fetch information about the car and send it back to the phone which requested information in the first place. A model of our planned system can be seen in Figure 3.1 and in the following sections we will explain each step in more detail.

Figure 3.1: The planned flow of our system.

14 3.1 Android application

The pipeline of our system will begin with using an application on a smartphone. This application should open up a camera with an embedded object detector that can detect license plates in real-time. This object detector is going to have two purposes. The main purpose should to be to localize the license plate for further usage in our pipeline. Its other purpose is to give the user an idea of how good the picture is going to be. This means that the bounding boxes should be displayed in the camera and not just work in the backend. The reason for this is that a good picture will make is easier for an OCR service to read it. The object detection used in our application should be a CNN trained by us, using data provided by our employer and the YOLO technique. Since it is going to be used in a phone and not a computer our computational resources will be far more limited, and as such YOLO is required to give us the best possible performance. A frame rate of at least 5-10 frames per second is what we believe necessary for the application to feel usable. When a user is certain the picture will be good, the user should take a photo. This will start the cropping process. The location of the license plate and the image should be sent to a backend service in the same application that crops the picture and in turn sends it to a server hosting the next section of our pipeline, the OCR service.

3.2 OCR Service

When the cropped image is received at the service, it expects a cropped image where the bigger portion of it should be covered by the actual license plate. The first step should prepare the image to later be read and includes refining the image such that it will present the most central information, in this case, the characters. To obtain this, the program will use filters to make the plate visible, so that it later can coordinate the edges of the actual plate. The program should then filter out unnecessary parts of the image by blurring it to get rid of noise and scratches. Moreover, to get the image cleaner, it is converted into

15 grayscale and gets binarized, as well as potentially skewed. The service will skew the license plate so that it fits the image, thus removing disturbing content outside the plate. When the image has been preprocessed, it should be sent further down the pipeline. The OCR service should perform so that the read text from the cropped image should be returned in an average of about one second or less. The worst case should not exceed ten seconds. The program, in the next step, will read the image with the OCR service. The image should be well prepared in the previous step to facilitate the work of the OCR service, helping it return the correct string of characters.

3.3 Database

When the database receives the text it should fetch information about the car the license plate belong to and send it back to the phone. This information should then be used to automate the process of uploading new information about the car, such as images and damage reports. While the database is outside of our scope for the project, it is included to give a complete picture of the intended flow through our system.

3.4 Summary

In this chapter we explain how we intend to build our project as well as how the final flow is supposed to look. We describe how the system is split into three parts; an object detector, an OCR service and a database as well as how they are supposed to work together.

16 4 Project Implementation

In this chapter we will explain in detail how we implemented our system. It is split into two major parts. The first part will go into detail about how we created our object detector, set up the required tools and implemented it on an Android device. During the second part we will explain how we created our OCR service and all the image preprocessing required to make it as precise as possible.

4.1 Building an object detector

When creating our object detector, our two requirements were that it should work on an Android device and that it should be as fast as possible. We decided early that the YOLO technique was to be used because of its speed, but the original implementation in Darknet had no smartphone support. TensorFlow on the other hand had an open sourced project that implemented TensorFlow on an Android. Luckily, this project had support for using YOLO. Considering this, Darkflow was the perfect choice as the framework provided a YOLO model that was also usable in TensorFlow.

4.1.1 Setting up an environment

To set up our development environment we used Anaconda and all commands displayed in this section were written in the Anaconda terminal. Anaconda was used due to how easy it is to manage virtual environments using the platform. A total of 13 virtual environments were used during a trial and error phase of setting up the most basic object detector before we found the perfect setup. They were used to test different techniques and packages as well as changing things between iterations without ruining our older setups. Every new environment created with anaconda needs to reinstall Python, and our final environment called tgpu2 was created as done in Listing 4.1.

17 1 conda create -n tgpu2 2 conda activate tgpu2 3 conda install python=3.6

Listing 4.1: Create and prepare an Anaconda virtual environment called tgpu2.

Before we could start working we had to install Darkflow. First, the repository was cloned from , see Listing 4.2.

1 git clone https://github.com/thtrieu/darkflow

Listing 4.2: Clone Darkflow repository.

Simply cloning the repository is not enough to use it, we also had to download all the required dependencies as in Listing 4.3.

1 conda install tensorflow-gpu==1.12 2 conda install 3 conda install numpy 4 conda install cython

Listing 4.3: Download Darkflow dependencies with Anaconda.

Due to errors arising with the newest version of TensorFlow, we had to use the older version 1.12. With the dependencies installed, we could install Darkflow using the Python package manager pip, see Listing 4.4.

1 cd 2 pip install . 3 >>Successfully installed darkflow-1.0.0

Listing 4.4: Installing Darkflow using pip.

With that done, everything required to use Darkflow was installed and we could move on to the next step.

18 4.1.2 Training

Before training our CNN we had to prepare data to train with. As planned, we received large amounts of data from our employer. The final dataset consisted of about 3500 pictures of vehicles where the license plate was visible to varying degrees. Since Darkflow uses supervised learning for its CNN, all pictures had to be labeled. What this means in object detection is that for every picture in the dataset, a square has to be manually drawn over the object you wish to detect then labeled with what it is representing, as shown in Figure 4.1.

Figure 4.1: Annotating a picture of a license plate.

The final task before starting a training session is to download a configuration file and pre-trained weights. The configuration file is a text file specifying exactly how the CNN will be built. It determines how many convolutional layers will exist as well as tuning for each layer. The configuration file also has to be modified to fit the amount of objects the object detector should be able to detect, in our case one. Pre-trained weights are an already trained CNN which we built upon. This is common practice when training a CNN because pre-trained weights are usually trained on huge datasets where the CNN is allowed

19 to learn low level features like edges and corners. [3] The model is then used and trained to recognize specific objects as in our case, license plates. During our training we always used the same pre-trained weights, the Tiny YOLO weights found at the official page for Darknet.2 To begin training we opened a terminal in our Darkflow folder and wrote the command seen in Listing 4.5.

1 python flow --model cfg/tiny-yolo-voc-graph.cfg \ 2 --load bin/tiny-yolo-voc.weights \ 3 --train --trainer adam \ 4 --dataset datasets/images \ 5 --annotations datasets/annotations \ 6 --gpu 0.75

Listing 4.5: Initiating a Darkflow training session.

The model option specifies which configuration file to use. Tiny-yolo-voc-graph.cfg was our modified and final configuration file. Load determines which, if any, already trained CNN to use. Since this was our first run for a new CNN we used the pre-trained weights. In subsequent runs for the same CNN, the load command was used to load the CNN we had trained. Train tells Darkflow to start training while trainer adam specifies that while training, the Adam optimizer should be used. Dataset and annotations are the file paths to our datasets, allowing Darkflow to find the images and annotations we created before. GPU 0.75 makes Darkflow use GPU instead of CPU and allows it to use 75% of the available VRAM. We wrote earlier that we had to find the perfect environment, and the same was true for our model. A total of 14 CNNs were trained on data of varying size and quality, making our work very iterative. During our training phase we usually trained two models every week: one over the course of the week and one over the weekend. The CNNs were then tested on a separate dataset of images they had never seen before and the results were

2https://pjreddie.com/darknet/yolo/, [2019-04-15].

20 in turn compared to each other. Since this project had a time limit, speed of training was a deciding factor for how good the final object detector would be. This made it very important to use a GPU to train instead of the CPU, as a GPU is many times faster at training a NN. We did a training session using CPU to compare the two, and the results showed that using the GPU while training was more than eight times as fast, coming down at 2,100 training steps per hour compared to 250 steps per hour when using a CPU. The GPU used for this task was an NVIDIA Quadro M2000M, which is on the slower side of available GPUs.3 To further speed up the training we wrote a Python script that resized all our images. The original images we received usually had both a width and a height of the size 2000-3000 pixels, which was a lot bigger than necessary. Our script reduced the largest of width and height to 500 pixels while keeping the same aspect ratio. This usually reduced image size by 10-20 times and while no comparison was made, it sped up training.

4.1.3 Porting to an Android device

As planned, we used TensorFlow’s open source project for Android4 as means for porting our CNN. The solution worked well, but it was not a project made specifically for neither YOLO nor object detection, but rather a multipurpose project for several different areas of neural networks. Our goal with this project was to reduce it to the smallest possible module that did nothing else than show a camera and detect objects using our model. This meant that we had to spend a lot of time reading the code and learn how certain pieces were used to be able to decide if it was safe to remove them. The object detection showed a camera preview and added labels whenever an object was found. These two pieces of functionality worked on different threads, meaning that the camera preview had a frame rate similar to that of any other camera preview. The object detector thread took a snapshot of the camera preview when it first started, ran the YOLO algorithm on it and

3https://www.videocardbenchmark.net/gpu.php?gpu=Quadro+M2000M&id=3373, [2019-04-15]. 4https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android, [2019-04-15].

21 if an object was found then a box and a label was drawn on the camera preview. As soon as the algorithm was finished it took a new snapshot and repeated itself.

When we had reduced the Android project to be as bare-boned as possible, we started expanding it for our own purpose. Our goal was to create a module that would be easy to integrate into our employers own existing application. As such, we created a new Android activity called MainActivity with the purpose of simulating the process of adding the module to another application. There were no problems just starting the TensorFlow activity from our own, but support for ending the activity and moving data had to be added. Since the original TensorFlow activity only had a camera preview and the labeling of detected objects, the first thing we did after adding it as a module was adding a button which would end the activity and go back to MainActivity. The next step was to retrieve the detected image from the TensorFlow activity. This was done first by changing the starting method from startActivity() to startActivityForResult(), which makes it possible to add a return value to one activity and return it to the one who started it. In the TensorFlow activity, the method called when pressing the button had to be edited to return with the relevant data. To store the image of our license plate we added a variable called returnBitmap. In the end of every YOLO iteration there is an if-case checking if the YOLO algorithm found a license plate and if the algorithm is more than 50% confident that it actually was a license plate. Any time this is true, returnBitmap is set to a cropped out version of that license plate. When the user presses the camera button returnBitmap is converted into an array of bytes and returned to MainActivity where it is restored as a bitmap. Code was also added which changed the color of the drawn box on the camera preview depending on how confident the algorithm was that the detected object was a license plate. Whenever the confidence was above 90% the box turned green, otherwise it was red. This was to give the end user an idea of when a picture is good enough. A demonstration of the app using an Android emulator can be seen in Figure 4.2.

22 Figure 4.2: A screenshot of the final application where the object detector is more than 90% confident that a license plate is found.

4.2 OCR service

The second main part of the implementation starts here and will consist of what the cropped image will preprocessed by before the OCR will read it. The implementation of the OCR service is divided into smaller parts consisting of one or more functions. Because the OCR service will have to manipulate the given image, both in terms of skewing and color manipulation, a set of libraries are used. The program is written Python 3.6, with support for every library included, whereas Python version 2.7 and 3.7 was lacking support for the necessary ones.

23 4.2.1 The two colors of the plates

While white is the common color on the plates, there are some exceptions. The inner color of the plate can also be of the color green. The program will have to have some sort of check on if the color of the plate is green or not. To know the color of the plate is crucial, and will be explained later in the process. In the function of which will identify the color of the plate, we first apply a k-means clustering algorithm. The inputs to this algorithm will be: our image, k=3, we stop the algorithm either if iterations reaches 200 or the accuracy reaches .1, ten attempts that the algorithms gets to execute using different initial labellings and random initial centers in each attempt, see Listing 4.6. [21] In Figure 4.3, we can see that the the algorithm has quantized the colors in the image to three, thus identified the three dominant colors and the desired result. The colors result of k-means will get converted to HSV to see if one of the three colors are meeting the criteria of being classified as green. If it is, the function in which the k-means algorithm operates, will return True, else False.

1 colors =3 2 criteria = \ 3 (cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER,200,.1) 4 flags = cv.KMEANS_RANDOM_CENTERS 5 _, labels, palette = \ 6 cv.kmeans(pixels, colors, None, criteria,10, flags)

Listing 4.6: OpenCV k-means on image.

24 Figure 4.3: The result of k-means color quantization where k=3.

4.2.2 Prepare the image

To highlight the plate, we use the hue, saturation and value (HSV) aspects of the image. Because the typical license plate is a white quadrilateral (polygon with four sides) with black characters on it, there is not much deviation from plate to plate. Setting up a condition, where the actual white quadrilateral is in focus, is advantageous. From here on, in the code segments, OpenCV is imported as cv. With cv.imread(), a three-channeled colored image is returned. This image is represented in an an array containing red, green and blue (RGB) values of each individual pixel in that image. This image is blurred slightly to reduce noise and then converted from RGB to HSV, see row two in Listing 4.7.

1 image = cv.imread(image, 1) 2 blurred = cv.GaussianBlur(image, (5,5), 0) 3 hsv = cv.cvtColor(blurred, cv.COLOR_BGR2HSV)

Listing 4.7: Reading the image with OpenCV.

25 The three values in the color is represented in the array from row one or two in Listing 4.8. First one is hue, second one is saturation and third one is value. The ranges are 0-179, 0-255 and 0-255 respectively. Now, as earlier mentioned, the plate is quite uniform in terms of colors seen in the actual quadrilateral containing the characters. We can utilize this by setting up the values in the array which will be the values the program will let through.

1 lower_limit = np.array([0,0,200]) 2 upper_limit = np.array([179,255,255])

Listing 4.8: Creating HSV ranges. That is the upper and lower limit.

Here, we have set the hue value to range all the way from 0-179, which means that whatever hue, the program will allow it. This is necessary, because the image could have been captured with a tilt towards warmer or colder tones, or other colors for that matter. The saturation is next to be set. For now, we will accept values between 0, that is absolutely gray, to 40, a small bit of saturation. This is because the quadrilateral is black and white and does not contain colors, but could still reflect them as well as get produced from the camera and the processing of that image. The third range to be set is the value. In the third range we have to take into consideration the overall brightness and contrast of the image. The image could in worse cases be very bright and washed out overall; or equally bad, very dark. We will assume that the actual plate is brighter than its surroundings and therefore start off with a higher value rather than a low and let the program work its way down. This is so that we find the plate in the image before anything else. After the ranges are set, a mask with the lower and upper limits is created, see Listing 4.9. This mask contains a binarized image where the pixels containing the values in range are given the value 1, and therefore are white, whereas the pixels with the values outside the range are represented as black. Now, because the images will not look the same—not in any of these three HSV values the program have to work with—the same ranges will not work for each and every image. This is where we implement dynamic values, so that the program will be able to catch values from different images. In Figure 4.4, we can see

26 a mask with ranges, set only to obtain an acceptable outcome on that specific image and therefore not reproducible in all cases.

1 mask = cv.inRange(hsv, lower_limit, upper_limit)

Listing 4.9: Creating the mask with the input of the image, the lower values in HSV, as well as the upper ones.

Figure 4.4: The desired HSV mask applied. The quadrilateral is highlighted and the characters are easy to identify.

In Listing 4.10, we see that we use the variable gn to, in every iteration, regulate the ranges. The first range to be regulated is the value. The values starts at 254-255 to get the minimal span. This is not very likely to let anything through, much less the plate. As told earlier, we have to assume that the plate is brighter than its surroundings, so that the program will work its way down the brightness stair, as shown in Figure 4.5.

1 lower_limit = np.array([0,0,254-gn]) 2 upper_limit = np.array([179,1+gn,255]) 3 mask = cv.inRange(hsv, lower_limit, upper_limit)

Listing 4.10: Creating dynamic HSV ranges. The upper and lower limit respectively will part progressively

27 Figure 4.5: HSV Masks with six different ranges (values in upper left corner). Left three lower values and right three higher values.

The program is successful in progressively getting the plate visible due to the ranges getting bigger, thus accepting darker and darker pixels (right most values in each array), as well as more and more saturated ones (middle values), see Figure 4.6.

Figure 4.6: The final accepted mask, acquired when the saturation value ranges from 0-60 (second value in each array) and the value (third value) is ranging from 195-255.

4.2.3 Identify the contour

To find the points of each contour in the masked image we use OpenCV’s findContours(), see Listing 4.11. Because the shape of the contours are not always a quadrilateral (the best candidate for a quadrilateral)—but rather all contours found in the image—conditions for how the desired shape should look like is implemented. Because the source image is already cropped so that the outer most parts of the actual plate should not be very distant from the edge of the image, we can expect the plate to cover at least one fifth of the image, thus

28 eliminating the contours with areas smaller than that size.

1 contours, _ = cv.findContours(mask, cv.RETR_TREE, cv.CHAIN_APPROX_NONE)

Listing 4.11: Getting coordinates of all points that will make up the contours of the mask.

The next condition will look at the contour to determine if the figure of the contour is quadrilateral like. This is done by the rows in Listing 4.12. We are using arcLength() to return the closed contour perimeter (length of the contour). [18] This is passed to approxPolyDP(), which in turn returns an approximated curve(s) with a given precision. [22] We will expect a return value of four from approxPolyDP(). This is when the contours shape probably is close to the shape of a quadrilateral.

There are essentially three conditions the shape has to fulfill to get accepted by the program. The first one we brought up just recently, that is the shape approximation. The second one was mentioned in short, earlier, and described how the shape of the contour had to cover at a certain area of the image as a whole, that is one fifth. The third criteria the contour has to meet is to not be a too disproportionate quadrilateral. The distance between the corners can not deviate too much. Too much—in terms of the desired quadrilateral—is that the distance between two corners (side) can not be three times longer or shorter than the two corners on the opposite side. This means that the northern edge of the shape can not be three times longer or shorter than the southern edge. This applies on the left and the right edge as well. How the corners are located is described later on. If these three conditions are passed, the shape, now most certainly a quadrilateral, will be approved. The final contour is represented by being drawn on the source image in Figure 4.7.

1 epsilon = cv.arcLength(cnt, True) 2 approx = cv.approxPolyDP(cnt, 0.02 * epsilon, True)

Listing 4.12: Confirms quadrilateral with arcLength() and approxPolyDP().

29 Figure 4.7: Desired contour drawn on the source image.

4.2.4 Identify the corners

The next step in the OCR service is locating the corners of the shape, formed by the contour. The function getCorners() will get an array of coordinates for the points that together will make up the contour. An array containing four initial values is created to have something to compare the first coordinates to, see Listing 4.13. Followed by that, an array with four arrays containing two zeroes each. There after, the number of indices are counted, thus getting the number of points in the contour. The function iterates over all of the indices and looking for the corners, seen in Figure 4.8. At first we check if the value of x and y added is smaller than the temporary value in tempext[0]. The first value compared to is considered the maximum value and will always be the largest; same applies on the initial values of 0 but vice versa. Therefore, the coordinates for every corner will get updated at first compare. Every time the functions finds a new x and y, smaller than the previous x and y combined, it replaces the old with the new value. This is applied for every corner but with a small change in their respective condition. For every iteration of the function, every corner is checked and narrowing down the corners to their extremes. Finally the functions will return the diagonal extremes, that is, the corners, see Figure 4.8.

30 1 def getCorners(cnt): 2 tempext = [9999,0,9999,0] 3 extreme = [[0,0],[0,0],[0,0],[0,0]] 4 indecies = len(cnt) 5 6 for i in range(indecies): 7 if x + y < tempext[0]: #<^ 8 tempext [0] =x+y 9 extreme [0] =x,y 10 11 if x - y > tempext[1]: #^> 12 tempext [1] =x-y 13 extreme [1] =x,y 14 15 if x - y < tempext[2]: #south< 16 tempext [2] =x-y 17 extreme [2] =x,y 18 19 if x + y > tempext[3]: #south> 20 tempext [3] =x+y 21 extreme [3] =x,y 22 23 return extreme

Listing 4.13: Locating the corners, iterating through every point in the contour.

31 Figure 4.8: The corners of the contour drawn on the source image.

4.2.5 Skew the plate

The image is analyzed, the plate is identified and the corners are set, but nothing has actually happened to the image. When the corners are identified, the program now knows what four coordinates to grab when skewing the plate, so that is what we will do. The first parameter to function skew() is the source image, which we will skew, the second parameter is the coordinates of the corners. The four variables—nw, ne, sw, se—will receive the x and y coordinates of the four corners respectively. The variable w is given the value of which the function averageXDistance() returns, see Listing 4.14. This value will be an average of the distances between the two upper corners and the two lower corners added together. That is the length of the northern and southern edge, see edges A and B in Figure 4.9.

E A F

D B

G H C

Figure 4.9: A quadrilateral with the edges A, B, C and D and the corners E, F, G and H.

32 The sizes of the standard license plates from Sweden and Norway are the same. The license plates from Finland and Denmark has virtually the same ratio between height and length as the ones from Sweden and Norway as well, deeming the same method for skewing applicable on them too. The dimensions are 520 × 110. This gives us a ratio of 4.73, so w height is therefore . The coordinates of the corners of our shape is put into pts1, see 4.73 Listing 4.14. These are the points we want to grab when skewing the image. Variable pts2 will later contain the coordinates of where we want each corner to be stretched to. The function will get the perspective transformation from getPerspectiveTransform(). [20] The image is skewed and what is the actual white region of the plate is placed on an image where the corners align perfectly with the target destination of the corners of the white part of the plate, see Figure 4.10.

1 def skew(src, extremes): 2 nw, ne, sw, se = extremes 3 w = averageXDistance(extremes) 4 h = w / 4.73 5 6 pts1 = np.float32([[nw], [ne], [sw], [se]]) 7 pts2 = np.float32([[0, 0], [w, 0], [0, h], [w, h]]) 8 matrix = cv.getPerspectiveTransform(pts1, pts2) 9 result = cv.warpPerspective(src, matrix, (w, h)) 10 11 return result

Listing 4.14: The function that utilizes OpenCV’s functions getPerspectiveTransform() and warpPerspective() to skew the image.

33 Figure 4.10: The result of grabbing the corners of the source image and skewing them to the desired corners.

4.2.6 Refining the image

The next step, before reading the image, will include highlighting the characters on the skewed plate. This implies removing noise and unnecessary content in the image. For the program to facilitate the reading for the OCR service, it will binarize the image. Because there are no colors to work with (except for green plates), we will use binarize the image without the working with the HSV colors. We want to find the best possible image for the OCR service to read. Because the lighting in the images we will work with will differ by a lot in some cases we have to be careful when making the image binary. We do not want to lose necessary information, this is the characters, by making the image too white. We do not want to create otiose information, like noise and redundant pixels. This will not bode well for the OCR service. Therefore, we will use something called local adaptive thresholding, where as if the image contain really dark or white areas, we can still overcome this. The binarization, using local adaptive thresholding, is surprisingly good at obtaining a good enough binary image for the OCR service to read (guess right) the first try. Unfortunately, this is not always the case. If the OCR service will not read and guess correctly the first time, the function will take a threshold value in as a parameter in the next iteration, the variable srcret, see Listing 4.15. This threshold value have been calculated in the previous iteration as the optimal one for this image. Together with this threshold value, the function will

34 also take the N :th iteration in as attempt. The attempt variable contains the number of times the program have been going through this function yet. This value will get raised to the ret mult, which is 2.5. This is because if the OCR service will not match the image the first time, it usually means that the threshold is not far off. However, in rare cases, the value is very far off, and this is when the raising method of the threshold is really handy. This makes the threshold reach the potentially far off threshold quite fast. These two values —attempt and srcret —will be our inputs to our manual threshold in all but the first iteration. These values are nothing we will change, but rather let the program work them for us. However, there is one value that we will change: ret mult. The ret mult variable will contain the value with which we will combine the two earlier mentioned variables, see Listing 4.15, row 11 and 13. How this variable is changed com- pletely changes how the program will perform and will be discussed in the Section 5.3.2. Every other iteration the value is getting negative, because the threshold that the func- tions is given is not certain to be found over or under the first thought optimal threshold for the image. The iterations will stop either by having the OCR service guess correctly or by the binary image getting all white (all pixels 1 ). The image will become all white when the threshold is set to zero. So if the function was given a threshold of 150, and we were multiplying the attempt with ten, we would have to reiterate 30 times before the break point kicks in. This is because every other iteration, we will use the same ret at the iteration before but making it positive.

35 1 def preProcess(img, attempt, srcret): 2 ret_mult = 2.5 3 if attempt == 0: 4 th = cv.adaptiveThreshold(img, 255, \ 5 cv.ADAPTIVE_THRESH_MEAN_C, cv.THRESH_BINARY, 59, 12) 6 ret , _ = cv.threshold(img, 0, 255, \ 7 cv.THRESH_BINARY + cv.THRESH_OTSU) 8 return th, ret 9 attempt= attempt+1 10 if attempt %2 == 0: 11 new_ret= int(attempt/2) ** ret_mult 12 else : 13 new_ret = -(int(attempt/2) ** ret_mult) 14 ret, th = cv.threshold(img, srcret+ new_ret , 255, \ 15 cv.THRESH_BINARY) 16 return th, ret

Listing 4.15: The function where the binarized image is returned together with the calcu- lated optimal threshold.

4.2.7 Read the image

When the cropped image has been preprocessed it is ready to be read by Pytesseract. There were two settings in Pytesseract that were configured. The first one is telling Pytesseract how it should treat the text (e.g., read one character or a full paragraph). The other configuration says that the letters which it will be reading will not occasionally form any kind of word and thus excluding help from dictionaries. The two settings were set by sending a configuration string to Pytesseract containing the two flags psm and load system dawg. The result of setting psm to 7 made Pytesseract treat the text as a single line of text. The configuration load system dawg set to false

36 made Pytesseract exclude any dictionary. This was done because we do not want it to try and predict any form of words, but just a random set of characters. Finally, we read the image by passing it together with the configuration parameter into Pytesseract and the text that Pytesseract guesses is returned and put in guess. The implementation can be seen in Listing 4.16.

1 import pytesseract as pt 2 guess = pt.image_to_string(img, config = ’--psm7 load_system_dawg=false’)

Listing 4.16: The implementation of Pytesseract, configured to read the license plate.

4.3 Summary

In this chapter we explained how we developed our system. We went into detail how we installed and used different tools to create a suitable object detector and how that model was later used in an Android application. In the follow section we explained how the OCR service was set up and how it had to preprocess images to become as accurate as possible.

37 5 Evaluation

In Chapter 4 we developed a system consisting of two major parts, an object detector and an OCR service. The object detector was ported to an Android application created by TensorFlow, and the application was in turn modified to suit our own needs. Our OCR service was developed as a standalone service that receives images taken with our application and reads them. This chapter will evaluate what we have created based on our application’s performance on different Android devices, accuracy of our object detector as well as both accuracy and performance of our OCR service. We also measure the impact of our image preprocessing.

5.1 Android performance

Performance was a big issue for us from the beginning and one of the key points during the development of our object detector. Using real-time object detection on a phone is not a new concept, but it is demanding even on the hardware of modern smartphones. When using a Samsung S10 the phone was able to complete the process of taking a snapshot, performing the YOLO algorithm and draw a box if any license plate was found in about 0.5 seconds. This results in an update frequency of two times per seconds. While discussing the project before implementation we agreed that a performance of at least 5-10 updates per second was necessary for the application to feel smooth enough. This was proven false when we got a chance to test it live. Two updates per second is not a lot, but the camera preview that ran at full speed made application feel surprisingly smooth. Part of the reason for this is because the application is supposed to be used on cars standing still in a parking lot. Holding the camera still means that the box drawn on the screen does not lag behind as it would when rotating since the box is drawn at specific coordinates on the screen. When testing it we pointed the camera at a plate, waited a moment for the object detection to calibrate and pressed the camera button. Getting a green box (90% confidence) was very

38 easy and usually happened as soon as the camera was pointing directly at the rear or front of a car. The performance of then taking a photo and sending it was almost negligible. When a box is shown on the camera preview, a cropped version of the license plate is already stored. This means that pressing the camera button immediately ends the activity while the image is sent to our OCR service on another thread without user interference. Since a Samsung S10 is a high-end smartphone we deemed it was necessary to test the application on worse performing devices. We were able to get two more smartphones to test on, a Samsung S8 as well as a Samsung S5 Mini. When using the Samsung S8 our performance was only slightly worse than on a Samsung S10 with about 0.6 seconds spent on each iteration of the YOLO algorithm, which still felt acceptable. The Samsung S5 Mini on the other hand had a performance of five seconds for each iteration. While functional, we consider the application to be too demanding for such a low-end device. For a more detailed comparison of the different devices see Table 5.1. This table shows that using TensorFlow on an Android is all about the CPU. Extra memory, at least over 4 GB, does not seem to make any difference.

Table 5.1: Comparing the running time of our app to the specifications of our devices. Model CPU Memory size Running time Samsung S10 Octa-core Exynos 9820 1.9-2.7 ghz 8 GB 0.5 seconds Samsung S8 Octa-core Exynos 8895 2.3 GHz 4 GB 0.6 seconds Samsung S5 Mini Quad-core 1.4 GHz Cortex-A7 1.5 GB 5 seconds

5.2 Evaluation of object detection

To evaluate our final object detector we took the testing images discussed in Section 4.1.1 and used the object detector on them. These pictures were not taken with OCR in mind and it is important to note that they are only used to evaluate the quality of the object detector. We wrote a Python script that would use the YOLO algorithm with our model

39 on each image in our test set and count every time it found a license plate with at least 50% confidence, which we consider to be the minimum for a successful detection. From a total of 147 images of varying quality and angle where the license plate was shown, 138 plates were found. This gave us an accuracy of about 94% on the type of images it was trained on, which is a very acceptable number. To gather images with the purpose of later testing our OCR service we used the same script but on our original dataset of 3500 images and changed the confidence criteria to 90%. The script was also edited so that any detected license plate fulfilling the criteria was cropped and stored. This was to gather images that would be similar to those taken with the application while the box is green. 1026 images were found and used in the next section to evaluate our OCR service. Do note, however, that these images do not prove the object detectors quality as these are the same images as it was trained on, resulting in bias towards them.

5.3 OCR service

We will evaluate the results of the OCR service by measuring the reliability, comparing speed and accuracy against one another. We will discuss how to balance the program to accurately read images and compare the result of having the image processed before being read by the OCR software against not having any preprocessing at all, and let the OCR software operate on its own.

5.3.1 OCR performance

When images that are going into the OCR service are static and not immediately can be corrected, reliability is crucial. Making the OCR service reliable comes with a cost: performance. We could work with an image all day to eventually being able to read it. This, on the other hand, is not something we would want to do, and would definitely not be convenient having the responsibility of carrying back the information to an app, that should answer in seconds. Therefore, the preprocessing of the image has to be made in a

40 balanced manner. Luckily, most of the cropped images of the license plates retains a width of at least 150 pixels, is often being read fairly quick and easy. The reading of the image is what is taking the heaviest load when a larger image is passing through the pipeline. This means that the actual OCR software analyzing the image is getting slower by a faster rate than any other component on the OCR service when a picture of higher resolution is being processed. We will therefore measure the performance of the OCR service with the number of tries by the OCR software itself in mind, and in doing so, separate the two main components in the OCR service. The first component is the preprocess of the image, and the second one is the reader—the OCR software.

The smallest size of an image that is still readable by the OCR service comes down to around 150 pixels wide, making it between 5-10 kilobytes large. Compare this to the largest potential image, being 1000 pixels wide and about 20 times larger coming in at about 100- 200 kilobytes. Because the speed will vary depending on the hardware the service runs on, the time it takes piping the image through the service is relative and the focus should therefore be on comparing images with different sizes to one another, and the individual performance, in terms of time, not as much.

To check if OCR service reads the license number and guess right or wrong, we use a database containing all plate numbers we will analyze and many more, in editable text. We will compare the guess by the OCR service against the numbers in database.

We will start by measuring the speed of two generic images with sizes in range of the two earlier mentioned (5-10 and 100-200 KiB respectively). This test is done so that the OCR service will be able to read the preprocessed image without the need of iteration, that is, the OCR service will read the image only once. Iteration occurs when the guess of the OCR service does not match the pattern criteria of a license number and has to go through some or all of the preprocess steps, see Figure 5.1.

The two images compared to one another tells us that being preprocessed and then read takes about four times as long for an image with an area about 40 times larger than the

41 smaller and about 18 times the size on disk. Luckily this means that the time it takes to read larger images is growing in a logarithmic rate. However, this is only when comparing the numbers without need of any re-preprocessing, see Table 5.2.

Figure 5.1: A flowchart of the image being put into the OCR service pipeline, first being preprocessed and then read by the OCR software, later to be matched or not.

Table 5.2: Comparison of size and time between two generic images of minimum and maximum potential dimensions with no need for iteration. Dimensions(pixels) 192x43 999x343 Size(KiB) 7.2 128.3 Time(ms) 192 864

There are two types of iterations the OCR service can do if the guess is not correct the first time. This includes repeating the skewing phase and the refining phase. There is no guarantee that the program will succeed in finding an arbitrary quadrilateral in the first iteration, or even in any iteration. If the program will not find this in the first iteration, the next iteration will include tweaking the conditions for how strict the criteria is for it to find the quadrilateral. This makes program repeat the skewing phase with a different configuration, see Figure 5.1. This occurs only if the program is not able to detect an arbitrary quadrilateral in the previous iteration and will be repeated only up to two times. The repetitions stops at two because the change in the configuration is rather sensitive and does not require much tweak to get more forgiving finding a new quadrilateral.

42 The next type of iteration is including refining the image, this refining phase always occurs after going through the skewing phase. The refining phase assumes that the image is correctly skewed. In every refining iteration, the program will change the input to the binarization with a small value making the threshold of the binarization going up in one iteration and down in the next. The value is increasing for every iteration. The OCR service will then have to read (analyze and guess) the image for a new match. The probability of the need for iteration is however not very high. Most common is that the program will get it right the first time having skewed it and refined it once, or not guess right at all. Because the component in the OCR service that takes the longest time is the OCR software reading the image, this is preferably not something we want it to do more than once. This required us to ask ourselves some important questions: will the person using the app want to maybe have to wait longer for the app to answer, that is, often get the right information back but not as often have to wait for, say 20 seconds, to get back a message saying: we could not find the car? Or, will the person want to have a little worse precision (accurately guess the right car), and when the program is incapable of returning the right match, will be doing so in about 3 seconds instead of 20 seconds?

5.3.2 Precision vs. time

After the OCR service was as good as complete—having used our time best we could—there was one component left that made quite a huge impact, being just that one component, on how the program performed: the binarization. Remember how we talked about how refining the image included binarizing the image; getting rid of noise and highlighting the characters on the plate. This is the main step in the refining phase, and as we now know, every time the image comes out of the refining phase, the OCR software will read it. Remember also how the OCR software made the largest impact in terms of time, among all of the components. The binarization is the one mainly responsible to find the sweet spot of the image for the OCR software to read. The binarization will assume that the image

43 is correctly skewed and will only change the lighting, in the image, and every time this is done, the OCR software will try and read the image. This is where the important tweak comes in. Luckily for us, we have utilized something called adaptive threshold. With this, dark areas will become brighter and bright areas become darker, hopefully to make the characters in the image as distinct as possible. Now, even though this type of binarization is rather amazing, we can not trust it completely on its own to do the work. If we are lucky, the binarization will find the perfect image for the OCR software and then it is done. However, this is not always the case.

Our program will stop the binarization when it registers that the image is all white, eventually from the binarization, or if the OCR software read and match the number before this. For us to make the OCR service as fast as possible, we want to reach some of these points as fast as possible. We, of course, want the OCR software to be able to read the image correctly, but if this is not possible, we will surpass this threshold and go towards the all white image. There is no way for the binarization, on its own, to know if the image has passed this threshold of ”optimal reading” for the OCR software. So, for the program to know this, the OCR software has to read the binarized image and guess. This is where the dilemma is located. We could hope that the first binarization will get it right the first time, and just cut it off after that, not requiring the OCR software to read it more that once. This would speed up the program by a lot. Or, we could reiterate the refining phase again if the OCR service is wrong, bringing with us the time it will take for the OCR service the next time, the next time after that, and so on.

We ran some tests on a set of 342 images, of varying qualities. Here we are evaluating how big of an impact the tweaking of the threshold in binarizing an image actually has and how to regulate it to work as optimal as possible. The optimal state would be hitting either the best binarized image for the OCR service to read as fast as possible, or if this is not possible, reach the all white image as fast as possible. This, because these two cases are our break points. Maybe we would be better off completely without the reiterations of

44 the binarization. The number we will tweak with is the value of which the binarization method uses as threshold. We will use a variable that will change according to the attempt (iteration) we are in. And remember, the following will occur if the first adaptive binarization will not succeed and we have to start changing the threshold manually, see Figure 5.2. Example: we get back a calculated threshold value of 150 from the first iteration (see Figure 5.2 (threshold value)), then the threshold is 150 in the next iteration. The equation will look like so: 150 + N × x, where N is the current attempt and x is the multiplier. If x = 2, then, every N :th iteration, will look like so: 150+N ×2. Remember that for every positive value, there is a negative one that will also be applied, and so with this ret becomes: -2, 2, -4, 4, -6, 6..., and will give the threshold 148, 152, 146, 154, 144, 156... until image is white or OCR service guesses right.

Figure 5.2: A simplified flowchart of the binarization process where the image is put into the adaptive threshold, and if not matched by the OCR software, will go into the manual threshold together with a calculated threshold value.

We set the value of x to 2, 10, 40 and 100 in the four tests. We are looking for a case where we get a high accuracy but still will not have to wait for as long for the slower

45 images. The slowest image is the image in the set of 342 images that takes the longest to go through the OCR service, see Table 5.3. This image will almost certainly be one that the OCR software was not able to read; the image had to go through all iterations without any luck, hence the long time. This is the image that the person using the app will have to wait for in worst case. We feel that this was an absolute necessity to keep in mind, because we would not want to wait for 18 seconds (see Table 5.3) for the OCR service to tell us that it could not find a car with the plate you sent to it. On the other hand we got quite a good accuracy with x = 2, see Table 5.3. 18 seconds we deemed not to be worth waiting for, even though this is rare. Therefore having the multiplier set to 2 was not something to stick with. In Table 5.3 we can also see that by having x set to 100, we manage to keep iterations down and therefore the average time per image. This however is jeopardizing the precision by a lot for the OCR service. The number are getting a little better when lowering x to 40, where we can see that we still manage to keep the speed up, even when accuracy is raising. Giving x the value of ten, we can see that the numbers are starting to show good progress. The accuracy is really close to the one with multiplier set to two and the average attempt is quite low as well.

Table 5.3: A table showing the outcomes of the varying multipliers with the four most essential numbers for our evaluation. Multiplier Accuracy Time/Image Slowest Image Average Attempt x = 2 90.9% 1.90 sec 18 sec 3.035 x = 10 90.3% 1.00 sec 6.89 sec 1.462 x = 40 84.8% 0.85 sec 6.86 sec 1.091 x = 100 80.1% 0.88 sec 6.76 sec 1.059

By looking at the average attempts it takes to match an image, in Table 5.3, we can see that in the case with the multiplier set to ten, the average attempt is rather close to one, which means that many images need only one or close to one iteration to get though,

46 and very few will go through by iterating almost max amount of times before the OCR service can read them. We know this because the OCR service can usually read it in first attempt is high, and if not, the binary image is often far off. Because of this, we changed the operation in the formula for our threshold. Instead of multiplying the attempt with a given number, we multiplied the attempt with itself, we used squaring. This means that we will gain threshold slowly and the further off the source threshold, the threshold will gain momentum exponentially. This effectively means that the OCR service is still able to match the first image, just as before, but now it also match the binary images where the threshold is not far off. Better yet, the images that required large threshold deviation, it could potentially match faster as well. In Table 5.4, we can see x = 4 as our best case and outcome, with the best time per image-to-accuracy ratio. Still, we chose to go with x 2.5, because we value accuracy slightly higher. In reliability, accuracy is the main attribute, and when the difference in time per image is that small, accuracy is more important.

Table 5.4: A table showing x, which is the number the attempt will get raised to every iteration, together with the four most important numbers for the evaluation. Power Accuracy Time/Image Slowest Image Average Attempt x = 2 89.7% 1.02 sec 6.76 sec 1.477 x = 2.5 89.2% 0.93 sec 7.21 sec 1.263 x = 3 88.0% 0.93 sec 7.41 sec 1.135 x = 4 87.7% 0.85 sec 6.80 sec 1.091

The time it takes to deal with the worst image is not something we will stare us blind on either, because we could have ”bad luck” with it, hitting just a case where this image is missed due just by skipping the threshold needed for this specific image. The more important ones are the time per image together with the accuracy; acquiring good values in these and we will get a reliable OCR service. Comparing our best candidates from the two tests, we end up with x = 10 from the

47 multiplication case, and x = 2.5 from the squaring case. This is a close one and because of this, we ran a new test with each candidate on a set of 1026 images, instead of the previous 342. The results are shown in Table 5.5. Even though the two candidates are close we think that the squaring method get the edge here. While the accuracy is better by a really small amount on the multiplying method, the speed of the squaring method triumphs.

Table 5.5: A table comparing the two methods for tweaking the threshold in the binariza- tion of the image. Methods Accuracy Time/Image Average Attempt Multiplying (x = 10) 88.0% 1.08 sec 1.389 Squaring (x = 2.5) 87.9% 0.96 sec 1.247

5.3.3 Evaluating the impact of preprocessing

In the beginning of our project, we started by trying the OCR software on its own on images, quickly realizing that this maybe was not what we were looking for, because the OCR software did not perform that good, especially not on license plates. We started to make some preprocessing to see of we could help and improve the OCR service, which we could. More and more before work to the OCR service took place, and suddenly, we had improved the accuracy a whole lot. We will evaluate how big of an impact the preprocessing have on the OCR software to together work as the OCR service we intended it to do. To get a better understanding of how valuable the preprocessing is for the OCR software, we let the program go through 1026 images of varying sizes and qualities. Keep in mind that these images, even having passed the filter of the object detection, plausibly will be captured in better angles and with more precaution in practice, favoring both time and precision of the program. Of these 1026 images we got 899 matches and 127 misses, meaning the service got 88 images out of 100 correct. As earlier discussed, we could even improve the accuracy even further, in exchange for some speed. Taking into consideration that many of the photos of the plates are taken with large variety in terms of quality, the

48 OCR service performed rather good. The time it took for the service to go through all of the 1026 images was 988 seconds or 16.5 minutes, putting the program at about 0.96 seconds per image. The next test will be the result we just got, analyzing the large set of images, compared to the OCR software (the reader) operating on its own reading the same set of images. This will work as an evaluation of how much the preprocessing is actually helping the OCR software to read the images. Out of 1026 images read, the OCR software matched 64 of them, giving the program a precision rate of 6.23%. The time it took for the OCR software—operating on its own—to go through the set of images was 212 seconds, or 3.5 minutes, giving us a processing time of 0.21 seconds per image, see Table 5.6. This shows that the OCR software on its own is really fast, but lack a lot of accuracy.

Table 5.6: A table comparing the accuracy and speed of the OCR software with and without preprocessing. Preprocessing Accuracy Time/Image Average Attempt Yes 87.9% 0.96 sec 1.247 No 6.23% 0.21 sec 1

Because the set of 1026 images we have put through the service so far has a large variety in terms of quality, we created a subset of the 1026 images in which we filtered out the images we would not have captured ourselves and left those we considered legitimately taken. That is, the ones we thought the OCR service should be able to handle. With this subset, we did the same tests as we already done with the large set. This subset of images contained 176 images in total. While the OCR software on its own is still very fast, it is still not remotely accurate at 7.95%, see table 5.7. And this being even though we filtered the worse images out. Comparing the result to the OCR service that is reading preprocessed images, we can clearly see a huge difference in accuracy. When the images in the subset went through processing before being read, the OCR service found 167 of the 176 images, giving the

49 OCR service a precision rate of 94.8%. The time per image landed at 0.94 seconds, which is quite good considering being a backend cloud service, where the actual OCR service delay will only be one of many time factors counted into the total delay, such as transfer over the internet and more. This shows the actual gap between reading an image that has been preprocessed and the original image. Without the image being preprocessed we deem the OCR software useless, at least for our purpose.

Table 5.7: A table comparing the accuracy and speed of the OCR software with and without preprocessing on a subset of optimized images. Preprocessing Accuracy Time/Image Average Attempt Yes 94.8% 0.94 sec 1.176 No 7.95% 0.22 sec 1

5.4 Summary

In this chapter we evaluated all our results. We discuss our application which—while it did not run as fast as we had hoped—still was a success on the right devices. The accuracy of our object detection was calculated by using it on similar images to those it trained on and comparing the number of detections to the number of images. Our OCR service was evaluated in accuracy by feeding it a thousand images resembling those that would be sent from a phone and comparing it to the same OCR software without any preprocessing.

50 6 Conclusion

In this chapter we discuss the project as a whole. We also consider what could be done to further improve what we have created and what we would have done if we had more time.

6.1 Project summary

For the duration of 16 weeks we have gone from neither of us ever touching the field of machine learning to developing an application that is able to detect license plates as well as an associated OCR service that is able to read the plates. Since neither of us had any prior experience with neural networks when we began, our first couple of weeks were spent researching the subject. We read theory about neural networks as well as how it is used practically. Several different tools and techniques were tested during our research phase, walking down the wrong path many times before stumbling upon our final frameworks. The project has had a lot of minor problems as well as a few major. An example of a major problem was when we spent several days trying to port our trained CNN to Android by using TensorFlow Lite, which is a lightweight implementation of TensorFlow specifically made for mobile and embedded systems. This was unsuccessful due to TensorFlow lite requiring another type of file than the one our model was created as and in the end we decided to use regular TensorFlow. Minor problems mostly consisted of being unable to install packages or programs and things not working when they were supposed to. Problems like these were usually solved by either checking the associated GitHub for similar errors or looking them up on Google. As we went into this project with little to no knowledge of neither the software nor the theory behind ML or the associations to it, there was no thinking ahead of things. Where there was a small knowledge of the language Python, there was none in how to apply it, together with computer vision, on images. This lead to the search for it on the internet. Information was found, but not necessarily the right one, let alone the better one. The first

51 major problem was not knowing about the configurability of the OCR software. This lead to many hours of work trying to find an optimal state of the image to be in with regards to resolution and the size of the actual characters on the plate. This problem was later solved discovering how the OCR software could be tweaked to read an image in different ways. Some examples of the configuration are: reading the image, assuming that it was a sentence to be read, treat it as if it was a single word or in our case, a single uniform block of text. After having spent many hours on optimizing the size of the image and the characters with little success and results at the bottom, this discovery help us a lot. Our way of attacking the problem was now miles different from that of before. The next major problem was found by trying to efficiently identify the plate in the image. As earlier mentioned, the probability that the program will have to iterate is not high. This is due to the discovery of the HSV model. Before we had one way of working with the identification of the plate, and that was through adaptive binarization. With no HSV colors ranges to work with we could not tell the program to let through the necessary ranges of light and not determine what colors should be accepted, full trust was given to the adaptive binarization. With this trust, it was not uncommon that the program let though the country identifier (the left most part of the plate), making the OCR software trying to read that part as a character. Upon finding the HSV colors, we could much more accurately highlight the actual plate. Moreover, before we could utilize HSV—even if there are not many of them—green plates was nearly impossible to read.

6.2 Future work

Creating and teaching neural networks are by nature an iterative process. There are many things which may ruin the learning process; mislabelled data, lack of data or a bad optimization algorithm, to name a few. This creates a situation where you are almost always able to improve your model given enough time and data. This is true in our case as well, both our object detector and OCR service could be improved if we had more time.

52 This would however bring diminishing returns and is not something we see as needed. There are a few different possibilities of where to take this project in the future. Even if the system is created specifically for license plates, we have laid a great foundation for creating a similar service using the same tools. Changing the object detection to detect other objects is as easy as gathering images of desired object, training a CNN using the command in Listing 4.5 and replacing the model file in the application. Adjusting the OCR service for a new object would prove to be harder though, as the image preprocessing does not make generic objects more readable but is specified for the figure and colors of license plates. With that said, there are things that could be done to improve it in its existing state. Due to lack of time, the application is not integrated to our employers application and the OCR service is not set up on a server. Both points would be necessary in the future for the project to be used in production.

6.3 Concluding remarks

Artificial Intelligence is hands down the thing right now, reaching from computer science out to all kinds of fields, all over the world. No doubt—this is the future. To have been part of this future and gotten a taste of that of A.I and ML has been amazing, to say the least.

53 References

[1] Anaconda distribution. https://www.anaconda.com/distribution/. [2019-06-11]. [2] Openalpr documentation. http://doc.openalpr.com/. [2019-06-11]. [3] Transfer learning. http://cs231n.github.io/transfer-learning/. [2019-04-08]. [4] Abdellatif Abdelfattah. Image classification using deep neural networks - a beginner friendly approach using tensorflow. https://medium.com/@tifa2up/image- classification-using-deep-neural-networks-a-beginner-friendly- approach-using-tensorflow-94b0a090ccd4. [2019-02-26]. [5] Derek Bradley and Gerhard Roth. Adaptive thresholding using the integral image. Journal of graphics tools, 12(2):13–21, 2007. [6] Jason Brownlee. Gentle introduction to the adam optimization algorithm for deep learning. https://machinelearningmastery.com/adam-optimization- algorithm-for-deep-learning/. [2019-06-11]. [7] Jason Brownlee. Supervised and unsupervised machine learning algorithms. https://machinelearningmastery.com/supervised-and-unsupervised- machine-learning-algorithms/. [2019-05-13]. [8] John Canny. A computational approach to edge detection. In Readings in computer vision, pages 184–203. Elsevier, 1987. [9] Wikimedia Commons. File:rosa gold glow 2 small noblue.png — wikimedia commons, the free media repository. https://commons.wikimedia.org/w/index.php?title= File:Rosa_Gold_Glow_2_small_noblue.png&oldid=145364404, 2015. [Online; ac- cessed 6-May-2019]. [10] Wikimedia Commons. File:hsv color solid cylinder saturation gray.png — wiki- media commons, the free media repository. https://commons.wikimedia.org/ w/index.php?title=File:HSV_color_solid_cylinder_saturation_gray.png& oldid=329592315, 2018. [Online; accessed 9-May-2019]. [11] Wikimedia Commons. File:rosa gold glow 2 small noblue color space.png — wikimedia commons, the free media repository. https://commons.wikimedia.org/ w/index.php?title=File:Rosa_Gold_Glow_2_small_noblue_color_space.png& oldid=299155562, 2018. [Online; accessed 6-May-2019]. [12] Wikipedia contributors. K-means clustering — Wikipedia, the free encyclo- pedia. https://web.archive.org/web/20190502090549/https://en.wikipedia. org/wiki/K-means_clustering. [Online; accessed 2019-05-02].

54 [13] Maya R. Gupta, Nathaniel P. Jacobson, and Eric K. Garcia. OCR binarization and im- age pre-processing for searching historical documents. Pattern Recognition, 40(2):389– 397, 2007.

[14] Lars Hulstaert. https://www.datacamp.com/community/tutorials/object- detection-guide, 4 2018. [2019-02-26].

[15] Ujjwal Karn. A quick introduction to neural networks. https://ujjwalkarn.me/ 2016/08/09/quick-intro-neural-networks/, 8 2016. [2019-02-26].

[16] Ding Liu. Connecting low-level image processing and high-level vision via deep learn- ing. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 5775–5776, 2018.

[17] Nicomsoft. Optical character recognition (ocr) - how it works. https://web. archive.org/web/20190220082918/https://www.nicomsoft.com/optical- character-recognition-ocr-how-it-works/. [2019-02-20].

[18] OpenCV. OpenCV contour features. https://web.archive.org/web/ 20190412073446/https://docs.opencv.org/3.1.0/dd/d49/tutorial_py_ contour_features.html. [2019-04-10].

[19] OpenCV. OpenCV documentation. https://web.archive.org/web/ 20190312165124/https://opencv.org/about.html. [2019-03-12].

[20] OpenCV. OpenCV geometric image transformations. https://web.archive.org/ web/20190415132820/https://docs.opencv.org/2.4/modules/imgproc/doc/ geometric_transformations.html. [2019-04-15].

[21] OpenCV. OpenCV k-means clustering. https://web.archive.org/web/ 20190502123454/https://docs.opencv.org/3.0-beta/doc/py_tutorials/ py_ml/py_kmeans/py_kmeans_opencv/py_kmeans_opencv.html. [Online; accessed 2019-05-02].

[22] OpenCV. OpenCV structural analysis and shape descriptors. https: //web.archive.org/web/20190413212704/https://docs.opencv.org/2.4/ modules/imgproc/doc/structural_analysis_and_shape_descriptors.html. [2019-04-13].

[23] Chirag Patel, Atul Patel, and Dharmendra Patel. Optical character recognition by open source ocr tool tesseract: A case study. International Journal of Computer Applications, 55(10):50–56, 2012.

55 [24] Matthias Lee PyPI. Project description. https://web.archive.org/web/ 20190313150806/https://pypi.org/project/pytesseract/. [2019-03-13].

[25] AF.˚ At a glance. http://www.afconsult.com/en/about-af/at-a-glance/, 2. [2019-02-12].

[26] Joseph Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/ darknet/, 2013–2016. [2019-03-13].

[27] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. http://arxiv.org/abs/1506.02640, 2015. [2019-04-13].

[28] Sumit Saha. A comprehensive guide to convolutional neural networks- the eli5 way. https://towardsdatascience.com/a-comprehensive-guide-to- convolutional-neural-networks-the-eli5-way-3bd2b1164a53. [2019-04-10].

[29] Sas. Machine learning. https://www.sas.com/en_us/insights/analytics/ machine-learning.html. [2019-02-19].

[30] Archana A Shinde and DG Chougule. Text pre-processing and text segmentation for ocr. International Journal of Computer Science Engineering and Technology, 2(1):810– 812, 2012.

[31] Richard Szeliski. Computer Vision - Algorithms and Applications. Texts in Computer Science. Springer, 2011.

[32] Trieu. darkflow. https://github.com/thtrieu/darkflow. [2019-03-13].

56