<<

Go-Green or Go-Home: Using for real-time traffic monitoring

Guðjón Björnsson Þórður Friðriksson

Thesis of 12 ECTS credits submitted to the School of Computer Science at Reykjavík University in partial fulfillment of the requirements for the degree of Bachelor of Science in Software engineering

May 15, 2020

Supervisor: Gylfi Þór Guðmundsson

Examiner: Guðmundur Már Einarsson Acknowledgements

We want to thank Reykjavík University for providing us facilities while working on this project, even though we did not get a chance to utilize them the whole time due to the Covid-19 pandemic. Thank you Gylfi Þór Guðmundsson for the around the clock guidance and help and special thanks go to Hallgrímur Arnaldsson for proposing the project. Lastly thank you Guðmundur Már Einarsson for your valuable feedback.

i [This page is intentionally left blank]

ii Contents

1 Introduction1

2 Background2 2.1 Machine learning...... 2 2.2 Computer vision...... 3 2.3 Object detection...... 3 2.4 Neural networks...... 4 2.5 The Icelandic license plate...... 9 2.6 Optical character recognition (OCR)...... 11

3 Methods 14 3.1 System design overview...... 14 3.2 Data collection...... 15 3.3 Choosing and training the object detector(s)...... 15 3.4 Vehicle tracking...... 16 3.5 Plate reading...... 17 3.6 Passenger counting...... 22

4 Experiments and evaluation 23 4.1 Data used in evaluation...... 23 4.2 Evaluating passenger counting...... 24 4.3 Evaluating OCR...... 26 4.4 Evaluating license plate reading...... 27 4.5 Performance evaluation of the object detectors...... 28

5 Future work 29 5.1 Hardware changes...... 29 5.2 Data and training...... 30 5.3 Different types of DNNs...... 30 5.4 Cloud solutions...... 31 5.5 OCR...... 31 5.6 Training Tesseract...... 31 5.7 Improving OCR with explicit rules...... 32 5.8 Combining YOLO and OCR...... 32 5.9 Optimization of the object detector...... 32 5.10 Improve passenger counting...... 32 5.11 Statistical filtering...... 32

6 Conclusion 33

7 Appendix 36 7.1 Character prediction table...... 36

iii Abstract With growing population, traffic and air pollution have become a real problem and the need for eco-friendly solutions dire. In our busy world, people tend to make the easy choice instead of the right one. There is therefore an urgency for incentives and penalties to help guide people towards the right and environmentally friendly choices. In this thesis we dive in to the implementation of a system for real-time road-side monitoring of traffic using deep neural networks and computer vision, with the intent of counting a vehicle’s occupants and capturing it’s license plate registration. We could then use the license plate registration to look up a vehicles model, weight and emission of greenhouse gases using the Icelandic department of motor vehicles’ online look up service. With this information we can determine if the vehicle meets Iceland’s standard for a green vehicle. This information can be used to aid in the enforcement of reward systems such as car-pool lane monitoring, automated toll booth charging or priority parking accesses. We evaluate two state of the deep learning networks and show that we can successfully predict the correct number of front seat passengers with an accuracy of 89%, and obtain the correct license plate reading with an accuracy of 81%. Furthermore we propose solutions for improving the system and achieving better accuracy and performance.

iv 1 Introduction

The world is changing for the worse, both on land and in the sea, and the changes are happening faster then ever before [1], over the last 50 years, nature’s capacity to support us has plummeted. Air and water quality are reducing, soils are depleting, crops are short of pollinators, and coasts are less protected from storms [2] and we as humans are to blame. One of our bigger concerns regarding this problem is our contribution to air pollution via vehicle emissions and with the ever constant growth of the human population combined with the fact that the number of registered vehicles in Iceland is increasing at an even greater rate [3] the prospect of a future with clean air seems bleak, so how can we help relieve this environmental pressure being put on by cars and other motorized vehicles? Some might suggest to abandon car’s altogether, returning to alternative and more eco-friendly means of transportation such as walking or cycling. This is a fine and noble idea but unfortunately it is not a feasible one. As difficult as it is to admit, for society to continue to function with the same productivity, we need fast and reliable means of transportation. So the problem statement is simple, we need to make automotive transportation pollute less all the while keeping society’s throughput equal to current levels. The solution, as with any other problem, might not be as straightforward as we would like to think. There are three main ideas on how to solve this problem. The first is to use public transits like buses, trams or trains which will both reduce air pollution as well as road congestion. The downside being that such modes of transportation will not get everyone exactly where they need to go nor does that transport you as quickly as your own car. The second idea is to drive fuel efficient vehicles with low or even no greenhouse gas emissions. These cars can help the environment while potentially saving you money on fuel costs at the pump. These vehicles can for example be electric, hybrids, hydrogen fuel cell vehicles or even simply cleaner burning gasoline vehicles. The third idea is to carpool with other people who are heading in the same direction and thus share the ride, the cost and the emission of green house gases. In this thesis we will look into ways to better motivate people to utilize the two latter of the three ideas, that is to motivate people to drive more fuel efficient cars and to adopt ride sharing into their daily routines. To achieve this we dive in to the implementation of a traffic monitoring system aimed at distinguishing those who choose environmentally friendly vehicles and ride sharing. If such a system where to exist in a sufficiently general and a small enough form factor it could be deployed in various locations, rewarding individuals who choose a more sustainable way of living. To construct such a system, we research the world of object detection and computer vision to see if it is possible to read and look up registration plates to recognize fuel efficient vehicles, as well as to identify ride sharing by detecting how many passengers are in the front seats. We also research the possibility of doing all this under the constraints of a real time application, which means that our detection needs to be both accurate and fast.

1 The rest of the thesis will go into detail of how this research was conducted, and can be broken up into five chapters.

• The background which gives a summary of all the concepts and theory that the research builds upon

• The methods chapter that discusses the process and approach that is taken while conducting the research.

• The result chapter that displays unbiased raw data from the results of our research.

• The discussion and future work chapter that gets our take on the results as well as discusses what work is yet to be done.

• Finally the conclusion chapter which is a simple summary and final words of our research and report.

Overall, our experiments give us the accuracy and speed at which our implemented system performs as well as to give us an idea of how well it can potentially perform if deployed as a real and usable application in the real world.

2 Background

In this chapter we will cover the background material that we build our work upon, starting with a discussion on machine learning.

2.1 Machine learning In most common programming practices the algorithm is written by a programmer, telling the computer exactly how to perform a given task. To solve for advanced tasks, programming can be difficult and very time consuming. Machine learning involves com- puters discovering how they can perform tasks without being explicitly programmed to do so by a programmer. Machine learning algorithms construct a mathematical model from the data given, known as training data. The mathematical model can then be used to make predictions, or take decisions, based on new and previously unseen data. Many machine learning approaches exist but they most commonly fall in to one of the following three categories [4].

• Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “supervisor”, and the goal is to learn a general rule that maps inputs to outputs.

• Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

2 • Reinforcement learning: A computer program interacts with a dynamic environ- ment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent) As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximise.

2.2 Computer vision Computer vision often abbreviated CV is a scientific field that deals with the study of how a computer sees and understands the content of images and videos. A computer sees an image as a bunch of 1’s and 0’s, those 1’s and 0’s make up pixels which are the smallest element of an image. An image is represented as an array of pixels where each pixel can have a number of channels, depending on the format the most common is three channels representing the color intensity of red, green and blue given as a value between 0 and 255. Modern formats include more complex formats and compression techniques that are out of scope for this thesis. Computer vision involves taking the array of pixels and manipulating it to bring out features and characteristics of that image, machine learning can then be used to match patterns in the image to known patterns. Computer vision emerged in the 1950s with algorithms detecting edges of simple shapes and sorting them into categories like “square” and “circle”. In 1977 the first commercial application of computer vision was made to allow a computer to read typed, printed and hand written text using optical character recognition (OCR) and was aimed at enabling blind people to read [5]. With the rise of the internet in the 90’s a large set of images became available to the public for analysis, this data set along with cheaper and more abundant computing power would only keep on growing exponentially and further acceleration in the growth of computer vision. Today computer vision is used in many fields to do specific tasks faster and with more accuracy then humanly possible. With new applications appearing rapidly, some of the many uses for computer vision include automatic inspection in manufacturing lines, identifying cancer cells in medical scans, self driving vehicles, fingerprint and facial recognition for unlocking devices and many more.

2.3 Object detection Object detection is a subfield of computer vision and is the combination of image clas- sification and object localisation, returning both the predictions of objects locations on the image and the class or label for each object. Object detection has been around for quite some time but the first framework proposed to work in real time was introduced in 2001 [6].

2.3.1 The machine learning approach The first Methods for object detection involved extracting features from the image and passing them on to a classifier, such as a support vector machine. Although these algo- rithms were not very precise due to the fact that the feature extractors did not capture

3 much detail about the objects, they were fast and light weight enough to be deployed on digital cameras.

2.3.2 The deep learning approach Deep learning is a state of the art approach to object detection and will be further discussed in section 2.4, for a detailed description of object detection neural networks see section 2.4.2.

2.3.3 Object tracking When processing a video file for object detection, the video is represented as a sequence of frames, how many depends on the format of the video. Each frame differs slightly from next and when displayed in a series they trick the mind in to looking like continuous motion. The problem that arises when performing object detection on video is how do we know that an object, that may have moved slightly between frames, is the same as the one we detected in the previous frame or a new object. To tackle this problem the location of each object is stored and then, when the next frame is processed, the previously seen objects locations are updated to the location of the object currently closest to the previous location. New objects, if any, are added to the tracking list.

2.4 Neural networks Artificial neural networks (ANN) are computation models based on a system of connected nodes loosely inspired by the neural network of the brain. A neural network can be trained to do specific tasks based on a set of rules without being programmed to do so. In this research we will look in to the ability to train a neural network to do object detection. In object detection the ANN takes an image as its input and spits out predictions of objects it believes it has found in the given image. The output consists of: a class or a name for the object, for example “person”; a location where the network believes it has located an object, often given by two x,y coordinates representing the top left and bottom right corners of the bounding rectangle containing the object; and a confidence score as a value between 0 and 1 meaning how confident the network is that it has located an object of that specific class in that location. An output for one image is not limited to a single prediction and in most cases the networks gives multiple predictions of varying confidence, sometimes more then one, overlapping, prediction in the same general location. An artificial neural network consists of nodes or artificial neurons that combine input from the data in to a numeric value and edges between them, the edges between nodes have certain weights used to either dampen or amplify the output of the neuron in relation to the node on the other end of the edge. A layer is a collection of nodes working together at a specific depth in a network, although single layer networks exist and are the foundation for multilayered networks used in more complex calculations, we will be focusing on the structure of the latter. In a multilayered network the layers can be

4 categorized into three distinctive layer types, those being the input layer, hidden layer and output layer, where each layer’s output is simultaneously the subsequent layer’s input.

• Input layer: The input layer takes in the raw data, often no computation is done in this layer for example each node could represent the color value of a single pixel.

• Hidden layer: The hidden layers of a network are not directly observable from the systems inputs and outputs, they take in the output of the layer before it and try to detect more sophisticated patterns in the data, this is where the “magic” of neural networks happens and most networks have more then one hidden layer. The hidden layer is most intuitively described in the context of object detection. depending on the depth of the layer, it could be combining outputs of the nodes before it to detect edges in an image, those edges can then be combined in the following hidden layer to detect squares or other forms made up of more then one edge.

• Output layer: The output layer is the last layer in the network, even though it acts much like the hidden layer in a way combining the weighted results of the layer before it, it is considered it’s own layer since it gives out the final results of the network.

2.4.1 Supervised learning When training a network to recognise specific objects in images supervised learning is best suited. Initially the weights of all edges are randomized to a small number so as to break symmetry and a collection of data must be labeled with the desired objects and split into three parts, one for training, one for validation and one for testing. Supervised learning can be split in to the following three phases

1. Forward propagation: The network is exposed to the training data and as it passes through the network, each node applies it’s transformation function to the data it received and passes it on the neurons of the next layer. When the data has passed through all the layers, the network outputs the predictions for the given input.

2. Loss calculation: The output from the first phase can now be compared to the real values, since this is supervised learning the real values are known prior. Loss can be calculated as the difference between the real values and the calculated values.

3. Back propagation: By tracing the loss back through the network each nodes relative contribution to the loss function is calculated. A technique called gradient of descent is used to change the weights in small increments using the derivative of the loss function to arrive at a loss as close to zero as possible for the next iteration of the training.

5 By tracing the loss back through the network altering the weights of neurons to lower the loss, significance can be given to certain features of the data, training a neural network requires many iterations of these phases, each iteration is called an epoch. By increasing the number of epochs until the loss score converges with the loss of the validation data we conclude that our network is trained well enough. If the loss of out network continues to decrees but the loss of the validation data starts increase we may have run in to the problem of overfitting, overfitting happens when the network has learned the charac- teristics of the training data to well and may therefore fail to predict on unseen data. Underfitting is also possible when the neural network is not capable of capturing the characteristics of the data either because of undertraining or lack of number of neurons and layers to represent the datas characteristics [7].

2.4.2 Different object detection CNN’s Object detection has always been an interesting problem in the field of deep learning. Its primarily performed by employing convolutional neural networks (CNNs) [8], with the first published break through being Overfeat in 2013 [9]. Since then deep learning has come a long way and is now possible in real time, although speed is a desired aspect it does come at the cost of accuracy. In the following chapter we will discuss some of the most popular types of neural networks used for object detection today and how they stack up to each other.

• R-CNNs, or Region based CNNs: First, R-CNN uses what is called “selective search” to generate about 2000 region proposals or bounding boxes of potential objects in the image. Each region proposal is resized to match the input of a CNN from which a 4096-dimension vector of features is extracted. According to [10], the CNN acts as a feature extractor and the output dense layer consists of the features extracted from the image and the extracted features are fed into a machine learning classifier known as a Support Vector Machine(SVM) to classify the presence of the object within that candidate region proposal are: - it takes a long time to train the network since you need to classify 2000 region proposals per image. - it cannot be implemented real time as it takes around 47 seconds for each test image on a reasonably fast computer. - the selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

• Fast R-CNN: The approach to Fast R-CNN is similar to the R-CNN algorithm but, instead of feeding the region proposals to the CNN, the entire image is fed as input and a convolutional feature map is generated, from it so called Region of Interests, or RoIs, are detected with the selective search method. the feature maps size is reduced using a RoI pooling layer to get valid Region of Interests with fixed

6 height and width as hyperparameters so that it can be fed into a fully connected layer. From the RoI feature vector, a layer using a Softmax squashing function is used to predict the class of the proposed region and another with a linear output for the bounding box [11]. This architecture is faster then standard R-CNN because there is no need to feed 2000 region proposals to the CNN every time, instead the operation is done only once per image and a feature map is generated from it [10].

• Faster R-CNN: Both R-CNN and Fast-RCNN use selective search to find out the region proposals and since that is a slow algorithm a new method was proposed that eliminates the selective search and lets the network learn the region proposals. Similar to Fast R-CNN the image is provided as an input to a convolutional network which provides a convolutional feature map. Then, to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes. This approach outperforms R-CNN and Fast-RCNN in both speed and precision [10].

• YOLO or You Only Look Once is an object detection algorithm much different from the region based algorithms mentioned before. In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes. How YOLO works is that an image is taken and split into an SxS grid, within each of the grid m bounding boxes are taken. For bounding box, the network outputs a class probability and offset values. The bounding boxes that have a class probability above a threshold value are selected and used to locate the object within the image. The advantages of YOLO is that it is much faster then the above mentioned algorithms and can therefore run in real time, however price does come with a cost, the limitations of YOLO are that it struggles with small objects within the image which is due to the spatial constrints of the algorithm [12].

• SSD or Single Shot MultiBox Detector, has the task of object localization and clas- sification doing only a single forward pass of the network, hence the “single shot”. The model takes an image as input which passes through multiple convolutional layers with different size filters(10x10, 5x5 and 3x3). Feature maps from convolu- tional layers at different position of the network are used to predict the bounding boxes. They are processed by a specific convolutional layers with 3x3 filters called extra feature layers to produce a set of bounding boxes. The Non-Maximum Sup- pression method is also used at the end of the SSD model to keep the most relevant bounding boxes. The Hard Negative Mining (HNM) is then used because a lot of negative boxes are still predicted. It consists in selecting only a subpart of these boxes during the training. Advantages of SSD described in [11] include: - SSD has a good balance between speed and accuracy.

7 -Every convolutional layer functions at a diverse scale and so it is able to detect objects of a mixture of scales.

Figure 1: Comparison between CNN’s

2.4.3 Labeling For supervised learning to take place data must first be labeled with the desired output of the model. The labeled data needs to be split into three categories training data, validation data and test data. 1. Training Dataset: The sample of data used to fit the model.

2. Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

3. Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. After every epoch the training model is evaluated with the validation data without ac- tually learning the validation data, the outcome can then be used to fine tune higher level hyperparameters. In a way the validation data affects the model in training but it does so indirectly since the model fits the validation data better with every update of the hyperparameters. Once the model has been completely trained it is evaluated with the testing data, evaluation with the testing data is what deems the model fit. The test data should be well curated and represent the various classes and scenarios that the model could face in a real world example. In some cases the test and validation set are one and

8 the same although this is considered bad practice if the validation is used to fine tune or affect the model in any way. The ratio between train,test and validation data is highly dependant on the applica- tion, some times no or very few hyperparameters are available for fine tuning and in that case the validation data can be a much smaller portion of the whole set [12]. Cross validation is also a common practise to avoid over fitting. In cross validation the data is split into train and test, the test data is then kept aside like before but the training portion is randomly split into training and validation each epoch [13].

2.5 The Icelandic license plate The Icelandic license plate comes in four standard sizes, all having blue letters printed on white background with a blue borderline using the font Helvetica Neue. The dimensions of the four standard sizes are [14]: A Dimension 110 x 520 mm, dimensions of letters 70 x 11 mm.

B Dimension 200 x 280 mm, dimensions of letters 70 x 11 mm.

C Dimension 130 x 240 mm, dimensions of letters 49 x 7 mm.

D Dimension 155 x 305 mm, dimensions of letters 61 x 9 mm.

Figure 2: Standard Icelandic license plates. Note that the word Stærð means size.

The plates are made up of 5 alphanumeric characters following the regular expressions:

A = [A − Z][A − Z][A − Z|0 − 9][0 − 9][0 − 9]

B,C,D = [A − Z][A − Z] − [A − Z|0 − 9][0 − 9][0 − 9] Plate “A” is written as a single line of text broken up by an inspection label between the second and third character, while “B”, “C” and “D” are broken in to two lines with a hyphen following the first line and inspection labels on the side.

9 2.5.1 Special cases A number of exceptions to the standard format exist, some deviating more from the standard format then others [15]. Plates using the standard dimensions

• Commercial plates - Red printing.

Figure 3: Commercial license plates.

• Diplomatic plates - First two letters are “CD” with white printing on green back- ground.

Figure 4: Diplomatic license plates.

• Oil - Black printing on yellow background.

Figure 5: Oil license plates.

• Temporary - Black printing on red background.

Figure 6: Temporary license plates.

• Vanity - 6 alphanumeric characters including Icelandic letters, in any order except for the standard one shown above, one white space is allowed(counts as one char- acter). In the case of two line plates the split is made after the third character. if narrow characters are chosen for example “1” and “I”, more then 6 are allowed as long as they fit the plate [16].

10 Figure 7: Vanity license plates.

Two plate types exist that do not follow the standard dimensions. Government plates have only one numeric value, "1" being the president and plates issued before January 1. 1989, the older plates have dimensions are based on the number of characters on them, they are printed in white or silver on a black background and are made up of one letter followed by 1-5 numeric letters.

Figure 8: Government license plates(left) and older format license plates(right).

2.6 Optical character recognition (OCR) Optical Character Recognition or OCR for short is, as the name suggests, a way to recognize text within a digital image so that the text in the image can be converted into a machine readable format which can then be manipulated. Some use cases of this technology would for example be to transform a printed book into an electronic one, indexing documents for use by search engines, reading vehicle registration plates and so on. The task of OCR can be broken down into four main subtasks, pre-processing, char- acter detection, character recognition and post-processing with a possibility of a second pre-processing step between the character detection step and the character recognition step.

2.6.1 Pre-processing In order to recognize text effectively, the software must first pre-process the image using techniques such as in [17, 18]. – De-skewing the image in order to make the lines of text perfectly horizontal or vertical. – Despeckling the image to remove spots and smoothing the edges of the characters. – Character isolation to split touching characters that may have bled into each other after the character location step. – Character isolation – Line removal to clean up non-character boxes and lines

11 – Layout analysis to identify columns, paragraphs, captions etc. at distinct blocks.

– Script recognition in multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script

– Normalize aspect ratio and scale to simply make it easier to recognize/detect the characters in the image.

2.6.2 Character detection Before the actual character recognition can take place the character detection must first be performed to get the location of each character and to extract them from the image, since this is done before the actual character recognition this is sometimes regarded as a pre-processing step but since this is a vital part of the system it deserves it’s own section. There are two main ways to deal with this task, the first using classic computer vision and the second using deep neural networks. By utilizing some classic CV techniques like contour and edge detection one can locate the characters one by one with a relatively low computational cost. However, contour detection is quite challenging for generalization, it requires a lot of manual fine tuning to fit a specific task and is therefore, in most cases, an infeasible approach. A solution to this is to utilize deep learning which can be quite general given a versatile enough training set, a trade of with this however is that deep learning requires a lot of training data while classic CV techniques do not [19].

2.6.3 Character recognition Suppose for a minute that there was only one letter in the alphabet, A, even then the task of OCR is quite a tricky one since there are multiple different ways to of writing the letter A since each person has a different hand writing and there exists hundreds of different fonts each with their own unique way of representing the letter A, albeit they share some similarities. To solve this problem and the problem of OCR in general, three main ways are utilized, pattern recognition, feature detection and deep neural networks [18]:

• Pattern recognition If everyone wrote the letter A exactly the same way, getting a computer to recognize it would be easy. You’d just compare your scanned image with a stored version of the letter A on a pixel by pixel basis and if the two matched you would know that you have the letter A, and that is exactly what the method of pattern recognition does. This method was first used in the 1960’s where the special font called OCR- A was developed that could be used on things like bank checks. Every letter was exactly the same width and the strokes were carefully designed so each letter could easily be distinguished from all the others and so by standardizing on one simple font, OCR became a relatively easy problem to solve, however this doesn’t really solve the problem at hand, which is that the letter A is not standardized to one way

12 of writing and so for a general use case this method will not perform well which leads us to the next methods [17]. • Feature extraction Feature extraction also known as Intelligent character recognition (ICR) is a far more sophisticated way of recognizing characters. Feature extraction decomposes characters into "features" like lines, closed loops, line direction, and line intersec- tions. The extraction features reduces the dimensionality of the representation and makes the recognition process more computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more character prototypes. If we take the sample of the character A again, a rule like the following could be applied to recognize it, If you see two angled lines that meet in a point at the top, in the center, and there’s a horizontal line between them halfway down, that’s the letter A. One can see that by applying this rule you’ll recognize most capital letter As, no matter what font they’re writ- ten in. Instead of recognizing the complete pattern of an A, you’re detecting the individual component features (angled lines, crossed lines, or whatever) from which the character is made. To match a feature to a stored character features some sort of Machine Learning model is used, the models that seem to do particularly well are Nearest neighbour classifiers such as the k-nearest neighbors algorithm [17]. • Deep neural networks The third method for character recognition, which some might argue is a feature extraction method, is utilizing the magic and power that is deep neural networks. This approach is by far the most popular today and for good reason, it provides a much more general solution to the OCR problem without sacrificing accuracy in the process which arguably is the most desired property of a good OCR system [20].

2.6.4 Post-processing To make OCR even more accurate some post processing is often required, using tech- niques such as – Constrained character set to only be able to predict desired characters. – Constrained word list to only be able to predict desired words – Pattern list to only be able to predict characters that follow a specific pattern – Near-neighbor analysis to make use of co-occurrence frequencies to correct errors by noticing that certain words are often seen together. – Utilizing grammar of the language being recognize to help determine if a word is a noun, verb etc. and thereby making it easier to predict. Some of these post-processing methods may also be introduced during the recognition stage itself to produce more accurate and more efficient results [17].

13 2.6.5 Tesseract Tesseract is a highly popular open-source OCR engine currently maintained by Google and uses many of the previously discussed approaches to OCR. It’s most recent version implements an engine that utilizes neural networks to perform the recognition and pro- vides support for multiple different languages and scripts as well as providing a flexible interface that can take in parameters to fit your OCR needs, such parameters can for ex- ample be whitelisting characteres, blacklisting characters and specifying the type of text like single character, text line, text block etc. Tesseract also provides good support for custom training/fine-tuneing the network on your own text images to yield even better results. A downside to Tesseract on the other hand is that it is mostly trained on text documents and so for any other use the accuracy and results might be subpar.

3 Methods

In this chapter we go over the design implementations of the system.

Figure 9: System overview.

3.1 System design overview The system can be split in to three main components, vehicle tracking, license plate reading and passenger counting. An overview of the system can be seen in figure9. First, our system takes a frame of video as input and runs it through a deep learning object detector(s) to detect all vehicles, licence plates and passengers at the same time. Each detected vehicle is then tracked and, if detected, the plate and the passenger count of that vehicle is assigned to it. This is necessary because we expect to see the same vehicle over multiple frames and thus we need to avoid duplicate values of the same car like discussed in section 2.3.3. When a vehicle has passed by, i.e. has not been detected in several consecutive frames, the most frequent plate reading and the average passenger count are kept. The plate can then be cross referenced with Iceland’s vehicle database

14 to get detailed information such as it’s engine type or carbon emission estimates. With that data we can then estimate the cars eco-friendliness.

3.2 Data collection Neural networks need a lot of labeled data to train on otherwise they can end up overfitted like we discussed in section 2.4.1, and such data is not readily available for Icelandic conditions. To train our object detection network(s) we knew that we would have to collect and label the data ourselves. Our original plan was to collect the data on campus grounds where we had more control over the environment, could choose a proper camera to work with etc. However, due to the Covid-19 pandemic that broke out world wide at the time we could not stick to our plan and collect data at the campus since the university was closed. We therefore had to improvise and we did so by attaching a our Go-Pro to the hood of our car and going out for a drive. We tried to gather data in multiple conditions, like rain, snow, sun, day, night, etc. but most of the data was collected during the day in sunny weather due to the fact that it was the most common weather at the time. In total we collected 3.5 hours of video. More details on the data collection can be found in section 4.1.

3.3 Choosing and training the object detector(s) The object detection model that we started with was the Faster-RCNN object detector discussed in section 2.4.2. Above all else we needed to be able to accurately detect both passengers and the licence plates. Our initial question was then focused on if detecting people through the windows of a car was even feasible. We decided to start with the Faster-RCNN model since it was the most accurate, albeit slow. Our reasoning was that the most accurate system would be the one most likely to give us the best idea of how difficult a task this problem was. To train the model we labeled 4,502 images that were gathered from the streets using the approach described in section 3.2. For each image we labeled the license plates and the passengers, we also labeled the windshields since we hoped that it might help us count the front seat passengers if we extracted the windshield. After training the detector on the labeled data we used it to extract license plates from multiple videos. We then labeled 2,131 of the extracted plate images where the labels were of the characters on the plate and what we call “characters blocks” which we also detect since we wanted to see on which our OCR performs best. We then trained a separate Faster-RCNN object detector on these images.

3.3.1 One detector to find them all Having trained both detectors we soon realized that running both object detectors on a single, relatively under powered, GPU was not going to work very well. The switching between the two models was causing some kind of a thrashing effect and the GPU was not able to maintain both models in memory at the same time. On top of that we had

15 planned to use a third, out of the box, object detector to detect the vehicles themselves for object tracking but that was now out of the question. To remedy this we first took our, out of the box, vehicle detector and used it to detect all vehicles in our previously labeled traffic images and label the vehicles in them for us. We then trained a new object detector, again using Faster-RCNN, to detect license plates, passengers, windshields, vehicles, characters and character blocks to avoid both of the pre-mentioned problems since now we had a single object detector and therefore we avoided the thrashing. The main problem with this solution was the variable size of images the DNN had to process, although the frames were all 1920 by 1080 pixels, the license plates varied in size depending on there location relative to the camera. When the DNN has to take in a different sized image, the convolutional layers and pooling layers themselves are independent of the input dimensions. However, the output of the convolutional layers will have different spatial sizes for differently sized images, and this will cause an issue if we have a fully connected layer afterwards. Even though the inference takes twice as long every time a new image size is processed the predictions were accurate enough to perform OCR on them with good results. After this the next step was to try different models for object detection, since we had already tried a model that was slow but accurate we thought that next we would try a model that was on the other end of the spectrum, that is, fast but less accurate. For this we trained none other then the king of speed, YOLOv3, which we trained on the same data and with the same labels as our previous detector. YOLO also solves the previous problem of different sized images since it is fully convolutional and therefore the inference time with YOLO is relatively consistent even when switching between different sized images .

3.4 Vehicle tracking For every frame the system processes, all detected vehicles are tracked using the object tracker described in section 3.4.1, The vehicle tracker stores each vehicle with a unique ID and keeps track of the following entities for each one.

– pred - The last prediction for this vehicle

– centroid - The centeroid of the vehicles bounding box

– most_common_plate - A tuple made up of a list of strings representing the most common plate readings and an integer representing the number of occurrences

– plate - A map that maps all previos plate readings to the number of there occur- rences

– passenger_avg - An integer representing the estimated number of passengers calculated using the method described in section 3.6

– passenger_updates_count - The number of passenger count updates

16 The license plate reading acquired with the method described in section 3.5 and the number of passengers detected within the bounding box of each vehicle are then passed to the vehicle tracker and the vehicles entities updated.

3.4.1 Object tracking algorithm The object tracking algorithm takes in a list of object predictions given by the neural net- work as described in section 2.4. From the bounding boxes of the predictions, centeroids are calculated, each new centeroids euclidean distance to previously stored centeriods is calculated and the once closest to each other are considered the same object. If a centeriod is not paired with any of the stored centeroids it is considered a new object and is added to the list of objects with a new unique ID. When an object from the stored objects is not paired to any of the new objects it is marked as “disappeared”. For each frame an object does not show up on, the disappeared counter is incremented and when it reaches a given threshold the object is removed from the list of tracked objects. By keeping track of the disappeared objects we can buffer errors in the object detection itself, for example if an object is not detected in one or more of the frames it is in.

3.5 Plate reading To read a plate in an image the plate must first be extracted, then the characters of the plate must be isolated and finally the characters must be read and understood in the correct order.

3.5.1 Plate extraction To extract the plate from an image we run the image through the object detector to detect the plates within the image and each plate is cropped out.

17 Figure 10: boxed plates

Figure 11: cropped plate 1

Figure 12: cropped plate 2

After the plates have been cropped out they go under a process called a four point transformation which is the process of warping the cropped plate into a top down “bird’s eye view” of the plate.

−→ Figure 14: After warp Figure 13: Before warp

We will now explain how this is done. To locate the plate it self within the cropped image it is first converted to gray scale then two copies are made, one solely for locating the four corner points of the quadrilateral surrounding the plate and the other for the actual four point transformation. Both images are converted to binary, meaning they

18 consist only of two color values black(0) and white(255). The filter applied to the first image uses a universal threshold, if the pixel value is smaller than the threshold, it is set to 0, otherwise it is set to a maximum value, this reduce noise while sharpening edges but in the mean time distorts characters making them harder to read for the OCR 15. The second image goes through a less aggressive using an adaptive threshold where the threshold value is calculated for smaller regions and therefore, there will be different threshold values for different regions of the image 16.

Figure 15: Binary image using a universal threshold

Figure 16: Binary image using a adaptive threshold

Next all contours are located on the first image and converted to a convex hull. At this point the contours are represented as a series of points on the image and since we are only interested in quadrilaterals with exactly four corners, polygons are approximated from the contours and only the ones with four points are kept. The four points are now used to crop out the license plate and fix the aspect ratio of the license plate in the second image, returning a rectangle containing just the license plate. All fore steps are depicted in figure 17.

(a) Initially all the (b) Contours are con- (c) Polygons are then (d) Only polygons with 4 contours are detected verted into convex hulls approximated corners are kept

Figure 17: The four stages of converting the contours detected into a bounding box capturing the face of a licence plate.

The warping is not always perfect as can be seen in figure 18 and therefore some filtering is needed. To do this we use a statistical filtering method which entails plotting the column sum of the binary image as a function of the image’s x-axis.

19 Figure 18: Statistical filtering

Since the graph of the binary image of a real plate has a very unique shape we can count the number of prominent peaks and valleys of the graph and from some threshold we can determine with good accuracy if a warped plate is a true plate.

3.5.2 Character segmentation Once the plate has been extracted we need to perform character localization to then extract the characters from the plate. To do this we again use our object detector to detect the individual characters on the plate as well as character blocks.

Figure 19: Character localisation

Figure 20: Block localisation

Before we can send the localized characters and blocks on their way to the OCR system we first need to order the characters so the plate is read right. To achieve this we first realize that we can classify the plates into two groups, slim plates and fat plates. From figure 3 we see that Size A is a slim plate where all the characters form a single line and sizes B,C and D are fat plates where the characters are broken into two lines with two characters on the upper line and the three on the lower line. Two classify the plates we combine two methods,

1. The first method uses the least square method to find the line of best fit through the top-left point of each character, it then uses the normalized root mean square error(NRMSE) to determine if a plate is slim or fat. In a slim plate the top-left

20 corners of the characters should line up almost perfectly and the NRMSE should therefore be very low, on the other hand in a fat plate the NRMSE will be much higher since the characters do not all line up in a single line.

Figure 21: Straight slim plate with NRMSE of 0.028

Figure 22: Slanted slim plate with NRMSE of 0.008

Figure 23: Fat plate with NRMSE of 1.45

21 2. The second method creates a line between the top-left corners of the character blocks and from the line’s slope we determine if the plate is a slim or fat. Again, since the blocks are lined up horizontally in slim plates the slope should be close to zero but with the fat plate the blocks are lined up vertically and thus the slope should be much higher approaching infinity.

Figure 24: Straight slim plate with a slope of 0.0

Figure 25: Slanted slim plate with a slope of 0.1875

Figure 26: Fat plate with blocks perfectly lined up so the slope is approaching infinity

3.5.3 Character recognition The last piece in the plate reading is to actually recognize what character/s an image is composed of. To do this we used a 3d party library from Google called Tesseract which specializes in reading images of characters. To utilize fully the power of Tesseract’s neural network we generated an image file which consisted of many pages of text with the same font as the license plates and then trained Tesseract’s neural network on it to fine tune the OCR, on top of that we also generated 10.500 images of actual extracted characters using the methods mentioned above and made Tesseract label those images for us which we then inspected to fix any miss-labeled characters and then Tesseract got trained on those character images as well.

3.6 Passenger counting This part was relatively simple and given more time it could have been more complicated to increase the accuracy. Here we simply use our object detector to detect passengers within a vehicle, much like with the plate reading, and then we simply store the average passenger count per vehicle, rounded to the closest integer.

22 Figure 27: Passenger detection

4 Experiments and evaluation

In this section we will explain the experiments and show the results. We start however by explaining the data we will be working with.

4.1 Data used in evaluation We collected data for our experiments ourselves using a GoPro Hero4 silver mounted on a car. The data collected consists of 23 videos in 1080p resolution at 24 frames per second. The 23 videos span a total of 3,5 hours of live traffic and capture most vehicle types along with many of the special case license plates discussed in section 2.5. We had planned to collect the data inside the grounds of the university campus but due to the outbreak of the Covid-19 pandemic the campus grounds where closed down. As we recorded in public, and to respect privacy, we neither use nor report the date and time of the recordings. However, since identifying the licence plates of the cars is the goal of our task we do not blur the licence plates in the images. We would like to note however that we never identified the owners of the cars nor worked with any personal information other than identifying the licence plate from the images.

4.1.1 The training data From the videos mentioned in 4.1, 6,633 frames where hand picked and labeled with a sum of 47,301 tags, for the training of the neural networks. Of the 6,633 aforemen- tioned images 4,502 were whole frames labeled with the tags Vehicle, Person, Plate and Windshield. The other 2,131 images were individual plates cropped from the frames with

23 the labels character and character block, these images are of varying sizes depending on the plates location relative to the camera. As discussed in section 3.5.3, we are using Tesseract see section 2.6.5 for more detail, and because it offers custom training from images, another two data sets were generated. The first set was generated using the font “helvetica neue”, which is the font used on the Icelandic license plates, and consists of 10,000 pages of text, were each character was treated as an image. The second training set was made up of 10,500 images of license plate characters cropped out of the actual license plates in the data using an object detector.

4.1.2 The testing data To measure systems accuracy in both passenger identification and license plate reading, we created a test set. The test set consisted of clips from three videos containing 37 vehicles in total. The 37 vehicles in the test set are fairly representative of the rest of the data and from observation we felt that this would suffice as an adequate test set. The ground truth values of the passenger count and license plate reading for each vehicle were annotated for later comparison. To test the accuracy of these different versions of Tesseract. An additional 465 images of license plate characters were generated from the data and labeled with the true values.

4.2 Evaluating passenger counting In section 3.3 we discussed the three different versions of object detectors that were implemented, however due to the first versions lack of performance all further testing of that object detector was discontinued, for a more detailed explanation see section 4.5. All following tests involving the object detectors therefore refer to the object detectors described in section 3.3.1. Because detecting passengers in the backseat of a moving car is nearly impossible, all ground truth values for passenger counts were either one or two, one indicating only a driver present or two indicating a driver and a front seat passenger. Following are the results for the two different object detectors.

4.2.1 Faster R-CNN The Faster R-CNN object detector was able to predict the correct number of passengers in 33 out of 37 cases, resulting in 89% accuracy 1. In an attempt to improve up on this a different version of the object detector was tested, where the windshield for each vehicle was cropped out and sent to a separate neural network for passenger detection. This version had equal accuracy but took a big hit on performance because of the additional overhead of the added network similar to what we discussed in section 3.3. Therefore further work and testing on this version was discontinued.

1In all cases of wrong prediction of passengers the object detector failed to detect a second passenger.

24 (a) Both passengers detected. (b) Phantom passenger detected.

Figure 28: Examples of using Faster R-CNN for passenger detection.

In figure 28b we see one of the few cases where a false positive passenger detection occurred.

4.2.2 YOLO The YOLO object detector predicted the correct number of passengers in 30 cases out of 37 resulting in 81% accuracy. However, the object detector failed to detect any case where more than 1 passenger was present. In most of the failure cases the object detector did at some frames detect both passengers but not often enough to increase the average passenger count above 1.5.

(a) Both passengers detected. (b) Second passenger not detected.

Figure 29: Examples of using YOLO for passenger detection.

In figure 29b we can see that the second passenger is not detected and therefore the average passenger count decreases.

25 4.3 Evaluating OCR When testing the accuracy of the Tesseract OCR five versions were tested. Tesseract comes pre-trained with more then 100 languages trained on over 4500 different fonts, included in those languages are Icelandic and English which make up two of the five versions tested. The other three versions were custom trained versions using each of the two data sets mentioned in section 4.1.1. The first of those three versions was the one trained on images generated from a font file with the same font as the license plates, the second version was trained on actual extracted images of characters from license plates and the third was trained on a combination of those two data sets. Each version was tested on 465 unseen actual character images mentioned in section 4.1.2 and their correct accuracy, total , was recorded.

307 • Pre-trained Icelandic had an accuracy of 465 = 0.66 323 • Pre-trained English had an accuracy of 465 = 0.694 352 • Custom training from font had an accuracy of 465 = 0.757 399 • Custom training from images had an accuracy of 465 = 0.858 419 • Custom training from font + images had an accuracy of 465 = 0.901

Figure 30: Visualization of quality (y-axis) for the various Tesseract configurations (x-axis).

From figure 30 we can clearly see that training Tesseract paid off, with accuracy rising from about 70% to 90%. The accuracy for each individual characters as well as what characters the recognizer was mistaking with anothers is recorded in the table in the appendix 7.1

26 4.4 Evaluating license plate reading When testing our systems ability to read license plates, we use the test set mentioned in 4.1.2 and focus on the most common license plate reading given over the consecutive frames for each tracked vehicle. We then compare that reading to the ground truth value to determine our systems accuracy. Because a big part of the license plate reading is the OCR the system analyzed the videos, fitted with the best version of OCR identified by the experiment shown in section 4.3. The object detectors were trained to predict both the location of individual characters and segments containing blocks of characters, so the experiment was ran twice for each of the object detectors, having the OCR read either character by character or block by block.

4.4.1 Faster R-CNN 30 1. Plate reading per character had an accuracy of 37 = 0.811 29 2. Plate reading per block had an accuracy of 37 = 0.784

(a) Incorrect license plate reading. (b) License plate reading success.

Figure 31: Examples of using Faster R-CNN for license plate reading.

In all cases the plates were detected successfully. In almost all failure cases the plate reading was of by one character, switching similar characters i.e. “1 ↔ I”, “8 ↔ B” and “E ↔ F”, an example of this can be seen in figure 31a where “VI875” is read as “V1875”. A few of the failure cases involved the OCR returning more then one reading per character or block i.e. “ALP49 → ALLP49” and “R560 → R560560”.

4.4.2 YOLO 30 1. Plate reading per character had an accuracy of 37 = 0.811 25 2. Plate reading per block had an accuracy of 37 = 0.676

The YOLO object detector was not able to predict the license plates or the vehicles in all frames, resulting in fewer readings and therefore not as accurate.

27 (a) Vehicle not detected (b) Vehicle detected

Figure 32: Examples of using YOLO for license plate reading.

In figure 32a the vehicle is not detected and therefore neither the passenger count nor the plate reading is registered.

4.5 Performance evaluation of the object detectors The main requirement for real time traffic monitoring is that the system needs to be able to process video fast enough to be considered real time. In this section we look in to the processing speed of the three object detectors mentioned in section 3.3 and explain why only two of them were considered for further testing.

4.5.1 Faster R-CNN x3 The first implementation ran two Faster R-CNN neural networks trained on the two data sets described in section 4.1.1, and a third pre-trained one mainly for detecting vehicles for object tracking, like mentioned in section 3.3 this version experienced a lot of thrashing and was only able to process 1 frame every 4 seconds on average. Because of this poor performance it was obvious that a different approach was needed, one that only utilized a single neural network.

4.5.2 Faster R-CNN The second implementation used a single Faster R-CNN object detector made by merging the three detectors described in section 4.5.1, for a detailed description of how the data was merged see section 3.3.1. This version managed to process 14 frame per second with slow downs only occurring because of Faster R-CNN’s hesitation when processing images of varying sizes.

4.5.3 YOLO The third and final implementation of the object detector was a YOLOv3 neural network, trained on the same data as the second implementation seen in section 4.5.2. This

28 implementation achieved an impressive 28 frames per second, processing our video faster then it’s normal playback. YOLO offers significant improvements in inference speed at the cost of accuracy as can be seen in sections 4.2 and 4.4 and visualized on figure1.

5 Future work

From the results seen in section4 we can see that real-time traffic monitoring is possible. However, further work is needed to increase both the accuracy and inference speed of our system. The final version is not light weight enough to be deployed on a small form factor computer like the Jetson Nano, but with some of the proposed improvements discussed in this chapter we hope this goal is achievable. In the following subsections we will discuss possible changes and further development. We start by looking at the hardware, followed by data and labeling and finally a few sections on various software changes.

5.1 Hardware changes A single Jetson Nano is not capable of running our current system. While it could be possible to divide the work between several such machines we think it would be better to get a more capable device, for example a the Nvidia NX. If the multi-device option would be taken, each computer could run a small, spe- cialized, ANN. For example, the first one could crop the image, while others could each hold a specialized ANN to to a specific task and those nodes could be run in parallel. In this way each network can be kept smaller and with less overhead but it will require more complex design and synchronization. Another option would be to do pre-filtering on a single Jetson Nano, or any other single board computer, and then pass the needed workload over desktop machine or a to a cloud service for the heavier processing.

29 Figure 33: On campus location initially intended for data gathering.

5.2 Data and training As mentioned before because of the Covid-19 pandemic, only a small portion of the data gathered for training and testing was acquired at the ideal location we first intended to set up the camera. The two locations on the map are both on school property and neither of them captures the highway making it legal to film with the schools permission. After the school closed down the traffic in these areas reduced down to almost zero. As a work around the rest of our data was captured on a car mounted GoPro camera. Some of the footage was captured while the vehicle was stationary but a reasonable portion was captured while driving, capturing the footage while driving did introduce some shake and distortion. Even though quite some footage was gathered it was of variable quality and was missing some weather variants like heavy rain and snow. Using the camera intended for the final product in a location and environment similar to the final location should give better results for future data gathering and training.

5.3 Different types of DNNs In the most accurate version of our solution we were running a single Faster-RCNN network to detect all objects (vehicle, person, license plate, windshield, character and character block). This includes also the segmentation of the license plate and a custom trained version of Tesseract OCR for character recognition. However, as mentioned in section 3.3.1 this solution did not handle variable sized images very well. To address this problem there are three options,

1. Transformation: The first option would call for optimizing the network with TensorRT and allowing for two or more cached engines, depending on the number

30 of unique image sizes to be processed. If the network would be trained on two or more different image sizes, where the plates had been scaled to a fixed width and then padded for variable height, then the network could hopefully learn to segment characters and blocks of characters in those images. The number of fixed image widths could be more then one, e.g. for the different types of plates. We stared working on this but early attempts did no work as expected as they probably require more training data.

2. Network Property: Fully convolutional networks or FCN’s, do not have a fully connected layer and there for do not care about varying input size. Networks like R-FCN, SSD and some variants of YOLO are FCN’s and would be a suitable choice [21]. FCN’s do trade some accuracy for a big improvement in inference time. This can be seen in section4 where we see a significant decrease in correct predictions using YOLOv3 for our object detector but a great increase in inference time.

3. Segmentation without DNN’s: It is possible to segment the characters and character blocks without the use of DNN’s given that the four point transformation is good. The four point transformation can be tweaked to work better at some location with enough data from that same location [22].

5.4 Cloud solutions Similar to the idea of distributing the load between single board computers, if the com- puter ran only a small network to crop the initial frame then smaller images could be sent to a cloud inference service. Tensorflow serving, Amazon Elastic Inference and Google cloud AI all offer cloud inference, the price for these services depends on the machine chosen to do the inferencing.

5.5 OCR The OCR used in our system was Tesseract and was chosen for it’s accuracy and compat- ibility. Tesseract is one of many free OCR’s available, other options include Calamari, Kraken and OCRopus. If we expand into cloud solutions for inferencing then we could as well utilize cloud OCR solutions such as Abbyy Cloud, Google Cloud Vision and Microsoft Azure Computer Vision, all services will include some cost.

5.6 Training Tesseract Out of the box accuracy of Tesseract is acceptable but much better results can be obtained by training it. Tesseract can both be trained from a specific font like done for our system and from images. Training from images requires extracting images of all unique characters from license plates and labeling, although it is much more time consuming it has the potential to give much better results. Tesseract can be utilized to give initial guesses on what characters are in the images and therefore only needing a supervisor

31 to correct mistakes. In section 4.3 we show that our best results were obtained using a custom trained version of Tesseract, trained on both the expected font and images from the data. With a larger data set and by fine-tuneing the training process this accuracy could be further improved upon.

5.7 Improving OCR with explicit rules We could improve the OCR’s performance by adopting some rules about how the Ice- landic plate is constructed, as can be seen in 2.5 the traditional plate first consists of 2 characters, then a character or a number and then 2 numbers. Knowing this would could improve Tesseract’s performance by knowing that if Tesseract, for example, predicted a B where there is supposed to be a number then most likely said number is the number 8. A problem with this approach however is that special number plates do not follow these rules and so we would first need to be able to distinguish between a special number plate and a traditional one.

5.8 Combining YOLO and OCR One Idea to improve both inference time and OCR accuracy is to instead of detecting any characters in images we try to train YOLO to detect and distinguish between all possible characters. Thus cutting out both inference time and complexity of the overall system. We do fear however that this would result in a lower OCR accuracy but this would be an interesting idea to test none the less.

5.9 Optimization of the object detector There are ways to optimize a trained object detection model, for example, with NVIDIA’s tensorRT, which is an SDK that can optimize neural network models trained in all major frameworks. For YOLO this could increase the inference speed up to 70% [23].

5.10 Improve passenger counting As of now we are only using our object detector to detect passengers in vehicles and counting them. We could try to complicate this process by for example extracting the front windshield of each car, much like we do with the license plates, and perform some sort of pre-processing on it and then run that through the object detector again.

5.11 Statistical filtering The statistical filtering function could be greatly enhanced by introducing a better tie breaker for images from the same frame. In the latest version of the system, if two images both contain enough prominent peaks to be considered a plate, the image with the lower average color value is considered cleaner. This presumption is based on the fact the cleanest possible image should contain only the characters on a white background.

32 6 Conclusion

In this thesis our goal was to evaluate the feasibility of a real-time traffic monitoring system using state-of-the-art computer vision technology. Our aim was to both count the number of passenger as well as to read the license plate in order to obtain information about the vehicle. With this information, along with access to the Icelandic department of motor vehicles’ online look up service, we can build interesting services that can guide and/or coerce people towards more eco-friendly options. As the results in chapter4 indicate, it is indeed possible to read the license plate and detect the passengers of a traveling vehicle. Using Faster R-CNN we showed that our system is able to predict the correct number of passengers with accuracy of 89% and is capable of recognizing and reading license plates in their entirety with an accuracy of 81%. The angle of the camera to the road matters. In our experience the traffic that faced the camera head-on was detected at a much higher rate. This holds for both detecting passengers but especially for reading the licence plate, where the accuracy was close to 100%. The only caveat is that back seat passengers become near impossible to see. We demonstrate in section 4.3, that an OCR can be trained to recognise Icelandic license plate characters with an accuracy above 90% and with more diverse data we hope to achieve even higher accuracy. Both of the methods evaluated, character- and block reads, have the same primary problem, namely that they were confusing particular letters and numbers (see 4.4). However, taking the structure of the Icelandic licence plates into account, we think the accuracy could be improved upon. The other deep learning candidate, YOLO, showed promising results, with faster inference time and capable of running in real-time. It was able to match the Faster R-CNN detector’s accuracy in regards to plate reading but came up short when trying to detect passengers. Thus, it will require further training and evaluation before clear superiority is established. The largest outstanding issue is our lack of diversity in the training and evaluation data. This shortcoming was largely due to the Covid-19 outbreak and our need to change our plans for data gathering. Difficult conditions, such as rain, snow or twilight have not been evaluated and thus we feel that it is to early to declare any one algorithm superior to another.

33 References

[1] Martin, “UN Report: Nature’s Dangerous Decline ’Unprecedented’; Species Extinction Rates ’Accelerating’,” library Catalog: www.un.org. [Online]. Available: https://www.un.org/sustainabledevelopment/blog/2019/05/ nature-decline-unprecedented-report

[2] “Humans are causing life on Earth to vanish,” library Catalog: www.nhm.ac.uk. [Online]. Available: https://www.nhm.ac.uk/discover/news/ 2019/december/humans-are-causing-life-on-earth-to-vanish.html

[3] “Hagstofan: Fjöldi ökutækja og eldsneytisnotkun 1995-2018,” library Catalog: hagstofa.is. [Online]. Available: https://hagstofa.is/utgafur/frettasafn/samgongur/ fjoldi-okutaekja-og-eldsneytisnotkun/

[4] “Machine learning,” May 2020, page Version ID: 954641710. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=954641710

[5] A. Kleiner and R. C. Kurzweil, “A Description of the Kurzweil Reading Machine and a Status Report on Its Testing and Dissemination,” Bulletin of Prosthetics Research, p. 10, 1977.

[6] “Viola–Jones object detection framework,” Feb. 2020, page Version ID: 942582711. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Viola%E2%80% 93Jones_object_detection_framework&oldid=942582711

[7] “Overfitting,” Apr. 2020, page Version ID: 953810115. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Overfitting&oldid=953810115

[8] “Introduction to basic object detection algorithms - Heartbeat.” [Online]. Available: https://heartbeat.fritz.ai/ introduction-to-basic-object-detection-algorithms-b77295a95a63

[9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” arXiv:1312.6229 [cs], Feb. 2014, arXiv: 1312.6229. [Online]. Available: http://arxiv.org/abs/1312.6229

[10] “R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detec- tion Algorithms.” [Online]. Available: https://towardsdatascience.com/ r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e

[11] A. Ouaknine, “Review of Deep Learning Algorithms for Object Detection,” Feb. 2018, library Catalog: medium.com. [Online]. Available: https://medium.com/ zylapp/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852

34 [12] T. Shah, “About Train, Validation and Test Sets in Machine Learning,” Dec. 2017, library Catalog: towardsdatascience.com. [Online]. Available: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

[13] “Cross-validation (statistics),” Apr. 2020, page Version ID: 952753784. [On- line]. Available: https://en.wikipedia.org/w/index.php?title=Cross-validation_ (statistics)&oldid=952753784

[14] “Umferðarstofa | 3.7 Stærð og gerð.” [Online]. Available: http://wayback.vefsafn.is/ wayback/20060304210936/http://www.us.is/id/3613

[15] “Vehicle registration plates of Iceland,” Jul. 2019, page Version ID: 907526187. [On- line]. Available: https://en.wikipedia.org/w/index.php?title=Vehicle_registration_ plates_of_Iceland&oldid=907526187

[16] “Umferðarstofa | 3.10.3 Einkamerki.” [Online]. Available: http://wayback.vefsafn. is/wayback/20060304210956/www.us.is/id/3572

[17] “Optical character recognition,” Apr. 2020, page Version ID: 951620985. [On- line]. Available: https://en.wikipedia.org/w/index.php?title=Optical_character_ recognition&oldid=951620985

[18] Victoria, “How does OCR work? | Tips for Users,” May 2016, library Catalog: www.scan2cad.com Section: Tips and Advice for Users. [Online]. Available: https://www.scan2cad.com/tips/how-does-ocr-work/

[19] G. Shperber, “A gentle introduction to OCR,” Apr. 2020, library Catalog: towardsdatascience.com. [Online]. Available: https://towardsdatascience.com/ a-gentle-introduction-to-ocr-ee1469a201aa

[20] “Overview of the new neural network system in Tesseract 4.00,” library Catalog: tesseract-ocr.github.io. [Online]. Available: https://tesseract-ocr.github.io/tessdoc/ NeuralNetsInTesseract4.00.html

[21] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully Convolutional One-Stage Object Detection,” arXiv:1904.01355 [cs], Aug. 2019, arXiv: 1904.01355. [Online]. Available: http://arxiv.org/abs/1904.01355

[22] Y. Zhang and C. Zhang, “A new algorithm for character segmentation of li- cense plate,” in IEEE IV2003 Intelligent Vehicles Symposium. Proceedings (Cat. No.03TH8683), Jul. 2003, pp. 106–109.

[23] Alexey, “AlexeyAB/darknet,” May 2020, original-date: 2016-12-02T11:14:00Z. [Online]. Available: https://github.com/AlexeyAB/darknet

35 7 Appendix

7.1 Character prediction table This table displays the character predictions when using the tesseract version trained on both real character plate images as well as on generated character images from a font file, it displays the accuracy for each character as well as to display what characters the recognizer is mistaking for another. Character Predictions Total True Total False Accuracy A A: 20, "": 1 20 1 0.95 B B: 5 5 0 1.0 C C: 1 1 0 1.0 D O: 2, D: 10, 0: 1 10 3 0.77 E E: 3 3 0 1.0 F F: 21, 3: 1, "": 1 21 2 0.91 G G: 4 4 0 1.0 H H: 4, 2: 1, "": 1 4 2 0.66 I 1: 8, "": 1 0 8 0.0 J J: 10, Y: 1 10 1 0.91 K K: 22 22 0 1.0 L L: 7 7 0 1.0 M M: 7, "": 4 7 4 0.64 N H: 1, N: 11 11 1 0.92 O O: 2, Z: 1 2 1 0.67 P P: 9 9 0 1.0 R R: 9, "": 2, 1: 1 9 3 0.75 S S: 16 16 0 1.0 T T: 7, 1: 2, 7: 1 7 3 0.7 V V: 10, Y: 1 10 1 0.91 U U: 20, J: 1 20 1 0.95 X X: 13, 7: 1, R: 1 13 2 0.87 Y Y: 18, "": 1 18 1 0.95 Z Z: 9 9 0 1.0 0 0: 23, C: 1, D: 1 23 2 0.92 1 1: 23, "": 1 23 1 0.96 2 Z: 3, 2: 8 8 3 0.73 3 3: 24 24 0 1.0 4 4: 16 16 0 1.0 5 "": 1, 5: 14 14 1 0.93 6 6: 13, "": 1, 8: 1 13 2 0.87 7 7: 17 17 0 1.0 8 8: 21, B: 2, 3: 1 21 3 0.88 9 9: 19, 3: 1, S: 1 19 2 0.90

36