on Raspberry Pi3 for Face Recognition

by

Kollu Nimshi

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Microelectronics and Embedded Systems

Examination Committee: Dr. Mongkol Ekpanyapong (Chairperson) Prof. Matthew N. Dailey Dr. A.M. Harsha S. Abeykoon

Nationality: Indian Previous Degree: Bachelor of Technology in Electronics and Communication Engineering Jawaharlal Nehru Technological University, Hyderabad, Telangana, India

Scholarship Donor: AIT Fellowship

Asian Institute of Technology School of Engineering and Technology Thailand December 2019

Acknowledgements

My heartfelt thanks to my dear advisor Dr. Mongkol Ekpanyapong. I couldn't have done this without his direct guidance and technical support. He drived me in the right direction whenever I needed it.

I would also like to thank my committee members Prof. Matthew N. Dailey and Dr. A.M. Harsha S. Abeykoon for their encouragement and insightful comments. My sincere thanks also go to Mr. Chatchai Pruetong, Mr. Amit Prasad Nayak, Mr Clifford, and Mr Sahuri Bond for the participation in my project and all other technical help. I express my profound gratitude to my family for their unfailing support and constant encouragement and my friends at AIT for being very nice to me during my stay here.

Kollu Nimshi December 2019

ii Abstract

In the present context, there is one big issue regarding intelligent security system using Face Recognition. In fact, it is a valid question-why do we need to implement only Face Recognition as intelligent security system? There is an effort to implement this on low power edge devices like Raspberry Pi3 and improve the accuracy of face recognition software. Even the smallest changes in the light or orientation could reduce the overall performance of recognition leading to more false positives. Though it can be implemented on powerful machines like CPU, GPU etc., yet it is not the best solution as it consumes large size and more power. It also increases the cost and complexity to maintain. Thus, bringing this application into embedded single board computers is very important .Edge computing by reducing deep learning model size is next coming future scope in Embedded System field and see how we can build intelligent system on low power devices.

To increase recognition accuracy, the Deep Neural Networks (DNN) can play a vital role for the implementation of deep learning based tasks. Earlier such systems have been implemented in this area has been done in two factors: (i) end to end learning for the task using a Convolutional neural network (CNN), and (ii) the availability of large scale training datasets. After training the CNN on a desktop PC we employed a Raspberry pi, model B, for the image classification purpose. However, to utilize this CNN model with millions of free parameters on a low power embedded is much more complex and a challenging objective. This constitutes a challenge for embedded vision systems performing edge inference as opposed to cloud processing.

Therefore, this led to the idea of using a Neural Compute Stick as a edge inference for accelerating the performance on Raspberry Pi3.The Intel Neural Compute Stick (NCS) provides a possible route for running large - scale neural networks on a low cost, low power, portable unit. Computer vision has made it possible to acquire, process, analyze and extract high-level understanding for digital images and videos. Researchers are also looking at ways to apply the latest advances in facial-recognition technologies to uncontrolled environments, where success rate is maximum up to only 50% only.

In this study, Facenet model using one shot learning algorithm is implemented for face recognition and verification on Raspberry Pi3. This system replaces the use of complete trained Facenet model on pi3 by converting this large model into Intel NCS graph and OpenVINO models format by Intel NCS SDK tools and OpenVINO Model Optimizer. With the advanced NCS API and Inference Engine API we are able to perform the inference on pi3 thereby, improving the speed of recognition of objects/ faces. The goal of this experiment is to describe a simple and easy hardware implementation of face recognition system on Raspberry pi3 that run the trained model which is trained on Custom datasets. This system is programmed using Python and is operated and controlled by Raspberry Pi3 with an USB Camera.

Keywords: Intelligent Security System , Face Recognition, Facenet and Dlib, Deep Neural Networks , Convolutional Neural Networks, Embedded Vision Systems , Raspberry Pi3, Edge Inference, Intel Neural Compute Stick (NCS) , NCS SDK AND Inference Engine API , OpenVINO Model Optimizer, NCS Graph , OpenVINO Models, Python

iii

Table of Contents

Chapter Title Page

Title Page i Acknowledgments ii Abstract iii Table of Contents iv List of Figures vi List of Tables ix List of Abbreviations x

1 Introduction 1 1.1 Overview 1 1.2 Problem Statement 3 1.3 Objective 4 1.4 Limitations and Scope 4 1.5 Thesis Outline 5

2 Literature Review 6 2.1 Background 6 2.2 Challenges of Face Recognition Algorithm and 11 How Deep Learning Algorithms Can Solve It? Outline of Deep Face Architecture

3 Methodology 12 3.1 Overview 12 3.2 Data Collection 14 3.3 Data Pre-Processing 16 3.4 Main drawback to implement Facenet Model- 25 on Embedded Devices 3.5 Neural Compute Stick Platform 27 3.6 Model Optimizer 30 3.7 Procedure to Convert Facenet Model to 31 NCSDK Graph Format 3.8 Face Recognition on Raspberry Pi3 Using 38 OpenVINO Toolkit 3.9 Software 44

4 Experimental Results 46 4.1 Overview 46 4.2 Face Recognition Results on Raspberry Pi3 46 without Using Intel NCS 4.3 Face Recognition Results on Raspberry Pi3 50 Using Intel NCS and NCSDK

iv

5 Conclusion, Recommendations and Future Works 80 5.1 Conclusion 80 80 5.2 Recommendations and Future Works

References 81

v

List of Figures

Figure Title Page 2 1.1 A system set up on Raspberry Pi3 using Intel NCS 2.1 Open Face vs. earlier non-exclusive face recognition 8 Implementations 3.1 Workflow Representation of Methodology 13 3.2 Training image folder 14 3.3 Data collection corresponding to Nimshi label 15 3.4 Hyper parameters Values 16 3.5 Screenshot taken during pre-processing the Facenet Model 16 3.6 Face Detection Outputs after applying MTCNN Algorithm 17 3.7 Bounding Boxes for Face Detection of 4 Person’s 18 3.8 . .npy files 18 3.9 Pre-Trained CNN Model 19 3.10 Screenshot taken during training the Facenet model 20 3.11 Face Embedding Matrices values of 4 Persons 21 3.12 Facenet Classifier Model File 22 3.13 Face Recognition Outputs of 4 Persons 23 3.14 Flowchart representation of training and testing Facenet model 24 with custom data 3.15 Size of my trained model(.pb) and classifier model(.pkl) 25 3.16 Time Taken To load Trained Mode (.pb) l On Raspberry PI3 26 3.17 Time Taken To load Trained Mode (.pb) l On Raspberry PI3 27 3.18 Implementation of the Myriad2 VPU used within the Neural 28 Compute Stick (NCS) platform 3.19 Illustration of using Intel NCS to develop a DNN based Embedded 29 System 3.20 Live-Object Detection on Raspberry Pi 3 using Intel- NCS 30 3.21 Intel NCSDK And OpenVINO Architecture 31 3.22 Facenet Checkpoint Files 32 3.23 Facenet Graph file after Compiling and its Size 34 3.24 Simple Inference Code Flow 35 3.25 FACENET MODEL VIEW 36 3.26 OPENVINO MODEL SIZE PROPERTIES 37 3.27 NEURAL COMPUTE STICK AND NEURAL 38 COMPUTE STICK 2 3.28 MYRIAD X ARCHITECTURE 39 3.29 Command To Convert OpenVINO FP16 Format 40 3.30 Successful Conversion To FP16 OpenVINO IR 40 3.31 Models 41 3.32 Visualization Of Network Topology Of .xml File 42 42 3.33 to Pooling Of Different Data

Reshape to Normalization Layer Showing Different Data Size 43 3.34 44 3.35 OpenVINO .XML Model Structure Transferred OpenVINO Models To Raspberry PI3

vi 3.36 Flowchart For OpenVINO Face Recognition Algorithm 45 4.1 . Face Recognition results on Raspberry pi3 without using Intel 47 NCS 4.2 Counting no of times Face Recognized to generate Confusion 48 Matrix 4.3 Implementation on Raspberry Pi3 using Intel NCS 50 4.4 Face Recognition Results of 5 Persons under Lighting 51 4.5 Face Recognition Results of 5 Persons under Low Lighting 52 4.6 Face Recognition Results under different Emotions 53 4.7 Python Code for Calculating Difference between 2 images 54 4.8 Distance Calculation based on Threshold Value as shown in 54 RaspberryPi3 Shell 4.9 Implementation on Raspberry Pi3 using Intel OpenVINO 55 Inference Engine 4.10 Face Recognition Results on Raspberry Pi3 Using Intel 56 OpenVINO Method 4.11 Face Recognition Results on Raspberry Pi3 Under HAT 57 Conditions Using Intel OpenVINO Method 4.12 Multiple Face Recognition Results on Raspberry Pi3 Using 58 Intel OpenVINO Method 4.13 Inference Time Calculation Using OpenVINO deployed on 59 Intel NCS 4.14 Inference Time Calculation For Multiple Face Recognition 59 Using OpenVINO deployed on Intel NCS 4.15 Inference Time Calculation Using OpenVINO deployed 60 on Intel NCS2 4.16 Inference Time Calculation For Multiple Face Recognition 60 Using OpenVINO deployed on Intel NCS 4.17 Flowchart of how Images are passed through Intel NCS 61 4.18 Time for loading SVM Models 63 4.19 OPENVINO MODEL SIZE PROPERTIES 63 4.20 Time for reading IR Models 63 4.21 Time taken to generate Input and Output Blobs 64 4.22 Time taken to Create Executable Network 64 4.23 Time taken for Pre-Processing on Raspberry Pi3 64 4.24 Time taken for performing inference on my Trained 65

Model 66 4.25 Benchmark Tool Results on my trained .XML File 4.26 Benchmark Tool Results of my trained .XML File on 67 Raspberry PI3 ARM

vii 4.27 Facenet Prediction Graph 69 4.28 Facenet Frame Rate Graphs 70 4.29 Printing the Maximum Probability Prediction Confidence Value 77 Corresponding to Clifford Label 4.30 Accuracy for 4 different cases 78

viii List of Tables

Table Title Page

2.1 . Recognition Accuracy Rate Comparison 6 2.2 Literature Summary on different face recognition methods 9 3.1 INTEL NCS VS NCS2 39 4.1 Confusion Matrix Table 49 4.2 Accuracy and Total Time taken to perform Predictions 49 on Raspberry Pi3 4.3 Hyper parameters Values 62 4.4 Timing analysis of each of the step taken while 65 prediction Performance in NCS 4.5 Performance Analysis On 4 Different Hardware 68 Platforms 4.6 Accuracy Calculations : Confusion Matrix in Non- 71 Lighting Conditions 4.7 Accuracy Calculations : Confusion Matrix in 73 Lighting Conditions 4.8 Accuracy Calculations : Confusion Matrix For 75 OpenVINO Based Implementation On Raspberry Pi3 4.9 Overall Summary For Measuring Accuracy 76 Performance Using OpenVINO IR Models And Intel NCS 77 4 .10 Overall Probability Confidence Values based on Maximum Frequency

ix List of Abbreviations

AI ANN Artificial Neural Network CNN Convolutional Neural Network CPU GPU IOT ReLU Rectified Linear Unit YOLO You Only Look Once Rpi Raspberry Pi NCSDK Neural Compute Software Developer Kit NCS Neural Compute Stick VPU Vision Processing Unit FPGA Field Programmable Gate Array ZISC Zero Instruction Set Computer Chips DSP Digital Signal Processing SOM Self-Organizing Maps SVM Support Vector Machines PCA Principal Component Analysis LBPH Local Binary Pattern Histogram LDA Linear Discriminant Analysis LFW Labelled Faces in the Wild SGD Stochastic ELL Embedded Learning Library ACL ARM Compute Library Ms Milli seconds FPS Frames Per Second OS Operating System MTCNN Multi Task Cascaded Convolutional Networks MLP Multi-Layer RBF Radial Basis Function Network HOG Histogram of oriented gradients Mo Model Optimizer IR Intermediate Representation

x Chapter 1 Introduction

1.1 Overview

Recognizing a Person name is the main focus of the Computer Vision technology which cannot be neglected because most of the time is wasted. In the previous methods used for face recognition are traditional methods which are not much accurate. So, for these kind of problems the researchers have developed some solutions like face recognition techniques by making use of Convolutional Neural Networks(CNN) .Therefore, by combining the deep learning techniques with face recognition technology, this technologies are rapidly spreading in various sectors such as malls, universities , and ministries which can be made to communicate in an integrated manner with the system and as a result, convenience, safety, and energy efficiency can be achieved by implementing on low power devices .

Face Recognition Process is executed in 2 steps:

 First we need to detect human faces by making use of some objects like video cameras. To this detected face we draw some bounding boxes and normalize it.

 Now we pass this detected face /bounded boxes to some trained classifier for performing final prediction of human name known as Automatic Facial Acknowledgement.

If the process contains above 2 steps are called as Full Automatic Algorithm whereas containing only last step are called as Partial Automatic Algorithm.

Now the main issue is to run this face recognition in embedded devices. For any embedded processors to run any application there are 3 important elements:

1. Size of the deep learning model

2. The device measurements

3. The power supply required to run this devices

To add the favour of deep learning methods in raspberry pi3 is tough due to its small processing speed and power. We may now be tempted to go for Internet of Things (IOT). Integrating artificial intelligence to the IOT devices provide massive benefits like sharing more in- depth information with a high level security. But at the present situation these devices are unable to run deep learning models like Tensor Flow, caffe, Mxnet etc.

The IOT has been started to shift to edge computing due to following reasons:

(i) For storing the unimportant data in the cloud.

(ii) Due to uninterrupted transmissions of data leads to wastage of energy for IOT gateway, which declines their battery life.

(iii) Single Point of Failure in term of communication and processing devices.

1 Edge computing means the need to push image recognition and general capabilities to end/edge devices (also known as embedded systems), which are inherently resource constrained system.

Running computationally intensive tasks on embedded devices is a challenge in itself. This push to add visual intelligence to the end devices gave rise to the field of embedded vision. Embedded vision applications are steadily growing, with great promise for the future. But there are still problems with algorithms being too computationally intensive or their implementation too difficult that make them unfeasible.

Therefore, to solve the above problem in this project, we use Facenet model to perform Face Recognition and Verification. In this project we prefer Facenet architecture since it is fast and solves the face verification problem. It learns one deep CNN, then transforms a face image into a vector embedding. This embedding can be used to compare faces to see how similar they are and can be used in the following 3 ways:  Face verification  Face Recognition  Face Clustering

After we get this trained facenet model we convert this model to Intel NCS format by generating its graph file. By performing the inference of this trained model using this compiled Facenet graph we reach our goal i.e., high frame rates and accuracy on Raspberry Pi3.

The below fig. 1.1 shows hardware setup of my project.

Figure 1.1: A System Set up on Raspberry Pi3 using Intel NCS

2 1.2 Problem Statement

In general the simplest face recognition system work based on comparison with unknown face. This may looks simple and easy but this is not a suitable method when we are implementing in low computation speed embedded computers. For example if we want to implement around 50 persons with this approach will cause raspberry pi3 a very low frames per second. Time and Accuracy plays a major role in any automatic face recognition system. So we need to develop the computationally efficient algorithms for this problem.

The second problem with raspberry pi3 is they have a very little memory. Since the training the deep learning models is done on either CPU or GPU and we need large amount of training data to perform more accurate predictions using this deep learning models. But the main problem with this approach is this models they take up a lot of space on disk, with the original facenet model being over 200MB, storing is the main problem in embedded device.

The third drawback on running deep learning on raspberry is , one that raspberry PI is not an accurate representation of an embedded system, second that TensorFlow would have to be ported to our target platform as the libraries are not available for all platforms , making it hardware dependent, and the Third Problem is Speed of recognition on raspberry pi3. As we discussed from 2nd problem as the size of model increases with amount of training data provided to the pre-trained models. The important characteristics of embedded system are high speed, low size, high accuracy and high reliability. Speed is inversely proportional to size in terms of performance of embeeded system. But to improve accuracy we need to increase the training data. So there is a trade-off between accuracy vs. speed on low power Embedded System. For instance, the deep learning based face recognition implementation on pi3 is taking around 10-15 seconds for one person which is very bad when we are implementing in real-time biometric systems.

At long last the last issue is with the accuracy of models. The dynamic expressions of faces causes a serious problem in recognition system. They are arranged into two gatherings: Intrinsic components and outward factors. The main problem with intrinsic factors are facial hair, facial expression etc. The Problem with the extrinsic factors is collaboration of light with the face. These components incorporate light, pose, and scale and so on.

Illumination is simply defined as the light variations. The main problem with this is same person appears different when seen in the camera with same face expression with changes in the lighting condition. It is observed in one experiment that distinction between 2 pictures of a similar individual taken under fluctuating enlightenment is more prominent than the contrast between the pictures of two unique people under same light.

Therefore, the reason for this examination is to supplant the conventional face acknowledgment strategies by utilizing the concepts of computer vision and deep learning models. Hence, by taking this pre-trained model like facenet, Alexnet etc. we start performing inference on raspberry pi3 using Intel NCS and see the performance of raspberry pi3 in terms of Speed , Accuracy , Ram , and CPU usage.

In simple words to say the problem to be solved is: Practical Implementation of deep learning on Raspberry Pi3.

3 1.3 Objectives

 Implementing a real time Embedded Vision through Raspberry Pi 3 Processor which can recognize the face of 4 persons by comparing the unique features of that face to all the people you already know to determine the person’s name using deep learning Facenet Model .

 To optimize the Facenet Model using the OpenVINO Model Optimizer and later integrate this OpenVINO Architecture (FP16) with trained SVM classifier.

 To improve the live face recognition speed on raspberry pi3 by using the Intel Movidius Neural Compute Stick and perform Parallel Programming using this OpenVINO Asynchronous Execution of Inference Engine.

 To improve the accuracy of the face recognition system on Raspberry Pi3 on this optimized OpenVINO model under Hat conditions, High quality videos, Facial expressions, and Human Movements.

1.4 Scope and Limitations

1.4.1 Scope • We are comparing performance of face recognition for 5 persons by taking 3 scenarios: (1) Ubuntu Operating System, (2) Raspberry Pi3 only and (3) Raspberry Pi3 + Intel NCS. This Trained model is also tested in both lighting and non-lighting conditions, contrasting facial expressions, poses and measure the Accuracy, F1 Score, Precision and Recall using Machine Learning Scikit Library. • Compared with classical systems, embedded vision provides enormous cost benefits and security. • Using Intel NCS on embedded mini computers we can implement a smart Iot based solutions that uses a face recognition to authorize access to restricted areas. Therefore by integrating this deep learning algorithms on to embedded vision system for image classification has helped programmers to spend less time and energy in developing intelligent algorithms for face detection and recognition. 1.4.2 Limitations

• As the number of validation images increases speed on pi3 reduces slowly but accuracy increases. • We should make sure that face head is straight watching the camera otherwise model may not recognize correctly leading to poor accuracy. • In the lighting conditions the accuracy will slightly reduce because the model will take complete face along with the background.

4 1.5 Thesis Outline

I organize rest of the dissertation as follows.

In Chapter 2 I describe the Literature Survey

In Chapter 3 I propose my methodology

In Chapter 4 I present the experimental results

In Chapter 5 I conclude my thesis

5 Chapter 2 Literature Review

2.1 Background

In this literature review, the different approaches used earlier for Face Recognition are presented.

1. Principal Component Analysis (PCA) :

It also known as eigen face , eigen picture, and principal component . L. Sirovich and M. Kirby [1] proposed the use of principal component analysis to efficiently represent images showing faces. In this method they reproduce the face pictures by a little accumulation of weights for each face and a standard face picture (eigen picture). The weights depicting each face are acquired by anticipating the face picture onto the eigen picture.

2. Local Fisher Discriminant Embedding (LFDE) :

Cheng-Yuan Zhang and Qiu-Qi Ruan [2] presented an face recognition approach known as the L-Fisherfaces. This main difference between Linear Discriminant analysis and Local Fisher Discriminant Embedding (LFDE) in LFDE the face pictures are mapped into a face subspace for examination whereas in LDA we simply observe the Euclidian structure of face space . They contrast the proposed L-Fisherfaces approach and PCA, LDA, LPP, and UDP on three distinctive face databases. Test results recommend that the proposed L-Fisherfaces gives a superior portrayal and accomplishes higher precision in face acknowledgment.

3. Local Binary Pattern Histogram (LBPH):

Aftab Ahmed , Fayaz Ali [3] proposed a system operating at 35px resolution to classify faces at changed points , side stances and tracking the countenances during human movement. They have created LR500 dataset for performing training and classification. They have used Local Binary Pattern Histogram (LBPH) architecture for facial recognition at low resolution.

They have tested total on 2500 images and recognized images up to 2470 . The below table. 2.1 are the Accuracy results under different resolution rates.

Table 2.1: Recognition Accuracy Rate Comparison

Algorithm 45 px 35px

LBPH 94% 90%

6 4. Support Vector Machine (SVM):

According to the review of [4] expressed the methodology for face acknowledgment utilizing SVM which is a learning system that is seen as a suitable procedure for extensively valuable model affirmation because of its high hypothesis execution without the need to incorporate other data. Support vector Machine(SVM) identifies hyperplane that isolates the biggest conceivable division of purposes of a similar class on a similar side, while boosting the separation from either class to the hyperplane.

5. OpenFace :

Brandon Amos, Bartosz Ludwiczuk , and Mahadev Satyanarayan (2016) [5] presented experiments showing how Open Face provides high accuracy compared with other opensource methods and present new classification for mobile applications. They use Labeled Faces in the Wild( LFW) dataset for this experiment . The below image 2.1 shows the experimental results for classifying 10 to 100 people. Figure 1 demonstrates that including more individuals diminishes the exactness and that OpenFace dependably has the most elevated precision of a huge edge. Figure 2 demonstrates that including more individuals builds the preparation time. OpenFace's SVM is the quickest preparation. Figure 3 demonstrates the prediction time. The prediction time of eigenfaces, fisher faces, and LBPH shows some variation while OpenFace's remains almost constant.

1

2

7

3

Figure 2.1: OpenFace vs earlier non-exclusive face recognition implementations

6. Neural Networks

Finally, the last approach for the face recognition is Artificial Neural Networks (ANN). Interesting part of ANN might be expected to be a function of (non-linear) in the networks. Compared with Eigen face based method this method Extraction might be progressively proficient. In [6] suggested a hybrid neural system which joins local image sampling, a SOM, and CNN. The convolutional system separates progressively bigger highlights in a various levelled set of layers and gives halfway invariance to interpretation, revolution, scale, and disfigurement. The creators revealed 96.2% right acknowledgment on ORL database of 400 pictures of 40 people. The order time is under 0.5 second, yet the preparation time is up to 4 hours. As a rule, neural system methodologies experience issues when the quantity of classes (i.e., people) increments..

8 The comparison of the above literature on different face recognition techniques is as shown below table 2.2.

Table 2.2: Literature summary on different face recognition methods

Reference Title Overall Summary [1] “Application of It demonstrates the solid display of the face the Karhunen- acknowledgment under different brighten Loève procedure conditions by basic connection between for the pictures with changes in lighting up. characterisation of human faces” The connection between pictures of the entire countenances isn't proficient for palatable acknowledgment execution. This method has lowest accuracy compared with other methods. [2] "Face Recognition The author presented a face recognition Using L- Fisher method called the Local Fisher Faces” Discriminant Embedding (LFDE) strategy

PIE, FERET, and ORL face databases are the 3 databases on which the research is performed. The accuracies are 88.3%, 94.33%, and 96.3%. From this experiment LFDE performs well compared with PCA/ Eigen Faces [3] “LBPH based This Paper uses LBPH algorithm for face improved face recognition at low level resolution. They recognition at low have used (LR500) datasets for training resolution ” and classification. They have successfully identified human faces in different angles, different postures and tracking during human motion. The main limitation of this experiment is cameras not able to determine individuals perfectly. The dark and lighting conditions are still considered issues for face recognition.

9 [4] “Support Vector Performance is evaluated for face Machines applied recognition and verification .The to face recognition” recognition accuracies are observed that SVM has reached about 77-78% while only 54% for PCA and for verification, the error rate is 7% for SVM and 13% for PCA. The main drawbacks are : 1. FERET database is not much efficient since it don’t have different poses of images. 2. Not much knowledge provided about how lighting they used when taking the images. [5] "Open Face: A The main aim of Open Face is to build face general-purpose recognition in mobile environment where a face recognition user can take his own design parameters library with mobile such that it should provide high accuracy applications" with low training and prediction time. This Paper shows good accuracy compared with other methods by using Open Face as a face recognition library [6] “Face recognition: In this Paper they integrated a local image A convolutional sampling, a self-organizing map (SOM) neural-network and a Convolutional Networks. The results approach” illustrates that SOM+CNN shows good accuracy than PCA + CNN.

From all the above papers only the last two paper use Neural Networks which is completely efficient method for face recognition compare with other methods

From [5] reference paper they did not studied the impact of executing this recognition techniques on different architecture like embedded GPU or embedded boards. They have only presented performance experiments illustrating that Open Face’s recognition time is suitable for embedded applications compared with other different methods. Hence, my thesis work will use this FaceNet library that bridges gap on study this recognition time on Raspberry Pi3. We will study the impact of deep learning models on low power constrained embedded platforms and see how to optimize this deep learning model size.

From [6] presents the use of SOM + CNN tested for 40 Persons, but still, there is a room for improvement in his approach. This paper is implemented on GPU but my aim is to use this CNN algorithm on raspberry pi3 and make it performance accelerated on this embedded device.

From all the above papers only the last two paper use Neural Networks which is completely efficient method for face recognition compare with other methods

10 2.2 Challenges of Face Recognition Algorithm and how Deep Leaning Algorithms Can solve it?

Variational Factors: When designing algorithm the main aim is to eliminate variations. These are factors which are not observed. For instance when analysing an image showing the face of a person, the factors of variations are distance of the face from the camera, the emotions, lighting conditions etc. It is tough to separate such top features from input image

The other factors for variations in accuracy includes:

 Variation of intensity of Light

 Different poses variation

 Depending on the camera distance variation

 Depending on the variations of Datasets Size

We will compare different Recognition Rates for different face recognition discussed above - PCA VS LBPH in terms of percentage as shown in below table 2.3. "[7]

Table 2.3: Overall Performance Scale in Percentages

FEATURES PCA/EIGEN LBPH FACES

LIGHT 85-90% 70%-75% VARIATION

POSE 88-93% 68%-73% VARIATION

DISTANCE 88-93% 70%-75% VARIATION

DATASET 85-90% 80%-85% SIZE VARIATION

Therefore, Deep learning provides solution for this problem by introducing the concept called Representation Learning. We will understand this concept more clearly from the below given image and see how deep learning is solving this kind of variation problems.

11 Chapter 3 Methodology

3.1 Overview

In this Facenet One-shot Learning Algorithm we make use of in order to recognise the faces and later convert that trained model to graph file format.

The Figure 3.1.1 shows the Flow Chart of methodology.

I organize the rest of the methodology.

I describe how i detect the face.

I describe how I can perform training on my custom datasets of detected faces.

I describe how to train a SVM-SCIKIT classifier and to evaluate results for Face Recognition

I describe how to convert my trained model to OpenVINO format.

I describe how INTEL NCS AND OPENVINO TOOLKIT can be used for face detection and recognition in Raspberry Pi3.

First we look at the flowchart (SEE 3.1) of my complete implementation thesis project.

Later, we understand step by step of each sections as mentioned above.

12 Training image folder which consists of custom data sets of 5 persons

IMAGES ARE PROVIDED TO MTCNN FOR FACE DETECTION

Now this detected faces are passed to CNN model for training the model which will extract the features of person and later convert this features to 128 dimension embedding values of each person .

Output files after training the model are:

.npy file for fetching the face

.pb file which is a pre-trained facenet model for feature extraction

.pickle file where our custom data is stored

What Intel SDK or Model Optimizer Does?

It converts this Facenet trained model (.pb file) to Intel NCS friendly format either Facenet graph file or OpenVINO IR Model which is a lighter version of the model transferred to the Raspberry Pi3 for performing predictions by using this graph/ IR file and inference on Intel NCS . We need to remember that faster recognition is achieved on Raspberry Pi3 using Intel NCS API or OpenVINO Inference Engine API

Figure 3.1: Workflow Representation of Methodology

13 3.2 Data Collection

In this this, training and recognizing the person is done using convolutional neural networks. In the proposed method, this would like to collect the images of 5 persons and create a dataset with all the images so that the neural network can perform face recognition.

In this thesis, we collected the images of the five persons at the two different conditions. One is under normal lighting conditions and other in non-lighting conditions. Each person gives different poses, emotions to make more accurate the model. The images are collected using the X-cam. This camera can be accessed using the VLC media player, through which the videos are recorded. The videos are recorded in different lighting conditions. We collected 300 images for each person and trained the model. For test images, we randomly collected the images from the videos recorded and tested them for detection, for testing the videos, we collected the video on a different day for few minutes and tested them.

The camera which is used for collecting the data is X CAM CAR DV which has high picture clarity. It has built in G-Sensor and has support of motion detect support. The image quality is Full HD 1080p 25/30 fps and the video resolution 1920*108-P.The camera is very good in taking the quality of the data to the next level, it also has wide dynamic range. The performance of the camera is also excellent in low-light conditions with 3D noise reduction.

Below is the image 3.2 where images of 5 persons are stored and labelled them with their respective names.

Figure 3.2: Training image folder

14 Image data directory should look like the following figure: person-1

├── image_1.jpg

├── image_2.jpg

...

└── image_p.jpg

Figure 3.3: Data collection corresponding to Nimshi label

15 3.3 Data Pre-processing

One of the major limitation with the Dlib face detector misses some of the hard examples (partial occlusion, silhouettes, etc.). In order to solve this various face landmark detectors have been tested. It has been observed that Multi Task Cascaded Convolutional Networks (MTCNN) can boost the performance compared with other Face Detectors.

The hyperparameters used in this model are:

Min_face_size: This is basically used to limit the minimum image size for the face detection. Images smaller than this size face detection is not possible.

Scale factor: The image is downsized, and the image is iteratively scaled down until the image size is smaller than the image size.

Threshold: If the probability that the detected box image is a face is greater than pnet_threshold will be retained, and the next step is filtered.

Image size and Margin. The input to the Inception-Resnet-v1 model is 160x160 giving some margin to use a random crop.

The below image 3.4 are the hyperparameter values used in my program and image 3.5 after finishing pre-processing on training label images.

Figure 3.4: Hyper parameters Values

Figure 3.5: Screenshot taken during pre-processing the Facenet Model

16 Stage 1: Face Detection

Now I apply the MTCNN algorithm for face detection which works in three steps and use one neural network for each.

The first part is a proposal network

: It predicts potential face positions and their bounding boxes like an attention network in Faster R-CNN. The consequences of this progression is a large number of face detections and lots of false detections.

The second part uses images and outputs of the first prediction. It makes a refinement of the outcome to take out large portion of false detections and aggregate bounding boxes.

The last part refines even more the predictions and includes facial landmarks predictions (in the original MTCNN implementation).These strategy identify, detect and align the faces by making eyes and bottom lip appear in the same location on each image. The detect_face function returns two variables, bounding boxes and points for them.

The images 3.6 are my output images of detected faces for my training images stored in training image folder. As we can observe only the face part is detected successfully from the complete image under different image conditions.

Figure 3.6: Face Detection Outputs after applying MTCNN Algorithm

17 To limit the memory usage at each Tensor Flow session I set parameter gpu_memory_fraction = 0.25.

After we finish face detection we get bounding boxes and .npy files for each person as shown in image 3.7 which later used for face recognition by passing this detected faces to CNN CLASSIFIER and 3.8 are npy files.

Figure 3.7: Bounding Boxes for Face Detection of 4 Person’s

Figure 3.8: .npy files

18 Stage 2: Training of Model on Custom Datasets

In this step what we need is a way to extract a few basic measurements from each face. Then we could measure our unknown face the same way and find the known face with the closest measurements. For example, we might measure the size of each ear, the spacing between the eyes, the length of the nose, etc.

Now after finish finishing the pre-processing of data we have to train the model with a predefined model as shown in below image 3.9 (put in .pb file inside the folder named as model).

This will generate vector face embedding for each person. These embedding values are used for classification purpose by training . In the next stage we will understand clearly how to train the classifier using this Pre-trained Model (Inception-Resnetv1 Model).

Figure 3.9: Pre-Trained CNN Model

19 Stage 3: Training the classifier using SoftMax function and Performing Live- Face Recognition Deep learning requires hardware with high specifications for training. Thus, the hardware was set up with the following configurations.

CPU: Intel Core i7-7770k, 4.3 GHz Motherboard: Asus Prime Z270-A and Asus ROG-SYRIX-GTX1080TI RAM:16 GB each for both CPU and GPU.

By using this pre-trained model, i trained on my custom datasets of 5 persons. The time taken to finish training is 30 minutes. The number of epoch number is in the range 60-80, the learning rate is set to 0.005 and the batch size of 1000.

The below image 3.10 is screenshot taken when training my facenet model. We can clearly observe from the below image the Model filename (.pb) . In this model when we are training our model all our face features are getting extracted as discussed in above Stage-2.

Figure 3.10: Screenshot taken during training the Facenet model

The batch parameter indicates the batch size used during training. Our training set contains a few thousand images, but it is not uncommon to train on millions of images. During the training process, the weights of the neural network is iteratively updated depending upon the mistakes it makes on the training dataset. While updating the weights, it is not practical to use all images in the training set all at once. That's why, a small subset of images is used in one iteration. The subset is known as the batch size. When the batch size is set to a value, the exact number of images are used in one iteration to update the parameters of the neural network.

20 The parameter learning rate plays a huge role during training. “Since the neural network is updated depending on the small batch of images, the weight updates fluctuate quite a bit. That is why a parameter momentum is used to penalize large weight changes between Iterations. A typical neural network has millions of weights and therefore they can easily over fit any training data. Over fitting means that the network performs very well on training data but its performance is quite poor on test data. It is almost like the neural network has not learned the underlying concept but only memorized the answer to all images in the training set. To mitigate this problem, large value for weights should be penalized. The parameter decay controls this penalty term. The default value works just fine, but it can be tweaked if over fitting is noticed.

Based on the current batch of data, learning rate is used to control how aggressively the neural network should learn. In general, this number lies between 0.01 and 0.0001. “At the beginning of the training process, we are starting with zero information and so the learning rate needs to be high. But later on when as the neural network processes a lot of data, the weights should change less aggressively. In other words, the learning rate needs to be de- creased over time. In the configuration file, this decrease in learning rate is obtained by first specifying that our learning rate decreasing policy is constant or steps.

Therefore, to improve the performance of the final model the learning rate is decreased by a factor 10 when the training starts to converge. This is done through a learning rate schedule. Now these images will be fed in a batch into the model. This model will return a 128 dimensional embedding for each image, returning a 128 x 128 matrix for each batch. After these embedding’s are created, you’ll use them as feature inputs into a scikit-learn SVM classifier to train on each identity.

Figure 3.11: Probability Prediction values of 4 Persons

The image 3.11 shows the matrix values corresponding to probabilities values for each person.

21 The image 3.12 shows the classifier output file of our trained model which is stored in .pkl format as shown in below image.

Figure 3.12: Facenet Classifier Model File

Stage 4: Live Face Recognition using SVM

Once we finished trained classifier then we activate Web Camera and captures an image. This image is then stored in the system and finds for a human face in the captured image using OpenCV and Python. The detected human face is then compared with the faces stored in the database using deep convolutional neural networks algorithm. Now we will use basic machine learning classifier called Support Vector machine (SVM) classifier. This algorithm works by looking at all the faces we have measured in our training neural networks model as discussed in Stage 3. The person who has the closest measurements to our stored face’s measurement will be displayed with the person name as the result of this classifier. The main advantage of this classifier it takes milliseconds to run this classifier.

22 Now I have performed testing of my trained model file on live webcam and starting performing prediction using trained classifier. The image 3.13 is my face recognition outputs of my trained model for 4 Persons on UBUNTU 16.04.

Figure 3.13: Face Recognition Outputs of 4 Persons

23 Flowchart for training and testing Facenet Model

The flowchart for training and testing the Facenet model is as shown in below fig 3.14

Start

Data Collection

Data Pre-Processing by Applying MTCNN algorithm to detect faces

Test Data Training Training the .pb model (Images, Data with the detected faces Videos)

Classifier Model is saved as .pkl Using SVM classifier we can start performing predictions

Test on images and videos by loading the trained classifier model

Stop

Figure 3.14: Flowchart representation of training and testing Facenet model with Custom data

24 3.4 Main drawback to implement Faceenet Model on Embedded Devices

Now we have successfully trained and tested the model. Now we need to perform this recognition on Raspberry Pi3 model by transferring this trained model. The main drawback when we start implementing this model on raspberry pi3 is speed of recognition because of size of the Facenet model. The below image 3.15 shows the size of only trained model is only itself 93.3MB and image 3.16 shows size of my trained classifier model. The total size of facenet model is 200MB.

The first drawback is as the size of the model increases speed will decrease .

Figure 3.15: Size of my trained model and classifier model (.pkl)

25

Figure 3.16: Time taken to load Trained Model (.pb) on Raspberry PI3

As shown from the above image the time taken to load trained model is 57.65 seconds.

The Second draw back as the size of the model increases the time to load on low power devices like Raspberry pi3.

We already discussed in previous literature works on raspberrypi3 that speed is a big issue. To overcome this challenge I found the solution is to use VISION PROCESSING UNIT developed by Intel team. In the remaining sections of this chapter we discuss in-depth about Intel NCS and OpenVINO

26 3.4 Neural Compute Stick Platform

The Intel Neural Compute Stick (NCS) platform [35] is a System-on-Chip (SoC) implementation of the Myriad2 VPU. The NCS is an on-chip hardware block explicitly intended to run deep neural networks at high speed and low power without bargaining accuracy, empowering devices to see, understand and respond to their environments in real time. Vision Processing Units (VPUs) are different from Graphic Processing Units (GPUs) as the GPU hardware is focused on multimedia tasks whereas VPUs empower visual intelligence at a high compute per watt.

A high-level overview of the device is illustrated in Figure 3.17 [36] . The diagram depicts the approximate implementation used in the NCS platform (variant MA2450).

Figure 3.17: Implementation of the Myriad2 VPU used within the Neural Compute Stick (NCS) platform

3.5.1 Importance of NCSAPI in Deep Learning Inference Applications

In the diagram 3.17, Two RISC processors deal with the communication with the host and the execution on the VPU (i.e., runtime scheduler). The applications speak with the VPU using a USB 3.0 interface and the so-called Neural Compute API (NCAPI) . The principle reason for this API is to empower the deployment of convolutional networks for inference on the NCS2. At the point NCAPI initializes and opens a device, a firmware is loaded onto the NCS. Now the device is ready to accept the network graph files and execute commands to conduct inference on the VPU. The mvncLoadTensor () command moves a specific input image to the NCS device and loads the pre-compiled graph for execution. This will automatically facilitate the data transfer with one of the RISC processors into the NCS. It will also immediately queue the execution of the graph on the SHAVE processors through the runtime scheduler. The operation will return as soon as the data is transferred and the execution is scheduled, without blocking the host process. Now, the application is able to overlap additional computations while the inference has been offloaded to the NCS (e.g., decode the next frame).

27

3.5.2 Working of Vision Processing Unit

In this section we will understand the working architecture of Vision Processing unit. The image 3.18 shows the development process of NCS based Embedded System [37]

Figure 3.18: Illustration of using Intel NCS to develop a DNN based Embedded -System

The training process does not need to utilise the NCS stick or SDK however, only the trained model is used by Utilizing the software SDK of the NCS, the user will perform training, profiling, tuning and compiling a DNN model on the NCS on a PC that runs 64bit Ubuntu 16.04 OS. The provided SDK can check the validity of designed DNN and API for python and C languages. From that point forward, any developer system (e.g. a raspberry pi) that runs a perfect OS with neural compute API can accelerate neural network inferences.

To Summarize below are the steps how Intel NCS for any deep learning model:

 We use a pre-trained Tensor Flow/Caffe model or train a network with Tensor Flow/Caffe on Ubuntu or Debian.

 Use the NCS SDK tool chain/ INTEL OPENVINO to generate a graph file OR IR FILE by compiling them on development machine Ubuntu 16.04 LTS. (We will discuss both the results one using graph file for object detection and other using OpenVINO for face detection on Raspberry Pi3 )

 Deploy the graph file and NCS to your single board computer running a Debian flavour of Linux. I used a Raspberry Pi 3 B running Raspbian (Debian based) which is known as prototyping.

 With Python, use the NCS API to send the graph file to the NCS and request predictions on images/video.

One of the main advantage of generating graph file is it can be deployed to any low power devices and write our own logic using this generated graph file for performing our required tasks like object detection , emotion recognition etc.

28 Now I want to explain by using the above procedure I performed a testing on yolov3 model object detection model on my raspberry pi3 by converting this trained model to graph file for detecting persons, bottled,. etc from NCSDK GitHub Repository and performed the live- object-detection and i found it has classified me as a person and water bottle in 80 milliseconds as shown in below output image 3.19 . This is an initial test performed on Pi3 using Intel NCS to understand its working functionality.

In the next section we will understand clearly how to convert tensor/caffe models to graph files.

Figure 3.19: Live-Object Detection on Raspberry Pi 3 using Intel NCS

In the next section we will understand about OpenVINO architecture and face detection implementation results on raspberry pi3.

29 3.6 Model Optimizer

Model Optimizer is defined as Cross -Platform Command line tool that make easier change between the training and deployment environments and automatically adjusts deep learning models for giving best performance on our targeted devices.

The below image 3.20 is the workflow for both the NCSDK and OpenVINO toolkit [39] .

Fig 3.20: Intel NCSDK and OpenVINO Architecture

1. Choose one of the supported framework by model optimizer and train that models with your datasets.

2. Run the model optimizer to get the Intermediate Representation of the network which is used as the input for the Inference Engine for all the targeted devices like CPU, GPU, FPGA, and Myriad.

3. The IR is a Pair of files that describes the whole model : .xml : This is a topology file which describes the network topology .bin : The trained data file which contains binary data of the weights and biases

The main advantage of choosing Intel NCS in my thesis project is it includes software tools, an API, and examples, so developers can create software that takes advantage of the accelerated neural network capability provided by the hardware.

This toolkit accompanied with Python and C Language API that enables applications that utilize hardware accelerated deep neural networks by means of neural compute devices such as the Intel® Movidius™ Neural Compute Stick (NCS).

30 3.7 Procedure to Convert Facenet Model to NCSDK Graph Format

 First we train a model either Tensor Flow or cafe model on a development machine

 Create the same neural network only for inference and not for training. So remove all the parts that are related to training like dropout layers, loss, optimizers etc. Also make sure you name the input and the output layer. I did that and stored the code in infer_model_tf.py file. You can run this file using python infer_model_tf.py.

 We will find some new files after running the inference program in the folder name FACENET CHECKPOINT which consists of below files inside this checkpoint file as shown below facenet_celeb.index,facenet_celeb.metaand

facenet_celeb.data-00000-of-00001.

 mvNCCompile facenet_celeb_ncs/ facenet_celeb.meta -in input layer -on softmax_tensor -o NCS\ graph/facenet_celeb_ncs.graph.

 After we get graph in Ubuntu OS we transfer the graph file to pi3 and start performing inference on this live webcam frames and graph file. As we discussed in step3 that we need to get checkpoint files for generating graph file. The below image 3.25 shows this files which consists of my trained model, .index, .meta, and .data files.

Figure 3.21: Facenet Checkpoint Files

31 Now after we get the above we are ready to use Intel NCS SDK toolkit to generate graph file by using MvNCCompile command as discussed in above step4. mvNCCompile is a command line tool that compiles network and weights files for Caffe or Tensor Flow models into an Intel Movidius graph file format that is perfect with the Intel Movidius Neural Compute SDK (Intel Movidius NCSDK) and Neural Compute API (NCAPI).

The below image 3.26 shows the facenet graph file with its size. Now we can observe that size of my complete model came to 45.6 MB from 200MB size previously discussed.

Therefore, now we are ready to transfer this graph file instead of complete facenet files on to raspberry pi3 and achieve faster results and good accuracy. In next section we will see results on Raspberry Pi3 comparing the speed and accuracy in 3 cases.

Figure 3.22: Facenet Graph file after Compiling and its Size

32 3.7.1 Working of Single-Shot Learning Facenet Algorithm on Intel NCS

• The Facenet graph which is trained is trained on to find and quantify landmarks on faces in general. The Facenet Classifier uses Siamese Neural Networks and Triplet Loss to classify known and unknown faces, basically which means it calculates the distance between an images is presented in live webcam and a folder of validated images.

• To determine a match we use Face _Match_Threshold value is used. I have used 1.2 as threshold value. Whenever the particular face gets matched then we overlay that particular image name (KOLLU NIMSHI) from validated image folder with the present frame and matched output. In my program I have used green color to indicate that my face is matched.

• So if we want to recognize particular image then we get inference from that particular image and save it as a control output. Once we have control output we compare with inference output of any other image to determine face matched or not.

• Hence in overall summary, we use SINGLE-SHOT LEARNING ALGORITHM to determine whether they are matched or not.

33 3.7.2 Summary of steps on how Facenet Graph is used for Face Recognition

• First we get the list of all the sticks that are plugged in and pick the first stick to run the network by creating a object called device.

• In second step we load Facenet graph file onto the Intel NCS device. Here graph file is created by ncsdk compiler and we read the graph file to memory buffer. Now we create the NCS API instance from this memory buffer containing the graph file.

• In third step we perform image pre- processing on source image that matches the Intel NCS standards. From documentation the Network width and height is 160 X 160.

• This is important step we take a face net graph file and load the pre-processed test image across the graph file and return the predictions.

• The below image 3.27 shows the flow diagram on how the inference is performed for any deep learning model.

Device Initialization Load Neural Network Obtain input tensor

Cleanup Get inference result Start inference

Figure 3.23: Simple Inference Code Flow

We perform inference on this validated image and the graph file. Therefore, this is taken as valid output. Finally, we overlay this image with frames generated from live webcam.

Limitations of this Implementation:

 NCSAPI can support only Myriad Devices but this cannot be directly on CPU using this graph file like OpenVINO Models.  It compares with the background which will affect accuracy.  The time for prediction has been improved but still can be improved using OpenVINO Inference Engine API

Therefore, in order to solve above limitations and improve further the performance we are going to implement face recognition using Intel OpenVINO Inference Engine implementation using Myriad VPU Hardware. The below sections are clearly discussed on how OpenVINO are generated and performed predictions using these models.

34

3.7.3 Converting Facenet Model to OpenVINO IR Format

Facenet models contains both training and inference part. Switching between these 2 states is done based on placeholder value. IR files which we generate are used only for inference so training part can be avoided.

The below diagram shows the Facenet Model View. We can observe from below diagram it consists of 2 inputs for network: Boolean phase _train which manages state of the graph (train/infer) and batch _size which is a part of batch joining pattern.

Figure 3.24: Facenet Model View

35 3.7.4 Procedure to Convert Facenet Model to OpenVINO IR Format and Deployment

As we have understood Facenet model view from the above slide now we will now understand the commands used to convert our trained model to INTERMEDIATE REPRESENTATION Files.

1. First we train Facenet model on our Custom Datasets (discussed in Stage 2) and freeze the model to get .pb file.

2. Once we trained Facenet Model and convert it to get Protobuf file. This Converted model is given as input along with its path to the Intel OpenVINO Model Optimizer from deployment toolkit which run the mo_tf.py Script to get OpenVINO format.

3. Now test this model in the IR Format using the Inference Engine in my Face Recognition application and deploy this model in the Myriad target environment.

The below diagram shows successful conversion of my trained model to OpenVINO Format which is of FP32 Type.

Figure 3.25: OpenVINO Model Files

36 3.7.5 Impact OpenVINO Models On Raspberry Pi3 Execution Time

The Size of the deep learning model plays an important for calculating execution time of any application. As we know Processing speed, memory, energy use, physical size, cost, time to deployment: all of these are important elements in an embedded processing system design. The resources required to deploy the classifier and to execute operations - often within critical real-time latency constraints - will make or break design.

The below image shows the Facenet IR files with its size. Now we can observe that size of my complete model came to 45.6 MB from 200MB size previously discussed (Fig 3.15)

Figure 3.26: OpenVINO Model Size Properties

The main challenge here is to integrate an Optimized model on Intel NCS that can automatically reduce the load on raspberry Pi3 without specific knowledge of underlying library and hardware on Raspberry Pi3 .

In order to obtain both Accuracy and Speed we propose a Combination of Asynchronous Inference Engine with Trained Support Vector Machine Model. The deployment of the trained neural network on a device that executes the algorithm is known as the Inference .The basic idea of this model is to implement Parallel Programming based on request id for improving performance of Pi3. In order to figure out how much is difference in time for processing each frame on raspberry pi3, we take 2 Conditions

 Time taken to initialize model and Pre-Processing frames using MTCNN Algorithm before sending to Inference Engine .

 Inference Time Calculation for each Pre-Processed frame after passing to Asynchronous Inference Engine.

We will discuss more clearly on time calculation for each event in Chapter 4.

The execution time of a program in raspberry Pi3 is contributed by three parts: CPU operations, memory operations, and I/O operations. For a deep learning point of view, these parts refer to FLOPs, memory size, and parameter size. The Execution time for any deep learning application can be modelled by below equation as y=w⊺x+b ( 3.1) Where w is weight vector, b is bias and x = [Flops, mem, parameter size] .

37 3.8 Face Recognition on Raspberry Pi3 using OpenVINO Toolkit

3.8.1 Hardware

In this Project, we test OpenVINO Models on two types of hardware, Intel Neural Compute Stick and Intel Neural Compute Stick-2.

The most interesting property part of this device is it comes in USB form which can be easily plugged into edge devices like raspberry pi3 which lack the processing power to run deep learning applications.

The NCS consists of Myriad2 Vision Processing Unit which provides 4Gbits of LPDR3 DRAM, imaging and vision accelerators, and 12 SHAVE Processors. These Processors are used to accelerate neural networks by executing these in parallel. It provides USB3.0 Type A Interface and supports Caffe and Tensor Flow deep learning frameworks only. The VPU additionally has a SPARC core that runs custom firmware.

The below image 2 images are hardware accelerators of both Intel NCS and NCS2 .

Figure 3.27: Neural Compute Stick and Neural Compute Stick 2

The NCS2 is powered by Myriad X VPU which comes with 16 Shave Cores and provides 8 times performance in running Convolutional Neural Networks compared with Myriad 2 VPU, maintaining same low power and accuracy of the model. Myriad X is available in 2 IC Packages: MA2485 and MA2085. The ncs2 uses this MA2485 Chip which provides 4Gbit LPDR4 memory which has been embedded into the package. The memory interface is 32-bits wide which can operate at 1600 MHz for data rates up to 12.8 GBytes per seconds

The interesting part of this Myriad X VPU architecture compared with the previous generation of VPU is it consists of Neural Compute Engine which is capable of providing a total performance of over trillion Operations per Second.

38 It has over 20 hardware accelerators to achieve tasks such as optical flow and Stereo depth perception algorithm. The below image shows the MYRIAD X Architecture.

Figure 3.28: Myriad X Architecture

The below table gives Overall Summary of this 2 Hardware’s

Table 3.1: Intel NCS VS NCS2

PROPERTY INTEL INTEL NCS2 NCS

SOC MYRIAD 2 MYRIAD X

ON-CHIP 4GB LPDR 4GB LPDR4 MEMORY Provided by MA2485 IC Chip No of Shave 12 16 Processors

Native FP16 FP16 Precision Support

From the above table we can observe that both this stick supports FP16 data type. But our previously generated open vino models is FP32 type. So I have to convert into FP16 format the OpenVINO models to perform inference on Intel NCS OR NCS2 running on Raspberry Pi3.

39 3.8.2 Conversion of OpenVINO models to FP16 Format

The below image shows the command use to convert models to FP16 data type. As we can observe that we are passing the trained facenet model as input and also given data type FP16 Format telling our model to convert into this format.

Figure 3.29: Command to Convert OpenVINO FP16 Format

After we execute this command we have successfully converted this model to FP16 format. The below images shows my converted OpenVINO FP16 Models

Figure 3.30: Successful Conversion to FP16 OpenVINO IR Models

The below image shows the.xml Inception Resnet v1 Model Architecture layers while converting the Facenet model.

40

Figure 3.31: Visualization Of Network Topology Of .xml File

From the above image we can observe that we have set batch size equal to be 1. From the Line 4 we have of our input layer which has precision of FP16 which shows our model has been to our specified data format by performing the data conversion operation using model optimizer.

From the Line 7-10 we have dimensions 1, 3, 160, 160 which indicates that i perform inference on 1 image which has input dimensions of 160 X160 with 3 channel RGB. Suppose we want to use batch size equal to 3 or 4 our wish and reshape our input shape according to our own dimensions we don’t need to train our model again l , by using model optimizer we can change the dimensions and batch size of our converted model. We need to observe that as batch size increases the respond time will be reduced, but the frames per second might be higher.

41 Now the data is processed by convolution layer with a name Conv2d_1a_3x3 , and then to a Relu layer, and then to a Max Pooling Layer. The below image shows how the dimensions are changing down the pipeline.

Figure 3.32: Convolution to Pooling Layer of Different Data size

Finally, we have total of 314 layers where last 3 layers shows the Reshape layer, Fully Connected Layer, and Normalization Layer. The Fully Connected Layer has output shape dimensions of 128 which is used to extract as features and are stored in output blobs. The below image shows the last 3 layers of our optimized model

Figure 3.33: Reshape to Normalization Layer Showing Different Data Size

42 The modified Facenet CNN Architecture of the OpenVINO .xml model is as follows:

Input

3x3 Conv 2d_2a

3x3 Relu

3x3 Conv2d_2b

3x3 Max P ooling

1x1 Conv2D

1x1 Relu

InceptionResnetV1/ Repeat/Connv2d 3x3

InceptionResnetV1/ Repeat/Connv2d3x3/ Relu.

Concat

Eltwise

InceptionResnetV1 /Logits/Flatten/Reshape

8x8 Logits/AvgPool

Fully Connected 128

Normalize

Figure 3.34: OpenVINO .XML Model Structure

43 3.9 Software:

OpenVINO utilizes Tensor Flow Optimized Models, a Python Inference Engine API, to deploy these IR Model on CPU, MYRIAD, VPU, and FPGA. This API Will allow to handle the models and perform inference in synchronous and asynchronous modes based on infer request. For each neural network we have OpenVINO benchmark tool that profiles the inference performance on myriad device without internal delay. The Performance of our model is measured based on Latency and Throughput. Mean Values are taken as Profiled Inference Time.

We install OpenVINO toolkit on Raspbian OS which includes the Inference Engine and MYRIAD Plugins. The following modules are installed automatically on Raspberry pi3 :

1. INFERENCE ENGINE 2. OpenCV 3. Sample Applications

In Raspberry Pi3 OpenVINO installation we don’t have Model Optimizer Package to convert this models. So we need to do this in host machine to convert this model and later transfer this models. The below image shows my transferred OpenVINO Models on Raspberry Pi3.

Figure 3.35: Transferred OpenVINO Models to Raspberry PI3

44 OpenVINO Face Recognition Algorithm is done in 3 Stages:

1. Integration of Inference Engine with MTCNN Algorithm.

2. Training the SVM Classifier using this Extracted Embedding for each label

3. Performing Live Face Recognition Using above trained models and loading IR Files and Accelerating Performance on Raspberry Pi3Using Myriad Device Provided by OpenVINO Inference Engine API. The below image is the flowchart for performing face recognition on Raspberry pi3 Using Intel Distribution OpenVINO Toolkit deployed on Intel NCS Or Intel NCS2.

Fig 3.36: Flowchart for OpenVINO Face Recognition Algorithm

45 Chapter 4 Results and Discussions

The implementation of deep learning on Raspberry pi3 is challenging and time-taking process and initially hitting straight real time conditions might result in more effort and less outcome. There were many efforts are made to improve the accuracy and speed on the Raspberry Pi3 , the Face recognition results which are shown using intel ncs on Raspberry pi3 has 0.1-0.3 seconds face recognition time .

4.1 Overview

I organize the rest of this experimental results as follows.

In section-4.2, I present the results of face recognition on raspberry pi3 without using the Intel NCS by transferring the trained CNN model on pi3 directly.

In section-4.3, I present the results of face recognition under lighting, non-lighting, and different emotions conditions on raspberry pi3 using Intel NCS. In section-4.4 , I present the results of face recognition on raspberry pi3 using Intel OpenVINO deployed on Intel NCS and NCS2.

In section-4.5, I present the Performance Metrics and Facenet Analysis based on Intel NCS Graph, OpenVINO models.

4.2 Face Recognition Results on Raspberry Pi3 without using Intel NCS

The test dataset consists of 250 images for all 5 persons which are the combination of images during all the possible lighting conditions i.e. normal light, low light. different poses etc. It contains the pictures of all the 5 persons.

First I have performed recognition on pi3 without using Intel NCS and OpenVINO. I have transferred my trained CNN Model on to pi3 and perform prediction using SVM Classifier. Using this experiment calculated the speed of recognition and accuracy before converting my trained to open vino format for accelerating the pi3.

46 The below images of 4.1 are my face recognition results of 4 persons on Raspberry Pi3.

Figure 4.1: Face Recognition results on Raspberry pi3 without using Intel NCS

47

Figure 4.2: Accuracy and Time taken to Recognize

The above images shows that average time taken to recognize is 3.656 seconds. This time has to be reduced which will be discussed in Section 4.3 and 4.4.

I have tested for 50 counts per each person and checked the performance by measuring the accuracy. For this I have taken the most frequently occurred Probability Confidence value as my final accuracy.

48 The below table 4.1 shows the Confusion Matrix for 4 Persons. By using machine learning scikit-learn library we import confusion matrix for calculating the accuracy, F1 Score, Precision, and recall .

Table 4.1: Confusion Matrix Table

For above Confusion matrix table the correct predictions for 4 persons out of 50 frame counts are 48, 49, 50, and 40. They are also known as True Positives. Accuracy = True Positives/Total Number of counts. The average accuracy is found to be 98% based on above equation.

From table 4.2 we can observe one of the main limitation in raspberry pi3 is speed of face recognition as compared with accuracy. So in order to improve the speed first performed face recognition on raspberry pi3 using Intel NCS which is discussed in section 4.3.

Table 4.2: Accuracy and Total Time taken to perform Predictions on Raspberry Pi3

Method Accuracy Time Per Frame For Recognition

Facenet(Feature 98% 3.656 Seconds Extraction) + Classification (SVM Classifier )

49 4.3 Face Recognition Results on Raspberry Pi3 using Intel NCS and NCSDK

In this section we will see face recognition results using Intel Movidius Neural Compute Stick on raspberry pi3. First we will transfer our trained custom model graph file to raspberry pi3. We also need to have validation images folder where we store 2 images per Person. In my project I have chosen 2 to get high accuracy as we implement one-shot learning algorithm.

The below image 4.3 shows the graph file, validation images, and main code on raspberry pi3.

Figure 4.3: Implementation on Raspberry Pi3 using Intel NCS

50 The below images 4.4 -4.6 are Face Recognition on Pi3 in lighting, low lighting conditions, and different emotions for 5 Persons. We can observe our deep learning model is performing well.

Figure 4.4: Face Recognition Results of 5 Persons under Lighting Conditions

51

Figure 4.5: Face Recognition Results of 5 Persons under Low Lighting Condition

52

Figure 4.6: Face Recognition Results under different Emotions

53 The below image shows Euclidian Distance Calculation between 2 images. We already discussed this in literature in more detail in Facenet One Shot Learning Algorithm. This basically gives similarity values between 2 images. The below image 4.7 is the python code for calculating the difference using NumPy to calculate between two face images. These values are represent in matrix form.

Figure 4.7: Python Code for Calculating Difference between 2 images If two images embedding values are same then total difference value is smaller otherwise it will be larger. We can also observe in raspberry pi3 shell (see image 4.8) that it calculates frame count and gives matched with that particular person. By using frame count we can create confusion matrix for calculating accuracy.

Figure 4.8: Distance Calculation based on Threshold Value as shown in RaspberryPi3 Shell

54 4.4 Face Recognition Results on Raspberry Pi3 using Intel OpenVINO

Now we run the main code running on raspberry pi3 and deploying it on Myriad by taking input as live web camera as shown in below image.

Figure 4.9: Implementation on Raspberry Pi3 using Intel OpenVINO Inference Engine

55 The below images 4.10 -4.12 are Face Recognition on pi3 in lighting conditions , hat conditions, and multiple face recognition for 4 Persons. We can observe our deep learning optimized model is performing well.

Figure 4.10: Face Recognition Results on Raspberry Pi3 Using Intel OpenVINO Method

56

Figure 4.11: Face Recognition Results on Raspberry Pi3 under HAT Conditions Using Intel OpenVINO Method

57

Figure 4.12: Multiple Face Recognition Results on Raspberry Pi3 Using Intel OpenVINO Method

58 Now we will see inference time calculation results using Intel NCS1 on Raspberry Pi3. The below image shows the Inference Time Calculation using Intel NCS on Raspberry Pi3.

Figure 4.13: Inference Time Calculation Using OpenVINO deployed on Intel NCS

From the above image we can observe that we have successfully accelerated performance on raspberry pi3 and now recognizing in 0.136 seconds.

Figure 4.14: Inference Time Calculation for Multiple Face Recognition using OpenVINO Deployed on Intel NCS

From the above image we can observe that we have successfully accelerated performance for multiple face recognition and now recognizing in 0.272 seconds.

59 The above results can be further improved by using Intel NCS2 which can delivers 8 times performance as we have discussed in previous chapter -3.

Now we will see inference time calculation results using Intel NCS1 on Raspberry pi3. The below image shows the Inference Time Calculation using Intel NCS2 on Raspberry pi3.

Fig 4.15: Inference Time Calculation Using OpenVINO deployed on Intel NCS2

From the above image we can observe that we have successfully accelerated performance on raspberry pi3 using ncs2 and now recognizing in 0.05 seconds compared with 0.136 seconds.

Fig 4.16: Inference Time Calculation for Multiple Face Recognition using OpenVINO Deployed on Intel NCS2

From the above image we can observe that, similarly we have accelerated performance using Intel NCS2 for multiple face recognition and now recognizing in 0.103 seconds compared with 0.272 seconds.

60 4.5 Comparison Study

The project was done the above 3 stages. Firstly an analysis for model selection was done in Ubuntu Operating System and after choosing FaceNet as the best model (given the time and accuracy of prediction) porting this system to Raspberry PI was another stage.

The Performance for real-time face recognition inference on the Pi3 has been measure through the following metrics as shown in below:

 Average time for recognizing person

 Frames Per Second

 Latency

 Accuracy

The below image 4.12 showing the timing analysis on Intel NCS for measuring prediction of facenet model.

4.5.1 Timing Analysis of Face Recognition using Intel NCS

Sending the image from validated Pre-Processing image image folder.

Recognition by movidius Sending pre-processed image to

movidius

Receiving and final processing

Figure 4.17: Flowchart of how images are passed through Intel NCS

61 Pre-processing of image includes the formatting of image by resizing it to a fixed size of 160* 160 this size is just arbitrary consideration of number and this resize is done using the OpenCV. There are 6 labels of prediction in this model of Facenet: Nimshi, Amit Prasad, Sahari Bond, Chatchai, Clifford, and Unknown Person.

Final processing includes overlaying the square boxes and labels on the frame, showing the match of prediction for each of the detected person. Here Green colour indicates if found_match = True and Red colour indicates if found match = False if my face is unmatched.

The table 4.3 represents the average time taken for each of the above step, average is taken for total 300 images of all 5 persons:

Table 4.3: Timing analysis of each of the step taken while prediction in Movidius NCS

Event Time taken in (ms) Loading image from 50 validated image folder Pre-Processing of image 37 Sending image to 37 movidius Recognition by 1 movidius Receiving and Final 5 Processing

The Total time of prediction is the sum of Pre-processing, Sending image to Movidius, Detection and final processing of image. Therefore the average time of prediction comes to be 130 milli seconds.

62 4.5.2 Timing Analysis of Face Recognition using Intel OpenVINO Inference Engine

The below image 4.12 showing the timing analysis on Intel NCS for measuring prediction of OpenVINO models.

Loading Trained SVM Reading IR Models Allocate Input and Models on to Raspberry on Raspberry Pi3 Output Blobs Pi3

Inference Time Pre-Processing Time Time Taken To Calculation Using Using MTCNN Create Executable Asynchronous Algorithm for each Network on OpenVINO Inference live webcam frame on Raspberry Pi3 Engine Raspberry Pi3

Figure 4.18: Flowchart of how frames are passed through OpenVINO Inference Engine

First we need to load trained SVM Model on to raspberry pi3 after we import OpenVINO Inference Engine API. We need to observe that this is running completely on Raspberry Pi3

The time taken to load SVM model on Pi3 is 1.794 seconds as shown in below image.

Figure 4.19: Time for loading SVM Models

Now we need to read IR Models and pass this as arguments to Inference Engine Network. Later, we use Inference Engine Plugin to load MYRIAD Plugin. Note that IE Network, IE Plugin are imported by OpenVINO Inference Engine API.

The below image shows time taken to read IR Models is 0.252 Seconds.

Figure 4.20: Time for reading IR Models

63 Now we declare the input and output configuration. The below image shows the time taken to create input and output blobs

Figure 4.21: Time taken to generate Input and Output Blobs

We load this model on myriad device by creating the executable network along with number of request. We can create as many executable networks like this and use this to Perform Inference.

Figure 4.22: Time taken to Create Executable Network

The above image shows the taken to create executable network on Raspberry Pi3.

Now we have done everything and ready to start inference. First, I want to calculate time taken to perform pre-processing using MTCNN for each live frames on raspberry Pi3 to get cropped image for each frame detected faces.

The below image shows that average time taken is 25 .56 seconds.

Figure 4.23: Time taken for Pre-Processing on Raspberry Pi3

64 Now, the interesting part is we need to reduce this pre-processing time. This is where OpenVINO Inference Engine comes now. When we pass this cropped image to asynchronous inference engine and process the request to get our output. This time is reduced by 10 times where it has successfully reduced the CPU Load and accelerated performance on low power Raspberry Pi3 using MYRIAD VPU which is proved practically as shown in below image.

Figure 4.24: Time taken for performing inference on my trained model

The below table 4.4 represents the overall summary for time taken of all the events

Table 4.4: Timing analysis of each of the step taken while Performing Predictions using OpenVINO Inference Engine

Event Time taken in (seconds) Loading SVM Models 1.795

Reading IR Models 0.252

Time taken to generate 0.000123 input and output blobs Time taken to create 10.58 executable network Pre-Processing of frames 25.56

Face Recognition using 0.05 Inference Engine

The Total time taken on raspberry pi3 before sharing PI3 CPU load to VPU is the sum of first 5 events which is equal to 38.2383 seconds. Once we share the PI3 CPU load to myriad x VPU using inference engine the time reduced to 0.05 seconds or 50 milli seconds.

65 4.5.3 Performance Metrics and Facenet Analysis

Now we have trained our model and successfully tested model under different conditions which are discussed in above chapter -3. We need to check now the performance of the OpenVINO models by using OpenVINO Benchmark tool provided by the Intel Team.

The 3 important factors that are used to measure our model efficiency are:

1. Latency or Response Time

2. Throughput (Frames /sec or Inferences/sec)

3. Data Format (FP32, FP16, INT8)

I have performed testing on my .xml file using Myriad on Intel Core CPU , we have used Asynchronous API for our project the primary metric to be measured is Throughput (Inferences Per Second) for Asynchronous mode. As we have discussed previously Asynchronous Applications uses number of infer requests and executes Start Async method. Therefore the total number of executed iterations is given by this Count = 618 iterations and time taken to finish this execution by this Duration = 60345.3 ms as shown in below image.

Latency and Throughput measured for batch size = 1. We are using batch size =1 because as the batch size increases the Latency or response time increases which will affect performance of our OpenVINO Model. As we can observe from the below image for batch size =1 the latency reached to 195.245 ms and 10.241 frames per second. .

Figure 4.25: Benchmark Tool Results of my trained .XML File on Intel Core i3-8100 CPU Processor

The above results indicates that this model (FP16 Format) has given good performance which can be integrated to our Face Recognition application on raspberry pi3 accelerated by VPU.

66

Figure 4.26: Benchmark Tool Results Of My Trained .XML File on Raspberry PI3 ARM Processor

As we can observe from the above image that my throughput of my trained .xml file has given 9.08 frames per second on Raspberry pi3 which is almost similar in terms of performance of my trained model on Intel Core i3-8100 CPU Processor with 10.2411 Frames Per Second. However, we cannot trust this benchmark results completely. We have observed that Intel NCS2 can perform Inference classification at 5 .08 Frames per Seconds.

67 Now we perform analysis. For this following are the three scenarios are considered:

● Intel Core i3-8100 CPU Processor

● Raspberry PI3 only

● Raspberry PI3 + Movidius NCS (NCSDK VS OpenVINO)

This is because the development of the project was done the above 3 stages. Firstly an analysis for model selection was done in PC and after choosing FaceNet as the best model (given the time and accuracy of prediction) porting this system to Raspberry PI was another stage. This framework brought about a period of 5-10 secs per image which is reduced further to 0.13 sec in next stage by integrating Movidius NCS device with raspberry PI.

This can also be further improved by using Intel OpenVINO toolkit and deploy this on Intel ncs2 to further accelerate performance on pi3.

For this comparative study the below table 4.5 showing the comparative study on the 4 scenarios as discussed for different measuring parameters.

Table 4.5: Performance Analysis on 4 Different Hardware Platforms

Measuring Raspberry Raspberry Core i3-8100 Raspberry PI3 +Intel Parameters CPU Processor PI3 +Intel PI3 NCS NCS (Based on (Based on OpenVINO ) NCSDK) Average time for 0.92 sec 11.56 sec 0.13 Sec 0.05 Sec recognizing person Frames per second 6.132 fps 0.56 fps 3.05 fps 5.08 fps

68 The below graph 4.11 depicts the Prediction time for the 3 cases as discusses.

FACENET-PREDICTION TIME 4

3.5

3

2.5

2

1.5

1

0.5

0 Intel Core i3- Raspberry PI3 PI3+ INTEL NCS PI3+INTEL PI3+ INTEL NCS2 8100 CPU (Direct (NCSDK) NCS(OPENVINO) (OPENVINO) Processor Implementation Without Hardware Accelerators) AVERAGE PREDICTION TIME(Seconds)

Figure 4.27: Facenet Prediction Graph

69 FRAMES PER SECOND 7

6

5

4

3

2

1

0 Intel Core i3-8100 Raspberry PI3 (Direct PI3+ INTEL NCS PI3+INTEL NCS CPU Processor Implementation (NCSDK) (OPENVINO ) Without Hardware Accelerators) FRAMES PER SECOND RATE

Figure 4.28: Facenet Frame Rate Graphs

We can observe from above graph 4.12 that by using Intel NCS we boosted our performance up to on Pi3 using VPU.

70 4.5.4 Accuracy Calculation of Intel NCS SDK in Non-Lighting Conditions

I have tested for 50 frame counts per each person and checked the performance with predicted labels and compared with true labels. By using machine learning scikit-learn library we import confusion matrix for calculating the accuracy, F1 Score, Precision, and recall.

The below table 4.6 shows the Confusion matrix results of face recognition for 5 Persons in non- lighting conditions.

Table 4.6: Confusion Matrix in Non-Lighting Conditions

71  For above Confusion matrix table the correct predictions for 5 persons out of 50 frame counts are 47, 46,50,48,49.They are also known as True Positives .

 Accuracy = True Positives/Total Number of counts.

 From the above formula we calculated accuracy as 96%

 F1 score is a good metric when data is imbalanced. F1 Score is the harmonic mean of the recall and precision.

 F1 Score = 2 X Precision X Recall / Precision + Recall

 Precision = True Positives/ True Positive False Positive

 Precision(Nimshi) = 0.96

 Precision(Amit Prasad) = 0.959

 Precision(Chatchai) = 0.961

 Precision(Sahuri Bond ) = 0.96

 Precision(Clifford) = 0.960

 Average Precision=Precision(Nimshi)+Precision(Amit Prasad)  +Precision(Chatchai) +Precision(Clifford) + Precision(Sahuri Bond ) /5 = 0.971

 Recall = TP/(True Positive + False Negative)

 False Negative(Nimshi) = 3

 False Negative(Amit Prasad) = 4, False Negative(Chatchai) = 0 ,

 False Negative(Sahuri Bond ) = 2, False Negative(Clifford) = 1

 Recall(Nimshi) = 0.94, Recall(Amit Prasad) = 0.92, Recall(Chatchai) = 1.0,

 Recall (Sahuri Bond ) = 0.96 ,Recall Clifford) = 0.98

 Average Recall = Recall(Nimshi) +Recall(Amit Prasad) +Recall(Chatchai)+ Recall (Sahuri Bond ) + Recall (Clifford) /5  Average Recall = 0.96

 F1 Score by substituting the above Average Recall and Average Precision values in the above equation .

 F1 Score = 0.9654 or 96.54%

72 4.5.5 Accuracy Calculation of Intel NCS SDK in Lighting Conditions

Similarly, I have tested for 50 frame counts per each person and checked the performance with predicted labels and compared with true labels. By using machine learning scikit-learn library we import confusion matrix for calculating the accuracy, F1 Score, Precision, and recall.

The below table 4.7 shows the Confusion matrix results of face recognition for 5 Persons in normal lighting conditions.

Table 4.7: Confusion Matrix in Lighting Conditions

73  For above Confusion matrix table the correct predictions for 5 persons out of 50 frame counts are 42, 44,45,48,47.They are also known as True Positives .

 Accuracy = True Positives/Total Number of counts.

 From the above formula we calculated accuracy as 90.4%

 F1 score is a good metric when data is imbalanced. F1 Score is the harmonic mean of the recall and precision.

 F1 Score = 2 X Precision X Recall / Precision + Recall

 Precision = True Positives/ True Positive False Positive.

 Precision(Nimshi) = 0.893

 Precision(Amit Prasad) = 0.88

 Precision(Chatchai) = 0.918

 Precision(Sahuri Bond ) = 0.905

 Precision(Clifford) = 0.921

 Average Precision=Precision(Nimshi)+Precision(Amit Prasad)  +Precision(Chatchai) +Precision(Clifford) + Precision(Sahuri Bond ) /5 = 0.9034

 Recall = TP/(True Positive + False Negative)

 False Negative(Nimshi) = 8

 False Negative(Amit Prasad) = 6, False Negative(Chatchai) = 5 ,

 False Negative(Sahuri Bond ) = 2, False Negative(Clifford) = 3

 Recall(Nimshi) = 0.84, Recall(Amit Prasad) = 0.88, Recall(Chatchai) = 0.9,

 Recall (Sahuri Bond ) = 0.96 ,Recall Clifford) = 0.94

 Average Recall = Recall(Nimshi) +Recall(Amit Prasad) +Recall(Chatchai)+ Recall (Sahuri Bond ) + Recall (Clifford) /5

 Average Recall = 0.904

 F1 Score by substituting the above Average Recall and Average Precision values in the above equation .

 F1 Score = 0.9037 or 90.37%.

74 4.5.6 Accuracy Calculation for Intel NCS2 OpenVINO Implementation on Pi3

We have used Sklearn SVC to classify 4 persons. I got confidence of the prediction for each recognized face per frame. So in order to accuracy of my model we considered a total of 50 frame counts and noted all the correct predictions by comparing with true labels, also noted 50 confidence values per person. By using machine learning scikit-learn library we import confusion matrix for calculating the accuracy, F1 Score, Precision, and recall.

The below table 4.8 shows the Confusion matrix results of face recognition for 4 Persons.

Table 4.8: Confusion Matrix for OpenVINO Based Implementation on Raspberry Pi3

For above Confusion matrix table the correct predictions for 4 persons out of 50 frame counts are 48, 46,45, and 49.They are also known as True Positives .

Accuracy = True Positives/Total Number of counts.

From the above formula we calculated accuracy as 94%.

F1 score is a good metric when data is imbalanced. F1 Score is the harmonic mean of the recall and precision.

F1 Score = 2 X Precision X Recall / Precision + Recall

Precision = True Positives/ True Positive False Positive.

75  Precision(Nimshi) = 0.96

 Precision(Amit Prasad) = 0.902

 Precision(Chatchai) = 0.957

 Precision(Clifford) = 0.942

 Average Precision=Precision(Nimshi)+Precision(Amit Prasad)  +Precision (Chatchai) +Precision (Clifford) /4 = 0.94.

 Average Recall = 0.94

Recall = True Positive / (True Positive + False Negative)

 Recall(Nimshi) = 0.96, Recall(Amit Prasad) = 0.92, Recall(Chatchai) = 0.9, Recall Clifford) = 0.98

 Average Recall = Recall(Nimshi) +Recall(Amit Prasad) +Recall(Chatchai)+ Recall (Sahuri Bond ) + Recall (Clifford) /4 = 0.94

 Average Recall = 0.94

F1 Score by substituting the above Average Recall and Average Precision values in the above equation .

 F1 Score = 0.94 or 94 %.

Table 4.9: Overall Summary for Measuring Accuracy Performance Using OpenVINO IR Models and Intel NCS

METHOD ACCURACY Average Average F1 Score Precision Recall

OpenVINO IR Models 94% 94% 94% 94% (Feature Extraction) +Classification(SVM Classifier )

76 From this 50 Confidence values per person we consider the most frequently occurred Confidence value as shown in below image Fig 4.29.

Figure 4.29: Printing the Maximum Probability Prediction Confidence Value Corresponding to Clifford Label

From table 4.10 we can observe the Probability Confidence Values of the SVM Classifier for threshold value of 0.85 for all the Predicted Labels.

Table 4.10: Overall Probability Confidence Values based on Maximum Frequency

LABEL NAME Confidence (%) Amit Prasad 92.31% Nimshi 95.77% Chatchai 91.56% Clifford 96.5%

77 The below graph 4.29 and 4.30 shows the accuracy on raspberry pi3 under 2 conditions:

Without NCS

With NCS

FACENET-ACCURACY 100%

98%

96%

94%

92%

90%

88%

86% Raspberry PI3 PI3 + INTEL NCS PI3+ INTEL NCS PI3+ITEL (Without Intel Ncs) (NON-LIGHTING (LIGHTING NCS(OPENVINO) CONDITIONS) CONDITIONS) AVERAGE ACCURACY OF TOTAL 4 PERSONS

Figure 4.30: Accuracy for 4 different cases

78 FACENET-ACCURACY 100%

98%

96%

94%

92%

90%

88%

86% Raspberry PI3 PI3 + INTEL NCS PI3+ INTEL NCS Raspberry PI3 + (Without Intel Ncs) (NON-LIGHTING (LIGHTING Intel NCS2 CONDITIONS) CONDITIONS) (OpenVIO) PRECISION RECALL F1 SCORE

Figure 4.31: Precision, Recall and F1 Score for 4 different cases

79 Chapter 5 Conclusion, Recommendations and Future Works

5.1 Conclusion

With the implementation of Facenet Face Recognition, 4 class labels i.e. Nimshi, Amit Prasad, Chatchai, and Clifford were successfully recognition at various lighting conditions and different environment. A USB Camera was used from which continuous live stream video of the room was obtained. As a result, this system can successfully recognize and also verifies whether its matches with validated image dataset or not.

Intel NCS SDK has been used to the trained model for reducing the size of the trained which about 48.6 MB graph file. It takes the graph file of Intel NCS SDK output as input and performs inference on the test image. The original model did not gave satisfactory performance on raspberry pi3, but now it is improved by reducing the size of the facenet model. We have observed some limitations using this implementation like back ground matching, low accuracy in lighting conditions compared with non-lighting conditions etc.

To solve the above limitations and improve the performance both time and accuracy using Intel ncs2 we have used OpenVINO Toolkit Software and achieved Inference time prediction of 50 milli seconds and Accuracy of 94%.

By changing the architecture of the facenet model using NCS SDK or OpenVINO , the recognition time has been increased by a factor 10 times but the accuracy was slightly reduced Compared with the Original Facenet Model . This study attempts to compare the performance of Facenet Model using Intel Ncs on Raspberry Pi3 and visualize how ncs are accelerating the performance on low power edge devices by measuring the Frames per seconds, Inference Time Calculation and Accuracy.

5.2 Recommendations and Future Works

Intel •NCS is used for this study which is one of the powerful vision processing units. But more powerful next generation Intel ncs2 has been released. Recently Raspberry PI4 has been released to market. Running this same face recognition application on such hardware will further improve the system performance in terms of speed If the• system is trained with a much greater number of images collected at a different distance, angles and pose variations, it would result in higher accuracy and precision. In this system, the face must be facing the camera for correct recognition. Similarly,• the system can be expanded with multiple ncs devices on raspberry to faster inference compared with one stick. Also here when we are implementing the one-shot learning algorithm complete back ground is taken for comparison. In future works we can take yolo based face detector graph , facenet graph and combine both this graph files to get much more accuracy under lighting condition. In Coming• Future Edge AI will be the hottest research topic and we need to use low bit width Many Embedded Hardware Companies will choose different hardware’s like Jetson Board , Google Corolla and Intel ncs2 based on the application and develop algorithms for Object detection, Image Classification, Human Pose Estimation etc.

80 References

The nature of statistical learning theory,” New York: Springverlag, 1995.

Face Recognition Using L-Fisherfaces. CY Zhang, QQ Ruan - Journal of Information Science & …, 2010 - search.ebscohost.com

LBPH based improved face recognition at low resolution

P.J. Phillips, “Support vector machines applied to face recognition,” Processing system 11, 1999

Openface: A general-purpose face recognition library with mobile applications B Amos, B Ludwiczuk… - CMU School of …, 2016 - reports- archive.adm.cs.cmu.edu

Face recognition: A convolutional neural-network approach S Lawrence, CL Giles, AC Tsoi, AD Back - IEEE transactions on neural …, 1997 - cs.cmu.edu Comparing and Improving Facial Recognition Method

A. Rosebrock, Deep Learning for Computer Vision with Python. PyImageSearch

Deep metric learning using triplet network E Hoffer, N Ailon - International Workshop on Similarity-Based Pattern …, 2015 - Springer

https://medium.com/@kuzuryu71/improving-siamese-network-performance-

https://machinelearningmastery.com/introduction-to-deep-learning-for-face-

Deep face recognition: A survey M Wang, W Deng - arXiv preprint :1804.06655, 2018 - arxiv.org

A Performance Comparison of Loss Functions for Deep Face Recognition Y Srivastava, V Murali, SR Dubey - arXiv preprint arXiv: 1901.05903, 2019 - arxiv.org

Deepface: Closing the gap to human-level performance in face verification Y Taigman, M Yang, MA Ranzato… - Proceedings of the IEEE …, 2014 - cv- foundation.org

Facenet: A unified embedding for face recognition and clustering F Schroff, D Kalenichenko, J Philbin - … and , 2015 - cv-foundation.org

81

Face Recognition Using L-Fisherfaces. CY Zhang, QQ Ruan - Journal of Information Science & …, 2010 - search.ebscohost.com

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. 2, 3, 4, 6

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with . CoRR,

M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2, 4

Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint Identification-verification. CoRR, abs/1406.4773, 2014. 1, 2, 3.

Siamese Neural Networks and Facenet for One-shot Image Recognition

Richard Szeliski, Computer Vision. London: Springer London, 2011, Texts in Computer Science, ISBN: 978-1-84882-934-3 [Online]. Available: http://link.springer.com/10.1007/978-1- 84882-935-0. [Accessed: 18-Jul-2017]

Embedded Vision Machine Learning on Embedded Devices for Image classification in Industrial Internet of things B Parvez - 2017 - diva-portal.org

Enabling embedded inference engine with arm compute library: A case study D Sun, S Liu, JL Gaudiot - arXiv preprint arXiv:1704.03751, 2017 - arxiv.org

git clone https://github.com/Microsoft/ELL.git

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello, ‘An Analysis of Deep Neural NetworkModels for Practical Applications’

82