FAULT TOLERANCE AND RE-TRAINING ANALYSIS ON NEURAL NETWORKS

by ABHINAV KURIAN GEORGE B.Tech Electronics and Communication Engineering Amrita Vishwa Vidhyapeetham, Kerala, 2012

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science, Computer Engineering, College of Engineering and Applied Science, University of Cincinnati, Ohio 2019

Thesis Committee:

Chair: Wen-Ben Jone, Ph.D.

Member: Carla Purdy, Ph.D.

Member: Ranganadha Vemuri, Ph.D. ABSTRACT

In the current age of big data, artificial intelligence and technologies have gained much popularity. Due to the increasing demand for such applications, neural networks are being targeted toward hardware solutions. Owing to the shrinking feature size, number of physical defects are on the rise. These growing number of defects are preventing designers from realizing the full potential of the on-chip design. The challenge now is not only to find solutions that balance high-performance and energy-efficiency but also, to achieve fault-tolerance of a computational model. Neural computing, due to its inherent fault tolerant capabilities, can provide promising solutions to this issue. The primary focus of this thesis is to gain deeper understanding of fault tolerance in neural network hardware.

As a part of this work, we present a comprehensive analysis of fault tolerance by exploring effects of faults on popular neural models: multi- model and neural network. We built the models based on conventional 64-bit floating point representation. In addition to this, we also explore the recent 8-bit integer quantized representation. A fault injector model is designed to inject stuck-at faults at random locations in the network. The networks are trained with the basic algorithm and tested against the standard

MNIST benchmark. For training pure quantized networks, we propose a novel backpropagation strategy. Depending on the performance degradation, the faulty networks are re-trained to recover their accuracy.

Results suggest that: (1) neural networks cannot be considered as completely fault tolerant;

(2) quantized neural networks are more susceptible to faults; (3) using a novel training algo- rithm for quantized networks, comparable accuracy is achieved; (4) re-training is an effective strategy to improve fault tolerance. In this work, 30% improvement in quantized network is achieved as compared to 6% improvement in floating point networks using the basic backprop- agation algorithm. We believe that using more advanced re-training strategies can enhance fault tolerance to a greater extent. Copyright 2019, Abhinav Kurian George

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. To my loving family and friends

iv Acknowledgments

I would like to extend my sincere thanks to my advisor Dr. Wen-Ben Jone. He has always been a source of inspiration and motivated me to keep moving forward. I am extremely grateful to him for working along with me even though I had to move away from the university. Dr. Jone was always available for me even through his rough times and I will always cherish working under him. His guidance and advice were very instrumental for shaping the course of this thesis.

I would like to thank my thesis committee members Dr. Ranganadha Vemuri and Dr. Carla

Purdy for reviewing and providing feedback on this work. I really appreciate the kind gesture and efforts they have taken to go through this material.

Finally I would like to thank my family back in India who have always supported me. They instilled confidence and faith in me when I started doubting my abilities. Also, would like to mention a special thanks to my friends especially Sangeetha for sitting hours together and reviewing my work. Without all your prayers and blessings this work would not have been possible.

Thank you all.

v Contents

Acknowledgments v

Contents vi

List of Figures ix

List of Tables xii

List of Abbreviations xiii

1 Introduction 1

2 Background 5

2.1 Neural Network ...... 5

2.1.1 Multi-Layer (MLP) ...... 6

2.1.2 Convolution Neural Network ...... 7

2.1.3 Recurrent Neural Networks ...... 10

2.2 Backpropagation Algorithm ...... 11

2.3 Applications of ANNs ...... 13

2.3.1 Pattern Classification ...... 13

2.3.2 Clustering or unsupervised pattern classification ...... 14

2.3.3 Prediction/Forecasting ...... 14

2.3.4 Content Addressable Memory ...... 14

2.3.5 Control ...... 15

2.4 Fault Tolerance ...... 15

2.4.1 Basic terms related to fault tolerance ...... 18

2.4.2 Fault tolerance in neural networks ...... 18 vi 2.4.3 TPU and quantization ...... 21

3 System Architecture 26

3.1 Fault Injection ...... 26

3.2 Experimental Setup for Feed Forward type neural network ...... 31

3.2.1 MNIST Benchmark ...... 31

3.2.2 The Feed Forward Neural Network ...... 32

3.2.3 Overall System for Feed Forward Re-training Experimentation . . . . . 35

3.3 System Design For Convolution Neural Network ...... 37

3.3.0.1 Fault Injection System For CNN ...... 42

3.3.0.2 Re-training of CNN ...... 43

3.4 Quantized Neural Network ...... 44

3.4.1 Quantization And De-quantization Scheme ...... 45

3.4.2 Quantized Feed Forward Network (Q-FNN) ...... 47

3.4.3 Fault Injection in Q-FNN ...... 52

3.4.4 Re-training Experiment for Quantized Feed Forward Network . . . . . 53

4 Experimentation Results 57

4.1 Metrics For Qualification ...... 57

4.2 Results for Floating Point Data-path ...... 59

4.2.1 Results for Feed Forward Neural Network ...... 59

4.2.2 Results for Convolution Neural Network ...... 68

4.3 Results for Quantized Data Path ...... 73

4.3.1 Results for Quantized Feed Forward Neural Network ...... 73

4.3.1.1 Results for Two Layer Quantized Feed Forward Neural Net-

work...... 73

4.3.1.2 Results for Five Layer Quantized Feed Forward Neural Net-

work...... 79

5 Conclusion and Future Work 86

5.1 Conclusion ...... 86

vii 5.2 Future Work ...... 88

viii List of Figures

2-1 Simple neuron in ANN [1] ...... 6

2-2 Multi-layer perceptron (MLP) [2] ...... 7

2-3 Convolution Layer Output with receptive field 3x3 [3] ...... 9

2-4 Hyperparameters of convolution layer ...... 9

2-5 Max pooling layer [3] ...... 10

2-6 [4] ...... 10

2-7 Feed forward network - backpropagation ...... 11

2-8 Architecture of LeNet-5 [5] ...... 14

2-9 Cause effect relationship between fault, error and failure [6] ...... 16

2-10 Classification of faults [6] ...... 17

2-11 Block diagram of TPU [7, 8] ...... 22

2-12 of MXU [7, 8] ...... 23

2-13 Performance-per-watt comparison [8] ...... 23

2-14 Quantizing in Tensorflow [8] ...... 24

3-1 A simple neuron with two inputs ...... 27

3-2 Two input neuron with possible fault sites ...... 28

3-3 Perceptron Neural Network to identify greater of two inputs ...... 28

3-4 Fault injected in the Network ...... 30

3-5 Results of simulating the faulty system in Simulink ...... 31

3-6 Custom neural network architecture ...... 32

3-7 Feed forward neural network used for hand written digit recognition [9] ...... 33

3-8 Feed forward neural network with re-Training ...... 33

3-9 Flow chart describing overall system ...... 35

ix 3-10 Convolution neural network ...... 37

3-11 Convolution example - CNN ...... 37

3-12 Max pooling example - CNN ...... 39

3-13 Layer-1 convolution and maxpooling - CNN ...... 39

3-14 Layer-2 convolution - CNN ...... 40

3-15 Layer-2 flatten - CNN ...... 41

3-16 Layer-3 dense - CNN ...... 41

3-17 CNN fault injector output - single stuck-at fault ...... 43

3-18 Quantization function [10] ...... 45

3-19 De-quantization function [10] ...... 47

3-20 Quantized two layered feed forward network architecture ...... 48

3-21 Components of quantized two layered feed forward network ...... 49

3-22 Quantized five layered feed forward network architecture ...... 50

3-23 Components of quantized five layered feed forward network ...... 51

3-24 Fault injector output for two layer QFNN ...... 52

3-25 Re-training architecture for two layered QFNN ...... 54

3-26 Re-training architecture for five Layered QFNN ...... 55

4-1 Maximum accuracy of floating FFN ...... 59

4-2 Minimum accuracy of floating FFN ...... 60

4-3 Average accuracy of floating FFN ...... 61

4-4 Accuracy plot with one stuck-at fault ...... 61

4-5 Average confidence plot for floating FFN ...... 62

4-6 Minimum confidence plot for floating FFN ...... 63

4-7 Accuracy improvement after re-training ...... 64

4-8 Confidence improvement after re-training ...... 65

4-9 Number of times re-trained for each fault ...... 65

4-10 QoR plot - re-training with critical faults ...... 66

4-11 QoC plot - re-training with critical faults ...... 67

4-12 Number of times re-trained - QoR metrics ...... 68

x 4-13 Maximum accuracy floating CNN ...... 69

4-14 Minimum accuracy floating CNN ...... 70

4-15 Average accuracy floating CNN ...... 70

4-16 QoR improvement by re-training CNN ...... 71

4-17 Worst case QoR after re-training CNN ...... 72

4-18 Number of times re-trained CNN ...... 72

4-19 Maximum QoR for two layer QFFN ...... 74

4-20 Minimum QoR for two layer QFFN ...... 75

4-21 Average QoR for two layer QFFN ...... 76

4-22 QoR for single stuck at fault ...... 76

4-23 QoR improvement after re-training ...... 77

4-24 Minimum QoR per fault after re-training ...... 78

4-25 Number of times 2 layer QFNN network re-trained ...... 79

4-26 Maximum QoR for a particular fault- five layer QFFN ...... 80

4-27 Minimum QoR for a particular fault- five layer QFFN ...... 80

4-28 Average QoR five layer vs two layer QFFN ...... 81

4-29 QoR before and after re-training ...... 85

xi List of Tables

4.1 Maximum number of images recognized - Faulty layer network ...... 82

4.2 Average Number of Images Identified - Faulty Layer network ...... 83

4.3 Minimum Number of Images Recognized - Faulty layer network ...... 84

xii List of Abbreviations

ANN ...... Artificial neural network NN ...... Neural network FFN ...... Feed forward network MLP ...... Multi-layer Perceptron CNN ...... Convolution neural network RNN ...... Recurrent neural network QNN ...... Quantized neural network QoR ...... Quality of recognition QoC ...... Quality of confidence QFFN ...... Quantized feed-forward network TPU ...... Tensor Processing Unit MNIST ...... Modified National Institute of Standards and Technology database

xiii Chapter 1

Introduction

Recent years have witnessed explosive growth in the amount of data and computer usage

[11]. The need for processing such large sets of data has led to renaissance in machine learn- ing. Artificial intelligence in general is applicable to a wide range of problems that involve , clustering, trend prediction, optimization and control [12]. In particular, the breakthroughs in deep neural networks have led to efficient solution of complex problems and examples of which are , image recognition and autonomous driving.

Thus this remains an attractive field in research. Due to the increasing demand for neural com- puting in recent years, research is directed toward custom solutions in hardware.

Artificial Neural Networks (ANN) are simple mathematical blocks, inspired by the human brains [13]. The artificial neuron is a nonlinear function of a weighted sum of inputs. Layers of artificial neurons in sequence form a neural network. A deep neural network comprising many such layers allows accurate processing of large data sets. Neural networks are usually associated with two phases namely training (development) and inference (prediction). Con- ventionally training is carried out in floating point and requires huge amount of parallelism.

This is why GPUs are a popular platform. Quantization is a step used in Tensor Processing

Units (TPU) introduced by [7], for efficient machine learning computations. By means of quantization, a 32 bit floating point number is converted into a 8 bit integer type which is good enough for inference. De-quantization is converting a 8 bit quantized integer value back into a floating point number. Re-quantization is another procedure in Tensorflow that maps 32 bit signed integer value to a 8 bit unsigned integer value [10]. In a well trained network the precision hence achieved is as good as its floating point version with an added advantage of 1 less hardware and memory requirement [7].

Backpropagation is the most widely used algorithm. As the name sug- gessts,the error is propagated backwards through the network and is used to update the network parameters. The error is computed by comparing the response of the network and the expected value. It is an iterative algorithm that uses the chain rule to compute the gradients of the net- work weights and biases that are to be updated during learning. An important parameter in backpropagation algorithm is the learning rate denoted by α. The value of α is typically very small and is used to control how much the weights and biases are adjusted while updating. A detailed description with an example of this is provided in Section 2.2.

Due to the shrinking feature size, reliability is becoming a major concern in semiconduc- tor technologies, resulting in increased number of physical defects. These growing number of defects prevent designers from realizing the full potential of the on-chip design. The challenge now is not only to balance high-performance and energy-efficiency, but also fault-tolerance of a computational model. Neural computing is one such promising methodology in fault-tolerant computing paradigm. This is because neural networks mimic the parallel and distributed struc- ture of the human brain [6]. However, in practice the inherent fault tolerance capability is limited, and neural networks cannot be considered intrinsically fault tolerant without a proper design. Obtaining a truly fault tolerant network is still an important issue. Error confinement and replication techniques are conventionally used to address this. They cannot be directly ap- plied on network hardware, because the computation and information are naturally distributed

[6].

In an attempt to study fault tolerance, we explored fault tolerance on various architectures of neural networks namely feed forward networks (FFN) (i.e., multi-layer perceptron (MLP)) and convolution neural network (CNN). MLP is a network in which each layer is a nonlinear function of weighted sum of outputs from the previous layer [7]. In CNN each layer is a non linear function of the weighted sum of spatially nearby subsets of outputs from the previous layer [7]. The neural networks used in this work are designed to perform the function of image recognition, in particular, hand-written digit recognition. We used the standard MNIST benchmark which consists of 60,000 training images and 10,000 test images [14]. The images

2 are hand-written digits from 0 to 9. We used the basic backpropagation algorithm to train the networks.

We designed a fault injector model to study the behavior of the networks in a faulty envi- ronment. Stuck-at faults, in particular stuck-at-0 and stuck-at-1 are the fault models used in this work. The faults are injected only at the outputs of the basic computational blocks. In

MLP network faults are present at the outputs of multiplier, , and the .

In CNN faults are placed at the outputs of convolution, adder, activation, pooling and flatten blocks. The fault injector model injects stuck-at faults at random locations. To present a com- prehensive analysis, each network is tested against a total fault count varying from 1 to 25.

Further each fault count is iterated over 25 times. At each iteration, the fault is injected at a new random location.

Initially a fault-free network is trained using the basic backpropagation algorithm to get a reference set of results. Next, faults are injected in the same network to test the number of images it is now able to recognize. If the number of images recognized is below a threshold value, then the network is re-trained using the same backpropagation algorithm. The threshold value depends on two metrics: Quality of Recognition (QoR) and Quality of Confidence (QoC).

It basically quantifies the total number of images recognized and the confidence of recognition respectively. This is the basic outline of our experimentation methodology.

Further the set of experiments are divided into two sections: floating-point (64 bit) and quantization (8 bit). We have used the Tennsorflow framework developed by Google for quan- tization [10]. As quantization is an attractive form of data representation, it has been gaining more importance in neural computing. Researchers have been exploring training algorithms for qunatized neural networks [15, 16]. For the purpose of re-training in our experiment, we de- signed a novel back propagation technique for training the quantized networks. This algorithm is unique in its way that it uses pure integer inference model during the forward pass. In the backward pass the outputs are dequantized to maintain the required precision.

We observe that the fault tolerance capability of quantized networks is much lower than that of floating point networks. This is partially due to the fact that precision is lost during quantization and each bit represents more information in quantized networks. However, re-

3 training the network can help improve the fault tolerance, thereby reducing the error rate. For

floating point network, we achieved 6% improvement in QoR and 18 % in QoC. In the case of quantized network, we achieved 30% improvement in QoR. As our goal was to perform a simple and comprehensive analysis, we used the basic backpropagation algorithm throughout the experiment. We believe that by using a more intelligent backpropagation regime the fault tolerance could be improved drastically.

This thesis is organized as follows:

Chapter 2: reviews the background information about ANN, algorithms used for training, fault tolerance methods and details of quantization.

Chapter 3: discusses the proposed methodologies in detail. We also explain the design of neural network architectures and fault injector model used in the experiment.

Chapter 4: presents the results of experimentations designed based on the methods proposed in Chapter 3.

Chapter 5: discusses the conclusion and future work.

4 Chapter 2

Background

This thesis uses many concepts from artificial intelligence and machine learning. In this section, we will provide a detailed background of each of the required concepts to help gain a better understanding of the contents of this work.

2.1 Neural Network

ANNs are a large collection of simple processors interconnected forming massively parallel systems inspired by the human brain [12]. The basic building block of a neural network is called a neuron. Each neuron has single/multiple inputs and a single output. The neuron possess unique parameters called weights and biases. These parameters are altered during the learning phase to learn the intended functionality. Weights symbolize the connection strength between the neurons. These weights and biases help activate the neuron. Each neuron is activated using a non-linear mathematical function such as RELU, sigmoid, tanh etc [1]. As shown in Figure

2-1 [1] the inputs of a neuron are multiplied by the corresponding weights. The sum of the weighted inputs and bias are then given as input to the activation function. Neural networks are considered as universal approximators [17]. There are mainly three kinds of ANN’s: Multi- layer Perceptrons , Convolution Neural Networks and Recurrent Neural Networks.

5 Figure 2-1: Simple neuron in ANN [1]

2.1.1 Multi-Layer Perceptrons (MLP)

The original perceptron model was proposed by Rosenblatt in 1950 [18, 2]. MLP is the extended variant of this model. MLPs belong to the class of feed forward neural networks.

It consists of an input layer , one or more hidden layer and an output layer. The connection between layers are directed in the sense that only lower layers are connected to the upper layers. However the neurons within a layer are not connected. The number of neurons in the input layer depends on the size of the input. In our case since we are using 28x28 images the number of input neurons is 784. Input neurons have a linear activation function. The number of output neurons depend on the number of output classes. In this work we have to recognize hand written images of digits between 0 and 9 therefore we have 10 classes or 10 output neurons. The number of neurons in the hidden layer can be varied to improve accuracy or avoid over-fitting during training [2]. Figure 2-2 [2] shows the general architecture for a MLP. MLPs are usually trained using back propagation algorithm. Ivakhnenko et Lapa developed the first trained deep

MLP consisting of 8 hidden layers [19]. A small example of MLP network is discussed in the forward pass phase of Section 2.2.

6 Figure 2-2: Multi-layer perceptron (MLP) [2]

2.1.2 Convolution Neural Network

Convolution Neural Networks (CNNs) are inspired by the work of Hubel, David H and

Wiesel, Torsten N (1962) on a cat’s visual cortex [20]. They observed that the cat’s vortex acts as filters that only fire for a particular input pattern. This formed the basis to develop the modern CNN as described in ”Neocognitiveon” proposed by Fukushima, Kunihiko and

7 Miyake, Sei (1982) [21]. CNN’s have lower number of parameters when compared to MLP counterparts and this property makes it ideal for deep neural networks. Also they have the capability of taking raw inputs and identified features from these raw inputs. Consider we build a MLP for recognizing a full HD RGB image of 1920x1080 pixels. This would require an input layer of about 6 million neurons. Let us consider it is connected to a hidden layer having

1000 neurons. This would result in 6 billion parameters for the first layer. This requires a lot of hardware for storage. Also the training process would be very slow as the number of parameters is very high.

CNN was designed to solve this computational problem by using convolution layers instead.

Convolution layers are different from the feed forward layer as each neuron in the layer is connected to a small subset of neurons in the previous layer. The subset is usually a square shaped region whose size is a hyperparameter named Receptive field. Figure 2-3 shows how the neuron in a convolution layer is connected to its input layer having a receptive field of

3x3. Thus each neuron in the layer would be connected to 27 neurons in the input layer. Each neuron recognizes a local pattern from the previous layer outputs. The patterns are independent of its position in the image, so all the neurons can share the same parameters. This is referred to as parameter sharing. In order to recognize multiple features within a layer, CNN’s have many filters. These filters are the weights of the features and are adjusted during training.

These filters move in strides which is another hyper parameter of the convolution layer. Strides determine the number of convolution operations and the distance between two scanned regions.

Padding is another hyperparameter which pads 0 around the input and ensures that we perform considerable number of convolution operations near the edges of the input. Figure 2-4 shows the three hyperparameters for a convolution layer. The output size of a convolution layer can be calculated as follows:

in − receptive f ield + 2 ∗ padding out = + 1 stride

8 Figure 2-3: Convolution Layer Output with receptive field 3x3 [3]

Figure 2-4: Hyperparameters of convolution layer

A convolution layer in the CNN is usually followed by a pooling layer. Pooling layer is also square shaped with dimensions equal to width and height of the previous layer. However, pooling layer does not have any weights or biases. They perform a fixed function on its inputs.

Max pooling layer is the most common pooling layer. It returns the maximum value of the input. Figure 2-5 describes the operation of max pooling layer. CNN’s have been successful in and speech recognition domains. A small example of convolution and pooling operations in a CNN is discussed in Section 3.3.

9 Figure 2-5: Max pooling layer [3]

2.1.3 Recurrent Neural Networks

In traditional neural networks we assume that every input is independent of each other. Thus we discard the output after each input. However this may not be true, and we may need to know information about the previous input to predict the next output. This is the basis for designing a re-current neural network. They have a feed back loop in them which ensures the previous information is retained. Thus RNN is said to have memory. Figure 2-6 shows a simple RNN with feed back loop which passes previous results. RNN can be though of as unfolded copies of the network ’A’. The most popular RNN is Long Short Term Memory (LSTM) and have been used for speech recognition, image processing, text processing [22]. LSTM can decide which information needs to be retained and which can be discarded.

Figure 2-6: Recurrent Neural Network [4]

10 2.2 Backpropagation Algorithm

Neural networks in the early days did not have the ability to learn by themselves [23, 24].

By 1958 simple networks that were capable doing of supervised learning were developed [25].

Supervised learning creates a mapping function that maps inputs to outputs based on the sample

training data. The training data set consists of a pair of sample data and its desired output.

Supervised learning using backpropagation is the simplest algorithm used for training neural

networks. The algorithm and its optimization are discussed in various papers [26, 5, 27, 28] .

Backpropagation uses the method to compute weights and biases that min-

imize the error. The chain rule helps to calculate the gradients iteratively for each layer. It is

important that we ensure the differentiabilty of the error function. This can be done by using

activation functions such as sigmoid, relu etc. A brief working principle of backpropagation is

explained using a small example in the following paragraphs.

Figure 2-7: Feed forward network - backpropagation

Consider a three layer feed forward neural network as shown in Figure 2-7. The backprop-

agation algorithm is divided into two parts i) forward pass ii) backward pass. We first initialize

random weights to each layer. Let us say the initial weights are as follows: w1 = 0.11, w2 =

0.21, w3 = 0.12, w4 = 0.08, w5 = 0.14 and w6 = 0.15.. Let us say the inputs are i1 = 2 and

i2 = 3, and the desired output is out = 1. We first do a forward pass with the inputs i1, i2. The outputs of hidden and output layer are computed as follows: h1 = i1 ∗ w1 + i2 ∗ w2 = 2 ∗ 0.11 + 3 ∗ 0.21 = 0.85 h2 = i1 ∗ w3 + i2 ∗ w4 = 2 ∗ 0.12 + 3 ∗ 0.08 = 0.48

11 out = h1 ∗ w5 + h2 ∗ w6 = 0.85 ∗ 0.14 + 0.48 ∗ 0.15 = 0.191

Assume that we have not used any bias in this example and also that we have considered linear

activation functions in the hidden and output units. Next we compute the error function of

the network, for which we commonly compute the mean squared error (MSE). This error is

then propagated backwards through the network in the backward pass step. Example of the

backward pass for the network in Figure 2-7 is described in the paragraph below. 1 − 2 The MSE for the network can be given as Error = 2 (predicted out desired out) where predicted out = 0.191 and desired out = 1. Therefore error function of the network has

Error = 0.327. The error function is indirectly a function of the weights and can be reduced by modifying the weights of the network. Gradient descent is an iterative technique mentioned above to modify/update weights. The modified weights in general can be written as follows:

∗ δError Wx = Wx − α ∗ δWx

Here, α is the learning rate of the network and assume it to be 0.05. Using the equation above

and the chain rule to find δError we calculate the modified weights w6, w5, w4, w3, w2, w1 as δWx follows:

δError ∆ = δout = (predicted out − actual out) ∗ − ∗ δError ∗ δout − ∗ ∗ w6 = w6 α δout δw6 = w6 α h2 ∆ = 0.17 ∗ − ∗ δError ∗ δout − ∗ ∗ w5 = w5 α δout δw5 = w5 α h1 ∆ = 0.17 ∗ − ∗ δError ∗ δout ∗ δh2 − ∗ ∗ ∗ w4 = w4 α δout δh2 δw4 = w4 α ∆ w6 i2 = 0.1 ∗ − ∗ δError ∗ δout ∗ δh2 − ∗ ∗ ∗ w3 = w3 α δout δh2 δw3 = w3 α ∆ w6 i1 = 0.13 ∗ − ∗ δError ∗ δout ∗ δh1 − ∗ ∗ ∗ w2 = w2 α δout δh1 δw2 = w3 α ∆ w5 i2 = 0.23 ∗ − ∗ δError ∗ δout ∗ δh1 − ∗ ∗ ∗ w1 = w1 α δout δh1 δw1 = w3 α ∆ w5 i1 = 0.12 This completes the backward pass and we have updated all the weights in the network. After many such iterations, the network converges and learns the intended function. Having seen the different types of networks and the learning algorithm that trains these networks, we see the vast applications of neural networks that depict their importance in the next section.

12 2.3 Applications of ANNs

NNs can approximate any functionality with considerable accuracy. Hence they are widely used in many applications such as speech recognition, pattern recognition, data processing etc

[29]. Such applications have a wider range of error tolerance. For instance, artificially intel- ligent networks are used in applications that require a trend prediction which is very intuitive based on the training regime. Predicting bankruptcy [30, 31] is one such application. We would like to classify innumerous applications into the following categories [12].

2.3.1 Pattern Classification

It basically is a classification problem where the network has to map an input to one of the output classes. Applications such as speech recognition, character/image recognition, blood cell classification fall under this category. Image recognition is a complex computational prob- lem that can be broken down into processes like image pre-processing, segmentation, feature extraction and matching identification [32]. Among the different models of networks, CNNs have demonstrated greater performance on this problem. LeNet-5 as shown in Figure 2-8 is a multi-layer ANN developed by LeCun in [5, 33] that classifies handwritten numbers. However, the performance of LeNet-5 is limited to this. Due to the lack of huge training datasets and computational capacity, performance of LeNet-5 degrades significantly on complex problems such as large scale image and video classification. Ever since, many advanced methods have been proposed that overcome the challenges encountered in training. AlexNet proposed by

Krizhevsky in 2012 [34], showed significant improvement over its previous works. The sucess of AlexNet [34], has motivated several other works as improvements like VGGNet [35] and

GoogleNet [36].

13 Figure 2-8: Architecture of LeNet-5 [5]

2.3.2 Clustering or unsupervised pattern classification

In clustering or , the input dataset is analyzed and grouped based on pattern similarities. Some of the well-known applications in this category include data mining, data compression and exploratory data analysis. Yuhui Yao and Lihui Chen used associate clustering neural networks to analyze gene data and understand the similarities between two gene samples [37]. Soumadip Gosh and Amitava Nag suggests a way by which we can use

ANN to study the patterns in large meteorological data [38].

2.3.3 Prediction/Forecasting

As the name suggests the network is used for prediction depending on the previous trend.

Such prediction capability is of much demand today in business, science and engineering fields, e.g., stock market predictions [39] and internet traffic prediction [40].

2.3.4 Content Addressable Memory

Traditionally most computers use , where the operands required for computation are accessed by addresses in the memory. Address of the operand is first com- puted, and if there an error in this computation an illegal operand may be accessed. Thus it is more desirable to access memory based on the content. This is the basis of associative memory or content accessible memory. The advantage of such an approach is that the content may be accessible even using partial inputs and are widely used in multimedia databases. Laurentiu

Mihai and Alin Gheorghi presents a hardware design for content addressable memory (CAM)

14 which is based on Hopfield network model [41].

2.3.5 Control

Neural networks are used to model control inputs for various dynamic systems. The net- work must model the inputs such that the system follows the same trajectory as the reference model. Dong Wei and Xinghua Pan in [42] present a dynamic variable air volume (VAV)system where a multi layer feed forward neural network acts as an optimal feed back loop . is another area where neural networks are used to model the control signals [43].

Having seen the different classes of applications, we find ANNs are used in yet another critical area. ANNs are being used to monitor, detect and diagnose faults in several important real time systems. The main advantage of using ANN’s for such tasks is that it is autonomous and does not require any human supervision. Considerable amount of work is being done to design fault detection and diagnosis (FDD) machine using artificial neural networks [44, 45,

46]. At this point, we cannot stress the importance of ANN in day-to-day life more. ANNs will continue to intervene and be a vital part of the human society. However, one should not forget that such applications are ultimately based on hardware and hence make them susceptible to physical faults of their own. Hence, it is important to address the fault tolerance in neural network hardware. The following section explores this idea more in detail. We also present examples and discuss the basic idea of this work.

2.4 Fault Tolerance

As the feature size decreases, the reliability of the corresponding semiconductor technolo- gies also degrades. Owing to the fact that the number of physical defects is increasing making reliability a major design constraint. These growing number of defects prevent designers from realizing the full potential of the on-chip design. The challenge now is not only to balance high-performance, energy-efficiency but also fault-tolerance of a computational model. Neural computing is one such promising methodology in fault-tolerant computing paradigm.

According to the study in [47], the human brain has an inherent capability to tolerate a

15 small number of synapse faults or external noise. The nervous system have efficient compo- nents that adapt to such faults or noises and still be able to perform at a very high level of accuracy. Neural networks which are inspired by the brain, are commonly considered to posses the same characteristics due to their parallel and distributed structure. Studies have shown that neural networks are tolerant to noisy inputs or approximations. For example tolerance to approximation can substantially improve the performance and energy gains of the hardware network design [48, 49]. However, in practice, the inherent fault tolerance capability is limited, and neural networks cannot be considered intrinsically fault tolerant without a proper design.

Obtaining a truly fault tolerant network is still an important issue. Error confinement and repli- cation techniques are conventional techniques used to improve fault tolerance. They cannot be directly applied on network hardware because the computation and information are naturally distributed [6].

To explore the various techniques of improving fault tolerance in ANNs, let us first take a look at the basic concepts of faults. According to Huitzil in [6], a fault is defined as ”an anoma- lous physical condition in a system that gives rise to an error ”. An error is a manifestation of a fault that may result in a deviation from the expected output [6]. A failure is referred to the system’s inability to perform the intended functionality [6]. There exists a cause-effect rela- tionship between fault, error and failure, which is conceptually described in Figure 2-9. The propagation of errors results in failure. However, presence of a fault does not necessarily result in an error.

Figure 2-9: Cause effect relationship between fault, error and failure [6]

According to Huitzil in [6], faults can be further classified as permanent and transient faults based on their temporal characteristics. A fault is said to be permanent if it is stable and 16 continuous with time. A transient fault on the other hand is present temporarily in time and are

commonly caused due to external noise. Figure 2-10 depicts the classification of faults along

with their typical causes.

Figure 2-10: Classification of faults [6]

In order to detect and diagnose faults, fault models are developed. Good fault models describe the physical manifestation and types of faults that can occur. They can also help detect the components that can become defective and misbehave under the influence of a fault in the system. While defining fault models for neural computing, two main requirements are to be kept in mind namely: accuracy and tractability [6]. While accuracy should make sure that realistic faults are modelled, tractability makes sure that complex systems are studied at affordable computational costs [6]. Thus while deriving realistic models at higher levels of

abstraction, one should keep in mind they accurately capture the faults at lower physical levels

[6]. With regard to this, stuck-at fault models are conventionally used fault models. Stuck-at

fault model, as the name suggests, indicates that a data or control line is stuck-at logic level

high (stuck-at-1) or logic level low (stuck-at-0). Owing to the simplicity yet effectiveness of

17 the stuck-at model, we choose to explore this in detail in our experimentation (discussed in

Chapter 3).

2.4.1 Basic terms related to fault tolerance

Basic terminology related to trustworthiness of a system are defined as follows [50, 51, 52].

Reliability: A system is said to be reliable if it performs correctly with a high probability in

the presence of faults [6]. Reliability is a quality that is measured over time and is used to

quantify uncertainty in a system. Estimating reliability requires gathering statistical data and

using probability theory.

Robustness: It is the property that allows a system to continue operating correctly in event of

perturbation of its inputs or presence of external noise.

Fault tolerance: It is the property of a system that guarantees proper operation in the event of

faults.

Error Resilience: It is defined as tolerance to approximation during computations.

2.4.2 Fault tolerance in neural networks

”A network N that performs a computation HN is fault tolerant if the computation HN f ault

produced by the faulty network is close to HN” [6]. Formally, for ∈ > 0 N is called the ∈ fault

tolerant, if it tolerates faulty components for any subset of size at most n f ails.

k HN(X) − HN f ault (X) k ≤∈ ∀X ∈ T

Here X is any stimuli, applied to the network N and N f ault, belonging to the training set T or is a part of input data to be processed by the network. Complete fault tolerance is achieved when

the computation of a faulty network HN f ault is equal to HN. However, a pre-defined threshold to fault tolerance is generally set below which the network can no longer function. Thus fault tolerance in neural computing depends on the threshold which defines the acceptable degree of performance. The threshold is specific to the application [6, 53].

Fault tolerance can be further distinguished as active and passive. A system with active

18 fault tolerance dynamically recognizes and nullifies the effects of faults [6]. In such a system

the tasks performed by faulty elements are reallocated to their fault-free counterparts [54]. A

passive fault tolerant system on the other hand nullifies the effect of faults by exploiting its

intrinsic property to mask [55]. No reconfiguration is required in such an approach. It is

designed to mask faults. There is also a hybrid approach in which active and passive fault

tolerance complement each other as explored by Nelson in [56].

We can say that passive fault tolerance strategies improve fault tolerance basically with-

out considering re-training. Further three main categories of this approach include enhancing

fault tolerance capabilities by: (i) design optimization/constraints on neural network, (ii) re- dundancy, and (iii) modifying training algorithms.

By explicitly introducing redundancy on a neural network, fault tolerance can be enhanced.

In such methods, a pre-trained network performs the desired computational tasks. Based on the fault model and the obtained results, critical neurons are replicated or synaptic weights are distributed evenly [6]. Emmerson and Damper [57] in their work investigated fault tolerance in multi-layer perceptrons functionally performing pattern recognition. Backpropagation was used as the training algorithm and the perceptrons with different hidden nodes were considered.

The trained networks were subject to random physically plausible faults iterated a number of times. The results showed that fault tolerance is not a function of the number of hidden units.

In other words the fault tolerance capabilities is not enhanced as the number of hidden units increase. A mechanism called augmentation was proposed in this work to improve fault toler- ance wherein each hidden neuron was replicated with their corresponding connections taking care that the same input output mapping is maintained in the network. Such augmented net- works displayed better fault tolerance abilities [57]. However, improved result in this method is achieved at the cost of a significant computational resource overhead.

Phatak and Koren [54] explored fault tolerance in feed forward neural networks with a hidden layer. The fault model used was permanent stuck-at type faults. They also proposed a solution wherein hidden units were replicated. The weights and biases of the output neurons are scaled accordingly. The fault model allowed synaptic weights to be stuck-at permanent

±W in addition to conventional stuck-at 0 weights. Experiments proved that the amount of

19 redundancy required was application-dependent. Also a significant amount of redundancy was required to make the network completely fault tolerant even for a single fault. The authors also suggests that modifications to training algorithm is a more feasible solution.

Artificial neural network model comprises of two components namely: (i) architecture, and ii) training algorithm. By modifying either of the two, new properties of fault tolerance can be acheived. On an existing architecture by modifying or enhancing the training algorithm, we can enhance fault tolerance abilities of the network will be discussed in the upcoming paragraphs.

Sequin and Clay in [58] proved that training with only single faults also include fault tolerance against multiple faults as well. They modified the training procedure in a feed forward neural network by means of which random faulty hidden units could be injected. Up to three hidden neurons were randomly selected for fault injection.

Another approach to modifying or enhancing training algorithms is to add a penalty term to the training cost function to bias the solution towards a fault tolerant result. In other words, an error function is introduced in conventional learning algorithms to promote uniform infor- mation distribution. Using a regularization parameter in the error function, the output of the network can be controlled based on the degree of faults. For instance, Cavalieri and Mirabella

[59] proposed a modified backpropagation algorithm that updates the synaptic weights based on a threshold value. The threshold value is dynamically learned during the training cycle.

Stuck-at fault model was used for analysis on a multi-layer perceptron. While such modified approaches seem attractive, they often increase the convergence time.

All of the above mentioned methods use the passive fault tolerance approach wherein the system does not detect faults online. In methods where active fault tolerance is implemented, the tasks of faulty elements are reallocated to fault-free ones dynamically. Detailed discussions and novel approaches to achieve this are explained in [60, 61, 62]. A self recovery mecha- nism called weight shifting was introduced by Khunasaraphan [60] which was applied to feed- forward neural networks. The weight shifting is invoked once a neuron is faulty. The weights on the faulty links of a neuron are shifted to the fault-free links of the same neuron. In case of a completely faulty neuron, the output links of the output neuron are rendered faulty [6, 60].

As interesting as this approach seems, it considered the fault detection circuitry as a black box

20 thus making it difficult to implement in hardware.

Some of the other eminent works in this area are briefly discussed in the following para- graph. For instance, fault tolerance in recurrent neural networks, used for solving optimization problems, is investigated by Protzel in [53]. Up to thirteen simultaneous stuck-at faults were in- jected into the network at random locations. By viewing the faults as a constraint to the system, a conditional performance measure was defined [6, 53]. The experimentation results suggest that optimization networks exhibit inherent partial fault tolerance. Nijhuis and Spaaenenburg in [63] portrayed fault tolerance to be the probability that the network will function correctly given x faulty neurons and y faulty connections [6, 63]. The degree of fault tolerance in such models is dependent on the type of physical faults. Also a small error rate neural model is proved to be less vulnerable to broken connections [6, 63].

So far we have seen the studies in fault tolerance limited to a few types of neural network architectures. The solutions presented in the above works come along with overheads in terms of computation or convergence time. Also not to forget that the demand for neural computing is exponentially increasing. In this era of big data, neural computing has become a necessity in hand held or mobile devices such as cell phones. For example, sophisticated features such as face recognition, voice recognition , iris scan etc. are commonly made available on cell phones today. The main challenge of including a neural engine in such power constraint devices is the small size and high speed performance. In an attempt to address the above mentioned challenges, the research team of Jouppi and Young [3], explored the Tensor Processing Unit

(TPU) architecturea and the details will be discussed in the next section.

2.4.3 TPU and quantization

Domain specific hardware is now used to achieve major improvements in performance and energy consumption [7]. Tensor Processing Unit (TPU) is a custom ASIC that was deployed in

2015 for data centers. While modern CPUs are based on the Reduced Instruction Set Computer

(RISC), TPU is based on the Complex Instruction Set Computer (CISC) [7, 8]. While RISC is designed to define and process simple instructions (example: load, store), CISC focusses on implementing high-level complex instructions (example multiply and add many times).

21 Figure 2-11: Block diagram of TPU [7, 8]

As shown in Figure 2-11, TPU comprises of three main elements namely:

Matrix Multiplier Unit (MXU): 8-bit multiply-and-add units for matrix operations. The first

generation TPU has 65,536 MXU [8].

Unifies Buffer (UB): 24MB of SRAM for registers [8].

Activation Unit (AU): Activation functions that are hardwired [8].

TPU operates on high level instructions such as Matrix Multiply or Convolve Multiply,

Read Weights, Activate etc. defined in [8]. The CISC instruction set focuses on the necessary

operations required for inference [8]. The TPU MXU is a matrix that processes

hundreds and thousands of simple operations in a single clock cycle. TPU uses a systolic array

mechanism to carry out MXU operations as shown in Figure 2-12. The 256x256 = 65,536

ALUs allow TPU to process 65,536 multiply-and-add operations at a time. Running at 700MHz

TPU can compute 92 teraops per second. From Figure 2-13, we can see hat TPU provides 83x

better performance-per-watt ratio compared to a conventional CPU and 29x better compared to

GPU [8].

22 Figure 2-12: Systolic array of MXU [7, 8]

Figure 2-13: Performance-per-watt comparison [8]

Another interesting fact to note is that the TPU achieves such high level of performance by a novel technique called quantization. As neural computing has advanced into Deep Neural

Networks (DNNs), the number of layers have greatly increased thus capturing and performing operations on large data sets. Conventionally this is carried out in floating point, which is why

GPUs are a popular resource for training. However, neural network functions during inference do not require the precision of floating point [8]. Hence by using the optimization technique

23 of quantization, floating point numbers are transformed into 8-bit approximate integers. Quan-

tization takes into account accuracy, and provides an arbitrary value between a maximum and

minimum range that is preset as shown in Figure 2-14 [8]. Quantization is a compression prob-

lem [64]. The weights and activation values in a trained neural network are distributed across

comparatively small ranges [64]. Since neural nets are robust to small amount of noise, the

error introduced by quantization maintains precision within an acceptable threshold explained

in detail in [64]. The main advantage is that with such a data representation fast calculations,

especially large matrix multiplications, can be carried out. The corresponding reductions in

memory are also great due to data compression. For example, when quantization is applied

to inception (image recognition model), it gets compressed from 91MB to 23MB which is ap-

proximately 75% smaller than its original size [8]. Being able to use such integer operations

drastically reduces the hardware and energy consumption in TPU. Up to 25x more multipliers

in TPU as compared to a GPU used widely on the cloud environment that consists of 32-bit

floating point multipliers [8]. This facilitates higher throughput of the TPU.

Figure 2-14: Quantizing in Tensorflow [8]

As quantization is an attractive form of data representation, much research has been going on in this area lately. Although the hardware, memory consumption, computation cost seems very efficient by means of quantization, the hardware implementation is still a victim of phys- ical faults. Reagen and Gupta in [65] explored fault tolerance and presented a framework for fault injection in DNN. They have injected permanent as well as transient faults in the memory 24 module of a DNN as it susceptible to memory faults [65]. Zhang and Gu in [66] studied the fault tolerance in TPU MAC. They injected stuck-at faults in MAC operation in the systolic array RTL model. With a hybrid algorithm, they were able to improve the fault tolerance of quantized network [66].

So far all the works presented have targeted a specific type of network or a specific module in the network. We present an overall picture of fault tolerance in this work. Not only do we aim to present a comprehensive analysis, but also attempt to solve issues in quantized training efficiently. As one shall see in Chapters 3 and 4, we have conducted a set of experiments that explore the effect of faults on various types of neural networks. In addition to this, we have also explored improvement in fault tolerance by using re-training strategy across these networks.

25 Chapter 3

System Architecture

The goal of this chapter is to discuss the details of experimentation and analysis of re- sults. We have experimented with custom neural networks in each of the corresponding phases.

Floating-point data path has been the conventional data path for machine learning. The high precision is a virtue of the floating-point type data path which ensures minimal information loss as the values are passed through the layers. However, it is an important observation that such type of floating point data path would require huge storage and computational effort in terms of computational time. In an attempt to address this we also study the quantized data path [7, 15, 16], details of which are discussed in section 3.3. Experimentation in this work can be divided into two phases, pertaining to the type of data path explored namely: Phase-I -

floating-point data path and Phase-II - quantized data hardware resources.

In order to experiment and study the effect of faults in hardware, we simulate and inject faults in the network. We then train the fault induced network to observe the effectiveness of re-training the network. The following Sections 3.1 to 3.3 explain the fault injection pro- cedure/methodology in MLP and CNN. Section 3.4 explains how to repeat the experiment to quantized neural network.

3.1 Fault Injection

In this section we discuss the details of the proposed fault injector model for a simple perceptron neural network. It is important to note that all faults modelled in this work are stuck-at-faults only. In addition to this, we have assumed that the faults are present only at

26 the outputs of the basic blocks of a neuron. As illustrated in Figure 3-1, consider a neuron with two inputs i1, i2 having corresponding weights of w1, w2 respectively and bias b1. The output o1 of this neuron can be represented mathematically as o1= fn(i1*w1 + 12*w2 + b1).

Translating this neuron into hardware, we can represent the basic blocks as shown in Figure

3-2. According to our assumption, the fault can be at the outputs of a multiplier or summing element or activation function ( output of the neuron). The possible faults are represented by the ’x’ mark in Figure 3-2. Observe that a maximum of 4 faults can be injected in this neuron.

Figure 3-1: A simple neuron with two inputs

27 Figure 3-2: Two input neuron with possible fault sites

Figure 3-3: Perceptron Neural Network to identify greater of two inputs

28 We now explain how faults are injected into a perceptron neural network by considering the simple perceptron neural network illustrated in Figure 3-3 [67]. This neural network per- forms the mathematical operation functionally similar to a comparator. Basically, it identifies the greater of the two inputs. Observe that the neural network has one hidden layer and one output layer with two neurons in each layer. This constitutes the basic architecture of this network. These architectural details namely : number of layers, neurons in each layer, along with the type of neural network is given as the inputs to the fault injector. By considering these inputs, the fault injector generates a fault list by random selection. For example, let us say that it has randomly selected the hidden layer and neuron H1. Subsequently, in H1 it has randomly selected the faulty component to be the multiplier of w1 and so on. The ele- ments in the fault list generated by the fault injector can be represented in the following format:

[[layer][component][component number][fault bit][fault value]]. The fault bit ranges from 0 to

63 assuming 64 bit floating point representation. The fault value is represented by either 0 or

1 implying that it is either stuck-at 0 or stuck-at 1 respectively. The fault injector injects the fault at the corresponding site in the network, as specified in the fault list. The resulting faulty network is as illustrated in Figure 3-4.

29 Figure 3-4: Fault injected in the Network

Observe that the fault is injected at w1 in the network (fault represented by ’x’ mark in

Figure 3-4). Let us now pay attention to the effects of the injected fault in the trained neural network considered in this example. Consider a stuck-at-1 fault injected at the bit-0 i.e. MSB

( Most Significant Bit) of the multiplier output. The weights matrix of the hidden layer is

[w1,w2,w3,w4]= [0.1497,0.1996,0.2497,0.2995]

The inputs are: [I1,I2] = [0.05,0.1]

The multiplier output for fault-free system is: 0.007485

The system outputs are:[O1,O2]= [0.7417,0.7749]

The multiplier output for the faulty system is : -0.007485

The system outputs are: [O1,O2]= [0.7412,0.7744]

Observe that the system outputs for the same inputs are different from the expected values of the fault-free system. This is due to the fault injected into the system. Figure 3-5 shows a simulink model designed to inject faults and study the effects for perceptron model shown in

Figure 3-3. The b2f block in Figure 3-5 converts 16 bit binary to floating-point numbers. The

30 multiply and neuron blocks in Figure 3-5 perform multiplication and activation functionality

and present the floating results in 16 bit binary representation.

Figure 3-5: Results of simulating the faulty system in Simulink

3.2 Experimental Setup for Feed Forward type neural net-

work

In this section, we shall discuss the effects of fault injection in a feed forward type neural network. We also present the details of analysis that is done in this work.

3.2.1 MNIST Benchmark

First, we discuss all the details of the neural network used as a part of the experimentation in this work. The custom neural network created performs the function of hand written digit recognition. The network is trained and tested using the MNIST ( Modified National Institute of

Standards and Technology database) benchmark ( reference to MNIST website) [14]. MNIST

31 is a huge database of handwritten images which is used as a standard benchmark for training and testing of networks in the field of machine learning. It consists of 60,000 training images and 10,000 testing images. We have used the same training and testing images count to train and test our custom neural network in this work [14].

3.2.2 The Feed Forward Neural Network

We call the software model of the neural network designed for this experiment as a custom neural network, as faults can be injected dynamically into the network which is not possible in the regular feed forward neural network. The custom neural network has two components the fault injector and the neural network with re-training architecture. This is illustrated in Figure

3-6. As described in Section 3.1, the fault injector returns a fault list based on the network architecture and the type of neural network. The following paragraphs describe in detail the structure of the neural network.

Figure 3-6: Custom neural network architecture

32 Figure 3-7: Feed forward neural network used for hand written digit recognition [9]

Figure 3-8: Feed forward neural network with re-Training

33 Figure 3-7 illustrates the architecture of the feed forward neural network similar to which we have used in this work. The network has three layers input layer, hidden layer and output layer. Hidden layer consists of 30 neurons and the output layer has 10 neurons. The input layer has 784 neurons, each containing a encoded value of the input pixel. The training data is a collection of 28x28 pixel images of handwritten digits and hence we need 784 input neurons to hold the encoded pixel values. The encoding used is in grey scale with a ’0’ and ’1’ to represent white and black. The 10 neurons in the output layer corresponds to each digit from 0-9. So if the digit is recognized as a 0, the corresponding output would be as follows [1 0 0 0 0 0 0 0

0 0]. The first neuron will be a 1 and the remaining neurons will be zero representing that the digit is recognized as 0. Apart from this, the feed forward neural network also has provision for incorporating faults given by the fault generator and the re-training architecture. A block level diagram representing the final feed forward neural network is shown in Figure 3-8. The impact of the faults is analyzed and improvement in recognition by re-training architecture is studied using the overall system discussed in the following Section 3.2.3.

34 3.2.3 Overall System for Feed Forward Re-training Experimentation

Figure 3-9: Flow chart describing overall system

Figure 3-9 shows the flow model for the overall system design we have used in this experi- mentation. A detailed discussion of the model is given in the following paragraphs. We initially designed a custom neural network which is fault free and this fault-free network is trained using the well-known backpropagation algorithm [26, 5, 27, 28]. This is achieved by passing the fault count parameter as 0 to the fault injector. Upon training the fault-free network, we determine the appropriate weights and bias required to identify the test images appropriately. We use the testing results returned by the fault-free custom neural network as the reference metrics for evaluation.

As mentioned earlier: upon specifying the size of the fault list, the fault injector returns the corresponding fault list. The fault list contains location of each fault, which needs to be incorporated in the network correspondingly. The custom neural network fault analysis tool then reads the fault list and inserts the faults at the specified positions dynamically during

35 inference/forward pass. To begin testing the effect of stuck-at-faults in the network, we now set the fault count as ’1’ in the fault injector. This system is tested against MNIST test suite, and the new results are compared against the reference values. We will define and describe the metrics used for qualification, shortly in the upcoming sections. Depending on the quality of recognition (QoR), there are two cases that we can consider:

Case I : No degradation on QoR - If the quality of recognition for this faulty network has not degraded, the network is not re-trained. This implies that the fault did not have a significant effect on the network and is acceptable. Thus we can save the computational time and power.

In such a case, the network is able to mask the fault and still maintain a significant number of images correctly.

Case II : QoR has significantly degraded - If the single stuck at fault introduced into the custom neural network causes the results to be below the standard assumed means that the weights and bias which were determined previously is not sufficient to mask the fault, and there is a need to re-train the neural network. In this work, we have used the back propagation algorithm to determine new weights and biases that can mask effects of the fault injected into the network. After the re-training process, the resultant network is tested against the same

MNIST test suite and the new results are analyzed.

The experiment is carried over a number of iterations wherein we test the effect of faults at different locations. In each iteration, the previous fault is removed and a new fault is injected at a location chosen randomly by the fault injector. After 25 iterations of testing the single stuck- at-fault, we analyze the impact of multiple stuck-at faults on the system. The same procedure is followed. In the case where the faults deteriorate the ability of the network to recognize the test images drastically, the network is re-trained and new weights and biases are computed. It is important to note that in this work, we analyze multiple stuck-at-faults for fault count from

1 to 25, in increments of 1.

36 3.3 System Design For Convolution Neural Network

In this thesis we have also analyzed the impact of faults on the Convolution Neural Network

(CNN). Convolution neural networks as mentioned in Chapter 2 are widely used in image

recognition. These neural networks are efficient from a hardware perspective as they have less number of weights and biases when compared to their multi-layer perceptron counterparts. We believe it would be an interesting model for our study pertaining to the impact of faults and effectiveness of the re-training strategy .

Figure 3-10: Convolution neural network

Figure 3-10 shows the architecture of a convolution neural network we have designed for

this experiment. The network mainly consists of three layers. Layers 1 and 2 have convolution

networks and Layer 3 is a fully connected or dense layer. The details and individual components

of each layer are described in the below paragraphs.

Figure 3-11: Convolution example - CNN

Layer-1 consists of a convolution layer, activation layer and pooling layer. The convolution 37 layer performs convolution operations and bias addition. The input to Layer-1 is first passed to the convolution layer. The convolution layer in Layer-1 has 12 filters of size 5x5 for each

filter. As mentioned in Section 2.1.2, these filters are considered the weights of the convolution layer and are adjusted during the training phase. We perform convolution between the inputs and the filter and obtain the corresponding feature map. Each filters in layer 1 moves in stride of 1. Figure 3-11 shows an example of the convolution operation between an input of size 5x5 and a filter of size 3x3. As shown, the 3x3 filter moves in stride of 1 over the image. We then perform the dot product of all the overlapping values as shown by the red square. The dot product results in a value 6 highlighted by the red square of the feature map shown in Figure 3-

11. The filter then moves by one stride to the right as shown by the green box, and we perform the dot product over the new overlapped region. The resulting value 7 is stored in the next position in row 1 of the feature map as indicated by the green box in Figure 3-11. After two strides we finish computing the first row of the feature map. The second row is computed in a similar manner. However, to compute the values of the second row, the filter is moved by one stride to the bottom as shown by the blue square and then follow the same procedure as described above. Performing convolution between input of size 5x5 and a 3x3 filter, we get a feature map of size 3x3.

As mentioned above the size of the inputs in our experiment to Layer-1 is 28x28 and each

filter has size 5x5.It is also important to note that in this experiment, the image is zero padded with a padding size of 2 around the borders. The feature map size can be calculated using the formula described in Section 2.1.2. The size of the feature maps in Layer-1 is 28x28 and there are 12 such feature maps. The results of convolution layer are passed to the activation layer.

Layer-1 uses the relu activation function. Layer-1 activation has the same output size as the

Layer-1 convolution. The results of the activation layer is passed to the pooling layer. We use max-pooling as the pooling layer for Layer-1. Figure 3-12 shows an example output of a max pooling layer with input size 4x4 and pool size 2x2 moving in strides of 2. In this experiment we have used max pooling size of 2x2 and moves in stride equal to 2. Max pooling is performed on each feature map of Layer-1. The pooling layer reduces each feature map size to 14x14 from its original size of 28x28. Thus the outputs of Layer-1 pooling is 12 feature maps each with

38 size 14x14 as shown in Figure 3-13. It is important to note that we have not shown activation

layer in Figure 3-13, but it is present before the max-pooling operation of each feature map.

Figure 3-12: Max pooling example - CNN

Figure 3-13: Layer-1 convolution and maxpooling - CNN

Layer-2 consists of a convolution layer, an activation layer and a flatten layer. The inputs to

Layer-2 are the results of the pooling layer from Layer-1. Thus we can say that the input has the following shape (12,14,14), where the depth equal to 12 is the number of feature maps of

Layer-1, height and width are the size of each feature map and is equal to (14,14) respectively.

Layer-2 has 16 filters of size (12,5,5), where depth is 12, height and weight are equal to 5.

39 The filters moves in stride of 1. Layer-2 performs convolution operation between its inputs and filters. The convolution operations are similar to those in Figure 3-11 with the difference being the input and filter in this case are 3 dimensional. It is also important to remember that the convolution layer also performs bias addition and the input in Layer-2 is also zero padded before convolution operations. Thus the results of convolution layer in Layer-2 are 16 feature maps each of size 14x14 as shown in Figure 3-14. These results are passed to relu activation layer of Layer-2. The results of the activation layer are fed into a flatten layer. The flatten layer simply flattens each feature map from 2-dimension to 1-dimension. Since each feature map of

Layer-2 has the height and width of 14, the flatten layer converts it into 196 (14 ∗ 14) outputs.

So the total number of outputs of the flatten layer of Layer-2 is 3136 (196 ∗ 16) as shown in

Figure 3-15.

Figure 3-14: Layer-2 convolution - CNN

40 Figure 3-15: Layer-2 flatten - CNN

Layer-3 as mentioned is a fully connected layer which takes the outputs of Layer-2 as its inputs. The fully connected layer has 10 outputs, each denoting a class from 0-9. The fully connected layer uses a softmax activation function. Each input is connected to the 10 output neurons as shown in Figure 3.16. Hence Layer-3 has total number of weights equal to 31,360 and 10 biases.

Figure 3-16: Layer-3 dense - CNN

41 3.3.0.1 Fault Injection System For CNN

We now describe the fault injection system designed for the convolution neural network described in the previous section. As mentioned in Section 3.1 in this thesis, we have injected faults in the outputs of basic blocks of a network. Also remember that the fault injector takes the network architecture and the number of faults to be injected as inputs.

Layer-1 convolution and activation outputs have 12 feature maps and each map has height and width equal 28 as mentioned above. Hence each map can be considered as a square matrix having 28x28 cells. Each cell in this matrix consists of convolution, addition, and activation blocks. A fault can be injected into one of these 12 feature maps and within the map selected it can be placed in the outputs of its building blocks. Layer-1 outputs are a 3D matrix of size

(12,28,28) and this is fed into the pool layer of Layer-1. The pooling layer reduces the size of

Layer-1 outputs to size (12,14,14). The faults injector injects faults in the output of the pooling layer as well. The fault injector first chooses the feature map, then one of the 14x14 cells of the selected map, and finally injects a random fault to one of its blocks.

Layer-2 convolution and activation outputs have 16 feature maps, each having size 14x14.

Each feature map can be viewed as matrix having 14x14 cells. Each cell has the same compo- nents as those mentioned for Layer-1. The fault injector can choose one of the sixteen feature maps, and then one of the 14x14 cells within the chosen map to inject a fault. The results of

Layer-2 activation is sent to flatten layer. The injector can place a fault in one of the 3136 outputs of the flatten layer.

Layer-3 as mentioned earlier is a fully connected layer. The basic blocks of this layer are similar to our feed forward network which are multiplier, adder and the activation block. The fault injector can choose outputs of one of these blocks to place a fault. The outputs of layer3 are the final outputs of the network. The outputs in all the above layers are stored in a float 64 bit representation. The fault injector randomly chooses a location from the above mentioned options and then places a random stuck at fault in one of the 64 bits. The fault injector finally outputs a list of locations where faults have to be placed based on the number of faults input that it is provided with. Figure 3-17 shows an example of the list generated for the number of faults equal to 1. It is important to note that the feature maps in Figure 3-17 have combined

42 convolution and activation layer outputs.

Figure 3-17: CNN fault injector output - single stuck-at fault

3.3.0.2 Re-training of CNN

The model is similar to that designed for feed forward re-training analysis described in

Section 3.2.3. The network is initially trained without any faults. The number of faults is increased in each iteration. The faults are injected in appropriate positions as specified by the fault list and then an inference on the test images is done. The number of images recognized by the faulty network is evaluated. This is compared to the number of images identified by reference , that is the initial fault-free network.

The threshold value for the number of images set in this experiment is 8000. If the number of images identified is greater than the threshold value, no re-training is required and the ex- periment moves forward to the next iteration. A new random single fault is generated for the iteration and the process continues.

In the case where the number of images are lower than the threshold value, the network needs to be re-trained. The re-training algorithm is triggered automatically when this condition is met. The weights are initialized again and the network now learns new weights to mask the

43 effect of the fault induced. Once the training is completed, we run our test data again on our

network with new weights and bias to compare the improvement we obtained by re-training.

The weights and biases are restored after this step before the next iteration. The network is

tested 25 times for each number of faults. The results of our experiments are presented in

Section 4.

3.4 Quantized Neural Network

Up until now, we have described the system designed with floating point data path. This

section describes the system designed for quantized data path.

In the current era of big data, it is predicted that the annual global IP traffic would reach

2.3 Zettabytes or 193 exabytes a month [68]. It is also predicted that smartphones and mobile devices would increase and account for 30 percent of the total IP traffic with an explosive growth rate of 58 percent [68]. Machine to machine modules would also have growth rate of

44 percent by 2020 [68]. Mobile devices have become an integral part of our daily lives and have redefined the way people interact with each other. Machine learning is commonly used in mobile applications like image recognition, health monitoring, language translation etc [11].

In addition to this, smartphones and mobile devices collect huge amount of data with respect to the user behaviors and preferences which has increased the demand for mobile applications with machine learning [11]. With the advent of and its breakthroughs, it has become a trend to incorporate deep learning in mobile applications [11]. However, there are many challenges in doing so with regards to the hardware resource requirement of deep learning networks. Nevertheless machine learning would become a core component in smartphones and mobile devices.

The two phases of neural network operations are training (learning) and inference (predic- tion). Training is the development phase where the weights are determined given the specifics of architecture: number of layers and type of neural network. On the other hand, inference refers to production. Training is conventionally done in floating point, which is implemented by Graphical Processing Unit (GPU) [7]. Quantization, as defined by Jouppi and Young in [7],

44 is a step that transforms floating-point numbers into narrow integers (often 8 bits) which are usually good enough for inference. The eight bit integer values posses the main advantage of

6X less power and area as compared to the IEEE 754 16-bit floating point values for multipli- cation [7, 69]. For addition, it is 13X less power and 38X area [7, 69]. These features of the quantized data path would make it an attractive model for machine learning data paths. Hav- ing said this, we move on to the next subsection where we discuss and explain the scheme of quantization used in this work. Later on, describe the fault injector model and the re-training methodology used to nullify effects of faults in a quantized network.

3.4.1 Quantization And De-quantization Scheme

In this work we use the quantization scheme developed by Tensorflow. Tensorflow is an open source software library used for high performance computation [10]. It was developed by

Google and is widely used by researchers for building neural networks. Tensorflow supports post training quantization which is a scheme that is used to reduce the model size. The latency is improved with a little degradation to accuracy. Post-training quantization quantizes the weights from floating point representation to 8 bits integer representation, thus reducing the model size and latency. Tensorflow also provides hybrid functions that convert activations to 8 bit integers.

In this thesis we quantize all operations to 8 bit integers and use these for computation. The quantization function used for our experiments takes in the float tensor that is to be quantized and the minimum and maximum values of the float tensor. The float values are then converted to 8 bit unsigned integer values. It is to be noted that the float values are stored in 32 bit format.

The quantization or mapping function from float 32 bit to 8 bit unsigned integer is as shown in

Figure 3-18.

Figure 3-18: Quantization function [10]

The terms of Figure 3-18 are listed below: 45 in[i]: It is the ith term of the floating input tensor. out[i]: It is the output 8 bit unsigned integer of the ith floating input.

range(T): It is the range represented by unsigned 8 bit integer (0-255). Therefore it is 255.0

min range: It is the minimum value of the input tensor.

max range: It is the maximum value of the input tensor.

The following is an example that illustrates the outcome of quantization. Consider an input

tensor of length three [2.1 , -1.5 , 0.95] . As you can see, the minimum and maximum values

of this tensor are -1.5 and 2.1 respectively. Hence min range is equal to -1.5 and max range is

equal to 2.1. The 8 bit unsigned integer output we obtain after passing it through the quantiza-

tion function is [255, 0, 174]. It is interesting to note that the maximum value of the input tensor

is mapped to the maximum value of the quantized range, i.e., 2.1 is mapped tp 255. Similarly

the minimum value of the input tensor is mapped to the minimum value of the quantized range,

i.e., -1.5 is mapped to 0. The function also returns maximum and minimum float values that the

quantized output is representing. This information is used for dequantizing which is the inverse

of quantization. In dequantization we take the 8 bit quantized input which in our case is 8 bit

unsigned integer and convert it back to float.

The following example illustrates the de-quantization process. Let us take the output from

our example for quantization described above. The input to de-quantization function is [255

, 0 , 174], and the maximum and minimum float values that are represented by this quantized

input are also provided to the function. The function performs inverse quantization as shown

in Figure 3-19 below. The output of the de-quantization function is the following floating point

tensor [2.1 , -1.5 , 0.96]. It is important to notice that the output is slightly different than the

input to our quantization function. The last value has changed from 0.95 to 0.96. When we

perform operations on these erroneous values, the error multiplies and we have accuracy loss.

This being said the accuracy loss is minimal making it efficient and replacing the need for

floating point computations.

46 Figure 3-19: De-quantization function [10]

The terms of Figure 3-19 are listed below:

in[i]: It is the ith term of the quantized 8 bit unsigned integer input tensor. out[i]: It is the ith term of float output. range(T): It is the range represented by unsigned 8 bit integer (0-255). Therefore it is 255.0 min range: It is the minimum float value represented by the input tensor.

max range: It is the maximum float value represented by the input tensor.

3.4.2 Quantized Feed Forward Network (Q-FNN)

In this section, we describe the architecture of the quantized neural network used in this

work. As described in the previous section, we have used Tensorflow quantized equivalent of

each layer. Previously, effects of quantization were approximated or rather simulated during

inference. Thus the operations were still in float. In our work we have used a novel approach

wherein we use actual quantized layers and all integer only computations. We have designed

two type of quantized feed forward neural networks. The networks contain two and five layers

respectively.We also use the same MNIST benchmark we had used for our floating point data

path experiments to train and test our quantized neural network. As mentioned previously, the

benchmark has 60,000 training images and 10,000 testing images. The images are stored in a

matrix of size 28x28 for each image which is flattened before feeding to the quantized neural

network.

47 Figure 3-20: Quantized two layered feed forward network architecture

We first describe our two layered feed forward neural network and then extend it to five layers. Figure 3-20 above shows the block diagram of a quantized two layer neural network.

The hidden layer consists of 30 neurons and output layer consists of 10 neurons.We store the trained weights and biases in float as space or speed of simulation is not in our interest. The stored floating point values are represented in 32 bits. The floating weights and inputs are first quantized into 8 bit unsigned integers and passed to the inputs of Layer1. Layer 1 consists of the following integer operation blocks: quantized matrix multiplication, quantized bias addition and quantized activation block. Quantized matrix multiplication block (Q Matmul1) takes the quantized weights and quantized inputs and perform matrix multiplication. Quantized bias addition block (Q b1) does the addition of results from Q Matmul1 and quantized bias. These results are then fed into Quantized activation block (Q act1) to get the final results of Layer

1. We have used the quantized relu activation function for this work. Layer 2 consists of the following integer operation blocks: quantized matrix multiplication, quantized addition. These blocks are followed by the relu activation function. The quantized matrix multiplication block 48 (Q Matmul2) of Layer 2 takes the output of Layer1 and multiplies it with the quantized weights

of Layer 2. The results are passed onto quantized bias qddition operation block (Q b2) where

it is added with the quantized bias of layer two. The final operation block is the relu activation

block which takes the de-quantized values of Q b2 and gives the results of the network. We dequantize the results just before the output for testing and training purposes. This is the path that the input data follows during inference.

Figure 3-21: Components of quantized two layered feed forward network

The components of each layer is shown in Figure 3-21 and it is important to note that the

final output of each layer is 8 bit unsigned integer except the output layer where it is floating

point. Also note that Tensorflow quantized matrix multiplication and bias addition operations

take two unsigned 8 bit inputs perform the corresponding operations. The results of these

functions are 32 bit signed integer. Hence we use re-quantize function provided by Tensorflow 49 to convert these results back to 8 bit unsigned integer representation. Thus we maintain 8 bit unsigned integer flow through out the model.

Figure 3-22: Quantized five layered feed forward network architecture

Five layered quantized neural network is an extension of the two layered network. It con- sists of four hidden layer and one output layer. The hidden layers have 30 neurons each and the output layer has 10 neurons. Figure 3-22 and Figure 3-23 show how the layers are con- nected and the components of each layer respectively. We have used quantized operations to 50 build the components of each layer. Thus each layer will have quantized matrix multiplication, quantized bias qddition and quantized activation blocks. The output is dequantized only in the

final layer. As mentioned in the two layer quantized network above, all the outputs are 8 bits unsigned integers. Figure 3-23 is the inference path of the network. The inputs and weights are all quantized to 8 bits integer using the quantization scheme described in Section 3.2.1. This quantization occurs before passing the inputs to the corresponding computation block within a layer.

Figure 3-23: Components of quantized five layered feed forward network

51 3.4.3 Fault Injection in Q-FNN

In this section, we describe the fault injector designed to inject faults in the networks de- scribed above. The fault injector in this works also takes the network architecture as inputs and generates a list. This list has the locations of the faults to be placed in the network. The same fault list is passed during inference and training phases. The difference between the fault injector for the floating point data path and the quantized data path is the number of bits in which the faults can be injected. In the floating point data path we used a 64 bit representation of our floating point values whereas in our quantized network we use 8 bit unsigned integer.

Hence we can inject faults in one of these 8 bits.

Figure 3-24: Fault injector output for two layer QFNN

Two Layer Fault Injector: In our two layered fault injector the architecture details are as follows: total number of layers: 2, number of neurons in hidden layer: 30, number of neurons in output layer: 10. The injector generates a list whose size is equal to number of faults given as inputs. The injector can choose from the outputs of multiplier, adder, and activation blocks from the hidden layer and the multiplier, and adder blocks from the output layer. The results 52 of each block in the hidden layer is of size [1,30]. The injector chooses to inject fault in one of these 30 outputs. The results of each block in the output layer is of size [1,10] and the injector can inject a fault in anyone of the 10 outputs. Once the layer, block and the output is selected the injector can choose from 8 bits to inject the fault. Figure 3-24 shows the fault injector and its output for a single fault in our two layer quantized neural network. Observing the fault list one can see that the a stuck-at 1 fault is injected in the most significant bit of the 8 bit addition output of Layer 1.

Five Layer Fault Injector: Fault injector in the five layer quantized feed forward network is an extension of the two layer fault injector. As mentioned it consist of four hidden layers and one output layer. The fault can be injected in the outputs of one of the three blocks of each hidden layer. Each hidden block output has 30 outputs. The final output layer has 10 outputs.

The outputs are represented in 8 bits and the random stuck at fault can be placed in one of these bits.

3.4.4 Re-training Experiment for Quantized Feed Forward Network

In this work, we have analyzed the impact of faults and the improvement obtained by re- training for both two layer and five layered quantized network. The system designed to perform the experiment is similar to that used in the floating point data path. However the re-training procedure is slightly different. This is due to the fact that when doing training we need preserve the precision of delta values being back propagated. The following paragraphs discuss the re- training strategy we have used for this experiment.

As mentioned above, during training we need higher precision to store and back propa- gate the delta values. We have used the back propagation algorithm to re-train our quantized network. The algorithm mainly does two tasks during its training phase. First: it forward prop- agates the inputs to the output (forward pass) and then back propagates the error in the final output (backward pass). During the forward pass the network is in the inference mode and all values are integer only. However during backward pass, the outputs of the multiplication, activation and bias addition blocks are de-quantized and delta values are calculated using these values. This ensures precision while calculating error values. When the whole training data,

53 i.e., 60,000 images complete the forward pass and backward pass, we say that it has completed one epoch. We train the network for 10 such epochs. It is important to note that dequantizer and back propagation algorithm are only activated if the network needs re-training. Figure 3-25 and

Figure 3-26 illustrates the work we have described above for the two layered quantized neural network and the five layered quantized neural network respectively.

Figure 3-25: Re-training architecture for two layered QFNN

54 Figure 3-26: Re-training architecture for five Layered QFNN

As in the floating data path experimentation, we start the experimentation by initially train- ing the network with no faults. We use the training strategy discussed above to initially train our network. Then we increase the number of faults in each iteration. Also, each fault is iter- ated 25 times and the fault injector returns a new location during each iteration. After injecting faults into our network, we run inference using the MNIST test data and check the quality of recognition (QoR) with faults. If the quality of images recognized is higher than the threshold 55 value (8000 images), we need not activate our re-training algorithm. However, if the quality of images recognized is lower then we re-train our network using our re-training model. After re- training we run inference again on the quantized network with updated weight and bias values, and check the quality of images recognized after re-training.

56 Chapter 4

Experimentation Results

Following the methodology explained in the previous chapter, we present the results in the

upcoming sections. This chapter is divided into three main sections: metrics for qualification of

results, results for floating point data-path model, and results for quantized data-path model. We

used python to design and test the neural networks against MNIST benchmark. For quantized

data-path model, we used Tensorflow framework [10].

4.1 Metrics For Qualification

It is important to remember that the custom neural networks designed in this experiment

functionally perform image recognition. Hence we define two new metrics Quality of Recog-

nition (QoR) and Quality of Confidence (QoC).

Quality of Recognition (QoR) can be defined as the number of images recognized from the

total number of test images.

Number o f images recognized Quality o f Recognition (QoR) = Total Number o f test images

Since we used MNIST benchmark the total number of test images is 10,000 [14]. In this work the QoR is measured against 10,000 throughout. So the equation for QoR can be re- written as:

Number o f images recognized Quality o f Recognition (QoR) = 10000

57 For defining the next metric, we first define the level of confidence. The level of confidence is the probability with which the image is correctly recognized. For instance, say the image to be recognized was a ’0’. In an ideal scenario, the network should recognize the ’0’ correctly with a probability of recognition as 1. This means that the network has identified the image with 100 percent confidence. In this case the probability 1 indicates the level of confidence.

Let us take a look at another example to see how the system output would be. Consider a stream of 5 test images given to the system to be recognized. To make this simpler, consider all the images to be hand-written digit ’0’. Let us assume that the network identifies all images correctly. Below is the set of values for level of confidence with which the network identifies the given test images.

Image 1 = [0.99]

Image 2 = [0.85]

Image 3 = [0.60]

Image 4 = [0.72]

Image 5 = [0.55]

Assume that the reference level of confidence is 0.7. As we can, see images 1,2 and 4 have a confidence level above 0.7. Hence we can say that 3 out of the 5 correctly recognized images are identified confidently. This indicates the Quality of Confidence (QoC). It is important to note that in our experiments, we have assumed the reference level as 0.9.

From this, we can define the Quality of Confidence (QoC) as the number of images recog- nized confidently from the total number of correctly recognized images.

Number o f images recognized con f idently Quality o f Con f idence (QoC) = Total Number o f test images recognized

It is important to note that we have used QoC and QoR both as the criteria for re-training in Section 4.2.1 and we used QoR as the criteria in the remaining experiments. The threshold for experiments with QoR as the criteria for re-training is set to 0.8. This indicates that for the network to be termed as fault tolerant it should correctly recognize 8000 out of the 10,000 test images else it is re-trained. The threshold for experiments with QoC is also set to 8000 images. Thus if the number of images identified confidently is less than 8000 then the network

58 is re-trained.

4.2 Results for Floating Point Data-path

This section can be further divided into two sub sections. Section 4.2.1 describes the results

for the two layered feed forward neural network described in Section 3.2.2. Section 4.2.2

describes the results obtained for the convolution neural network described in Section 3.3.

4.2.1 Results for Feed Forward Neural Network

We first trained the two layered feed forward neural network without any faults. After

the initial training we are able to achieve an accuracy of 95.36%. The network successfully

recognized 9536 images out of the 10,000 test images with QoR is 0.9536. Among the correctly

recognized images 8972 images were identified with a probability not less than 0.9 . Thus we

can say the QoC of the network is 0.94 (i.e., 8972/9536). We consider these values as the ideal case, and evaluate our system performance when we introduce faults against these ideal values.

As mentioned in Section 3.2.3 we increase the number of faults injected after 25 iterations. In this experiment, the network with the number of faults injected varying from 1-50.

Figure 4-1: Maximum accuracy of floating FFN

59 Figure 4-1 shows the trend between the number of faults injected versus the maximum number of images recognized. The number of faults injected into the system is on the X- axis and the maximum number of images identified with the fault(s) are on the Y-axis. These plots show the maximum number of images identified among the twenty five iterations for a particular number of injected faults. These plots show that neural networks indeed have some inherent fault tolerance. However, it is not completely fault resilient. Figure 4-2 shows the number of faults versus the minimum number of images identified. The X-axis represents the number of faults injected and Y-axis shows the minimum number of images identified for the corresponding number of faults injected. We can observe that even for a small number of faults, the number of images identified are highly impacted. Figure 4-3 shows the average number of images recognized for a particular number of faults injected into the network. It is clear that the average number of images recognized does reduce, as we inject more faults into the network.

Figure 4-2: Minimum accuracy of floating FFN

60 Figure 4-3: Average accuracy of floating FFN

Figure 4-4: Accuracy plot with one stuck-at fault

Figure 4-4 shows the results for the system performance when we injected a single stuck at fault for 50 iterations and each iteration we inject it in a random location. From the scatter plot shown in Figure 4-4, we observe that even a single fault ends up impacting the recognition capability of the network. It is also important to note that the location of fault plays a very crucial role in determining the QoR and QoC of a network. From the set of experimental results, we observe single stuck-at faults injected at the most significant bits have higher impact on the

QoR and QoC when compared to the single stuck-at faults injected in the least significant bits. 61 In IEEE-754, 64 bit floating point format, the MSB bits mainly constitute of the sign bit and integer part.Also it was observed that faults injected in the output neurons have drastic impact on the recognition capability of the network. Thus fault injected in these bits critically impact

QoR and QoC of the network. Figure 4-5 shows the average number of images recognized confidently by the two layered feed forward network for a given number of injected faults. It can be seen that the average number of images recognized confidently also decreases, as more faults are injected into the network. Figure 4-6 shows the minimum number of images recognized confidently for a given number of faults. The X axis is the number of faults injected, and the Y axis represents minimum number of images recognized confidently for that number of injected faults. Comparing Figure 4-3 and Figure 4-5, we can see that faults impact the average value of QoC slightly more than it impacts the average value of QoR. It is also important to note that the plots for the impact of faults on the minimum value of QoC and QoR are very similar. This being said we can observe that faults affects the minimum value of QoC slightly more than it affects the minimum value of QoR.

Figure 4-5: Average confidence plot for floating FFN

62 Figure 4-6: Minimum confidence plot for floating FFN

The above figures emphasize the need for a strategy to enhance the fault tolerance of the network. In this work, we use re-training by backpropagation as a strategy to minimize the impacts of faults. In the following paragraphs we discuss the results of re-training. In this experiment we have considered both QoR and QoC as the criteria to re-train the network. Firstly we consider the results when we use QoC as the retraining criteria. We have set the threshold of QoC as 0.84, that is, roughly 8000 images of the 9536 correctly recognized images should be identified with a probability greater than 0.9.

63 Figure 4-7: Accuracy improvement after re-training

Figure 4-7 describes the improvement in QoR achieved by re-training. The X-axis as in the above figure is the number of faults and the Y-axis is the number of images identified correctly by the faulty network. Figure 4-7 has two line graphs the first line (blue) graph represents the number of images that are recognized by the faulty network before re-training. The second line (orange) graph represents the number of images recognized by the faulty network after re-training. We have achieved a maximum of 58% improvement by re-training and an average improvement of 6% improvement. Now let us see the improvement achieved in QoC by re- training. Figure 4-8 shows the plot with confidence level before and after re-training. The

X-axis denotes the number of faults and Y axis the number of images identified confidently. As shown in Figure 4-8 there is improvement in QoC by re-training the faulty network. We were able to achieve an average of 18% improvement in QoC by re-training. This is also evident by comparing Figure 4-7 and Figure 4-8.

64 Figure 4-8: Confidence improvement after re-training

Figure 4-9: Number of times re-trained for each fault

Figure 4-7 and Figure 4-8 show that we can certainly improve the results by re-training our faulty network. In Figure 4-9 we show the number of times the network was re-trained for a particular number of faults. The X-axis is the number of faults injected and Y-axis is the number of times the re-training procedure was invoked. The best case scenario being zero times re-trained and the worst case being 25 times re-trained. In an ideal scenario, as the number of 65 faults increases, the fault tolerance decreases. Hence we can say that the number of times the

network is re-trained increases linearly w.r.t the number of faults. We can observe that this

trend is nearly linear in Figure 4-9. It is to be noted that unlike the ideal scenario, the faults

injected are at random locations in this experiment.

This being said there are cases when re-training is not sufficient to minimize the effects of faults. Figure 4-10 and Figure 4-11 shows the cases where re-training fails to improve QoR and

QoC metrics. We investigated and found that these cases occur due the to presence of critical faults. As mentioned earlier critical faults are faults that drastically impact the functionality of the network. These occur when the faults are placed in the MSB of floating 64 bit representation or when faults are placed in the higher order bits of the output layer. Thus in the presence of critical faults we are not able to achieve significant improvement by re-training. This can also be attributed to the fact the while re-training we randomly initialize the weights. Hence we should also consider a good choice for initialization which is out of the scope of this work.

Figure 4-10: QoR plot - re-training with critical faults

66 Figure 4-11: QoC plot - re-training with critical faults

The results for the experimentation with re-training criteria as QoR are very similar to the above results. However the difference is only in the number of times re-training is required for a given number of faults. As mentioned in Section 4.1, the faulty network is retrained only if the QoR acheived is less than our randomly set threshold of 0.8. Figure 4-12 shows the number of times re-training procedure was invoked when re-training with QoR criteria. The X-axis is the number of faults injected in the network and Y-axis is the number of times the network was re-trained. We observe a similar near linear trend in the plot.

67 Figure 4-12: Number of times re-trained - QoR metrics

4.2.2 Results for Convolution Neural Network

In the previous section we saw the results of fault analysis on floating-point MLP. In this section we analyze the results for the three layer floating-point CNN described in Section 3.3.

The experiments is carried out in two phases. The first phase is to analyze the fault tolerance in

CNN. The second phase we analyze the impact of re-training faulty CNN. It is also important to note that we have used QoR metric for the re-training experiments.

We first trained the fault-free CNN using basic backpropagation algorithm. After training we achieved accuracy of 91.68% which means the network was able to recognize 9168 images out of 10,000 MNIST test images. Thus for inference experiment the ideal QoR is 0.9168. We inject faults with fault count varying from 1 to 50 in the network and compare the results to the ideal QoR of 0.9168. Figure 4-13 shows the results for the best case or maximum number of images recognized for each fault count. The X-axis is the number of faults injected in the network and Y-axis is the number of images recognized. The results show that CNN also have some inherent fault tolerance. Even for higher fault count the network is able to tolerate the impact of the faults. This is similar to the results of floating-point MLP.

68 Figure 4-13: Maximum accuracy floating CNN

Figure 4-14 and Figure 4-15 show that this is not always the case. Figure 4-14 is the plot for worst case / minimum number of images recognized for each fault. Figure 4-15 is the plot for average number of images recognized for each fault. The X-axis for these plots are the number of faults injected and the Y-axis is the number of images recognized. Figure 4-14 suggest that injecting a few faults into the system also impacts the recognition capability. Just as in the floating-point MLP fault location plays a vital role in deciding the fault tolerance of the network. In the presence of critical faults even a single fault can affect the recognition capability drastically. Critical faults are faults in the output neuron or the faults in higher order bits of a critical node. Figure 4-15 suggests a smooth degradation in the average recognition capability of the network as the number of faults injected increases. This is similar to the smooth degradation in quality of recognition observed for floating point MLP in Figure 4-3

69 Figure 4-14: Minimum accuracy floating CNN

Figure 4-15: Average accuracy floating CNN

The results in Figure 4-14 and Figure 4-15 shows the need for re-training strategy to im- prove the fault tolerance of the network. It is important to remember that we have used QoR threshold for re-training the faulty network. Figure 4-16 is the plot showing the best case sce- nario for the re-training experimentation. The blue bar represents the number of images that 70 the faulty network was able to recognize before re-training and the orange line represents the number of images that the faulty network is able to recognize after re-training. We are able to achieve on an average 18% improvement in re-training using the basic backpropagation al- gorithm. Figure 4-17 shows the worst case results for the re-training experimentation. The blue bar represents the number of images recognized before re-training and the orange line represents the number of images recognized after re-training. We observe that in the presence of critical faults, re-training the faulty network is not able to improve the networks recogni- tion capabilities. It can also be attributed to random initialization of weights and biases while re-training just as in the floating-point MLP network.

Figure 4-16: QoR improvement by re-training CNN

71 Figure 4-17: Worst case QoR after re-training CNN

Figure 4-18 shows the plot between the number of faults injected and the number of times the network was re-trained for each fault. The X-axis denotes the number of faults injected and the Y-axis denotes the number of times the network was re-trained. It can be observed that it is similar to the re-training plot for floating-point MLP shown in Figure 4-12. It is nearly linear implying the number of times the network was re=trained increases as the number of faults

injected increases.

Figure 4-18: Number of times re-trained CNN

72 4.3 Results for Quantized Data Path

In Section 4.2 we saw the impact of faults on floating point data path. In this section we investigate the impact of faults on quantized data path. It is important to remember that we have used 8-bit quantized data path. Thus the final outputs of multiplier, adder and activation are all

8-bit unsigned integers. Also as previously mentioned we have used tensorflow framework to design the quantized network. In Section 4.3.1 we describe the results for quantized feed forward neural network.

4.3.1 Results for Quantized Feed Forward Neural Network

We have designed two quantized feed forward neural networks for this experimentation.

The details of these two networks are given in Section 3.4.2. Section 4.3.1.1 presents the analysis and results for the two layered quantized feed forward neural network. Section 4.3.1.2 presents the analysis and results for the five layered feed forward neural network. It is also important to note that in this experiment we have only considered QoR as the metrics to assess the performance of the system and the criteria for re-training as well.

4.3.1.1 Results for Two Layer Quantized Feed Forward Neural Network

This experimentation had two phases and the first phase is to identify the impact of faults on two layered neural network. The second phase is to study if we can minimize the effects of faults by using re-training strategy mentioned in Section 3.4.4. For understanding the im- pacts faults have on our two layered quantized feed forward neural network, we initially used pre-trained weights, biases in our quantized network. The pre-trained model had accuracy of

96.62% which implies that the network is able to correctly recognize 9662 images out of the total 10,000 test images in MNIST benchmark. Thus we can say that in ideal case, the QoR is

0.9662. We perform inference using this model by varying the number of faults injected into the network each with 25 iterations. We compare the QoR acheived or the number of images recognized correctly by the faulty network to our ideal QoR 0.9662, or the ideal number of images recognized which is 9662.

73 Figure 4-19: Maximum QoR for two layer QFFN

In Figure 4-19 we analyze the inherent fault tolerance of the two layered quantized neural network. The plot shows the maximum number of images recognized for a given number of fault. The X-axis is the number of faults and Y-axis is the number of images recognized. The plot shows that quantized network does have some inherent resilience when the number of faults are low. However as the number of faults increases, the network is prone to be more erroneous.

If we compare Figure 4-1 and Figure 4-19, we observe that in Figure 4-1 the maximum number of images recognized by the faulty floating point neural network was closer to its ideal value even for large number of faults. However in Figure 4-19, we can see a large deviation from the ideal value as the number of faults increases. We now look at the minimum QoR or the minimum number of images recognized for a particular number of faults. Figure 4-20 shows the plot for the number of faults versus the minimum QoR, or minimum number of images recognized by the faulty two layered quantized neural network.

74 Figure 4-20: Minimum QoR for two layer QFFN

Figure 4-20 shows that the minimum QoR of the network is below the ideal value even for a small number of faults. If we compare Figures 4-20 and 4-2, we can see that the floating point data-path indeed had better fault tolerance than that of the quantized data-path. When a small number of faults are injected into the network, our network with floating point data-path the minimum QoR or minimum number of images recognized was still closer to the ideal value.

However that is not the case for the two layered quantized neural network. Thus from the above two plots we can say that the quantized neural network are more susceptible to faults than its

floating point counter part.

Figure 4-21 shows the plot for number of faults injected vs the average number of images recognized correctly. The trend shows that as the number of faults increases, the average num- ber of images identified decreases steeply when compared to the average plot for floating point shown in Figure 4-3. Also on comparison of Figure 4-21 and Figure 4-3, we can see that the average number of images recognized correctly is better for floating point data path.

75 Figure 4-21: Average QoR for two layer QFFN

Figure 4-22: QoR for single stuck at fault

Figure 4-22 shows the scatter plot of the number of images recognized by our two layered quantized feed forward network in the presence of a single stuck-at fault. The plot shows that even a single fault in the quantized network has a drastic impact on the recognition capability of the network. Comparing Figure 4-4 and Figure 4-22 we observe that a single stuck at fault has more effect on the quantized neural network. The average drop in recognition for a single stuck- at fault in the floating point data path is 0.01%, whereas in the quantized network is around 76 50%. The larger drop in recognition can be attributed to the process of quantization where the

information is more compact. As we quantize, there is a loss of data and adding further faults

only impedes the network capability. Each bit in quantized 8-bit representation holds more

information when compared to a single bit in the 32-bit flaoting point representation. As seen

in the floating point data path, faults that are in the MSB or closer to the output layer cause far

more significant impact than other faults.

Figures 4-19, 4-20 and 4-21 show the need for a strategy to minimize the effects of such faults. In this experiment, we have used re-training as a strategy to mitigate the effects of faults.

It is important to remember we use QoR as the metrics with a threshold randomly set to 0.8, for determining if the network requires re-training or not. If the QoR of the faulty network is greater than 0.8 then we do not require re-training else we need re-training. Figure 4-23 shows the improvement in QoR or number of images recognized by re-training for faults injected varying from 1 to 10. The X-axis is the number of faults injected and the Y-axis is the number of images recognized. There are two line graphs in the plot first (orange line) representing the best case improvement achieved in the number of images. The second line (blue line) represents the number of images recognized before re-training.

Figure 4-23: QoR improvement after re-training

It is important to note that in the re-training experiment we trained our two layered quan- 77 tized neural network using the strategy described in Section 3.4.4. We initially train our fault free quantized network using this novel strategy and were able to achieve 92% accuracy and is comparable to the accuracy obtained by using pre-trained parameters. Also this ensures that we are able to train our network using this technique. The network now has an ideal QoR of 0.92 and the performance of the faulty network is compared to this ideal value. Figure 4-23 shows that we can definitely improve the performance of the network by re-training. We were able to acheive 30% improvement in the QoR on average. As shown in the in the case of floating point data path, the presence of critical faults impacts the re-training capability. The network is unable to re-train and improve its recognition capability in the presence of critical faults as shown in Figure 4-24.

Figure 4-24: Minimum QoR per fault after re-training

78 Figure 4-25: Number of times 2 layer QFNN network re-trained

The above figure shows the number of times the re-training procedure was invoked for each fault. We observe a near linear trend indicating that as the number of faults injected increases the the number of times re-training is necessary also increases. Also comparing Figure 4-9 and

Figure 4-25, we see that the slope of the graph in Figure 4-25 is steeper. This supports the previous claim that the quantized networks are more prone to errors than their floating point counterparts.

4.3.1.2 Results for Five Layer Quantized Feed Forward Neural Network

In this experiment we analyze the impact of faults on a deeper quantized netowrk. We have used the five layer deep feed forward quantized neural network described in Section 3.4.2 for this work. We first did an inference analysis on the model to see the impact faults have on the network. During this analysis we fed the network with pre-trained weights and biases.

These are weights and biases are quantized during the inference process. We acheived 96.65% accuracy by using pre-trained weights and biases. Thus our ideal QoR for the inference phase experiment is 0.9665, which implies that the quantized five layer neural network can recognize

9665 out of 10,000 test images.

79 Figure 4-26: Maximum QoR for a particular fault- five layer QFFN

Figure 4-27: Minimum QoR for a particular fault- five layer QFFN

Figure 4-26 and Figure 4-27 above shows the impact faults have on our deep quantized neural network. Figure 4-26 shows the maximum number of images recognized for a particular number of faults injected. Figure 4-27 shows the minimum number of images recognized for a particular number of faults injected. These figures are very similar to the Figure 4-19 and

Figure 4-20. Thus the impacts faults have on a deep quantized neural network is similar to 80 that of a two layered quantized network. Further, as the number of faults injected increases the average number of images identified for a five layer network are slightly better than the average number of images recognized for a two layer network as shown below in Figure 4-28.

Figure 4-28: Average QoR five layer vs two layer QFFN

We have also analyzed the impact for faults in a particular layer have on the recognition capability of the network. In this experiment we injected faults in one layer at a time. We start the experiment by injecting a single stuck-at fault in the five layered quantized feed forward network. We then analyze the QoR attained or the total number of test images recognized by the faulty network. The next iteration we inject another single stuck-at fault in a new location in the same layer and perform the same analysis. We increase the number of faults injected after twenty five such iterations. We performed this experiment by injecting 1 to 25 faults in layer 1. Once we finished all the iterations for layer 1, we move to the next layer and so on.

Tables 4.1- 4.3 show the experimentation results.

81 Faults injected Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

1 9634 9631 9634 9633 9625

2 9633 9629 9634 9629 9625

3 9628 9627 9629 9638 9626

4 9625 9622 9632 9631 9625

5 9624 9624 9608 9627 9625

6 9581 9601 9601 9645 9626

7 9603 9616 9621 9583 9627

8 9608 9627 9625 9622 9629

9 9632 9574 9613 9626 9625

10 9418 9606 9635 9592 9631

11 9552 9440 9605 9638 9625

12 9485 9548 9622 9617 9626

13 9353 8549 9218 8042 9625

14 9395 8452 9575 9032 9625

15 5939 6169 9482 2493 9625

16 9511 9033 9414 9580 9625

17 8146 9032 7632 8983 9623

18 9126 8982 9172 9147 9627

19 2720 3203 9286 7946 9584

20 9023 4368 7981 3737 9625

21 6323 2862 5597 5844 9626

22 4301 9310 4220 7108 9625

23 3507 9483 9452 2131 9539

24 8924 5483 2671 6507 7719

25 6188 7628 3328 1943 9612

Table 4.1: Maximum number of images recognized - Faulty layer network

82 Faults injected Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

1 8919 8017 9054 8032 8265

2 7437 6327 8409 7941 8170

3 8584 5515 5242 6149 7617

4 5486 4596 4385 5954 6148

5 6593 4898 4501 6623 5372

6 5348 5150 3424 3897 5647

7 5029 3432 4248 4012 4495

8 4624 4468 3260 4349 4525

9 3855 4052 3549 4858 4898

10 3108 2430 2923 2754 3443

11 4334 2608 2570 3526 2925

12 3002 2634 2644 2768 3243

13 2945 1801 2242 2112 3242

14 2662 1708 2104 2291 2028

15 1766 1707 1989 1226 2165

16 2317 2566 2109 2132 2162

17 2372 1579 1830 2179 2123

18 2890 1972 1735 1597 2477

19 1259 1229 1959 1482 1940

20 1552 1325 1659 1151 3309

21 1476 1031 1281 1247 2012

22 1413 1404 1341 1458 1766

23 1089 1780 1551 1063 2184

24 2082 1342 1197 1302 1676

25 1616 1302 1253 992 1985

Table 4.2: Average Number of Images Identified - Faulty Layer network

83 Faults injected Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

1 958 980 1677 1010 958

2 956 396 980 20 1032

3 974 892 974 506 974

4 433 292 86 958 892

5 389 974 958 958 910

6 867 290 144 958 892

7 958 164 127 638 958

8 311 554 731 700 129

9 265 739 892 350 299

10 304 78 227 331 558

11 471 803 974 84 70

12 312 282 374 104 138

13 343 585 871 958 892

14 177 24 106 196 8

15 143 726 749 28 927

16 515 250 37 974 892

17 784 883 28 81 802

18 313 415 739 105 974

19 651 825 87 241 420

20 442 750 55 770 25

21 278 300 149 52 353

22 469 495 70 594 11

23 60 363 21 228 815

24 360 531 26 22 270

25 519 93 102 55 34

Table 4.3: Minimum Number of Images Recognized - Faulty layer network

Table 4.1 shows the maximum number of images identified by injecting the specified num- 84 ber of faults in a particular layer. The table shows that each layer does have some inherent resilience to faults. Table 4.2 shows the average number of images recognized correctly by in- jecting a specified number of faults. It can be observed that as the number of faults injected in each layer increases the average decreases. Also it can be observed that if the faults are placed closer to the output layer, the average number of images recognized reduces. The irregulari- ties can be attributed to the randomness during fault injection. Table 4.3 shows the minimum number of images recognized by the network with faults in a particular layer. We analyzed the results and observed that faults in the multipliers have more impact on the QoR than those in adder and neurons.

In the second phase of the experiment, we analyzed the improvement achieved by re- training our deep network. For this we initially trained the five layered quantized without injecting any faults using the training strategy mentioned in Section 3.4.4 . We were able to achieve a QoR of 0.96. We set this as the ideal value of our experimentation and compare the results of re-training with the same. We were able to achieve significant improvement by re- training a faulty deep quantized neural network. Figure 4-29 shows the results for re-training

five layered neural network when injected a maximum of five faults. We achieved an average of 30% improvement in the number of images recognized.

Figure 4-29: QoR before and after re-training

85 Chapter 5

Conclusion and Future Work

5.1 Conclusion

Artificial intelligence has become a part and parcel of our daily lives. Thus it is necessary to gain deeper understanding of neural computing and explore the feasible methods of optimiza- tion. The tremendous growth in big data and cloud computing has only emphasized the impor- tance of neural networks. Exploiting its inherent property of massive parallelism, researchers are able to achieve breakthroughs in complex computational tasks such as face recognition, data modeling and prediction etc. With the swift advances made in AI and machine learning, it can now be made available on hand held mobile devices. The increasing demand for machine learning and necessity for hardware solutions has lead to the invention of special processors such as TPU. Also various hardware optimization techniques such as using approximate com- puting components are being researched. It is from this that we derive our motivation to study on the fault tolerance of neural network hardware.

This work aims at providing a comprehensive analysis on fault tolerance of neural networks by experimenting with different architecture models. Neural networks are said to have some inherent fault tolerance due to their architecture based on human brain. We analyzed the fault tolerance of MLP and CNN networks. We have also analyzed the impact of faults on floating point and quantized data path neural hardware architectures. Before presenting a brief sum- mary of our results, it is good to keep in mind that 64-bit IEEE-754 and 8 bit integer format representations were used for floating point and quantized data respectively. we used a two layer feed forward and a three layer CNN for floating point models. Further, to improve the 86 fault tolerance capabilities, effectiveness of re-training the faulty network was explored. We used the basic back propagation algorithm to train and re-train our networks throughout this work.

Quantization is a fairly new strategy for efficient hardware implementation of neural net- works. They have far less energy and computational overhead compared to their conventional

floating point counterparts. We designed a two layer and a five layer quantized feed forward neural network in this work using Tensorflow framework. We also explored a novel technique to train these pure integer quantized networks. The re-training algorithm designed uses the basic backpropagation paradigm and communicates with the pure integer inference network during its backward pass.

For the purpose of fault analysis we designed a fault injector model in python. Depending on the specifics of network topology and fault count, the injector generates a list of fault loca- tions and values. The locations chosen by the injector were random to ensure comprehensive experimentation. These specified faults were injected dynamically in the network during the inference mode or during the forward pass of the re-training mode.

The networks were designed to functionally recognize images of hand-written digits and were trained using the MNIST benchmark. The experimentation carried on floating point net- works suggest that neural networks cannot be considered as completely fault tolerant (Figure

4-2, Figure 4-6). Even a single fault could have significant impact on the recognition capabil- ity (Figure 4-4). Depending on the location of the fault, the fault may or may not cause fatal damage to the system. Results suggested that the average performance of a floating point feed forward network degraded gradually as the fault count increased (Figure 4-3, Figure 4-5). Also we observed that faults had larger impact on QoC than on QoR metrics (defined in Section 4.1).

The number of times re-training was required increased linearly as a function of the number of faults injected (Figure 4-9). We observed that fault tolerance could indeed be improved by re-training. We achieved 6% improvement in QoR and 18% improvement in QoC. However, in the presence of critical faults, it was observed that re-training by the basic backpropagation algorithm failed to improve fault tolerance.

The experimentation carried out on two layer quantized neural networks suggested that

87 quantized networks are more susceptible to faults than their floating point counterparts (Figure

4-15). This can be attributed to the fact that quantization is in itself an approximation technique and further faults induced in it will have a significant impact. We also observed that the impact of a single stuck-at fault was also more significant (Figure 4-16). Thus fault locations play a vital part in determining the effect of faults in a quantized network. The network performance degrade rapidly even for a small number of faults. We also observed that the five layer quantized network is more fault tolerant than the two layered network. We also observed that faults injected in the output layer generally have more impact than ones injected in other layers as seen in Table 4-2 and Table 4-3. The anomalies can be attributed to the fact that we have not injected faults in the activations of the output neuron. We achieved 92% and 96% accuracy for two and five layer quantized network respectively by using our novel training algorithm . This is comparable to the 96% accuracy we achieved using a pre-trained model. We also achieved

30% improvement in performance by re-training the faulty quantized network.

5.2 Future Work

So far we achieved a basic set of results using our simple experimentation methodology. In the future, we intend to extend this study on a quantized CNN model.

Based on the results, we can say that re-training a faulty network has positive impact on the error tolerance. This was established using the basic backpropagation algorithm so far.

We believe that by using a smarter re-training algorithm, better results could be achieved. For instance, one of the limitations of the basic backpropagation algorithm used in this work is that does not carry forward the outcome of the previous iteration to the next one. Due to this the network trained in the next iteration is unaware and is unable to compare its result against the previous set. By passing on such parameters in the next iteration will ensure that the result of the iteration surpass the preceding one, thus ensuring convergence in the successive rounds. Lastly, a hybrid approach based on redundancy and re-training could also be explored to improve fault tolerance across different models and architectures.

88 References

[1] Bin Ding, Huimin Qian, and Jun Zhou. Activation functions and their characteristics

in deep neural networks. In The 30th Chinese Control and Decision Conference (2018

CCDC), pages 1836–1841. IEEE, 2018.

[2] Hassan Ramchoun, Mohammed Amine Janati Idrissi, Youssef Ghanou, and Mohamed

Ettaouil. : Architecture optimization and training. In International

Journal of Interactive Multimedia and Artificial Intelligence, Vol. 4, N1, pages 26–30.

IJIMAI, 2016.

[3] Felix Altenberger and Claus Lenz. A non-technical survey on deep convolutional neural

network architectures. In arXiv:1803.02129v1, 2018.

[4] Suvro Banerjee. An introduction to recurrent neural networks. "https://medium.com/explore-artificial-intelligence/

an-introduction-to-recurrent-neural-networks-72c97bf0912".

[5] Yann LeCun, Leon´ Bottou, , and Patrick Haffner. Gradient-based learning

applied to document recognition. In Proceedings of IEEE, pages 1–46. IEEE, 1998.

[6] CESAR TORRES-HUITZIL. Fault and error tolerance in neural networks: A review. In

IEEE-Access, Volume 5, 2017, pages 17322–17341. IEEE, 2017.

[7] Norman P. Jouppi, Cliff Young, and Nishant Patil. In-datacenter performance analysis

of a tensor processing unit. In ISCA 17, June 24-28, 2017, Toronto, ON, Canada, pages

1–12. ACM/IEEE, 2017.

[8] Kaz Sato, Cliff Young, and David Patterson. An in-depth look at first

tensor processing unit (tpu). "https://cloud.google.com/blog/products/gcp/

an-in-depth-look-at-googles-first-tensor-processing-unit-tpu". 89 [9] Michael Nielsen. Using neural nets to recognize handwritten digits. "http://

neuralnetworksanddeeplearning.com/chap1.html".

[10] Google. Tensorflow. "https://www.tensorflow.org/".

[11] Ji Wang, Bokai Cao, Philip S Yu, Lichao Sun, Weidong Bao, and Xiaomin Zhu. Deep

learning towards mobile applications. In 38th International Conference on Distributed

Computing Systems ,2018, page 1. IEEE, 2018.

[12] Anil Jain and Jianchang Mao. Artificial neural networks: A tutorial. In March, 1996,

pages 31–44. IEEE, 1996.

[13] Praveen Kumar and Pooja Sharma. Artificial neural networks-a study. In International

Journal of Emerging Engineering Research and Technology, Volume 2, Issue 2, May 2014,

pages 143–148. IJEERT, 2014.

[14] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The of hand-

written digits. "http://yann.lecun.com/exdb/mnist/".

[15] Bert Moons, Koen Goetschalckx, Nick Van Berckelaer, and Marian Verhelst. Minimum

energy quantized neural networks. In 2017 51st Asilomar Conference on Signals, Systems,

and Computers. IEEE, 2017.

[16] Xi Chen, Xiaolin Hu, Hucheng Zhou, and Ningyi Xu. Fxpnet: Training a deep convolu-

tional neural network in fixed-point representation. In 2017 International Joint Confer-

ence on Neural Networks (IJCNN). IEEE, 2017.

[17] Richard L. Welch, Stephen M. Ruffing, , and Ganesh K. Venayagamoorthy. Comparison

of feedforward and feedback neural network architectures for short term wind speed pre-

diction. In Proceedings of International Joint Conference on Neural Networks, Atlanta,

Georgia, USA, June 14-19, 2009, pages 3335–3340. IEEE, 2009.

[18] Rosenblatt. The perceptron: A theory of statistical separability in cognitive systems.

In Cornell Aeronautical Laboratory, Report No. VG1196-G-1, January, 1958. Cornell

Aeronautical Laboratory, 1958. 90 [19] Alekseui Grigorevich Ivakhnenko and Valentin Grigorevich Lapa. Cybernetic predicting

devices. rapport technique. 1966.

[20] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and func-

tional architecture in the cats visual cortex. In The Journal of Physiology, 160 (1), pages

106–154, 1962.

[21] Kunihiko Fukushima. Neocognitron : A self-organizing neural network model for a mech-

anism of visual pattern recognition. In Biol. Cybernetics 36, 193 202 (1980, pages 193–

202. Springer, 1980.

[22] Lenka Skovajsova.´ Long short-term memory description and its application in text pro-

cessing. In 2017 Communication and Information Technologies, pages 1–4. IEEE, 2017.

[23] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in

nervous activity. the bulletin of mathematical biophysics. In The Bulletin of Mathematical

Biophysics, 5 (4), pages 115–133, 1943.

[24] FARNOUSH FARHADI. Learning activation functions in deep neural networks.

"https://publications.polymtl.ca/2945/1/2017_FarnoushFarhadi.pdf".

[25] Frank Rosenblatt. The perceptron : A probabilistic model for information storage and

organization in the brain. In Psychological Review, 65 (6), page 386, 1958.

[26] Raul´ Rojas. Neural Networks A Systematic Introduction. Springer, 1996.

[27] Yann LeCun, Leon´ Bottou, Genevieve B, and Klaus-Robert Mller. Efficient backprop. In

Neural networks : Tricks of the Trade, pages 9–50. Springer, 1998.

[28] Takeshi KAMIO, Shinichiro TANAKA, and Mititada MORISUE. Backpropagation al-

gorithm for logic oriented neural networks. In Proceedings of the IEEE-INNS-ENNS In- ternational Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New

Challenges and Perspectives for the New Millennium, pages 123–128. IEEE, 2000.

91 [29] Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad E Alsaadi.

A survey of deep neural network architectures and their applications. In Neurocomputing

Volume 234, pages 11–26. Elsevier, 2017.

[30] G.P. Zhang. Neural networks for classification: a survey. In IEEE Transactions on Sys-

tems, Man, and Cybernetics, Part C (Applications and Reviews) ( Volume: 30 , Issue: 4,

pages 451–462. IEEE, 2000.

[31] R C Lacher, Pamela K Coats, Shanker C Sharma, and L Franklin Fant. A neural net-

work for classifying the financial health of a firm. In European Journal of Operational

Research, Volume 85, Issue 1, pages 53–65. Elsevier, 1995.

[32] Tianmei Guo, Jiwen Dong, Henjian Li, and Yunxing Gao. Simple convolutional neural

network on image classification. In 2017 IEEE 2nd International Conference on Big Data

Analysis (ICBDA). IEEE, 2017.

[33] Cun Y L, Boser B, and Denker J S. Handwritten digit recognition with a back-propagation

network. In Advances in Neural Information Processing Systems. Morgan Kaufmann

Publishers Inc, 1990.

[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with

deep convolutional neural networks. In Communications of the ACM, Volume 60 Issue 6,

June 2017, pages 84–90. ACM, 2017.

[35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-

scale image recognition. pages 84–90. ICLR, 2015.

[36] Christian Szegedy, Wei Li, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir

Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper

with . In 2015 IEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 1–9. IEEE, 2015.

[37] Yao Yuhui, Chen Lihui, A. Goh, and Ankey Wong. Clustering gene data via associative

clustering neural network. In Proceedings of the 9th International Conference on Neural

Information Processing, 2002. ICONIP ’02. IEEE, 2002. 92 [38] Soumadip Ghosh, Amitava Nag, Debasish Biswas, Jyoti Prakash Singh, Sushanta Biswas,

Debasree Sarkar, and Partha Pratim Sarkar. Weather data mining using artificial neural

network. In 2011 IEEE Recent Advances in Intelligent Computational Systems. IEEE,

2011.

[39] Mohammad Talebi Motlagh and Hamid Khaloozadeh. A new architecture for modeling

and prediction of dynamic systems using neural networks: Application in tehran stock

exchange. In 2016 4th International Conference on Control, Instrumentation, and Au-

tomation (ICCIA). IEEE, 2016.

[40] A. Khotanzad and N. Sadek. Multi-scale high-speed network traffic prediction using

combination of neural networks. In Proceedings of the International Joint Conference on

Neural Networks, 2003. IEEE, 2003.

[41] Laurentiu Mihai Ionescu, Alin Gheorghita Mazare, and Gheorghe Serban. Vlsi implemen-

tation of an associative content addressable memory based on hopfield network model. In

CAS 2010 Proceedings (International Semiconductor Conference). IEEE, 2010.

[42] Dong Wei, Xinghua Pan, and Minglian Zhang. Neural-network- based multiple stage op-

timal control of variable-air- volume systems. In 2006 6th World Congress on Intelligent

Control and Automation. IEEE, 2006.

[43] Hunmo Kim and J.K. Parker. Hidden control neural network identification-based tracking

control of a flexible joint robot. In Proceedings of 1993 International Conference on

Neural Networks (IJCNN-93-Nagoya, Japan). IEEE, 1993.

[44] X.Z. Gao, S.J. Ovaska, and Y. Dote. Motor fault detection using elman neural network

with genetic algorithm-aided training. In Smc 2000 conference proceedings. 2000 ieee

international conference on systems, man and cybernetics. IEEE, 2000.

[45] Long Wen, Xinyu Li, Liang Gao, and Yuyan Zhang. A new convolutional neural network-

based data-driven fault diagnosis method. In IEEE Transactions on Industrial Electronics

( Volume: 65 , Issue: 7 , July 2018 ), pages 5990–5998. IEEE, 2018.

93 [46] A.A. Al-Jumah and T. Arslan. Artificial neural network based multiple fault diagnosis in

digital circuits. In ISCAS ’98. Proceedings of the 1998 IEEE International Symposium on

Circuits and Systems. IEEE, 1998.

[47] Wolfgang Maass. Noise as a resource for computation and learning in networks of spiking

neurons. In Proceedings of the IEEE ( Volume: 102 , Issue: 5 , May 2014 ), pages 860–

880. IEEE, 2014.

[48] Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna V. Palem, Olivier Temam, and

Chengyong Wu. Leveraging the error resilience of neural networks for designing highly

energy efficient accelerators. In IEEE Transactions on Computer-Aided Design of Inte-

grated Circuits and Systems ( Volume: 34 , Issue: 8 , Aug. 2015 ), pages 1223–1235.

IEEE, 2015.

[49] Georgios Volanis, Angelos Antonopoulos, Alkis A. Hatzopoulos, and Yiorgos Makris.

Toward silicon-based cognitive neuromorphic icsa survey. In IEEE Design & Test ( Vol-

ume: 33 , Issue: 3 , June 2016 ), pages 91–102. IEEE, 2016.

[50] Algirdas Avizienis.´ Framework for a taxonomy of fault-tolerance attributes in computer

systems. In Proceeding ISCA ’83 Proceedings of the 10th annual international symposium

on Computer architecture, pages 16–21. ACM, 1983.

[51] M. Al-Kuwaiti, N. Kyriakopoulos, and S. Hussein. A comparative analysis of network

dependability, fault-tolerance, reliability, security, and survivability. In IEEE Commu-

nications Surveys & Tutorials ( Volume: 11 , Issue: 2 , Second Quarter 2009 ), pages

106–124. IEEE, 2009.

[52] Regina Frei, Richard McWilliam, Benjamin Derrick, Alan Purvis, and Asutosh Tiwari

and Giovanna Di Marzo Serugendo. Self-healing and self-repairing technologies. In Int

J Adv Manuf Technol. Springer, 2013.

[53] P.W. Protzel, D.L. Palumbo, and M.K. Arras. Performance and fault-tolerance of neural

networks for optimization. In IEEE Transactions on Neural Networks ( Volume: 4 , Issue:

4 , Jul 1993), pages 600–614. IEEE, 1993. 94 [54] D.S. Phatak and I. Koren. Complete and partial fault tolerance of feedforward neural

nets. In IEEE Transactions on Neural Networks ( Volume: 6 , Issue: 2 , Mar 1995 ),

pages 446–456. IEEE, 1995.

[55] M Stanisavljevic,´ A Schmid, and Y Leblebici. Fault-tolerant architectures and ap-

proaches. In Reliability of Nanoscale Circuits and Systems. Springer, New York, NY,

pages 446–456. Springer, 2011.

[56] V.P. Nelson. Fault-tolerant computing: fundamental concepts. In Computer ( Volume: 23

, Issue: 7 , July 1990 ), pages 19–25. IEEE, 1990.

[57] M.D. Emmerson and R.I. Damper. Determining and improving the fault tolerance of mul-

tilayer perceptrons in a pattern-recognition application. In IEEE Transactions on Neural

Networks ( Volume: 4 , Issue: 5 , Sep 1993), pages 788–793. IEEE, 1993.

[58] C.H. Sequin and R.D. Clay. Fault tolerance in artificial neural networks. In 1990 IJCNN

International Joint Conference on Neural Networks. IEEE, 1990.

[59] Salvatore Cavalieri and Orazio Mirabella. A novel learning algorithm which improves

the partial fault tolerance of multilayer neural networks. In Neural Networks Volume 12,

Issue 1, January 1999, pages 91–106. Elsevier, 1999.

[60] C. Khunasaraphan, K. Vanapipat, and C. Lursinsap. Weight shifting techniques for self-

recovery neural networks. In IEEE Transactions on Neural Networks ( Volume: 5 , Issue:

4 , Jul 1994 ), pages 651–658. IEEE, 1994.

[61] Jiacnao Deng, Yuntan Rang, Zidong Du, Ymg Wang, Huawei Li, Olivier Temam, Paolo

Ienne, David Novo, Xiaowei Li, Yunji Chen, and Chengyong Wu. Retraining-based tim-

ing error mitigation for hardware neural networks. In 2015 Design, Automation & Test in

Europe Conference & Exhibition (DATE). IEEE, 2015.

[62] Muhammad Naeem, Liam J. McDaid, Jim Harkin, John J. Wade, and John Marsland.

On the role of astroglial syncytia in self-repairing spiking neural networks. In IEEE Transactions on Neural Networks and Learning Systems ( Volume: 26 , Issue: 10 , Oct.

2015 ), pages 2370–2380. IEEE, 2015. 95 [63] J.A.G. Nijhuis and L. Spaaenenburg. Fault tolerance of neural associative memories. In IEE Proceedings E-Computers and Digital Techniques ( Volume: 136 , Issue: 5 , Sep

1989 ), pages 389–394. IEEE, 1989.

[64] Google. Post-training quantization. "https://www.tensorflow.org/lite/

performance/post_training_quantization".

[65] Brandon Reagen, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Kyu Lee, Niamh

Mulholland, David Brooks, and Gu-Yeon Wei. Ares: A framework for quantifying the

resilience of deep neural networks. In 2018 55th ACM/ESDA/IEEE Design Automation

Conference (DAC), pages 600–614. IEEE, 2018.

[66] Jeff Zhang, Tianyu Gu, Kanad Basu, and Siddharth Garg. Analyzing and mitigating the

impact of permanent faults on a systolic array based neural network accelerator. "https:

//arxiv.org/pdf/1802.04657.pdf".

[67] Matt Mazur. A step by step backpropagation example. "https://mattmazur.com/

2015/03/17/a-step-by-step-backpropagation-example/".

[68] Cisco. Cisco visual networking index: Forecast and methodology, 2015-

2020. "http://www.davidellis.ca/wp-content/uploads/2016/01/

cisco-vni-june-2016-481360.pdf".

[69] William Dally. High-performance hardware for machine learning. "https://media. nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.

pdf".

96