Fault Tolerance and Re-Training Analysis on Neural Networks

FAULT TOLERANCE AND RE-TRAINING ANALYSIS ON NEURAL NETWORKS by ABHINAV KURIAN GEORGE B.Tech Electronics and Communication Engineering Amrita Vishwa Vidhyapeetham, Kerala, 2012 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science, Computer Engineering, College of Engineering and Applied Science, University of Cincinnati, Ohio 2019 Thesis Committee: Chair: Wen-Ben Jone, Ph.D. Member: Carla Purdy, Ph.D. Member: Ranganadha Vemuri, Ph.D. ABSTRACT In the current age of big data, artificial intelligence and machine learning technologies have gained much popularity. Due to the increasing demand for such applications, neural networks are being targeted toward hardware solutions. Owing to the shrinking feature size, number of physical defects are on the rise. These growing number of defects are preventing designers from realizing the full potential of the on-chip design. The challenge now is not only to find solutions that balance high-performance and energy-efficiency but also, to achieve fault-tolerance of a computational model. Neural computing, due to its inherent fault tolerant capabilities, can provide promising solutions to this issue. The primary focus of this thesis is to gain deeper understanding of fault tolerance in neural network hardware. As a part of this work, we present a comprehensive analysis of fault tolerance by exploring effects of faults on popular neural models: multi-layer perceptron model and convolution neural network. We built the models based on conventional 64-bit floating point representation. In addition to this, we also explore the recent 8-bit integer quantized representation. A fault injector model is designed to inject stuck-at faults at random locations in the network. The networks are trained with the basic backpropagation algorithm and tested against the standard MNIST benchmark. For training pure quantized networks, we propose a novel backpropagation strategy. Depending on the performance degradation, the faulty networks are re-trained to recover their accuracy. Results suggest that: (1) neural networks cannot be considered as completely fault tolerant; (2) quantized neural networks are more susceptible to faults; (3) using a novel training algorithm for quantized networks, comparable accuracy is achieved; (4) re-training is an effective strategy to improve fault tolerance. In this work, 30% improvement in quantized network is achieved as compared to 6% improvement in floating point networks using the basic backpropagation algorithm. We believe that using more advanced re-training strategies can enhance fault tolerance to a greater extent. Copyright 2019, Abhinav Kurian George This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. To my loving family and friends iv Acknowledgments I would like to extend my sincere thanks to my advisor Dr. Wen-Ben Jone. He has always been a source of inspiration and motivated me to keep moving forward. I am extremely grateful to him for working along with me even though I had to move away from the university. Dr. Jone was always available for me even through his rough times and I will always cherish working under him. His guidance and advice were very instrumental for shaping the course of this thesis. I would like to thank my thesis committee members Dr. Ranganadha Vemuri and Dr. Carla Purdy for reviewing and providing feedback on this work. I really appreciate the kind gesture and efforts they have taken to go through this material. Finally I would like to thank my family back in India who have always supported me. They instilled confidence and faith in me when I started doubting my abilities. Also, would like to mention a special thanks to my friends especially Sangeetha for sitting hours together and reviewing my work. Without all your prayers and blessings this work would not have been possible. Thank you all. v Contents Acknowledgments v Contents vi List of Figures ix List of Tables xii List of Abbreviations xiii 1 Introduction 1 2 Background 5 2.1 Neural Network . .5 2.1.1 Multi-Layer Perceptrons (MLP) . .6 2.1.2 Convolution Neural Network . .7 2.1.3 Recurrent Neural Networks . 10 2.2 Backpropagation Algorithm . 11 2.3 Applications of ANNs . 13 2.3.1 Pattern Classification . 13 2.3.2 Clustering or unsupervised pattern classification . 14 2.3.3 Prediction/Forecasting . 14 2.3.4 Content Addressable Memory . 14 2.3.5 Control . 15 2.4 Fault Tolerance . 15 2.4.1 Basic terms related to fault tolerance . 18 2.4.2 Fault tolerance in neural networks . 18 vi 2.4.3 TPU and quantization . 21 3 System Architecture 26 3.1 Fault Injection . 26 3.2 Experimental Setup for Feed Forward type neural network . 31 3.2.1 MNIST Benchmark . 31 3.2.2 The Feed Forward Neural Network . 32 3.2.3 Overall System for Feed Forward Re-training Experimentation . 35 3.3 System Design For Convolution Neural Network . 37 3.3.0.1 Fault Injection System For CNN . 42 3.3.0.2 Re-training of CNN . 43 3.4 Quantized Neural Network . 44 3.4.1 Quantization And De-quantization Scheme . 45 3.4.2 Quantized Feed Forward Network (Q-FNN) . 47 3.4.3 Fault Injection in Q-FNN . 52 3.4.4 Re-training Experiment for Quantized Feed Forward Network . 53 4 Experimentation Results 57 4.1 Metrics For Qualification . 57 4.2 Results for Floating Point Data-path . 59 4.2.1 Results for Feed Forward Neural Network . 59 4.2.2 Results for Convolution Neural Network . 68 4.3 Results for Quantized Data Path . 73 4.3.1 Results for Quantized Feed Forward Neural Network . 73 4.3.1.1 Results for Two Layer Quantized Feed Forward Neural Net- work.............................. 73 4.3.1.2 Results for Five Layer Quantized Feed Forward Neural Net- work.............................. 79 5 Conclusion and Future Work 86 5.1 Conclusion . 86 vii 5.2 Future Work . 88 viii List of Figures 2-1 Simple neuron in ANN [1] . .6 2-2 Multi-layer perceptron (MLP) [2] . .7 2-3 Convolution Layer Output with receptive field 3x3 [3] . .9 2-4 Hyperparameters of convolution layer . .9 2-5 Max pooling layer [3] . 10 2-6 Recurrent Neural Network [4] . 10 2-7 Feed forward network - backpropagation . 11 2-8 Architecture of LeNet-5 [5] . 14 2-9 Cause effect relationship between fault, error and failure [6] . 16 2-10 Classification of faults [6] . 17 2-11 Block diagram of TPU [7, 8] . 22 2-12 Systolic array of MXU [7, 8] . 23 2-13 Performance-per-watt comparison [8] . 23 2-14 Quantizing in Tensorflow [8] . 24 3-1 A simple neuron with two inputs . 27 3-2 Two input neuron with possible fault sites . 28 3-3 Perceptron Neural Network to identify greater of two inputs . 28 3-4 Fault injected in the Network . 30 3-5 Results of simulating the faulty system in Simulink . 31 3-6 Custom neural network architecture . 32 3-7 Feed forward neural network used for hand written digit recognition [9] . 33 3-8 Feed forward neural network with re-Training . 33 3-9 Flow chart describing overall system . 35 ix 3-10 Convolution neural network . 37 3-11 Convolution example - CNN . 37 3-12 Max pooling example - CNN . 39 3-13 Layer-1 convolution and maxpooling - CNN . 39 3-14 Layer-2 convolution - CNN . 40 3-15 Layer-2 flatten - CNN . 41 3-16 Layer-3 dense - CNN . 41 3-17 CNN fault injector output - single stuck-at fault . 43 3-18 Quantization function [10] . 45 3-19 De-quantization function [10] . 47 3-20 Quantized two layered feed forward network architecture . 48 3-21 Components of quantized two layered feed forward network . 49 3-22 Quantized five layered feed forward network architecture . 50 3-23 Components of quantized five layered feed forward network . 51 3-24 Fault injector output for two layer QFNN . 52 3-25 Re-training architecture for two layered QFNN . 54 3-26 Re-training architecture for five Layered QFNN . 55 4-1 Maximum accuracy of floating FFN . 59 4-2 Minimum accuracy of floating FFN . 60 4-3 Average accuracy of floating FFN . 61 4-4 Accuracy plot with one stuck-at fault . 61 4-5 Average confidence plot for floating FFN . 62 4-6 Minimum confidence plot for floating FFN . 63 4-7 Accuracy improvement after re-training . 64 4-8 Confidence improvement after re-training . 65 4-9 Number of times re-trained for each fault . 65 4-10 QoR plot - re-training with critical faults . 66 4-11 QoC plot - re-training with critical faults . 67 4-12 Number of times re-trained - QoR metrics . 68 x 4-13 Maximum accuracy floating CNN . 69 4-14 Minimum accuracy floating CNN . 70 4-15 Average accuracy floating CNN . 70 4-16 QoR improvement by re-training CNN . 71 4-17 Worst case QoR after re-training CNN . 72 4-18 Number of times re-trained CNN . 72 4-19 Maximum QoR for two layer QFFN . 74 4-20 Minimum QoR for two layer QFFN . 75 4-21 Average QoR for two layer QFFN . ..

Fault Tolerance and Re-Training Analysis on Neural Networks

In-Datacenter Performance Analysis of a Tensor Processing Unit

Abstractions for Programming Graphics Processors in High-Level Programming Languages

P1360R0: Towards Machine Learning for C++: Study Group 19

AI Chips: What They Are and Why They Matter

Shinjae Yoo Computational Science Initiative Outline

Podracer Architectures for Scalable Reinforcement Learning

Patent Claim Generation by Fine-Tuning Openai GPT-2

Automatic Full Compilation of Julia Programs and ML Models to Cloud

Interoperating Deep Learning Models with ONNX.Jl

Data Movement Is All You Need: a Case Study on Optimizing Transformers

Accelerators for Cyber-Physical Systems Sam Green, İhsan Çiçek and Çetin Kaya Koç University of California, Santa Barbara Introduction Capabilities Desired in CPS?

A Deep Neural Network Using the Posit Number System