<<

UNIVERSITY OF CALIFORNIA SAN DIEGO

Machine Learning in IoT Systems: From Deep Learning to Hyperdimensional Computing

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Computer Science (Computer Engineering)

by

Mohsen Imani

Committee in charge:

Professor Tajana Simunic Rosing, Chair Professor Chung-Kuan Cheng Professor Ryan Kastner Professor Farinaz Koushanfar Professor Steven Swanson

2020 Copyright Mohsen Imani, 2020 All rights reserved. The dissertation of Mohsen Imani is approved, and it is ac- ceptable in quality and form for publication on microfilm and electronically:

Chair

University of California San Diego

2020

iii DEDICATION

To my wife, Haleh, and my parents, Fatemeh, and Habibollah.

iv EPIGRAPH

Science without religion is lame; Religion without science is blind.

— Albert Einstein

v TABLE OF CONTENTS

Signature Page ...... iii

Dedication ...... iv

Epigraph ...... v

Table of Contents ...... vi

List of Figures ...... ix

List of Tables ...... xi

Acknowledgements ...... xii

Vita ...... xiv

Abstract of the Dissertation ...... xxi

Chapter 1 Introduction ...... 1 1.1 Deep Learning Acceleration ...... 3 1.2 Brain-Inspired Hyperdimensional Computing ...... 4

Chapter 2 Deep Learning Acceleration with Processing In-Memory ...... 6 2.1 Introduction ...... 7 2.2 Related Work ...... 9 2.3 Background ...... 10 2.3.1 DNN Training ...... 11 2.3.2 Digital Processing In-Memory ...... 12 2.4 FloatPIM Overview ...... 14 2.5 CNN Computation in FloatPIM Block ...... 16 2.5.1 Building Blocks of CNN Training and Inference ...... 17 2.5.2 Feed-Forward Acceleration ...... 21 2.5.3 Back-Propagation Acceleration ...... 22 2.6 FloatPIM Architecture ...... 24 2.6.1 Block Size Scalability ...... 25 2.6.2 Inter-layer Communication ...... 26 2.6.3 FloatPIM Parallelism ...... 28 2.7 In-Memory Floating Point Computation ...... 29 2.7.1 FloatPIM Multiplication ...... 30 2.7.2 FloatPIM Addition ...... 30 2.8 Evaluation ...... 32 2.8.1 Experimental Setup ...... 32

vi 2.8.2 Workload ...... 33 2.8.3 FloatPIM & Data Representation ...... 34 2.8.4 FloatPIM Training ...... 35 2.8.5 FloatPIM Testing ...... 37 2.8.6 Impacts of Parallelism ...... 38 2.8.7 Computation/Power Efficiency ...... 39 2.8.8 Endurance Management ...... 42 2.9 Conclusion ...... 42

Chapter 3 Hyperdimensional Computing for Efficient and Robust Learning . . . . . 44 3.1 Introduction ...... 45 3.2 Hyperdimensional Processing System ...... 46 3.3 Classification in Hyperdimensional Computing ...... 48 3.3.1 Encoding Module ...... 49 3.3.2 HD Model Training ...... 50 3.3.3 Associative Search ...... 51 3.4 Algorithm-Hardware Optimizations of HD Computing ...... 52 3.4.1 QuantHD: Model Quantization in HD Computing ...... 52 3.4.2 SearcHD: Fully Binary Stochastic Training ...... 58 3.5 Hardware Acceleration of HD Computing ...... 63 3.5.1 D-HAM: Digital-based Hyperdimensional Associative Memory 64 3.5.2 R-HAM: Resistive Hyperdimensional Associative Memory . 66 3.5.3 A-HAM: Analog-based Hyperdimensional Associative Search 71 3.5.4 Comparison of Different HAMs ...... 76 3.6 Conclusion ...... 79

Chapter 4 Collaborative Learning with Hyperdimensional Computing ...... 81 4.1 Introduction ...... 82 4.2 Motivational Scenario ...... 85 4.3 Secure Learning in HD Space ...... 86 4.3.1 Security Model ...... 86 4.3.2 Proposed Framework ...... 87 4.3.3 Secure Key Generation and Distribution ...... 88 4.4 SecureHD Encoding and Decoding ...... 90 4.4.1 Encoding in HD Space ...... 91 4.4.2 Decoding in HD Space ...... 93 4.5 Collaborative Learning in HD Space ...... 98 4.5.1 Hierarchical Learning Approach ...... 98 4.5.2 HD Model-Based Inference ...... 100 4.6 Evaluation ...... 101 4.6.1 Experimental Setup ...... 101 4.6.2 Encoding and Decoding Performance ...... 102 4.6.3 Evaluation of SecureHD Learning ...... 102

vii 4.6.4 Data Recovery Trade-offs ...... 105 4.7 Conclusion ...... 107

Chapter 5 Summary and Future Work ...... 108 5.1 Thesis Summary ...... 109 5.2 Future Direction ...... 110

Bibliography ...... 111

viii LIST OF FIGURES

Figure 2.1: DNN computation during (a) feed-forward and (b) back-propagation. . . . 10 Figure 2.2: Digital PIM operations. (a) NOR operation. (b) 1-bit addition...... 13 Figure 2.3: Overview of FloatPIM...... 15 Figure 2.4: Overview of CNN Training...... 17 Figure 2.5: Vector-matrix multiplication...... 18 Figure 2.6: Convolution operation...... 19 Figure 2.7: Back-propagation of FloatPIM...... 23 Figure 2.8: FloatPIM memory architecture...... 25 Figure 2.9: FloatPIM training parallelism in a batch...... 29 Figure 2.10: In-memory implementation of floating point addition...... 31 Figure 2.11: FloatPIM energy saving and speedup using floating point and fixed point representations...... 36 Figure 2.12: FloatPIM efficiency during training...... 36 Figure 2.13: FloatPIM efficiency during the testing...... 38 Figure 2.14: The impact of parallelism on efficiency...... 40 Figure 2.15: (a) FloatPIM area breakdown, (b) efficiency comparisons...... 41

Figure 3.1: (a) Overview of the HD classification consist of encoding and associative memory modules. (b) The encoding module maps a feature vector to a high- dimensional space using pre-generated base hypervectors. (c) Generating the base hypervectors...... 51 Figure 3.2: (a) QuantHD framework overview. (b) Binarizing and ternarizing the trained HD model...... 53 Figure 3.3: Energy consumption and execution time of QuantHD, conventional HD and BNN during training ...... 57 Figure 3.4: Energy consumption and execution time of QuantHD during inference . . . 58 Figure 3.5: Overview of SearcHD encoding and stochastic training...... 59 Figure 3.6: Classification accuracy of SearcHD, kNN, and the baseline HD algorithms. 61 Figure 3.7: Language classification accuracy with wide range of errors in Hamming distance using D = 10,000...... 64 Figure 3.8: Overview of D-HAM...... 65 Figure 3.9: Overview of R-HAM: (a) Resistive CAM array with distance computation; (b) A 4 bits resistive block; (c) Sensing circuitry with non-binary code generation...... 67 Figure 3.10: Match line (ML) discharging time and its relation to detecting Hamming distance for various CAMs...... 69 Figure 3.11: Energy saving of R-HAM using structured sampling versus distributed volt- age overscaling...... 71 Figure 3.12: Overview of A-HAM: (a) Resistive CAM array with LTA comparators; (b) Circuit details of two rows...... 72 Figure 3.13: Minimum detectable distance in A-HAM...... 75

ix Figure 3.14: Multistage A-HAM architecture...... 76 Figure 3.15: Energy-delay of the HAMs with accuracy...... 77 Figure 3.16: Area comparison between the HAMs...... 78 Figure 3.17: Impact of process and voltage variations for the minimum detectable Ham- ming distance in A-HAM...... 79

Figure 4.1: Motivational scenario ...... 84 Figure 4.2: Execution time of homomorphic encryption and decryption over MNIST dataset ...... 85 Figure 4.3: Overview of SecureHD ...... 86 Figure 4.4: MPC-based key generation ...... 88 Figure 4.5: Illustration of SecureHD encoding and decoding procedures ...... 88 Figure 4.6: Value extraction example ...... 91 Figure 4.7: Iterative error correction procedure ...... 94 Figure 4.8: Relationship between the number of metavector injections and segment size 96 Figure 4.9: Illustration of the classification in SecureHD ...... 97 Figure 4.10: Comparison of SecureHD efficiency to homomorphic algorithm in encoding and decoding ...... 100 Figure 4.11: SecureHD classification accuracy ...... 103 Figure 4.12: Scalability of SecureHD classification ...... 105 Figure 4.13: Data recovery accuracy of SecureHD ...... 106 Figure 4.14: Example of image recovery ...... 106

x LIST OF TABLES

Table 2.1: VTEAM Model Parameters for Memristor ...... 33 Table 2.2: FloatPIM Parameters ...... 34 Table 2.3: Workloads ...... 34 Table 2.4: Error rate comparison and PIM supports...... 35

Table 3.1: Comparison of QuantHD classification accuracy with the state-of-the-art HD computing...... 56 Table 3.2: Memory footprint of different algorithms (MB) ...... 63 Table 3.3: Average switch activity of D-HAM and R-HAM...... 68

Table 4.1: Datasets (n: feature size, K: number of classes) ...... 101

xi ACKNOWLEDGEMENTS

I would like to first thank my advisor, Prof. Tajana Rosing for her encouragement, support, and guidance during my Ph.D. I am extremely grateful for her understanding of all aspects of my life. The support that I have received from her was definitely way beyond the responsibility of an academic advisor. My success would definitely not have been possible, had it not been such an exceptional advisor who provided an incredible research-oriented environment and all the resources that the students need to succeed. I would like to have a special thanks to Prof. Farinaz Koushanfar, for her guidance, support, and mentorship on several projects that I collaborated with her team. I would like to also thank my other committee members, Prof. Ryan Kastner, Prof. CK Cheng, and Prof. Steven Swanson for their feedback and discussions related to this Ph.D. work. I also would like to thank all our collaborators who helped us during the last few years, especially, Prof. Jan M. Rabaey at UC Berkeley, Prof. Sharon Hu at the University of Notre Dame, and Prof. Nikil Dutt at UC Irvine. I would like to thank all my lab colleges in SEELab for their active collaboration, help, and all the good memories. I would also like to give special thanks to Yeseong Kim and Saransh Gupta. My research was made possible by funding from the National Science Foundation (NSF) Grant 1527034, 1619261, 1730158, 1826967, 1911095, CRISP, one of six centers in JUMP, an SRC program sponsored by DARPA, and SRC-Global Research Collaboration grant. Most importantly, I owe so much to my family, Fatemeh, Habibollah, Farhad, and Mahdi. My wife, Haleh, has been extremely supportive of me throughout this entire process and has made countless sacrifices to help me get to this point. I could overcome all the difficulties thanks to my family’s understanding, patience, and love. Chapters 2 contains material from “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision”, by Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana S. Rosing, which appears in IEEE International Symposium on Computer Architecture, July 2019 [1]. The dissertation author was the primary investigator and author of this paper.

xii Chapters 3 contains material from “QuantHD: A Quantization Framework for Hyperdi- mensional Computing”, by Mohsen Imani, Samuel Bosch, Sohum Datta, Sharadhi Ramakrishna, Sahand Salamat, Jan M. Rabaey, and Tajana Rosing, which appears in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2019 [2]. The dissertation author was the primary investigator and author of this paper. Chapters 3 contains material from “SearcHD: A Memory-Centric Hyperdimensional Com- puting with Stochastic Training”, by Mohsen Imani, Xunzhao Yin, John Messerly, Saransh Gupta, Michael Nemier, Xiaobo Sharon Hu, and Tajana Rosing, which appears in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2019 [3]. The dissertation author was the primary investigator and author of this paper. Chapters 3 contains material from “Exploring Hyperdimensional Associative Memory”, by Mohsen Imani, Abbas Rahimi, Deqian Kong, Tajana Rosing, and Jan M. Rabaey, which appears in IEEE International Symposium on High-Performance Computer Architecture, February 2017 [4]. The dissertation author was the primary investigator and author of this paper. Chapters 4 contains material from “A Framework for Collaborative Learning in Secure High-Dimensional Space”, by Mohsen Imani, Yeseong Kim, Sadegh Riazi, John Merssely, Patrick Liu, Farinaz Koushanfar and Tajana S. Rosing, which appears in IEEE Cloud Computing, July 2019 [5]. The dissertation author was the primary investigator and author of this paper.

xiii VITA

2011 B. S. in Electrical and Computer Engineering, University of Tehran, Tehran, Iran

2014 M. S. in Electrical and Computer Engineering, University of Tehran, Tehran, Iran

2020 Ph. D. in Computer Science (Computer Engineering), University of Cali- fornia San Diego, US

PUBLICATIONS

Mohsen Imani, Mohammad Samragh, Yeseong Kim, Saransh Gupta, Farinaz Koushanfar, Ta- jana Rosing, “Deep Learning Acceleration with Neuron-to-Memory Transformation”, IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb 2020

Hamid Nejatollahi, Saransh Gupta, Mohsen Imani, Tajana Rosing, Rosario Cammarota, Nikil Dutt, “CryptoPIM: In-Memory Acceleration for RLWE Lattice-based Cryptography”, IEEE/ACM Design Automation Conference (DAC), 2020. (Best Paper Candidate)

Behanm Khaleghi, Mohsen Imani, Tajana Rosing, “Prive-HD: Privacy-Preserved Hyperdimen- sional Computing”, IEEE/ACM Design Automation Conference (DAC), 2020.

Saransh Gupta, Mohsen Imani, Joonseop Sim, Andrew Huang, Fan Wu, M. Hassan Najafi, Tajana Rosing, “SCRIMP: A General Stochastic Computing Architecture using ReRAM in-Memory Processing”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2020.

Yeseong Kim, Mohsen Imani, Niema Moshiri, and Tajana Rosing, “GenieHD: Efficient DNA Pat- tern Matching Accelerator Using Hyperdimensional Computing”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), Mar 2020. (Best Paper Candidate)

Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana S. Rosing, “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision”, International Symposium on Computer Architecture (ISCA), Jun 2019

Mohsen Imani, Yeseong Kim, Sadegh Riazi, John Merssely, Patrick Liu, Farinaz Koushanfar and Tajana S. Rosing, “A Framework for Collaborative Learning in Secure High-Dimensional Space”, IEEE Cloud Computing (CLOUD), Jul 2019

Mohsen Imani, Samuel Bosch, Mojan Javaheripi, Bita Rouhani, Xinyu Wu, Farinaz Koushanfar, Tajana Rosing, “SemiHD: Semi-Supervised Learning Using Hyperdimensional Computing”, IEEE/ACM International Conference On Computer-Aided Design (ICCAD), 2019.

xiv Mohsen Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, Farinaz Koushanfar, Tajana Rosing, “SparseHD: Algorithm-Hardware Co-Optimization for Efficient High-Dimensional Computing”, IEEE International Symposium on Field-Programmable Custom Computing Ma- chines (FCCM), 2019.

Mohsen Imani, Justin Morris, John Messerly, Helen Shu, Yaobang Deng, Tajana Rosing, “BRIC: Locality-based Encoding for Energy-Efficient Brain-Inspired Hyperdimensional Computing”, IEEE/ACM Design Automation Conference (DAC), 2019. (Best Paper Candidate)

Mohsen Imani, Alice Sokolova, Ricardo Garcia, Andrew Huang, Fan Wu, Baris Aksanli, Tajana Rosing, “ApproxLP: Approximate Multiplication with Linearization and Iterative Error Control”, IEEE/ACM Design Automation Conference (DAC), 2019.

Daniel Peroni, Mohsen Imani, Hamid Nejatollahi, Nikil Dutt, Tajana Rosing, “ARGA: Ap- proximate Reuse for GPGPU Acceleration”, IEEE/ACM Design Automation Conference (DAC), 2019.

Minxuan Zhou, Mohsen Imani, Saransh Gupta, Tajana Rosing, “Thermal-Aware Design and Man- agement for Search-based In-Memory Acceleration”, IEEE/ACM Design Automation Conference (DAC), 2019.

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Rosing, “F5-HD: Fast Flexible FPGA- based Framework for Refreshing Hyperdimensional Computing”, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2019.

Mohsen Imani, John Messerly, Fan Wu, Wang Pi, Tajana Rosing, “A Binary Learning Frame- work for Hyperdimensional Computing”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2019.

Mohsen Imani, Yeseong Kim, Thomas Worley, Saransh Gupta, and Tajana S. Rosing, “HDCluster: An Accurate Clustering Using Brain-Inspired High-Dimensional Computing”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), Mar 2019

Mohsen Imani, Ricardo Garcia, Andrew Huang, Tajana Rosing, “CADE: Configurable Ap- proximate Divider for Energy Efficiency”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2019.

Mohsen Imani, Justin Morris, Samuel Bosch, Helen Shu, Giovanni De Micheli, Tajana Rosing, “AdaptHD: Adaptive Efficient Training for Brain-Inspired Hyperdimensional Computing”, IEEE Biomedical Circuits and Systems Conference (BioCAS), 2019.

Saransh Gupta, Mohsen Imani, Behnam Khaleghi, Venketash Kumar, and Tajana Rosing, “RAPID: A ReRAM Processing in Memory Architecture for DNA Sequence Alignment”, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2019.

xv Justin Morris, Mohsen Imani, Samuel Bosch, Anthony Thomas, Helen Shu, Tajana Rosing“CompHD: Efficient Hyperdimensional Computing Using Model Compression”, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2019. Minxuan Zhou, Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana S. Rosing, “GRAM: Graph Processing in a ReRAM-based Computational Memory”, IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2019 Mohsen Imani, Sahand Salamat, Jiani Huang, Saransh Gupta, Tajana Rosing, “FACH: FPGA- based Acceleration of Hyperdimensional Computing by Reducing Computational Complexity”, IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2019. Daniel Peroni, Mohsen Imani, Tajana Rosing, “ALook: Adaptive Lookup for GPGPU Accelera- tion”, IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2019. Mohsen Imani, Samuel Bosch, Sohum Datta, Sharadhi Ramakrishna, Sahand Salamat, Jan Rabaey, Tajana Rosing, “QuantHD: A Quantization Framework for Hyperdimensional Computing”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2019. Mohsen Imani, Xunzhao Yin, John Messerly, Saransh Gupta, Michael Nemier, Xiaobo Sharon Hu, Tajana Rosing“SearcHD: A Memory-Centric Hyperdimensional Computing with Stochastic Training”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2019. Daniel Peroni, Mohsen Imani, Hamid Nejatollahi, Nikil Dutt, Tajana Rosing, “Data Reuse for Accelerated Approximate Warps”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2019. Mohsen Imani, Justin Morris, Helen Shu, Shou Li, Tajana Rosing, “Efficient Associative Search in Brain-Inspired Hyperdimensional Computing”, IEEE Design & Test (D&T), 2019. Mohsen Imani, Ricardo Garcia, Saransh Gupta, Tajana Rosing, “Hardware-Software Co-design to Accelerate Neural Network Applications”, ACM Journal on Emerging Technologies in Computing (JETC), 2019. Saransh Gupta, Mohsen Imani, Harveen Kaur, Tajana Rosing, “NNPIM: A Processing In-Memory Architecture for Neural Network”, IEEE Transactions on Computers (TC), 2019. Yeseong Kim, Mohsen Imani, and Tajana S. Rosing, “Image Recognition Accelerator Design Using In-Memory Processing”, IEEE MICRO, IEEE Computer Society, Jan/Feb 2019. Mohsen Imani, Saransh Gupta, Yeseong Kim, Minxuan Zhou, and Tajana S. Rosing, “DigitalPIM: Digital-based Processing In-Memory for Big Data Acceleration”, ACM Great lakes symposium on VLSI (GLSVLSI), 2019 (Invited Talk). Mohsen Imani, Saransh Gupta, and Tajana S. Rosing, “Digital-based Processing In-Memory: A Highly-Parallel Accelerator for Data Intensive Applications”, ACM International Symposium on Memory Systems (MEMSYS), 2019.

xvi Joonseop Sim, Saransh Gupta, Mohsen Imani, Yeseong Kim, and Tajana S. Rosing, “UPIM : Unipolar Switching Logic for High Density Processing-in-Memory Applications”, ACM Great lakes symposium on VLSI (GLSVLSI), 2019. Saransh Gupta, Mohsen Imani, and Tajana S. Rosing, “Exploring Processing In-Memory for Different Technologies”, ACM Great lakes symposium on VLSI (GLSVLSI), 2019. Mohsen Imani, Deqian Kong, and Tajana S. Rosing, “Hierarchical Hyperdimensional Computing for Energy Efficient Classification”, IEEE IEEE/ACM Design Automation Conference (DAC), 2018. Mohsen Imani, Ricardo Garcia, Saransh Gupta, and Tajana S. Rosing, “RMAC: Runtime Con- figurable Floating Point Multiplier for Approximate Computing”, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2018. Minxuan Zhou, Mohsen Imani, Saransh Gupta, and Tajana S. Rosing, “GAS: A Heterogeneous Memory Acceleration for Graph Processing”, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2018. Mohsen Imani, Saransh Gupta, and Tajana S. Rosing, “GenPIM: Generalized Processing In- Memory to Accelerate Data Intensive Applications”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2018. Saransh Gupta, Mohsen Imani, and Tajana S. Rosing, “FELIX: Fast and Energy-Efficient Logic in Memory”, IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2018. Yeseong Kim, Mohsen Imani, and Tajana S. Rosing, “Efficient Human Activity Recognition Using Hyperdimensional Computing”, IEEE Conference on of Things (IoT), Oct 2018 Joonseop Sim, Mohsen Imani, Woojin Choi, Yeseong Kim, and Tajana S. Rosing, “LUPIS: Latch-up Based Ultra Efficient Processing in-Memory System”, 2018 International Symposium on Quality Electronic Design (ISQED), March 2018 (Best paper candidate). Mohsen Imani, Max Masich, Daniel Peroni, Pushen wang, and Tajana S. Rosing, “CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU”, IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2018. Mohsen Imani, Daniel Peroni, and Tajana S. Rosing, “Program Acceleration Using Nearest Distance Associative Search”, IEEE International Symposium on Quality Electronic Design (ISQED), 2018. Mohsen Imani, Daniel Peroni, and Tajana S. Rosing, “Program Acceleration Using Nearest Distance Associative Search”, IEEE International Symposium on Quality Electronic Design (ISQED), 2018. Sahand Salamat, Mohsen Imani, Saransh Gupta, and Tajana S. Rosing, “RNSnet: In-Memory Neural Network Acceleration Using Residue Number System”, IEEE International Conference on Rebooting Computing (ICRC), 2018.

xvii Daniel Peroni, Mohsen Imani, and Tajana Rosing“Runtime Efficiency-Accuracy Trade-off Using Configurable Floating Point Multiplier”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2018.

Mohsen Imani, Saransh Gupta, Sahil Sharma, and Tajana Rosing“NVQuery: Efficient Query Pro- cessing in Non-Volatile Memory”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2018.

Mohsen Imani, Abbas Rahimi, Deqian Kong, and Tajana S. Rosing, “Exploring Hyperdimen- sional Associative Memory”, IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2017.

Mohsen Imani, Abbas Rahimi, Deqian Kong, and Tajana S. Rosing, Jan M. Rabaey, “Exploring Hyperdimensional Associative Memory”, IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2017.

Mohsen Imani, Daniel Peroni, and Tajana S. Rosing, “CFPU: Configurable Floating Point Multiplier for Energy-Efficient Computing”, IEEE/ACM Design Automation Conference (DAC), 2017 (Best poster award at Research Expo).

Yeseong Kim, Mohsen Imani, and Tajana S. Rosing, “ORCHARD: Visual Object Recognition Accelerator Based on Approximate In-Memory Processing”, IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2017.

Mohsen Imani, Daniel Peroni, Yeseong Kim, Abbas Rahimi, and Tajana S. Rosing, “Efficient Neural Network Acceleration on GPGPU using Content Addressable Memory”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2017.

Mohammad Samragh, Mohsen Imani, Farinaz Koushanfar, and Tajana S. Rosing, “LookNN: Neural Network with No Multiplication”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2017.

Mohsen Imani, Saransh Gupta, Atl Arredondo, and Tajana S. Rosing, “Efficient Query Processing in Crossbar Memory”, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2017.

Mohsen Imani, Daniel Peroni, Abbas Rahimi, and Tajana Rosing“Resistive CAM Acceleration for Tunable Approximate Computing”, IEEE Transactions on Emerging Topics in Computing (TETC), 2017.

Mohsen Imani, Abbas Rahimi, John Hwang, Tajana Rosing, and Jan M. Rabaey, “Low-Power Sparse Hyperdimensional Encoder for Language Recognition”, IEEE Design & Test (D&T), 2017.

Mohsen Imani, Shruti Patil, and Tajana Rosing“Approximate Computing using Multiple-Access Single-Charge Associative Memory”, IEEE Transactions on Emerging Topics in Computing (TETC), 2017.

xviii Mohsen Imani, Abbas Rahimi, Pietro Mercati, and Tajana Rosing“Multi-stage Tunable Approxi- mate Search in Resistive Associative Memory”, IEEE Transactions on Multi-Scale Computing Systems (TMSCS), 2017.

Mohsen Imani, Daniel Peroni, and Tajana Rosing“NVALT: Approximate Lookup Table for GPU Acceleration”, IEEE Embedded System Letter (ESL), 2017.

Mohsen Imani, Yeseong Kim, and Tajana S. Rosing, “MPIM: Multi-Purpose In-Memory Pro- cessing using Configurable Resistive Memory”, IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2017.

Mohsen Imani, Yeseong Kim, and Tajana S. Rosing, “NNgine: Ultra-Efficient Nearest Neighbor Accelerator Based on In-Memory Computing”, IEEE International Conference on Rebooting Computing (ICRC), November 2017

Mohsen Imani, Deqian Kong, Abbas Rahimi, and Tajana S. Rosing, “VoiceHD: Hyperdimensional Computing for Efficient Speech Recognition”, IEEE International Conference on Rebooting Computing (ICRC), 2017.

Joonseop Sim, Mohsen Imani, Yeseong Kim, and Tajana S. Rosing, “Enabling Efficient System Design Using Vertical Nanowire Transistor Current Mode Logic”, 25th IEEE International Conference on Very Large Scale Integration (VLSI-SoC), October 2017

Mohsen Imani, Yeseong Kim, Abbas Rahimi, and Tajana S. Rosing, “ACAM: Approximate Com- puting Based on Adaptive Associative Memory with Online Learning”, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2016.

Mohsen Imani, Pietro Mercati, and Tajana S. Rosing, “ReMAM: Low Energy Resistive Multi- Stage Associative Memory for Energy Efficient Computing”, IEEE International Symposium on Quality Electronic Design (ISQED), 2016.

Mohsen Imani, Daniel Peroni, Abbas Rahimi, and Tajana S. Rosing, “Resistive CAM Acceleration for Tunable Approximate Computing”, IEEE International Conference on Computer Design (ICCD), 2016.

Mohsen Imani, Abbas Rahimi, and Tajana S. Rosing, “Resistive Configurable Associative Mem- ory for Approximate Computing”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2016.

Mohsen Imani, Shruti Patil, and Tajana S. Rosing, “MASC: Ultra-Low Energy Multiple-Access Single-Charge TCAM for Approximate Computing”, IEEE/ACM Design Automation and Test in Europe Conference (DATE), 2016.

Mohsen Imani, Abbas Rahimi, Yeseong Kim and Tajana S. Rosing, “A Low-Power Hybrid Magnetic Cache Architecture Exploiting Narrow-Width Values”, 5th Non-Volatile Memory Systems and Applications Symposium (NVMSA), August 2016

xix Pietro Mercati, Francesco Paterna, Andrea Bartolini, Mohsen Imani, Luca Benini, and Tajana S. Rosing, “VarDroid: Online Variability Emulation in Android/ Platforms”, ACM Great lakes symposium on VLSI (GLSVLSI), 2016.

Yeseong Kim, Mohsen Imani, Shruti Patil, and Tajana S. Rosing, “CAUSE: Critical Applica- tion Usage-Aware Memory System using Non-volatile Memory for Mobile Devices”, IEEE International Conference On Computer Aided Design (ICCAD), 2015.

xx ABSTRACT OF THE DISSERTATION

Machine Learning in IoT Systems: From Deep Learning to Hyperdimensional Computing

by

Mohsen Imani

Doctor of Philosophy in Computer Science (Computer Engineering)

University of California San Diego, 2020

Professor Tajana Simunic Rosing, Chair

With the emergence of the Internet of Things (IoT), devices are generating massive amount of data. Running machine learning algorithms on IoT devices poses substantial technical challenges due to their limited resources. The focus of this dissertation is to dramatically increase computing efficiency as well as the learning capability of today’s IoT systems by accelerating existing algorithms in hardware and designing new classes of light-weight machine learning algorithms. Our design makes a modification to storage-class memory to support search-based and vector-based computation in memory. We show how this architecture can be used to accelerate deep neural network in both training and inference phases, resulting in 303× faster and 48× more energy efficient training as compared to the state-of-the-art GPU.

xxi Hardware acceleration alone does not provide all the efficiency and robustness that we need. Therefore, we present Hyperdimensional (HD) computing, an alternative method of learning that implements principles of the functionality in the brain: (i) fast learning, (ii) robustness to noise/error, and (iii) intertwined memory and logic. These features make HD computing a promising solution for today’s embedded devices with limited resources as well as future computing systems in deep nanoscaled technology that have issues of high noise and variability. We exploit emerging technologies to enable processing in-memory which is capable of highly-parallel computation and data movement reduction. Our evaluations show that HD computing provide 39× faster and 56× more energy efficiency as compared to state-of-the-art deep learning accelerator.

xxii Chapter 1

Introduction

We live in a world where technological advances are continually creating more data than what we can cope with. With the emergence of the Internet of Things (IoT), devices will generate massive data streams demanding services that pose huge technical challenges due to limited device resources [6, 7, 8, 9]. Sending all the data to the cloud for processing is not scalable, cannot guarantee the real-time response, and is often not desirable due to privacy and security concerns. Much of IoT data processing will need to run at least partly on devices at the edge of the internet. Existing von-Neumann architectures cannot sustain this computational load due to the lack of resources, architectural issues, high variability and noise due to advanced technology. We recognize the following technical challenges while run machine learning algorithms on IoT devices:

• Limited computing resources: The embedded devices in IoT systems often do not have sufficient resources for the processing of sophisticated learning and big data applications [10, 11]. Running existing machine learning using traditional cores results in high energy consumption and slow processing speed.

• Noise and technology related failures: Both CMOS and other emerging technologies

1 are scaling rapidly. The technological and fabrication issues in highly scaled technology nodes add a significant amount of noise to both memory and computing units [12, 3]. Most existing algorithms do not have the brain-like robustness to work with noisy devices while providing accurate results.

• Internal data movement: Processing machine learning and big data applications in von- Neumann architectures is inefficient due to separate memory and computing units. The on-chip caches do not have enough capacity to store big data. This consequently creates a large amount of data movement between the processing cores and memory units which significantly slows down the computation.

• Security and distributed learning: In large IoT systems, sending all data to the cloud poses significant scalability issues coupled with privacy and security concerns. It conse- quently leads to a significant communication cost with high latency to transfer all data points to a centralized cloud.

In this dissertation, we propose two classes of solutions for IoT learning. We show how to accelerate Deep Neural Networks (DNNs) in both training and inference phases using processing in-memory architecture. We propose a novel architecture that enables storage-class memory to support high precision vector-based in-memory operations. We introduce several key design features that map the essential operations of machine learning algorithms to massively parallel in-memory architecture, resulting in 303× speedup and 48× energy efficient DNN training as compared to the state-of-the-art GPU. In order to achieve real-time learning in IoT systems, we need to rethink how we accelerate machine learning algorithms in hardware. In addition, we need to redesign the algorithms themselves using strategies that more closely model the ultimate efficient learning machine: the human brain. To address the issue, we propose Hyperdimensional (HD) computing [13]. HD computing is based on a short-term human memory model, sparse distributed memory, emerged

2 from theoretical neuroscience. A key benefit of HD computing is its natural robustness to noise. This is of critical importance in IoT systems, where noise and high error rates are common during communication. In addition, we show how processing in-memory can leverage this robustness to accelerate HD computation. Our evaluations show that HD computing provide 39× faster and 56× more energy efficiency as compared to our state-of-the-art deep learning accelerator [1]. In the rest of this chapter, we discuss the contributions and related work of this thesis in more detail.

1.1 Deep Learning Acceleration

Running data/memory-intensive machine learning workloads on traditional cores results in high energy consumption and slow processing speed, primarily due to a large amount of data movement between memory and processing units. Processing In-Memory (PIM) is a promising solution to address the data movement issue by implementing logic within memory. Instead of sending a large amount of data to the processing cores for computation, PIM performs a part of computation tasks, e.g., bit-wise computations, inside the memory; thus the application performance can be accelerated significantly by avoiding the memory access bottleneck. Several existing works have proposed analog-based PIM accelerators to perform vector-matrix multiplica- tions in a fast and analog-way in crossbar memory [14, 15]. However, these approaches have a few significant disadvantages: (i) They utilize Analog to Digital Converters (ADCs) and Digital to Analog Converters (DACs) which take the majority of the chip area and do not scale as fast as the memory device technology does. (ii) The existing PIM approaches use multi-level memristor devices that are not sufficiently reliable for commercialization unlike commonly-used single-level NVMs, e.g., Intel 3D Xpoint [16]. (iii) They only support matrix multiplication in analog memory while other operations such as activation function are implemented using CMOS-based digital logic. This makes design less general and increases the expense of fabrication. In Chapter 2, we propose FloatPIM, a digital-based processing in-memory platform

3 capable of accelerating deep learning over commercially available memory devices, i.e., Intel 3D XPoint [1]. FloatPIM accelerates the entire DNN training and inference phases directly in storage-class memory without using extra processing cores. At the hardware layer, our platform enables storage-class memory to support essential vector-based operations locally in memory. At the software layer, we integrate software infrastructure that seamlessly orchestrates the hardware structures. FloatPIM enables highly-parallel and scalable computation on digital data stored in memory, addresses the internal data movement issue by enabling in-place computation, and natively supporting floating-point precision. Although FloatPIM has application on a wide range of big data applications (e.g., bioinformatics [17] and Security [18]), our focus in this dissertation is on deep learning applications. Our evaluation shows that FloatPIM results in 303× faster and 48× more energy efficient DNN training as compared to the state-of-the-art GPU.

1.2 Brain-Inspired Hyperdimensional Computing

The emerging memory devices have various reliability issues such as endurance, durability, and variability [19, 20, 21]. This, coupled with the high computational complexity of learning algorithms, results in many writes to memory resulting in endurance issues in the accelerators [1]. Embedded devices are resource constrained. Instead of accelerating the existing algorithms on the embedded devices, we need to think how to design algorithms that mimic the efficiency and robustness of the human brain. Our research has been instrumental in developing practical implementations of Hyper- dimensional (HD) computing - a computational technique modeled after the brain [22]. The Hyperdimensional computing system enables large-scale learning in real-time, including both training and inference. HD computing is motivated by the observation that the key aspects of human memory, perception, and cognition can be explained by the mathematical properties of high-dimensional spaces. It models data using points of a high-dimensional space, called

4 hypervectors. These points can be manipulated with a formal algebra operations to represent semantic relationships between objects. HD computing mimics several desirable properties of the human brain, including: robustness to noise and hardware failure and single-pass learning where training happens in one-shot without storing the training data points or using complex gradient-based algorithms. These features make HD computing a promising solution for today’s embedded devices with limited storage, battery, and resources, and future computing systems in deep nanoscaled technology which have high noise and variability. We exploited the mathematics and the key principles of brain functionalities to create HD platforms. Our platform includes: (1) novel HD algorithms supporting classification used regularly by professional data scientists [23, 24], (2) novel HD hardware accelerators capable of up to three orders of magnitude improvement in energy efficiency relative to GPU implementations [4, 2], and (3) a software infrastructure that makes it easy for users to integrate HD computing as a part of a large IoT system and enable secure distributed learning on encrypted information [5, 25]. We have also leveraged the memory-centric nature of HD computing to develop an efficient hardware/software infrastructure for a highly-parallel PIM acceleration [4]. We exploited the robustness of HD computing to design an analog in-memory associative search [4] which checks the similarity of hypervectors in tens of nano-seconds, while providing three orders of magnitude improvement in energy efficiency as compared to today’s exact processors [4].

5 Chapter 2

Deep Learning Acceleration with Processing In-Memory

6 2.1 Introduction

Artificial neural networks, in particular deep learning [26, 27], have wide range of ap- plications in diverse areas including: object detection [28], self driving car, and translation [29]. Recently, in some specific tasks such as AlphaGo [30] and ImageNet Recognition [31], deep learning algorithms presented human-level performance. Convolutional neural networks (CNN) are the most commonly used deep learning models [28, 32, 33]. Processing CNNs in conven- tional von Neumann architectures is inefficient as these architectures have separate memory and computing units. The on-chip caches do not have enough capacity to store all data for large size CNNs with hundreds of layers and millions of weights. This consequently creates a large amount of data movement between the processing cores and memory units which significantly slows down the computation. Processing in-memory (PIM) is a promising solution to address the data movement issue [34, 35, 36]. Prior work [37, 15] exploit analog characteristics of non-volatile memory to support matrix multiplication in memory. These architectures transfer the digital input data into an analog domain and pass the analog signal through a crossbar memory to compute matrix multiply. However, these approaches have three significant downsides: (i) They utilize Analog to Digital Converters (ADCs) and Digital to Analog Converters (DACs) which take the majority of the chip area and power consumption, e.g., 98% of chip area in Deep Neural Network (DNN) accelerator [37]. In addition, the mixed-signal ADC/DAC blocks do not scale as fast as the memory device technology does. (ii) The existing PIM approaches use multi-level memristor devices that are not sufficiently reliable for commercialization unlike single-level NVMs, e.g., Intel 3D Xpoint [16]. (iii) Finally, they only support matrix multiplication in analog memory while other operations, such as activation functions, are implemented using CMOS-based digital logic. This makes the design application specific and increases the fabrication expenses. In this chapter, we propose FloatPIM, a novel high precision PIM architecture, which

7 significantly accelerates CNNs in both training and testing with the floating-point representation. This chapter presents the following main contributions:

• FloatPIM directly supports floating-point representations, thus enabling high precision CNN training and testing. To the best of our knowledge, FloatPIM is the first PIM-based CNN training architecture that exploits analog properties of the memory without explicitly converting data into the analog domain. FloatPIM is flexible in that it works with floating-point as well as fixed-point precision.

• FloatPIM implements the PIM operations directly on a digital data stored in memory using a scalable architecture. All computations in FloatPIM are done with bitwise NOR operation on a single bit bipolar resistive devices. This eliminates the overhead of ADC and DAC blocks to transfer data between the analog and digital domain. It also completely eliminates the necessity of the multi-bit memristors, thus simplifying manufacturing.

• We introduce several key design features that optimize the CNN computations in PIM designs. FloatPIM breaks the computation into computing and data transfer phases. In the computing mode, all blocks are working in parallel to compute the matrix multiplication and convolution tasks. During the data transfer mode, FloatPIM enables a pipelined, row-parallel data transfer between neighboring memory blocks. This significantly reduces the cost of internal data movement.

• We evaluate the efficiency of FloatPIM on popular large-scale networks with comparisons to the state-of-the-art solutions. We show how FloatPIM accelerates computations of AlexNet, VGGNet, GoogleNet, and SqueezeNet for ImageNet dataset [38]. In terms of accuracy, FloatPIM supporting floating point precision can achieve up to 5.1% higher classification accuracy than the one using fixed point representation. In terms of efficiency, our evaluation shows that FloatPIM in training can achieve 303.2× faster and 48.6× more energy efficiency as compared to the state-of-the-art GPU.

8 2.2 Related Work

There are several recent studies adopting alternative low-precision arithmetics for DNN training [39]. work in [40] proposed DNN training on hardware with hybrid dynamic fixed- point and floating point precision. However, in terms of convolutions neural network, the work in [41, 42] showed that fixed-point is not the most suitable representation for CNN training. Instead, the training can perform with lower bits of floating point values. Modern neural network algorithms are executed on different types of platforms such as GPU, FPGAs, and ASIC chips [43, 44, 45, 46, 47, 48]. Prior work attempted to fully utilize existing cores to accelerate neural networks. However, in their design the main computation still relies on CMOS-based cores, thus has limited parallelism. To address data movement issue, work in [49] proposed a neural cache architecture which re-purposes caches for parallel in-memory computing. Work in [50] modified DRAM architecture to accelerate DNN inference by supporting matrix multiplication in memory. However, DRAM requires refreshing scheme and only supports destructive PIM operations [51]. In contrast, FloatPIM performs a row-parallel and non-destructive bitwise operation inside non-volatile memory block without using any sense amplifier. FloatPIM also accelerates DNN in both training and testing modes. The capability of non-volatile memories (NVMs) to act as both storage and a process- ing unit has encouraged research in processing in-memory (PIM). Work in[35, 52] designed NVM-based Boltzmann machine capable of solving a broad class deep learning and optimization problems. Work in [15, 37] used ReRAM-based crossbar memory to perform matrix multipli- cation in memory and accordingly designed architecture to design PIM-based accelerator for CNN inference. Work in [53, 54, 55] used the same crossbar memory to accelerate CNN training. Work in [56] exploited the conventional analog-based memristive accelerator to support floating point operations. They exploit the exponent locality of data and the limited precision of floating point operations to enable floating point operations on fixed-point hardware. In contrast, we

9 j i j i k k wij wwjkjk wij wjk

Error Backward: j = g'(aj) kwjk z1 w1j aj g zj Weight Update: wij wij – ηzi. j zi wij

(a) Feed Forward (b) Back Propagation

Figure 2.1: DNN computation during (a) feed-forward and (b) back-propagation. enable floating point operations inherently in memory and do not rely on data pre-processing and scheduling, making FloatPIM a general floating point accelerator which is independent of data. In addition, the analog approaches require mixed-signal circuits, e.g., ADC and DAC, which do not scale as fast as the CMOS technology scales. Work in [14] proposed PipeLayer, a PIM-based architecture based on [37] to accelerate CNN training by exploiting inter-layer and intra-layer parallelism. PipeLayer eliminates using ADC and DAC blocks by using the spike-based approach. However, similar to other PIM architectures, PipeLayer precision limits to fixed-point operations. In addition, it uses not sufficiently reliable multi-bit memristors, which are hard to program especially during training with a large number of writes. Prior works exploited digital PIM operations to accelerate different applications such as DNNs [57, 58, 59, 60], object recognition [61], graph processing [36, 34], security [18], and database applications [62, 63]. However, those designs do not support high precision computation and incur significant internal data movement.

2.3 Background

10 2.3.1 DNN Training

Figure 2.1a show an example of neural networks in a fully-connected layer, where each neuron is connected to all neurons in the previous layer using weights. Figure 2.1a shows the computation of a single neuron in the feed-forward pass. The outputs of the neurons in the previous layer are multiplied with the weight matrix, and the results are accumulated in each neuron (a j). The result of accumulation passes through an activation function (g). This function is traditionally a Sigmoid [64], but recently Rectangular Linear Unit (ReLU) is the most commonly used [26]. The activation results are used as the input for the neurons in the next layer. The goal of the training is to find the network weights using the gradient descent method. It runs in two main steps: feed-forward and back-propagation. In the feed-forward step, it examines the quality of the current neural network model for classifying a pre-defined number of training data points, also known as batch size. It then stores all intermediate neurons values (Zi) 0 and the derivatives of the activation function g (a j) for all data point in a batch. The next step is to update the neural network weights, often referred to the back-propagation step. Figure 2.1b illustrates the back-propagation that performs two major tasks: error backward and weight update. Error backward: Back-propagation, first measures the loss function in the CNN output layer using: m k 1 (i) J = − ∑ ∑ (y j ) m i=1 j=1 Based on a chain rule, it identifies the gradient of the loss function to each weight in the previous layer using: dJ dJ da = ∑ k da j k dak da j

The error vector in a layer j (δ j) is computed backward depending on the error vector in 0 the layer k (δk) and derivatives of the activation in a layer j (g (a j)). Assuming δ j = −dJ/da j,

11 the following equation defines the gradient of the entropy loss for each neuron in the layer j:

  (t j − y j), if j is an out put unit δ j where  0 −g (a j)∑k δkWjk, if j is a hidden unit

In CNN, the convolution layer trains in a similar way to the fully-connected layer, but with higher computation complexity. This is because each output element in the convolution layers depends on the movement of the convolution kernel through a range of the input matrix. The following equation shows how convolution gets the gradient of a loss function to each input:

k k dJ dJ daQ k daQ j = ∑ k · j = ∑ δQ j dar,s r,s∈Q daQ dar,s r,s∈Q dar,s

Similar to the expansion of the equation in the fully-connected layers, we have:

k1−1 K2−1 dJ 0 k k i = [g · W ]∗Z 0 0 j ∑∑ ar,s ∑ ∑ δa−m,b−n m,n i+m , j+n dar,s a b m=0 n=0 | {z } k δr,s where ∗ denotes the convolution. Weight update: Finally, the weights are updated by subtracting the current weights from the

η.zi.δ j matrix: dJ Wi j ←− Wi j − η = Wi j − ηδ jZi dWi j where η is a learning rate, and Zi is the output of the neurons after the activation function in the 0 layer i. Note that both g (a j) and Zi are calculated and stored during the feed-forward step.

2.3.2 Digital Processing In-Memory

Processing in-memory digitally involves input-based switching of memristor, unlike the conventional memristor processing which uses ADC/DAC blocks to convert data between analog

12 (a) V0 GND p RON (low R) in1 in2 inn out n ROFF (high R)

V0 V0

in1 in2 inn in1 in2 inn out out No Switch Switched V0/2

(b) (A+B) (B+C) S A B C Cout S

pc1 pc2 pc10 Intermediate results

Cout = ((A+B) + (B+C) + (C+A) ) S = (((A +B +C ) + ((A+B+C) + Cout) ) )

Figure 2.2: Digital PIM operations. (a) NOR operation. (b) 1-bit addition. and digital domains. Digital PIM performs the computation directly on the stored values in the memory without reading them out or using any sense amplifier. Digital PIM has been designed in literature [65, 66, 67, 68, 69] and fabricated in [70], to implement logic using memristor switching. The output device switches between two resistive states, RON (low resistive state, ‘1’) and ROFF (high resistive state, ‘0’), whenever the voltage across the device, i.e., p and n terminals shown in Figure 2.2a, exceeds a threshold [71]. This property can be exploited to implement NOR gate in the digital memory by applying a fixed voltage, V0 across the memristor devices [65]. The output memristor is initialized to RON in the beginning. To execute NOR in a row, an execution voltage, V0, is applied at the p terminals of the inputs while the p terminal of the output memristor is grounded, as shown in Figure 2.2. The aim is to switch the output memristor from RON to ROFF when one or more inputs stored ‘1.’ value (low resistance). Since NOR is a universal logic gate, it can be used to implement other logic operations like addition [72, 73] and multiplication [74].

13 For example, 1-bit addition (inputs being A,B,C) can be represented in the form of NOR as,

0 0 0 0 Cout = ((A + B) + (B +C) + (C + A) ) . (2.1a)

0 0 0 0 0 0 0 0 S = (((A + B +C ) + ((A + B +C) +Cout) ) ) . (2.1b)

0 Here, Cout and S are the generated carry and sum bits of addition. Also, (A + B +C) , (A + B)0, and A0 represent NOR(A,B,C), NOR(A,B), and NOR(A,A) respectively. Figure 2.2b visualizes the implementation of 1-bit addition in a memristor-based crossbar memory. The processing cells, pc, store the intermediate results and are not used to store data. Digital processing in-memory achieves maximum performance when the operands are present in the same row because, in this configuration, all the bits of an operand are accessible by all the bits of the other operand. This increases the flexibility in implementing operations in memory. In-memory operations are in general slower than the corresponding CMOS-based im- plementations. This is because memristor devices are slow in switching. However, this PIM architecture can provide significant speedup with large parallelism. PIM can support addition and multiplications in parallel, irrespective of the number of rows. For example, to add values stored in different columns of memory, it takes the same amount of time for PIM to process the addition in a single row or all memory rows. However, the processing time in conventional cores highly depends on the data size.

2.4 FloatPIM Overview

In this section, we propose a digital and scalable processing in-memory architecture (FloatPIM), which accelerates CNNs in both training and testing phases with precise floating- point computations. Figure 2.3a shows the overview of the FloatPIM architecture consisting of multiple crossbar memory blocks. As an example, Figure 2.3b shows how three adjacent layers (recall the structure of layers and notations shown in Figure 2.1a) are mapped to the FloatPIM

14 (b) Computation in Memory (c) Row Parallel Operation in Each Block

z zi wij aj zj z wjk ak zk Block 1 i Block 2 j Switch zj = g = g

Block 3 Block 4 Switch zk Switch

zk wkl al zl

= g Switch

Row-Parallel Row-Parallel Enabling Row- Floating Point NOR-based Parallel Data Operations Activation/Pooling Transfer

Data Transfer (a) FloatPIM Memory Blocks Computing Mode Mode Figure 2.3: Overview of FloatPIM. memory blocks to perform the feed-forward computation. Each memory block represents a layer, and stores the data used in either testing (i.e., weights) or training (i.e., weights, the output of each neuron before activation, and the derivative of the activation function (g’)), as shown in Figure 2.3c. With the stored data, the FloatPIM performs with two phases: (i) computing phase and (ii) data transfer phase. During the computing phase, all memory blocks work in parallel, where each block processes an individual layer using PIM operations. Then, in the data transfer phase, the memory blocks transfer their outputs to the blocks corresponding to the next layers, i.e., to proceed either the feed-forward or back-propagation. The switches are shown in Figure 2.3b control the data transfer flows. In Section 2.5, we present how each FloatPIM memory block performs CNN computations for a layer. The block supports in-memory operations for key CNN computations, including vector-matrix multiplication, convolution, and pooling (Section 2.5.1.) We also support the activation functions like ReLU and Sigmoid in memory. MIN/MAX pooling operations are

15 implemented using in-memory search operations. Our proposed design optimizes each of the basic operations to provide high performance. For example, for the convolution which requires shifting convolution kernels across different parts of an input matrix, we design shifter circuits that allow accessing weight vectors across different rows of the input matrix. The feed-forward step is performed entirely inside memory by executing the basic PIM operations (Section 2.5.2.) FloatPIM also performs all the computations of the back-propagation with the same key operations and hardware to the one used in the feed-forward (Section 2.5.3.) In Section 2.6, we describe how the memory blocks compose the entire FloatPIM archi- tecture. FloatPIM further accelerates the feed-forward and back-propagation by fully utilizing the parallelism provided in the PIM architecture, e.g., row/block-parallel PIM operations. We show how these tasks can be parallelized for both feed-forward and back-propagation across a batch, i.e., multiple inputs at a time. It uses multiple data copies pre-stored in different blocks in memory. Section 2.7 presents in-depth circuit-level details of the PIM-based floating point addition and multiplication.

2.5 CNN Computation in FloatPIM Block

In this section, we show how a FloatPIM memory block performs the training/testing task1 of a single CNN layer. Figure 2.4 shows a high-level illustration of the training procedure of a fully-connected layer in FloatPIM. As discussed in Section 2.3, CNN training has two steps: feed-forward and back-propagation. During the feed-forward step, FloatPIM processes the input data in a pipeline stage. For each data point, FloatPIM stores two intermediate neuron values: (i) the output of each neuron after the activation function (Zi) and (ii) the gradient of 0 activation function for the accumulated results (g (a j)). In the back-propagation step, FloatPIM first measures the loss function in the last output layer and accordingly updates the weights of

1Please note that only with the feed-forward step, FloatPIM supports the testing task, i.e., inference, where an input data processes through different CNN layers.

16 hl h egt ftecnouinlyr r pae sn h nmmr vector-matrix in-memory the using updated are layers convolution the of weights the while hw h ro eunilypoaae n pae h egt ntepeiu layer. previous the in weights the updates and and 2.4b propagates Figure sequentially As error step. the show, feed-forward c the during stored values intermediate the using layer each arxmlilcto.Tevco-arxmlilcto sacmlse ymlilctosof multiplications by accomplished is multiplication vector-matrix The multiplication. matrix Inference and Training CNN of Blocks Building 2.5.1 PIM. digital in layer CNN single a support of FloatPIM operations how testing/training describe basic first we subsection, next the In convolution. and multiplication layers, fully-connected for weights the update to multiplication vector-matrix same FloatPIM back-propagation, the the uses In layers. convolution the for operation convolution fully-connected and the layers for multiplication vector-matrix are computations CNN the step, feed-forward

Nsuesmlroeain o ohteflycnetdadcnouinlyr.I the In layers. convolution and fully-connected the both for operations similar use CNNs etrMti Multiplication: Vector-Matrix η k ( b

) Input Vector × z Error Backward Backward Error i × × Weights Matrix

Weights Matrix w jk w iue2.4 Figure ij ×

= ( vriwo N Training. CNN of Overview : a n ftekyoeain fCNcmuaini vector- is computation CNN of operations key the of One ) × a Feed forward Feed j

17 η z i

Activation Activation Function Function Derivative Derivative η Activation ( z c

) j j z Weight Update Weight i ( ( g g ' ) ) - = = Updated Weights w z j ij g'(aj) zj z1 w1 w2 w3 z1 z2 z3 w1 w4 w7 a1 z2 w4 w5 w6 = z1 z2 z3 w2 w5 w8 = a2 z3 w7 w8 w9 Addition z1 z2 z3 w3 w6 w9 a3 Input Vector Input Weight Matrix

Copied Input Copied Transposed a a a Multiplication 1 2 3 Weight Matrix Multiplication Addition (a) Vector-Matrix Multiplication (b) PIM-Compatible Vector-Matrix Multiplication Figure 2.5: Vector-matrix multiplication. the stored inputs and weights, and addition to accumulating the results of the multiplications. Figure 2.5a shows an example of the vector-matrix multiplication. As discussed in Section 2.3.2, the in-memory operations on digital data can perform in a row-parallel way, by performing the NOR-based operations on the data located in different columns. Thus, the input-weight multiplication can be processed by the row-parallel PIM operation. In contrast, the subsequent addition cannot be done in the row-parallel way as its operands are located in different rows. This hinders achieving maximum parallelism that the digital PIM operations offer. Figure 2.5b shows how our design implements row-parallel operations by locating the data in a PIM-compatible manner. FloatPIM stores multiple copies of the input vector horizontally and

T the transposed weight matrix in memory (Wi j ). FloatPIM first performs the multiplication of the input columns with each corresponding column of the weight matrix. The multiplication result is written in another column of the same memory block. Finally, FloatPIM accumulates the stored multiplication results column-wise with multiple PIM addition operations to the other column. FloatPIM enables the multiplication and accumulation to perform independent of the number of rows. Let us assume that each multiplication and addition take TMul and TAdd latencies respectively. Thus, we require M × TMul and N × TAdd latencies to perform the multiplication and accumulation respectively, where the size of the weight matrix is M by N. Convolution: As shown in Figure 2.6a, the convolution layer consists of many multipli- cations, where a shared weight kernel shifts and multiplies with an input matrix. A naive way to

18 Weight Matrix z1 z2 z3 w1 w2 a1 a2 z4 z5 z6 w w = a3 a4 z1 z2 z3 w1 w2 w3 w4 z z z 3 4 7 8 9 z4 z5 z6 w1 w2 w3 w4 =

z7 z8 z9 shifter w1 w2 w3 w4 Copied Weights a Partial Convolution Operations Bs=0 1 a3 Shifted z1 z2 w1 w2 a1 windows z4 z5 w3 w4 = z4 z5 z6 w1 w2 w3 w4 z z w w = Shifted z4 z5 w1 w2 z7 8 9 1 w2 3 w4

a3 shifter w w w w windows z7 z8 w3 w4 = 1 2 3 4 Copied Weights Bs=1 (a) Convolution (b) PIM-compatible Convolution

b4 b5 b6 b3 Bs3

b2 Bs0 Bs2 Control Bs b 1 1 Bs Signals 1 Bs2 b0 Bs0 Bs3

b'0 b'1 b'2 b'3 (c) Barrel Shifter Figure 2.6: Convolution operation. implement the convolution is to write all the partial convolutions for each window movement by reading and writing the convolution weights repeatedly in memory. However, this method has high-performance overhead in PIM, since non-volatile memories (NVMs) have slow write operation. FloatPIM addresses this issue by replacing the convolution with light-weight interconnect logic for the multiplication operation. Figure 2.6b illustrates the proposed method which consists of two parts: (i) It writes all convolution weights in a single row and then copies them in other rows using the row-parallel write operation that happens just in two cycles. This method enables the input values to be multiplied with any convolution weights stored in another column. (ii) It exploits a configurable interconnect to virtually model the shift procedure of the convolution kernel. This interconnect is a barrel shifter which connects two parts of the same memory. Figure 2.6c shows the structure of a barrel shifter that provides a 3-bits shift operation as an example. Depending on the BS control signals, a barrel shifter connects different {b1,...,b6} bits

19 0 0 to {b1,...,b4}. The number of required shift operations depends on the size of the convolution windows. For the example shown in Figure 2.6b, for a 2 × 2 convolution window, the barrel shifter supports a single shift operation using BS = 0 or BS = 1 control signal. Similarly, for a n × n convolution kernel, the number of shift operation is n − 1. Our FloatPIM supports up to a 7 × 7 convolution kernel, and it covers all the tested popular CNN structures. Note that FloatPIM can also support n larger than 7 by rewriting shifted input matrices into other columns. Row-Parallel Write Both vector-matrix multiplication and convolution require to copy the input or weight vectors in multiple rows. Since writing multiple rows sequentially would degrade performance, FloatPIM supports a row-parallel write operation that writes the same value to all rows only in two cycles. In the first cycle, the block activates all columns containing ”1” by connecting the corresponding bitlines to VSET voltage, while the row driver sets the wordlines for the destination rows to zero. It writes 1s on all the selected memory cells at the same time. In the second cycle, the column driver connects only the bitlines which carry ”0” bit to the zero voltage, while the row driver sets the wordlines to VRESET . This writes the input to all memory rows. MAX/MIN Pooling: The goal of MAX (MIN) pooling layer is to find a maximum (minimum) values among the neuron’s output in the previous layer. To implement pooling in memory, we use a crossbar memory with the capability of searching for the nearest value. Work in [63] exploited different supply voltages to give weight to different bitlines and enable the nearest search capability. Using this hardware, we implement MAX pooling by searching for a value which has the nearest similarity to the largest possible value. Similarity the MIN pooling can be implemented by searching for a row of a memory which has the closest distance to the minimum possible value. Since the values are floating point, the search happens in two phases. First, we find value with the highest exponent; then for values with the same maximum exponent, we search to find a value with the which has the largest mantissa.

20 2.5.2 Feed-Forward Acceleration

There are three major types of CNN layers: fully-connected, convolution, and pooling layers. For each type of the three layers, we exploit different data allocation mechanisms to enable high parallelism and perform the computation tasks with minimal internal data movement. For the fully connected layer, the main computation is vector-matrix multiplication. CNN weights

(Wi j) are stored as a matrix in memory and multiplied with the input vector stored in a different column. This multiplication and addition can happen between the memory columns using the same approach we introduce for PIM-compatible vector-matrix multiplication. The convolution is another commonly used operation in the deep neural network, which is implemented using the PIM-compatible convolution hardware introduced in Section 2.5.1. After the fully connected and convolution layers, there is an activation function. We perform activation functions with a sequence of in-memory NOR operations. For example, we perform the ReLU function by subtracting all neuron’s output from the ReLU threshold value (THR). This subtraction can happen in a row parallel way, where the same THR value is written in another column of the memory block (all rows). Finally, we write the threshold value in a row-parallel way in all memory rows that the subtracted results have positive sign bits. For the neuron’s output with negative sign bits, we can avoid subtraction and instead write 0 value on all such rows. We also support non-linear activation functions, e.g., Sigmoid, using the PIM-based multiplication and addition based on Taylor expansion. For example, for Sigmoid, we consider

3 the first three terms of the Taylor expansion (1/2 + 1/4ai − 1/48ai ). The Taylor expansion is implemented in memory as a series of the control signals on the pre-activation vector stored in a column of a crossbar memory. First, we exploit in-memory multiplications to calculate different powers of the pre-activation values, e.g, a3 , in a row parallel way. Then, we multiply the values with a pre-stored Taylor expansion coefficient, e.g., 1/4,1/48, stored in reserved columns of the same memory. Finally, the result of activation can be calculated using addition and subtraction. Note that our approach parallelizes the activation function for all neuron’s output of a DNN

21 layer which is stored in a single column but different rows of a memory block. Moreover, since FloatPIM does not use separate hardware modules for any layers but implements them using basic memory operations. Hence, with no changes to memory and minimal modifications to the architecture, FloatPIM can support the fusion of multiple layers.

2.5.3 Back-Propagation Acceleration

Figure 2.7 shows the CNN training phases in fully-connected layer: (i) Error backward, where the error propagates through different CNN layers. (ii) Weight update, which calculates the new CNN weights depending on the propagated error. Fully-Connected Layer: Figure 2.7 shows the overview of the DNN operations to update the error vector (δ). Figure 2.7a shows the layout of the pre-stored values in each memory block in order to perform the back-propagation. Each memory block stores the weights, the output of neurons (Z) and derivatives of the activation (g0(a)) in a block for each layer.

During the back-propagation, δ vector is the only input to each memory block. The error vector propagates backward in the networks. The error backward starts with multiplying

th the weights of the j layer (Wjk) with the δk error vector. To enhance the performance of this multiplication, we copy the same δk vector on the j rows of the memory (as shown in Figure 2.7b).

The multiplication of the transposed weights and copied δk matrix is performed in a row parallel way.

Finally, FloatPIM accumulates all stored multiplication results (∑δ jWjk). One way is to use k ∗ bw-bits columns that stores all the results of the multiplications where bw is the values bit- width. Instead, we design an in-memory multiply-accumulation (MAC) operation which reuses the memory columns for the accumulation. FloatPIM consecutively performs multiplication and addition operations. This reduces the number of required columns to bw-bits, and results in significant improvements in the area efficiency per computation. Assuming that TMul and TAdd take for multiplication and addition respectively, the multiplication of the weight and δk matrix

22 Stored during feed-

forward

jk

)

j j

w Update ω &

T a jk

z k

j k copies w (

' PIM

jk η j

g Backward j

Reserved

k k Rotate & write Switch

)

ij

i

i a

T w Update ωij &

z

j (

i copies w '

j

ij η i PIM

g Backward i

Reserved j j Stored during feed- forward

(a) Layout of values in FloatPIM th

k k j k T × w jk × k copies kwjk g'(ai) j k k

(b) Error backward Block Memory j

Switch th

j j T i η z - w ij i j copies × ηzi j i

j

j (c) Weight update Memory Block Memory i

Figure 2.7: Back-propagation of FloatPIM. is computed in (TMul + TAdd) × k. Since FloatPIM performs the computation in a row-parallel way, the performance of computation is independent on j. The result of ∑δ jWjk is a vector 0 with j elements (Figure 2.7b). This vector multiplies element-wise by g (a j) vector in S cycles 0 and row-parallel way. Note that during feed-forward the g (a j) is written in a suitable memory location which enables column-wise multiplication with no internal data movement.

The result of the multiplication is a δ j error vector, and it is sent to the next memory block to update the weights (Figure 2.7c). The error vector is used for both updating the weights (Wi j)

23 and computing the backward error vector (δi) in a layer i. Next, FloatPIM transfers the δ j vector to the next memory block which is responsible to update the Wi j weights. The δ j vector is copied T in i memory rows next to the Wi j matrix using the copy operation.

For the weight update, the δ j matrix is multiplied with ηZi vector, where ηZi is calculated and stored during the feed-forward step. This takes j × TMul. As Figure 2.7b shows, the result of the multiplication is a matrix with j × i elements. Finally, FloatPIM updates the weights by

T subtracting Wi j from the ηδ jZi matrix. This subtraction happens column by column and the result will be rewritten in the same column as the new weight matrix. This reduces the number of required memory columns from k × bw to bw columns. Convolution Layer: There are a few differences between the feed-forward and convolu- tion layers in the back-propagation step. Unlike the feed-forward layer, the error term is defined as a matrix, i.e., the error backward computes the error matrix in a layer j (δ j) depending on the error matrix in a layer k (δk). The update on the error matrix happens by computing the convolution of the δ j and weight matrix, where the size of weights is usually much smaller than δ (m,n << r,s). This operation can be implemented in-memory using the same hardware we used to accelerate the convolution in the feed-forward layer. Next, the generated matrix from the convolution is multiplied with the derivatives of the activation function (g0), which is already stored in memory during the feed-forward step. It is computed the same PIM functionalities used for the fully-connected layers. Finally, the generated matrix is convolved with Zi which is the matrix corresponding to the output of the previous CNN layer. When a pooling layer is used, the Zi is the output of that layer.

2.6 FloatPIM Architecture

Figure 2.8 shows the overview of the proposed FloatPIM architecture processing multiple CNN layers. FloatPIM consists of 32 tiles, where each tile has 256 crossbar memory blocks which

24 hc nbeprleie aatase ewe h egbrn lcs( blocks neighboring the between transfer data parallelized enable which acltstels ucinadcnrl h o-rvr oundie n h wthsue for used switches ( the transfer and data driver fast column row-driver, the controls and function loss the calculates switches exploits FloatPIM computation. the continue to order in the block send memory to next needs the FloatPIM to back-propagation, data and feed-forward both In block. memory each in ( drivers column and row have eut nol 3adtoa eoyclst upr -i atsamlilcto.Tak othe to Thanks multiplication. mantissa 7-bit support to cells this memory bfloat16, additional In 93 shared. only a are in in multiplication results time and addition a for at cells happens intermediate multiplications the additions/ block, memory the of one only since However, and operations. 12 put to requires row For memory computation. each the of result intermediate the keep to bitlines extends design our since addition, e.g., bitline, larger much needs block each of size a have to assumed Scalability Size Block 2.6.1 Controller Data Transfer Controller

u oteeitn hlegsi h rsbrmmr [ memory crossbar the in challenges existing the to Due Global Row Driver C Driver Driver Driver Block Block Block Block Block Block Driver

Driver Driver FloatPIM ArchitectureFloatPIM I 5 3 1 / O Buffer Buffer O Shifter Shifter Shifter • C A Switch Switch Switch & ). Column Driver Column B Block Block Block Block Block Block 1 Driver Driver Driver K 6 4 2 iue2.8 Figure

× Shifter Shifter Shifter • A 1 Driver Driver

K Driver .T upr h ovlto enl eepottebre shifter barrel the exploit we kernel, convolution the support To ). oee,t nbeec lc opoesasnl N layer, DNN single a process to block each enable to However, . NOR laPMmmr architecture. memory FloatPIM :

Computing Row Driver Block Block Block Block Block prto oadto/mlilcto,i eursreserved requires it multiplication, addition/ to operation Block Block D 1 Block Mode k× T Block Size Scalability 32 0 5 3 2 1 4

n k 16 16

− Transfer N T 25 Data 1 F − 32 Row Driver T Block Pipeline Structure 1 2 k× 19 K Computing 1 Block Block Block Block Block Block Block k n 1 Mode osoeiptotu n egtmti.In matrix. weight and input/output store to isrsetvl o trn intermediate storing for respectively bits

T Sense Amp 3 5 3 2 1 4 Row Driver Block 1 k× Transfer T 1 4 Data k n 2 T Data TransferData Row

5 Sense Amp N - Computing Parallel btadto n multiplication, and addition -bit 75 Block Block Block Block Block Block Block Mode

T Row Driver Block 6 1 , k× 5 3 2 1 4 76 1 n k 32 • ,ec eoybokis block memory each ], B Data Transpose Data .Tecnrle block controller The ). Row E Block Block Block Block - Parallel Inter 1 3 Column Driver Column To Block Block To - Layer Communication Layer Sense Amp Sense S 1 5

Shifter S 2 Row S Block Block Block Block 2 Write - Parallel 2 Row Driver 4 scalability of the FloatPIM, i.e., working on digital data, FloatPIM performs the computation on a few cascaded memory sub-blocks (•D ). All blocks are controlled by the same column driver, but different row drivers that enable fine-tune block activation. FloatPIM transfers data between two neighbor blocks in a row-parallel way by reading the output of a block and writing it into the next memory. Regardless of the number of values/rows, the execution time of this parallel data transfer only depends on the bit-width of the values (N + 1 cycles for N-bit data transfer). Cascading the blocks comes at the expense of increasing the cost in internal data movement. Our evaluation shows that cascading a block into 32 sub-blocks increases FloatPIM execution time only by 3.8% (less than 3.4% energy overhead) as compared to assuming ideal 1k × 32K block size.

2.6.2 Inter-layer Communication

During the data transfer phase, the results computed in one memory block are written to another memory block as an input for the next computation phase. Let us assume two fully- connected CNN layers are mapped to two neighborhood blocks. The results of the PIM operation of the first memory block need to be written as an input in the second block. Assuming a CNN layer with 1K neurons, we need to write 1K values to process the next computation. To speedup the write, we design a switch which enables fast data transfer between the neighboring memory blocks. The data transfer between the blocks happen with rotation and write operations. For example, in the feed-forward step, the generated vertical output vector needs to be rotated and copied into several rows of the next memory block (explained in Section 2.5.1).

Similarly, in back-propagation the generated δ vector of backward error needs to be rotated and written in the next memory block to process the weight update (explained in Section 2.5.3). The circuit in Figure 2.8(•E ) shows how FloatPIM supports the rotation and write operations between the blocks. FloatPIM locates the memory blocks such that the tail of the neighbor blocks face together. Then, it exploits switches to connect the adjacent memory blocks. Each block is connected to its two adjacent neighbors. During the computation phase, these switches are in off

26 mode. So, each memory block can individually perform its computation. Then, during the data transfer phase, FloatPIM connects the blocks together in order to move data in a row parallel way.

For example, connecting S1 switches writes each column of the Block 1 into a row of Block 2.

This data transfer can happen in a bit-serial and row-parallel way. Similarly, activation S2 control signal connects the Block 2 to Block 3. Figure 2.8•F shows the functionality of FloatPIM memory blocks working in a pipeline structure. Each memory block models the computation of either a fully-connected or a convolution layer. At the first cycle (T0), the switches are disconnected and all the memory blocks are in the computing mode and work in parallel. Then, FloatPIM works in the data transfer mode for two cycles. In the first transferring cycles (T1), all odd blocks send their output values to their neighboring blocks with even indices (S1 = 1, S2 = 0). In the second transferring cycle

(T2), the even blocks are sending their generated output values to their neighbor odd blocks

(S1 = 0, S2 = 1). This enables to complete all the required data transfers only within the two consecutive steps. For example, switches can transfer a vector of values with bw-bits in 2 ∗ bw cycles, regardless of the number of rows. Each FloatPIM tile can process data points in the pipeline structure with d = T0 + T1 + T2 cycle width, where T0 and T1 + T2 are the computing and data transfer cycles, respectively. It should be noted that FloatPIM takes care of non-consecutive, but the periodic connection in neural network layers, such as ResNet [77]. For example, in ResNet, the output of each layer can be used as an input in the next two consecutive layers. For these cases, FloatPIM requires a different pipeline stage, wherein the data transfer mode the output of a particular layer is sequentially sent to the second and then the third block. For the cases with the non-periodic connection, the controller reads the value of the block in a row-parallel way and writes them into another memory block. However, since these cases are not common, the proposed inter-block communication can still significantly improve efficiency.

27 2.6.3 FloatPIM Parallelism

The proposed design parallelizes computations across memory blocks and data located in different memory rows. In this section, we describe other parallelization strategies exploited in our implementation. Parallelization of Feed-Forward: The CNN training happens in batch size windows (b). The batch size indicates the number of training data points which process in feed-forward before the back-propagation happen. In the feed-forward step, there is no dependency between the computation of different inputs in a batch; thus the feed-forward computation can be parallelized for all data points in a batch as well. In order to enable feed forward parallelism, FloatPIM replicates the CNN weights in different memory blocks, where each memory can process the information of a single data point in a batch. The number of tiles determines the feed-forward parallelism. FloatPIM can work in the highest performance if FloatPIM can parallelize the computation of all data point in a batch; otherwise, it reuses the memory blocks to perform the computation of multiple data points. In that case, each memory block needs to store the weights corresponding to multiple layers in order to avoid the costly write operation during feed-forward. In Section 2.8.6, we explore the impact of the number of FloatPIM tiles. Parallelization of Back-Propagation: FloatPIM keeps all intermediate neuron values (Z and g0) in memory and updates the weights accordingly. Figure 2.9 shows the functionality of FloatPIM memory blocks updating the weights of a CNN layer when there are b data points in a batch. In the back-propagation, FloatPIM cannot parallelize the computation in different layers, but the computation of different data points in a batch can be parallelized in each layer. FloatPIM may store the intermediate values of all data points in a batch in a single memory block and processes them sequentially (Figure 2.9a). It results in a lower power and memory requirement. The efficiency depends on how many data points in a batch is processed by a block. When P is less than b, we call this low power configuration as FloatPIM-LP. To further improve the performance, FloatPIM can parallelize the computation of different data points in a batch

28 Batch size Batch size w'jk wjk ηzi1 ηzib η j zi _ j1 jb Update Update × ×

(a) Serialized (P=1)

w'jk wjk ηzi1 η j1 zi1 + _ j1

Update × × Batch size Batch

(b) Fully Parallelized (P=b)

Figure 2.9: FloatPIM training parallelism in a batch. by processing them in separated memory blocks (Figure 2.9b). Each memory block stores the information from the feed-forward in a specific batch, while all memory blocks need to store the same weight matrix. The error backward for all blocks performs in parallel. To update the weights, FloatPIM collects the η.δ.Z vectors from all memory blocks that process a data point in a batch. The combined vectors are subtracted from the stored weight matrix, and the updated weight matrix is written back into all memory blocks in parallel. We call this fully-parallelized strategy as FloatPIM-HP.

2.7 In-Memory Floating Point Computation

This work represents the very first implementation of floating point addition and multipli- cation in the crossbar memory. A floating point number consists of a binary number string with three different parts: a sign bit, an exponent part, and a fractional value. For example, the IEEE 754 32-bit floating point notation consists of a sign bit, eight exponent bits, and 23 fractional

29 bits. The first bit in the floating point notation (A32) represents the sign bit, where ‘0’ represents a positive number. The next eight bits represent the exponent of the binary numbers (A31,...,A24), ranging from -126 to 127. The following 23 bits (A23,...,A1) represent the fractional part, also known as mantissa, which has a value between 1 and 2.

2.7.1 FloatPIM Multiplication

Floating point multiplication involves: (i) XORing the sign bits, (ii) addition of exponent bits, and (iii) fixed-point multiplication of mantissa bits. In-memory floating point multiplication requires storing the two operands and writing the result in the same row of another column. In FloatPIM, we XOR the sign bits and add the exponent bits using multiple NOR operations [72, 73].

While XOR takes 6 cycles, addition takes 13Ne cycles, where Ne is the number of exponent bits. The mantissa bits are multiplied in the way presented in [74]. While these operations are sequential, they can be parallelized over all rows in the memory. The latency and energy of FloatPIM multiplication can be formulated as:

2 TMul = (12Ne + 6.5Nm − 7.5Nm − 2)TNOR

2 EMul = (12Ne + 6.5Nm − 7.5Nm − 2)ENOR

2.7.2 FloatPIM Addition

Floating point addition involves: (i) left-shifting the decimal point (right-shifting mantissa) to make the exponents same, (ii) addition of shifted mantissa, (iii) normalizing the result. Assume the two floating point numbers to be added are A and B, where As (Bs), Ae (Be), and Am (Bm) represent the sign, exponent, and mantissa bits of A (B). We calculate the difference Ae −Be using

30 0 0 1 1 0 1 0 0 23×1.25 A Column Driver 1. Calculate the difference 2. Search for 3. Sign of exp' 4. Initialize + between Ae and Be positive exp' decides Re and tm1 tm2 to ‘0’ 22×1.3125 B Aex - Bex ‘0’ 0 0 1 0 0 1 0 1

Ae1 Be1 Be1 Bm1 0 0 0 0 0 1 Step ❶ Ae2 Be2 Be2 Bm2 0 0 0 exp' RowDriver Ae3 - Be3 = Ae3 Am3 0 0 0 ‘0’ Ae4 Be4 Be4 Bm4 0 0 0 Step ❷ As Ae Am Bs Be Bm exp' tm1 tm2 Rs Re Rm pc Ae5 Be5 Ae5 Am5 0 0 0 0 0 1

exp' Re tm1 tm2 exp' Operand 1 Operand 2 Result Ae Be 0 1 1 0 1 0 0 Step ❸ t 5. Search for qexp in exp' → Right shift Amx or Bmx by |qexp| 6. Calculate Rm = tm1 + tm2 Re m1

qexp = -Nm qexp = -y qexp = x qexp = Nm 0 0 0 0 Step ❹ tm2 for q = 1 0 0 0 1 Am1>>y 1 1 1 exp 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 Step ❺ 0 0 0 0 0 0 0 1 0 1 0 1 + = exp' tm2 0 0 0 1 Am4>>y 1 1 1 Bm3 1 1 1 0 0 0 0 0 0 0 0 0 0 >>x 1 1 Step ❻ exp' exp' exp' exp' Rm tm2 tm2 tm2 final tm2 tm1 tm2 Rm 0 0 1 1 1 1 1 0 23×1.875 R Figure 2.10: In-memory implementation of floating point addition. an in-memory fixed point subtraction and store the result as exp0. This subtraction is implemented

0 0 using multiple NOR operations as discussed earlier Based on exp , either Am is shifted by −exp 0 0 0 0 (if exp < 0) or Bm is shifted by exp (if exp > 0). We may accomplish it by reading out exp , and then shifting the mantissa bits accordingly. However, it does not parallelize the operations over multiple rows, resulting in high latencies. We propose a novel alternative approach that handles exp0 based shift by using exact search operation. Figure 2.10 shows an illustration of the proposed procedure. We create a new exponent, te, and two mantissas, tm1 and tm2, to be added together. Here, te is the greater of Ae and Be. tm1 is equal to the mantissa of the number with greater exponent, while tm2 is equal to the shifted mantissa. To identify the greater exponent, we search for ‘0’ in the memory column

0 containing the sign bit of exp . For all the matched (unmatched) rows, te and tm1 are equal to Ae

(Be) and Am (Bm) respectively. The copy operations for old exponents and mantissas, i.e., te and tm1, are performed by column-wise NOT operations, eliminating any read/write operation. Next, tm2 in all rows is first initialized to ‘0’s. We then search for each number in the range ±Nm in 0 the columns containing exp . Each search query qexp where −Nm ≤ qexp < 0, tm2 is equal to Am right-shifted by |qexp|. On the other hand, when 0 ≤ qexp ≤ Nm, tm2 is equal to Bm right-shifted by |qexp|.(Nm − |qexp|)th bit of tm2 is set to ‘1’ to incorporate the hidden digit in floating point representation. Shifting operation can be in turn carried out by simply copying the data with a

31 NOT operation at the target location. Finally, tm1 and tm2 are added using fixed-point in-memory 0 0 0 addition and stored as tm. To normalize tm used as final mantissa, for all exp 6= 0, if the addition 0 of tm1 and tm2 results in a carry, tm is right-shifted by one bit, while te is incremented by 1. If 0 0 exp = 0, tm is right-shifted by one bit and te is incremented by 1. Additionally, if the carry is 0 0 generated in this case, the MSB of the shifted tm is set to ‘1’. The new tm (denoted as tm) and te represent the output mantissa and exponent bits, respectively. The latency and energy of FloatPIM addition can be formulated as:

2 TAdd = (3 + 16Ne + 19Nm + Nm)TNOR + (2Nm + 1)Tsearch

EAdd = 2(Nm + 1)Esearch + 12(Ne + Nm)ENOR + NmEreset

2 +[2(Ne + Nm) + Nm/2 + Nm/2 + 1](Eset + Ereset) where the values of energy and execution time of the basic operations can be found here:

Eset Ereset ENOR Esearch TNOR Tsearch 23.8 f J 0.32 f J 0.29 f J 5.34pJ 1.1ns 1.5ns

2.8 Evaluation

2.8.1 Experimental Setup

We have designed and used a cycle-accurate simulator based on Tensorflow [78, 79] which emulates the memory functionality during the DNN training and testing phases. For the accelerator design, we use HSPICE for circuit-level simulations to measure the energy consumption and performance of all the FloatPIM floating-point/fixed-point operations in 28nm technology. The energy consumption and performance are also cross-validated using NVSim [80].

32 Table 2.1: VTEAM Model Parameters for Memristor

kon −216.2m/sec VT,ON −1.5V xoff 3nm koff 0.091m/sec VT,OFF 0.3V RON 10kΩ αon,αoff 4 xon 0 ROFF 10MΩ

We used System Verilog and Synopsys Design Compiler [81] to implement and synthesize the FloatPIM controller. For parasitics, we used the same simulation setup considered by work in [72]. The robustness of all proposed circuits, i.e., interconnect, has been verified by considering 10% process variations on the size and threshold voltage of transistors using 5000 Monte Carlo simulations. FloatPIM works with any bipolar resistive technology which is the most commonly used in existing NVMs. Here, we adopt memristor device with a VTEAM model [71]. The model parameters of the memristor, as listed in Table 2.1, are chosen to produce switching delay of 1ns, a voltage pulse of 1V and 2V for RESET and SET operations in order to fit practical devices [65]. Table 2.2 summarizes the device characteristics for each FloatPIM component. FloatPIM consists of 32 tiles, where each has 256 memory blocks to cover all the tested CNN structures. Each tile takes 0.96mm2 area and consumes 7.64mW power. In total, FloatPIM takes 30.64mm2 area and consumes 62.60W power on average.

2.8.2 Workload

We perform our experiment on ImageNet [38] which is a large dataset with about 1.2M training samples and 50K validation samples. The objective is to classify each image to one of 1000 categories. We tested with four popular large-scale networks, i.e., AlexNet [38], VG- GNet [31], GoogleNet [82], and SqueezeNet [83] to classify ImageNet dataset, summarized in Table 2.3. We compare the proposed FloatPIM with GPU-based DNN implementations (Conv:convolution, FC:fully-connected). The experiments are performed using Tensorflow [79] running on NVIDIA GPU GTX 1080. The performance and energy of GPU are measured by the nvidia-smi tool.

33 Table 2.2: FloatPIM Parameters Component Params Spec Area Power Crossbar Array size 1Mb 3449.6µm2 6.14mW Shifter shift 6 levels 19.26µm2 0.69mW Switches number 1K-bits 32.69µm2 0.42mW Max Pool number 1 80µm2 0.38mW Controller number 1 401.4µm2 0.65mW Memory Block size 1Mb 3,468.8µm2 6.83mW number 256 Blocks Tile size 256Mb 0.96mm2 7.64mW number 32 Tiles Total size 8Gb 30.64mm2 62.60W Table 2.3: Workloads

Number of Layers Classification Model Size Conv FC Error AlexNet [38] 224MB 5 3 27.4% GoogleNet [82] 54MB 57 1 15.6% VGGNet [31] 554MB 13 3 17.5% SqueezeNet [83] 6MB 26 1 25.9%

2.8.3 FloatPIM & Data Representation

Table 3.1 reports the classification error rate of different networks when they train with floating point and fixed point representation. For float precision, we used 32-bit floating point (Float-32) and bfloat16 (bFloat) [77], a commonly used representation in many CNN accelerators. For fixed-point precision, we used a 32-bit fixed point (Fixed-32) and 16-bit fixed point (Fixed-16) representations for FloatPIM training. For all networks, we perform the testing using Fixed-32 precision. To achieve maximum classification accuracy, it is essential to train CNN models using floating point representation. For example, using Fixed-16 and Fixed-32 for training, VGGNet provides 5.2% and 2.6% lower classification accuracy as compared to the same network trained based on bFloat. In addition, we observe that for all applications, bFloat can provide the same accuracy as Float-32, while computationally processes in a much faster way. This is because FloatPIM works based on the bitwise NOR operation, thus it can simply ignore processing

34 Table 2.4: Error rate comparison and PIM supports.

Float-32 bFloat-16 Fixed-32 Fixed-16 AlexNet 27.4% 27.4% 29.6% 31.3% GoogleNet 15.6% 15.6% 18.5% 21.4% VGGNet 17.5% 17.7% 21.4% 23.1% SqueezeNet 25.9% 26.1% 29.6% 32.1% PIM Designs Support Float-32 bFloat-16 Fixed-32 Fixed-16 ISAAC [37]     PipeLayer [14]     FloatPIM     the least significant bits of mantissas in floating point representation in order to accelerate the computation. Table 3.1 lists the supported computation precision by two recent PIM-based CNN accelerators [37, 14]. All existing PIM architectures can support CNN acceleration just using fixed-point values, which results in up to 5.1% lower classification accuracy than floating point precision supported by FloatPIM. Figure 2.11 shows the speedup and energy saving of FloatPIM, on average for the four CNN models, using the fixed point and floating point representation for the CNN training and testing. All results are normalized to Float-32. Our evaluation shows that FloatPIM using bFloat can achieve 2.9× speedup and 2.5× energy savings as compared to FloatPIM using Float-32, while providing similar classification accuracy. In addition, FloatPIM using bFloat model can provide higher efficiency than Fixed-32. For example, FloatPIM using bFloat can achieve 1.5× speedup, 1.42× energy efficiency as compared to Fixed-32.

2.8.4 FloatPIM Training

Figure 2.12 compares the performance and energy efficiency of FloatPIM with the GPU- based implementation and PipeLayer [14] which is a state-of-the-art hardware accelerating CNN training using ISAAC [37] hardware. For PipeLayer, we used read/write latency of 29.31ns/50.88ns and energy of 1.08pJ/ 3.91nJ per spike as reported in the reference paper [14].

35

& &

Speedup Energy Saving Energy

Testing Testing Training Training Speedup Energy Saving Speedup Energy Saving Figure 2.11: FloatPIM energy saving and speedup using floating point and fixed point represen- tations.

103 102

102 101

101 Speedup (GPU=1) 0

Energy Saving (GPU=1) 10 100

AlexNet VGGNet AlexNet VGGNet GoogleNet SqueezeNetGEOMEAN GoogleNet SqueezeNetGEOMEAN

Figure 2.12: FloatPIM efficiency during training.

In addition, we used λ = 4 which provides reasonable efficiency. During training, CNN requires a significantly large memory size to store the feed-forward information of different data points in a batch. For large networks, this information cannot fit on the GPU memory, thus it results in slow training. Our evaluation shows that FloatPIM can achieve on average 303.2× speedup and 48.6× energy efficiency in training as compared to GPU-based approach. The higher efficiency of the FloatPIM is more obvious on the CNNs with more number of convolution layers. Figure 2.12 also compares FloatPIM efficiency over PipeLayer when it

36 enables and disables the in-parallel data transfer between the memory blocks. Our evaluation shows that FloatPIM without parallelized data transfer provides 1.6× lower speedup, but 3.5× higher energy efficiency as compared to the PipeLayer. However, exploiting switches significantly accelerates the FloatPIM computation by removing the internal data movement between the neighboring blocks. Our evaluation shows that FloatPIM enabling in-parallel data transfer can achieve on average 4.3× speedup and 15.8× energy efficiency as compared to PipeLayer. The higher energy efficiency of FloatPIM comes from (i) its digital-based operation which avoids paying the extra cost of transferring data between the digital and analog/spike domain; (ii) the higher density of the FloatPIM which enables significantly better parallelism. The PipeLayer computing precision is bounded to fixed point operations, while FloatPIM provides the floating point precision which is essential for the highly accurate CNN training.

2.8.5 FloatPIM Testing

Figure 3.4 compares the performance and energy consumption of FloatPIM with NVIDIA GPU and ISAAC [37] which is the state-of-the-art PIM-based DNN accelerator. ISAAC works at 1.2GHz and uses 8-bits ADC, 1-bit DAC, 128×128 array size where each memristor cell stores 2 bits. We used the same parameters reported on the paper for the implementation [37]. We used FloatPIM (32-Tiles configuration) with and without in-parallel data transfer between the memory blocks. All execution time and energy results are normalized to GPU results. Our evaluation shows that both PIM-based architectures, i.e., ISAAC and FloatPIM, have significantly higher efficiency than GPU, since they address the data movement issue which is the main computation bottleneck of the conventional cores. The results show that FloatPIM using bFloat implementation can achieve on average 6.3× and 21.6× (324.8× and 297.9×) speedup and energy efficiency improvement as compared to ISAAC (GPU-based approach). Our evaluation shows that FloatPIM with no in-parallel data transfer (no switches) can still provide 1.7× speedup and 3.9× energy efficiency as compared

37 103 103

102 102

101 101 Speedup (GPU=1)

100 Energy Saving (GPU=1) 100

AlexNet VGGNet AlexNet VGGNet GoogleNet SqueezeNetGEOMEAN GoogleNet SqueezeNetGEOMEAN

Figure 2.13: FloatPIM efficiency during the testing. to ISAAC. FloatPIM provides the following advantages to ISAAC. (i) It eliminates the cost of internal data movement between memory blocks, which is a major bottleneck of most PIM architectures. (ii) It removes the necessity of using costly ADC and DAC blocks which takes the major portion of the ISAAC area and power. In addition, these mixed-signal blocks do not scale as fast as the CMOS technology does. (iii) FloatPIM is fully digital and scalable architecture which can work as accurate as the original floating point representation, while the precision of analog-based design limits to the fixed point representation.

2.8.6 Impacts of Parallelism

Feed-Forward: In the feed-forward step, we define the parallelism as the number of data points that can be processed in parallel. As discussed in Section 2.6.3, to improve the feed-forward performance, we can exploit different FloatPIM tiles to process different training data points in parallel. Figure 2.14a shows the impact of the number of tiles on FloatPIM performance speedup. We observed that increasing the number of tiles improves the performance of feed-forward. For example, FloatPIM using 32-tiles can achieve 1.83× higher performance as compared to FloatPIM with 16-tiles.

38 Back-Propagation: Unlike the feed-forward, the back propagation has dependencies between the CNN layers. This eliminates parallelizing the computation of different layers. However, in each layer, FloatPIM can parallelize the computation of different data points in a batch. In the low power design (FloatPIM-LP), a single block of memory processes a small set of data points in a batch (P = b/8). In contrast, the high-performance mode (FloatPIM-HP) can process all data points in a batch process in a single memory block (p = b). For the evaluation of this section, we consider b = 128 batch size for all the networks. Figure 2.14b,c show the speedup and normalized energy consumption of FloatPIM for a different level of parallelism. Our evaluation shows that increasing the parallelism from p = b/8 to p = b improves the FloatPIM performance on average by 78.3×. This parallelism comes at the cost of lowering the energy efficiency and increasing the effective memory size. The lower energy efficiency is due to the cost of error vector aggrega- tion from different memory blocks in order to update the weights. FloatPIM-HP provides on average 15.4% lower energy efficiency than FloatPIM-LP. In addition, FloatPIM-HP requires to replicate the weight of CNN layer in all blocks corresponding to different data points in a batch. Figure 2.14d shows the normalized energy-delay product (EDP) and memory size of FloatPIM using different back-propagation parallelism. The results are normalized to FloatPIM-LP with the serialized process. Our evaluation shows that FloatPIM-HP can provide 8.2× higher EDP improvement while requiring 3.9× larger memory as compared to FloatPIM-LP.

2.8.7 Computation/Power Efficiency

Unlike other PIM-based accelerators, FloatPIM makes very small changes to the existing crossbar memory. FloatPIM in 32-Tiles configuration takes 30.64mm2 area. Our evaluation shows that in FloatPIM 95.1% of the area has been occupied by crossbar memory. The extra interconnects and switches added to enable fast convolution and inter-block connection only take 0.15% and 0.24% of the total FloatPIM area (Figure 2.15a). In addition, FloatPIM does

39 103

103

102 102 Speedup (GPU=1) Speedup (GPU=1)

101 101

AlexNet VGGNet AlexNet VGGNet GoogleNet SqueezeNet GoogleNet SqueezeNet

(a) Feed-Forward Parallelism (b) Batch Parallelism Speedup

P=b/8 P=b/4 P=b/2 P=3b/4 P=b 8 1 EDP Memory Size 6 Low High 0.9 Power Performance 4

0.8 & Memory Size 2 Normalized EDP Improv. Normalized Energy 0 0.7 b/8 b/4 b/2 3b/4 b Batch Parallelisim (P) AlexNet VGGNet GoogleNet SqueezeNet

(c) Batch Parallelism Energy (d) Batch Parallelism Trade-off Figure 2.14: The impact of parallelism on efficiency. not require to have fine-tuned control on the row/column drivers. To perform column-wise NOR operation, we require to only select 3 bitlines at a time. Similarly, the row driver can be activated on the entire memory rows (for computation) or a single row (for read/write operations). Our evaluation shows that multi-row activation results in less than 0.01% area overhead. Similarly, the controller takes about 3.0% of total chip area. Figure 2.15b compares the computation (the number of 16-bit operations performed per second per mm2) and power efficiency (the number of 16-bit operations performed per watt) of the FloatPIM with ISAAC [37] and PipeLayer [14]. Since FloatPIM supports floating point

40 0.8% 4.9% GOPS/s/mm2 100 3.0% 0 101 102 103 104 ISAAC PipeLayer 35.1% FloatPIM-HP 56.2%

FloatPIM-LP

% 1 . GOPS/s/W 95 0 200 400 600 800 1000

ISAAC Crossbar Array Crossbar Area Breakdown Area PipeLayer FloatPIM-HP FloatPIM-LP

(a) Area Breakdown (b) Computation and Power Efficiency Figure 2.15: (a) FloatPIM area breakdown, (b) efficiency comparisons. operations, we report the results as the number of floating point operations (FLOPS), while for other PIM designs we report it as the number of operations (OPS). Our result shows that FloatPIM can achieve 2,392.4 GFLOPS/s/mm2 and 302.3 GFLOPS/s/mm2 computation efficiency in high performance and low power modes respectively. The higher efficiency of FloatPIM-HP as compared to ISAAC (479.0 GOPS/s/mm2) and PipeLayer (1,485 GOPS/s/mm2) comes from its higher density which enables more computation to happen in the same memory area. For example, ISAAC uses ADC and DAC blocks which take a large portion of the area. In addition, PipeLayer still requires to generate spike which results in lower efficiency. In the low power mode, FloatPIM utilizes memory blocks with a large bitline size (in order to process all data points in a batch). This increases the area while the amount of computations stays the same, regardless of the bitline size. In terms of power, FloatPIM can provide much higher efficiency than both ISAAC and PipeLayer. FloatPIM removes the necessity of the costly internal data movement be- tween the FloatPIM blocks by using the same memory block for both storage and computing.

41 Our evaluation shows that FloatPIM in high performance and low power modes can achieve 818.4 GFLOPS/s/W and 695.1 GFLOPS/s/W power efficiency which are higher than both ISAAC (380.7 GOPS/s/W) and PipeLayer (142.9 GOPS/s/W) design.

2.8.8 Endurance Management

FloatPIM operations involve switching of memristor devices. This may affect the memory lifetime, given the endurance limits of the commercially available ReRAM devices. We implement an endurance management technique to increase the lifetime of our design. As discussed before, FloatPIM reserves some memory columns to store the intermediate states while processing. These columns are the most active and experience the worst endurance degradation. To increase the lifetime of the memory, we change the columns allocated for processing over time. This distributes the degradation across the block instead of being concentrated to a few columns, effectively reducing the worst case degradation per cell. It results in an increase in the lifetime of the device. For example, for memory blocks in FloatPIM with 1024 columns, and with 93 of them reserved for processing (in case of bfloat16), this management increases the lifetime of the device by ˜11×. We also perform a sensitivity study of the lifetime of FloatPIM in terms of the number of classification tasks that can be performed. We observe that for the memory with endurance of 109 (1015) writes, FloatPIM can perform 3.1 × 108 (3.1 × 1014) classification tasks.

2.9 Conclusion

In this chapter, we proposed FloatPIM, the first PIM-based DNN training architecture that exploits analog properties of the memory without explicitly converting data into the analog domain. FloatPIM is a flexible PIM-based accelerator that works with floating-point as well as fixed-point precision. FloatPIM addresses the internal data movement issue of the PIM architecture by enabling in-parallel data transfer between the neighboring blocks. We have

42 evaluated the efficiency of FloatPIM on a wide range of practical networks. Our evaluation shows that FloatPIM in training can achieve 303.2× faster and 48.6× more energy efficiency as compared to the state-of-the-art GPU. In the next chapter, we explain how to design a new class of learning algorithm for robust and efficient learning on IoT systems. This chapter contains material from “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision”, by Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana S. Rosing, which appears in IEEE International Symposium on Computer Architecture, July 2019 [1]. The dissertation author was the primary investigator and author of this paper.

43 Chapter 3

Hyperdimensional Computing for Efficient and Robust Learning

44 3.1 Introduction

In the previous chapter, we described the design of DNN accelerator on emerging hard- ware technology which utilizes non-volatile memory memory technology [37, 14]. However, the emerging memory devices have various reliability issues such as endurance, durability, and variability [19, 20, 21]. This, coupled with the high computational complexity of learning algo- rithms, results in many of writes to memory, resulting in reliability issues in the accelerators [1]. Embedded devices are resource constrained. Instead of accelerating the existing algorithms on the embedded devices, we need to think how to design algorithms that mimic the efficiency and robustness of the human brain. To achieve real-time performance with high energy efficiency, we need to rethink not only how we accelerate machine learning algorithms in hardware, but also we need to redesign the algorithms themselves using strategies that more closely model the ultimate efficient learning machine: the human brain. Hyperdimensional (HD) computing [22] is a strategy developed by computational neuroscientists as a model the human short-term memory [84]. HD computing is motivated by the understanding that the human brain operates on high dimensional representations of data originated from the large size of brain circuits [85]. It models the human memory using points of a high-dimensional space, called hypervectors. The hyperspace typically refers to tens of thousand dimensions. HD mimics several important functionalities of the human memory model with vector operations which are computationally tractable and mathematically rigorou. HD computing is well suited to address learning tasks for IoT systems as: (i) HD models are computationally efficient (highly parallel at heart) to train and amenable to hardware level optimization, (ii) HD models offer an intuitive and human-interpretable model [86], (iii) it offers a complete computational paradigm that can be applied to cognitive as well as learning problems [86, 87, 88, 89, 5, 90], and (iv) it provides strong robustness to noise – a key strength for IoT systems, and (v) HD can naturally enable secure and lightweight learning. These features

45 make HD computing a promising solution for: today’s embedded devices with limited storage, battery, and resources, as well as future computing systems in deep nano-scaled technology which devices will have high noise and variability [4, 91, 92]. In this chapter, we propose an algorithm-hardware solution for efficient classification in high-dimensional space. Our solution includes designing new HD classification algorithm as well as a novel architecture to accelerate it. This main contribution of this chapter is listed as follows:

• We exploit HD mathematics to design an encoding module that maps different data types into high-dimensional space. We also propose a novel HD algorithm to perform light-weight classification over the encoded data.

• We propose two algorithm-hardware optimizations that revisit the learning and inference phases of HD computing for efficient hardware implementation. QuantHD is a framework for quantizing the HD model for efficient inference. SearcHD performs fully binary learning in HD computing supporting single-pass training.

• We design novel architecture to accelerate HD classification. Our architecture exploits the robustness of HD computing to design an analog in-memory associative search [4] which checks the similarity of hypervectors in tens of nano-seconds. As compared to optimized digital implementation, our analog architecture works with 1347× higher energy efficiency, while providing the same classification accuracy.

3.2 Hyperdimensional Processing System

The main difference between HD computing and conventional computing is the primary data type. Instead of using numbers, HD computing uses binary hypervectors where their patterns represent objects.

46 Hypervector generation: The HD applications use hypervectors that have dimensional- ity, D, often in the range of 10,000 length or more. Let’s assume that we generate two random

D binary hypervectors, H1,H2 ∈ {0,1} . These hypervectors are nearly orthogonal as their similar- ity in the vector space is almost zero. This indicates that two distinct items can be represented with two randomly-generated hypervectors. A hypervector is a distributed, holographic representation for information in that no dimension is more important than others. The independence of the components enables robustness against failure. Similarity computation: Reasoning in HD computing is done by measuring the simi- larity of hypervectors. There are multiple ways to define the distance. For binary hypervectors, we use the Hamming distance as the distance metric. We denote the Hamming similarity with

δ(H1,H2) where H1 and H2 are two hypervectors. Permutation: The permutation operation, ρn(H), shuffles components of H with n-bit(s) rotation. The intriguing property of the permutation is that it creates a near-orthogonal and reversible hypervector to H, i.e., δ(ρn(H),H) ' 0 when n 6= 0 and ρ−n(ρn(H)) = H. Thus, we can use it to represent sequences and orders. Bundling/Binding: HD computing combines and associates data using the element-wise multiplication and addition. The element-wise addition produces a hypervector that preserves all similarities of the combined data. We can also associate data using multiplication, and as a result, the multiplied hypervector is mapped to another orthogonal position in the hyperspace. Let’s consider text classification problem. The first step in HD computing is to map text data into a hypervector. We generate random binary hypervectors to represent the basic elements, e.g., the 26 letters of the Latin alphabet plus the (ASCII) space for text inputs (e.g., A,B,C ∈ {0,1}D). Here, we explain the defined HD arithmetic operations on the hypervectors. Binding of two hypervectors A and B is done by component-wise XOR and denoted as A ⊕ B. The result of the operation is new hypervector that is dissimilar to its constituent vectors i.e.,

δ(A ⊕ B,A) ≈ 5,000) for D = 10,000; hence XOR is well suited for associating two hypervectors.

47 Binding is used for variable-value association and, more generally, for the mapping. Bundling operation is done via component-wise majority function and denoted as [A+B+C]. The majority function is augmented with a method for breaking ties if the number of component hypervectors is even. The result of the majority function preserves similarity to its component hypervectors i.e., δ([A + B + C],A) < 5,000. Hence, the majority function is well suited for representing sets. The permutation, ρ(A), rotates the hypervector coordinates. Practically, it can be implemented as a cyclic right-shift by one position. The permutation operation generates a hypervector, which is unrelated to the given hypervector δ(ρ(A),A) ≈ 5,000. This operation is commonly used for storing a sequence of tokens in a single hypervector. For example, the sequence trigram (n = 3) of a-b-c, is stored as the following hypervector, ρ(ρ(A) ⊕ B) ⊕ C = ρ(ρ(A)) ⊕ ρ(B) ⊕ C. This efficiently distinguishes the sequence a-b-c from a-c-b, since a rotated hypervector is uncorrelated to all the other hypervectors. The encoding module bundles all the trigram hypervectors across the input text to generate the text hypervector as the output of encoding.

3.3 Classification in Hyperdimensional Computing

Figure 3.5a shows the overview of an HD classification consisting of an encoder module and an associative memory. The encoding module maps input data to hypervectors. During the training phase, all data corresponding to a particular class are encoded and the pertinent hypervectors are combined together to generate a class hypervector. At the end of the training, there is one hypervector representing each category. These class hypervectors are stored in associative memory. During inference phase, an unknown input data is mapped to a query hypervector using the same encoding module used for training. The query hypervector is then compared to all class hypervectors to determine the classification result. In the remainder of this section, we describe the functionality of each module in detail.

48 3.3.1 Encoding Module

HD uses an encoding that can map all data types to high dimensional space. In Section 3.2, we describe the basic HD operations and show an example of text encoding. Here, we cover the encoding method for feature vectors. Figure 3.5b shows how encoding module maps data points, v, to the high-dimensional space using the precomputed hypervectors. Consider a feature vector v = hv1,...vni. The encoding module takes this n-dimensional vector and converts it into a D-dimensional hypervector (D >> n). The encoding is performed in three steps, which we describe below. Step 1: Base hypervector generation: The HD encoding happens based on a set of randomly generated base hypervectors, e.g., definition of alphabets in text data. For feature vectors, the base hypervectors are defined to represent different feature values and positions. As Figure 3.5b shows, we use a sets of pre-computed level or base hypervectors to consider the impact of each feature value [93]. To create such level hypervectors, we compute the minimum and maximum feature values among all data points, say vmin and vmax, and then quantize the range of

[vmin,vmax] into Q levels, L = {L1,··· ,LQ}. Each of these quantized scalars will correspond to a

D-dimensional hypervector. Each level hypervector, Li, is unique and has D binarized dimensions, D i.e., Li ∈ {0,1} . Figure 3.5c shows how we generate the base hypervectors. We create the first level hypervector, L1, by randomly filling each element with either 0 or 1 value. The second level hypervector, L2 is created by flipping D/Q random dimensions of the L1. This continues until creating the LQ hypervector by flipping LQ−1 dimensions. Since we select and flip the dimensions randomly, with a high probability the L1 and LQ will have D/2 dimension difference. As a result, the level hypervectors have similar values if the corresponding original data are closer, while L1 and LQ will be nearly orthogonal. For data with quantized bases, e.g., text or DNA sequences, the level hypervectors do not need to have correlation, thus they can be generated randomly [94, 95]. Step 2: Element-wise hypervector mapping: Once the base hypervectors are generated, each of the

49 n elements of the vector v is independently quantized and mapped to one of the base hypervectors. The result of this step are n different binary hypervectors, each of which is D-dimensional. Step 3: Aggregation: In the last step, the n binary hypervectors are combined into a single D-dimensional non-binary hypervector. The naive approach for aggregation would be to simply add all of the n hypervectors together. This approach, however, does not take account the feature positions. To differentiate the impact of each feature index, we use Permutation. We know from random binary values that the permutation of different feature indexes is nearly orthogonal: (i) δ(L,ρ L) ' D/2 (0 < i 6 n) where the similarity metric, δ, is the Hamming distance between the two hypervectors, and ρ(i)L is the i-bit rotational shift of L. The orthogonality of a hypervector and its permutation (i.e. circular bitwise rotation) is ensured as long as the hypervector dimensionality is large enough compared to the number of features in the original data point (D >> n). The aggregation of the n binary hypervectors is computed as follows:

(n−1) H = L1 + ρ L2 + ... + ρ Ln.

where, H is the (non-binary) aggregation and Li is the (binary) hypervector corresponding to the i-th feature of vector v. The explained encoding also works for data points with a variable-length such as text-like data where the encoding can be applied on fixed-size n-gram windows [22].

3.3.2 HD Model Training

After mapping the input data to high dimensional space, a trainer block, shown in Fig- ure 3.5a, combines the encoded data points in order to create class hypervectors. Training module simply adds all encoded hypervectors which belong to the same class. In a face detection task, for instance, the trainer adds all hypervectors which have the ”face” tag and ”non-face” tags in two different hypervectors.

50 (a) Inference (b) 0 Inference q q q v1 Base hypervectors ρ Permuted L Data Encoding D Query 2 1 Similarity check v2 1 Similarity check Base hypervectors ρ Permuted L +

Training Quantize vn Base hypervectors n-1 Permuted L Training Data 1 1 1 ρ + @ Class 1 Encoding Σ c D Class 1 c 2 c 1 FeatureVector qD Query q2 q1 Training Data 2 2 2 @ Class 2 Encoding Σ c D Class 2 c 2 c 1 Base Hypervector Random (c) L hypervector Generator 1 Flip D/Q bits L2 hypervector Training Data DistanceSimilarity Flip D/Q bits Encoding N Class N N N @ Class N Σ c D c 2 c 1 LQ-1 hypervector Training Data Associative Memory Flip D/Q bits Accumulation LQ hypervector

Figure 3.1: (a) Overview of the HD classification consist of encoding and associative memory modules. (b) The encoding module maps a feature vector to a high-dimensional space using pre-generated base hypervectors. (c) Generating the base hypervectors.

Element-wise addition of hypervectors in training results in non-binarized class hyper-

D i th vectors, i.e., H ∈ N . For a given hypervector, H = hhD,··· ,h1i in i class, the non-binarized model can be generated as follows:

i i i i C = ∑H j = hcD,··· ,c1i j

To perform the classification on binarized hypervectors, we binarize model by applying majority function on non-binarized class hypervectors. For a given class hypervector, C = hcD,··· ,c1i, the majority function is defined as follows:   0, if c j < τ 0 0 0  MAJ(C,τ) = hcD,··· ,c1i where c j =  1, otherwise.

Using the majority function, the final hypervector for each data point is encoded by C0 =

MAJ(C,τ), and C0 ∈ {0,1}D and τ = n/2.

3.3.3 Associative Search

After training, all class hypervectors are stored in associative memory, as shown in Figure 3.5a. During inference, input data is encoded into a query hypervector using the same encoding module used for training. The associative memory compares the similarity of the query

51 hypervector with all stored class hypervectors and selects a class with the highest similarity. HD can use different metrics to find a class hypervector with the most similarity to query hypervector. For a class hypervectors with binarized values, Hamming distance is an inexpensive and suitable similarity metric, while class hypervectors with non-binarized elements require to use cosine similarity.

3.4 Algorithm-Hardware Optimizations of HD Computing

Although the initial model training in HD computing provides high classification accuracy, binarizing the trained model may result in significant drop in the inference accuracy. In this section, we propose two advanced training approaches for HD computing: QuantHD and SearcHD. QuantHD is an iterative training framework for HD model quantization. It provides high inference accuracy while quantizing HD model to binary or ternary representation. QuantHD is an iterative training approach that is suitable for devices with large memory to store original or encoded training data. This framework enables HD with binary models to provide accuracy similar to floating point. Next, we present SearcHD, a fully binary HD training framework, that performs the entire training and inference tasks without any arithmetic operations. SearcHD supports online learning by enabling fast single-pass training, with no need to store data points for iterative learning. Note that In contrast, SearcHD is single-pass training method that learns from the stream of data, without storing the training data points. This training approach is suitable for ultra low-power devices with no off-chip memory. Due to lack of iterative training, SearcHD provides on average 8.5% lower accuracy than QuantHD.

3.4.1 QuantHD: Model Quantization in HD Computing

In many practical applications, HD computing algorithms need to be trained and tested using floating-point values. The naive model binarization after training phase may result in

52 Retraining Iteration (a) Inference Dimension (D) Initial Training Retraining (b) 0 1 1 0 Binary Model 0 +1 Encoded 1 0 1 1 Encoded Binarization Encoded Testing Data Training Data Training Data c c c Similarity Similarity 11 12 13 c1D 1 0 0 0 1 3 check 6 8 7 check c21 c22 c23 c2D Binarization q Boundary q C C1 Cmatch-αQ C 1 NO 1 q q C C 2 C2 2 4 C 2 q Model q Cmiss+αQ C 3 C 3 cNonk1 ck-2Quantizedck3 c kD C3 ΔE<ε -1 +1 0 0 Update Validation +1 Quantized Quantized -1 Non-Quantized Non-Quantized Model +1 -1 -1 -1 0 q Cq Ck Ck C k k 5 Ternary Model 0 0 +1 0 2 Quantizaion -b +b Ternarization (Binary\Ternary) Boundary

Figure 3.2: (a) QuantHD framework overview. (b) Binarizing and ternarizing the trained HD model. low classification accuracy. On the other hand, working with floating-point values increases the HD computation cost and hinders the use of HD as a light-weight classifier. In this section, we propose a novel quantization framework that enables HD computing to be trained and tested on a low-cost binary or ternary models with high classification accuracy. QuantHD consists of three main steps; (i) Initial training: creates an initial HD model by accumulating all encoded hypervectors corresponding to each class. The initial training step is the same as conventional HD computing algorithms (explained in Section 3.3). (ii) Quantization, which projects the HD model to a binary or ternary model. Since the HD model has been trained to work with the floating-point values, the quantization can result in a significant quality loss. (iii) Retraining compensates for quality loss due to the model quantization. QuantHD iteratively retrains the HD model so that it adapts to work with the quantized model. Initial Training: QuantHD trains the class hypervectors by accumulating all encoded hypervectors which belong to the same class. As Figure 3.2a shows, each accumulated hypervector represents a class. For example, for an application with k classes, the initial HD model contains k

D non-quantized hypervectors {C1,...,Ck}, where Ci ∈ N (•1 ). Model Projection: We develop a model projection method which maps this model to q q a quantized hypervectors, {C ,...,C }, with binary or ternary representation ( 2 ). The binary 1 k • and ternary models represent the class hypervectors using {0,1} and {−1, 0, + 1} elements

53 respectively. Iterative Learning: Although the initial trained model provides high classification accuracy, the quantization of the model significantly degrades the accuracy. This accuracy degradation comes from mapping the binary domain which does not preserve distances between the vectors. To compensate for the possible quality loss, QuantHD supports a retraining procedure which iteratively modifies the HD model in order to adapt it to work with the quantization constraints. QuantHD keeps both quantized and non-quantized models. For each data point in the training dataset, say H, we first quantize the encoded hypervector, Hq, and then check its similarity with the quantized model. The similarity metric is the Hamming distance for the binary model and dot product for the ternary model (•3 ). If the quantized model correctly classifies Hq , we do not update the model. However, if Hq is incorrectly classified, we only update the non-quantized model, while the quantized model stays the same. This update happens on two- class hypervectors; a class that data is misclassified to (Cmiss), and a class that data point belongs to (Cmatch). Since in HD the information stored as a pattern of distribution in high-dimensional space, the update of the non-quantized model can perform by(•4 ):

Cmiss = Cmiss − αH and Cmatch = Cmatch + αH where α is a learning rate (0 < α < 1). Note that although the similarity check performs on the quantized model, we only update the non-quantized model, while the quantized model stays the same. Similarly, after once epoch or iteration across all training examples, the new non-quantized models are written back to the quantized model (•5 ). Model Validation: We examine the classification accuracy of the projected model on the validation data, which is 5% of the training data(•6 ). If the projected model accuracy is changed less than ε, we send the new model to inference to perform the reset of computation; otherwise, we start retraining the quantized model by checking the similarity of all training data points and accordingly updating the non-quantized model (•7 ). Note that the QuantHD stops

54 after a pre-defined number of iterations if the convergence condition does not satisfy. For all experiments, we use ε = 0.01 and limit the maximum number of iterations to 30. QuantHD Impact on HD Accuracy Table 3.1 compares the classification accuracy of QuantHD using binary and ternary models with the state-of-the-art HD computing algorithm using binary and non-quantized models [95]. To have a fair comparison, we give an advantage to the baseline HD to retrain the non-quantized model for the same number of iterations as QuantHD model. The results also reported for the baseline HD with the binary model, when the HD models have been turned into binary once after the training. For QuantHD, the models have been retrained for 40 iterations with α = 0.05 learning rate. For ternary models, we use ternary boundary b = 0.42σ which results in maximum classification accuracy. Our evaluation shows that the baseline HD provides high classification accuracy using non- quantized model. However, in the baseline model, both training and retraining are significantly costly. The training process involves retraining that involves several iterations of the similarity check over non-quantized model. Similarly, in the inference phase, the associative search between a query and trained model required a costly cosine metric. The binarization of the baseline model has been proposed to reduce the inference cost, by replacing cosine with Hamming distance similarity. However, this binarization used in the baseline HD computing [95, 93] has two main disadvantages: (i) it results in a significant drop in the classification accuracy, as the model never trained to work with this constraint. (ii) The retraining is as costly as the non-quantized model as the similarity check needs to perform using the cosine metric. QuantHD addresses the several existing issues in the HD computing algorithms. QuantHD provides an iterative procedure which enables the HD to learn to work with binary or ternary models. In addition, QuantHD defines a learning rate for training procedure which further improves the HD classification as compared to prior work with no learning rate (α = 1). Our evaluation on Table 3.1 shows that QuantHD using binary and ternary models can provide comparable accuracy to the non-quantized model. The accuracy is higher for the ternary model as

55 Table 3.1: Comparison of QuantHD classification accuracy with the state-of-the-art HD com- puting.

Baseline HD QuantHD Non-Quantized Binary Non-Quantized Binary Ternary ISOLET 91.1% 88.1% 95.8% 94.6% 95.3% UCIHAR 93.8% 77.4% 98.1% 96.5% 97.2% PAMAP2 88.9% 85.7% 92.7% 91.3% 92.7% FACE 95.9% 68.4% 96.2% 94.6% 95.4% CARDIO 93.7% 90.9% 97.4% 95.3% 97.7% EXTRA 70.2% 66.7% 74.1% 72.6% 74.0% Average 88.9% 79.5% 92.4% 90.8% 92.1% it gives more flexibility to the training module to select better class values. Our evaluation shows that QuantHD using binary and ternary models provide on average 16.2% and 17.4% higher accuracy as compared to the baseline HD computing [95] using a binary model (See Table 3.1). In addition, we observe that QuantHD accuracy using binary and ternary models is 1.9% and 3.1% higher than baseline HD using non-quantized model. QuantHD Training Efficiency: We compare the QuantHD with the baseline HD computing in terms of training/retraining efficiency. All HD-based designs have the same performance/energy during the generation of the initial training model. However, during retraining which involves the significant training cost, they have different computation efficiency. In the baseline HD, the retraining is performed by checking the similarity of each training data point with the non-quantized model. This search significantly increases the cost of retraining, since the associative search in the non-quantized domain is much more costly than binary or ternary domains. In contrast, the retraining in QuantHD performs by checking the similarity of training data points with the binary/ternary model. After each similarity check, QuantHD updates the non-quantized model by adding and subtracting a query hypervector from two class hypervectors. Figure 3.3 compares the energy consumption and execution time of the baseline HD and QuantHD. Our evaluation shows that QuantHD with the binary (ternary) model can achieve on average 36.4× and 4.5× (34.1× and 4.1×) energy efficiency improvement and speedup as compared to the baseline HD computing algorithm.

56 Figure 3.3: Energy consumption and execution time of QuantHD, conventional HD and BNN during training

QuantHD Inference Efficiency: Figure 3.4 compares the energy consumption and execution time of running a single query in the baseline HD computing with QuantHD using binary and ternary models. All reported results are the average energy and execution time of a single prediction, processed on the entire test data. In QuantHD, the encoding and associative search modules are working in a pipeline stage. Therefore, the execution time of the encoding module hides under the execution time of the associative search search. However, FPGA still needs to pay the cost of energy consumption in the encoding module (as shown in the top graph in Figure 3.4). Our evaluation shows that HD using the non-quantized model is the most inefficient design due to its significant cost during the associative search. Regardless of the training procedure, the binary models provide maximum efficiency during inference. The results show that QuantHD using the binary and ternary model can achieve 45.7× and 42.3× energy efficiency improvement and 5.2× and 4.7× speedup as compared to baseline HD while providing comparable accuracy. However, QuantHD can result on slow iterative and high precision training, but optimized inference. Next, we propose a framework for fully binary training and inference in

57 Figure 3.4: Energy consumption and execution time of QuantHD during inference

HD computing.

3.4.2 SearcHD: Fully Binary Stochastic Training

To ensure real-time learning on today’s embedded devices, we need to enable ultra- efficient training process. In this section, we propose SearcHD, a fully binary HD computing algorithm with probability-based training. Unlike most recent learning algorithms, e.g., neural networks, SearcHD supports single-pass training, where it trains a model by one time passing through a training dataset. Our proposed SearcHD is a framework for binarization of the HD computing algorithm during both training and inference. SearcHD removes the addition operation from training by exploiting bitwise substitution which trains a model by stochastically sharing the query hypervectors elements with each class hypervector. Since HD computing with a binary class has lower classification accuracy, SearcHD exploits vector quantization to represent an HD model using multiple vectors per class. This enables SearcHD to store more information in each class while keeping the classes binary. The SearcHD cost is its slightly lower accuracy as

58 Tag Query Base vectors f1 ID vector 1 + Class 1/Vector 1 f2 Base vectors + Class 1/Vector N ID2 vector ᵟ

+ ( )

2

Quantize /

Class 2/Vector 1 ᵟ

( ( ) Feature Vector Feature Base vectors fn + Class 2/Vector N

IDn vector + +

ClassSelection Probability

Query Minimum Distance

Non-binary Class k/Vector 1 Compare n/2 Class k/Vector N

Binary Query Model Update (a) Encoding (b) Stochastic Training Figure 3.5: Overview of SearcHD encoding and stochastic training. compared to QuantHD described in Section 3.4.1 SearcHD Bitwise Substitution: SearcHD removes all arithmetic operations from train- ing by replacing addition with bitwise substitution. Assume A and B are two randomly generated vectors. In order to bring vector A closer to vector B, a random (typically small) subset of vector B’s indices is forced onto vector A by setting those indices in vector A to match the bits in vector B. Therefore, the Hamming distance between vector A and B is made smaller through partial cloning. When vector A and B are already similar, then indices selected probably contain the same bits, and thus the information in A does not change. This operation is blind since we do not search for indices where A and B differ, and then ”fix” those indices. Indices are chosen randomly and independently of whatever is in vector A or vector B. In addition, the operation is one-directional. Only the bits in vector A are transformed to match those in vector B, while the bits in vector B stay the same. In this sense, A inherits an arbitrary section of vector B. We call vector A the binary accumulator and vector B the operand. We refer to this process as bitwise substitution. SearcHD Vector Quantization: Here, we present our fully binary stochastic training approach, which enables the entire HD training process to be performed in the binary domain. Similar to traditional HD computing algorithms, SearcHD trains a model by combining the en-

59 coded training hypervectors. As we explained before, HD computing using a binary model results in very low classification accuracy. In addition, moving to the non-binary domain makes HD computing significantly more costly and inefficient. In this work, we propose vector quantization We exploit multiple vectors to represent each class in the training of SearcHD. The training keeps distinct information of each class in separated hypervectors, resulting in the learning of a more complex model when using multiple vectors per class. For each class, we generate N models (where N is generally between 4 and 64). Below we explain the details of the proposed algorithm:

• Initialize the N model vectors to a class by randomly sampling from the encoded training hypervector of that class as shown in Figure 3.5b. For an application with k classes, the approach needs to store N × k binary hypervectors as the HD model. For example, we

th i i i can represent the wi class using N initial binary hypervectors {C1,C2,··· ,CN}, where Ci ∈ {0,1}D

• The training in HD computing starts by checking the similarity of each encoded data point (training dataset) to the initial model. The similarity check only happens between the encoded data and N class hypervectors corresponding to that label. For each piece of training data Q in a class, find the model with the lowest Hamming distance and update the model using bitwise substitution (explained in Section 3.4.2). For example, in the ith class,

i if Ck is selected as a class with the highest similarity, we can update the model using: i i Ck = Ck (+) Q

i In the above equation, (+) is the bitwise substitution operation, Q is the operand, and Ck is the binary accumulator. This algorithm helps reduce the memory access overhead introduced by using bitwise substitution. This approach accumulates training data more intelligently: given the choice of adding an incoming piece of data to one of N model vectors, we can select the model with the lowest Hamming distance to ensure that we do not needlessly encode information in our models.

60 (a) ISOLET (b) FACE

(c) UCIHAR (d) IOT Figure 3.6: Classification accuracy of SearcHD, kNN, and the baseline HD algorithms.

SearcHD Training Process: Binary substitution updates each dimension of the selected class stochastically with p = α×(1−δ) probability, where δ is a similarity between the query and the class hypervector and α is a learning rate. In other words, with flip probability p, each element of the selected class hypervector will be replaced with the elements of the query hypervector. α is the learning rate (0 < α) which determines how frequently the model needs to be updated during the training. Using a small learning rate is conservative, as the model will have minor changes during the training. A larger learning rate will result in a major change to a model after each iteration, resulting in a higher probability of divergence. After updating the model on the entire training dataset, SearcHD uses the trained model for the rest of the classification during inference. The classification checks the similarity of each encoded test data vector to all class hypervectors. In other words, a query hypervector is compared with all N × k class hyprevectors. Finally, a query identifies a class with the maximum Hamming distance similarity with the query data. To further improve the training speed, work in [96] proposed adaptive approach that sets the learning rate during the training phase. SearcHD Accuracy-Efficiency: Figure 3.6 shows the impact of the number of hyper- vectors per each class N on SearcHD classification accuracy in comparison with other approaches.

61 State-of-the-art HD computing approaches use a single hypervector representing each class. As the figure shows, for all applications, increasing the number of hypervectors per class improves classification accuracy. For example, SearcHD using eight hypervectors per class (8/class) and 16 hypervectors per class (16/class) can achieve on average 9.2% and 12.7% higher classification accuracy, respectively, as compared to the case of using 1/class hypervector when running on four tested applications. However, SearcHD accuracy saturates when the number of hypervectors is larger than 32/class. In fact, 32/class is enough to get the most common patterns in our datasets, thus adding new vectors cannot capture different patterns than the existing vectors in the class. The red line in each graph shows the classification accuracy that a k-Nearest Neighbor (kNN) algorithm can achieve. kNN does not have a training mode. During Inference, kNN looks at the similarity of a data point with all other training data. However, kNN is computationally expensive and requires a large memory footprint. In contrast, SearcHD provides similar classification accuracy by performing classification on a trained model. Figure 3.6 also compares SearcHD classification accuracy with the best baseline HD computing algorithm using non-binary class hypervectors [97]. The baseline HD model is trained using non-binary encoded hypervectors. After the training, it uses a cosine similarity check for classification. Our evaluation shows that SearcHD with 32/class and 64/class provide 5.7% and 7.2% higher classification accuracy, respectively, as compared to the baseline HD computing with the non-binary model. Table 3.2 compares the memory footprint of SearcHD, kNN, and the baseline HD algo- rithm (non-binary model). As we expect, kNN has the highest memory requirement, by taking on average 11.4MB for each application. After that, SearcHD 32/class and the baseline HD algorithm require similar memory footprints, which are on average about 28.2× lower than kNN. SearcHD can further reduce the memory footprint by reducing the number of hypervectors per class. For example, SearcHD with 8/class configuration provides 117.1× and 4.1× lower memory than kNN and the baseline algorithm while providing similar accuracy. In next section, we design digital and analog hardware to accelerate binary implementation of HD classification.

62 Table 3.2: Memory footprint of different algorithms (MB)

Baseline SearcHD kNN HD 64/class 32/class 16/class 8/class 4/class UISOLET 14.67 0.99 1.98 0.99 0.49 0.24 0.12 CARDIO 0.15 0.11 0.22 0.11 0.05 0.02 0.01 UCIHAR 13.29 0.45 0.91 0.45 0.22 0.11 0.05 IOT 17.54 0.07 0.15 0.07 0.04 0.02 0.01 3.5 Hardware Acceleration of HD Computing

Hyperdimensional computing is amazingly tolerant of errors. The random hypervector seeds are independent and identically distributed, a property that is preserved by the encoding operations (binding, bundling, and permutation) performed on them. Hence, a failure in a compo- nent is not “contagious”. Figure 3.7 shows the classification accuracy as a function of a number of bits error in computing Hamming distance. The results are reported for text classification of 21 European languages [98], during the inference phase. As shown, HD computing still exhibits its maximum classification accuracy of 97.8% with up to 1,000 bits error in computing the distance (i.e., when up to 10% of hypervector components are faulty). We exploit such robustness property of HD computing to design efficient associative memory that can tolerate the error in any part of a hypervector. Further increasing the error in distance metric to 3,000 bits, slightly decreases the classification accuracy to 93.8%. We call this range of the classification accuracies as moderate accuracy that has up to 4% lower accuracy compared to the maximum of 97.8%. Accepting moderate accuracy opens more opportunities for aggressive optimizations. However, increasing the error to 4,000 bits reduces the classification accuracy below 80%. Here we will show how we can exploit this inherent robustness to design efficient associative memory modules that can be implemented in memory, thus merging the memory and computation. In this section, we propose three architectural designs for hyperdimensional associative memory (HAM). We exploit the holographic and distributed nature of hypervectors to design memory-centric architectures with no asymmetric error protection that further allows us to effec- tively combine approximation techniques in three widely-used methodological design approaches:

63 Figure 3.7: Language classification accuracy with wide range of errors in Hamming distance using D = 10,000. digital CMOS design as well as digital and analog in-memory design.

3.5.1 D-HAM: Digital-based Hyperdimensional Associative Memory

Conventional content-addressable memories (CAMs) typically check the availability of an input query pattern among the stored (or learned) patterns, however finding the closest pattern is of our need for reasoning among hypervectors. We propose a digital CMOS-based hyperdimensional associative memory, called D-HAM. After the training phase, D-HAM stores a set of learned hypervectors in a CAM. For an input query hypervector, D-HAM finds a learned hypervector that has the nearest Hamming distance to the query. Figure 3.8 shows the structure of the proposed D-HAM consisting of two main modules: (1) CAM: it forms an array of C × D storage elements, where C is the number of hypervectors as the distinct classes and D is the dimension of a hypervector. During a classification event, each learned hypervector is compared with an input query hypervector using an array of XOR gates. An XOR gate detects the similarity between its inputs by producing a 0 output as the match and a 1 output as the mismatch. Therefore, in every row, the number of XOR gates with an output of 1 represents Hamming distance of the input query hypervector with the corresponding hypervector that is stored in the row. (2) Distance computation: it composed of parallel counters

64 D D Query hypervector Query hypervector

MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell

MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell MEM Cell C C Memory Storage of Learned Hypervectors Pre-stored hypervectors

MEM Cell MEM Cell MEM Cell MEM Cell

MEM Cell MEM Cell MEM Cell MEM Cell Comp

D bits Log(D) bits Binary Counter 1 Comp

XOR Array Comp

Binary

Counter 2

Comp

distance Nearest Hamming Nearest Hamming

D-bit Comp Counter Counter Counter Counter

Counter Dimension 1 Dimension 2 Dimension 3 Dimension D Comp

LogD-bit Binary

Counter C Comp Comp Comp Comp Comp XOR Array

Comparator Comp Comp CAM Array Counters Comparators

Comp

Nearest Hamming distance Figure 3.8: Overview of D-HAM.

and comparators that compute the distances using the outputs of XOR gates. D-HAM requires a set of C counters each with logD bits. Each counter is assigned to a row, and iterates through D output bits of the XOR gates to count the number of 1s. The value of a counter determines the distance of the input query hypervector with the corresponding stored hypervector in the row. Finally, D-HAM requires to find the minimum distance among the C counter values. It structures a binary tree of comparators with a height of logC to find a hypervector with the minimum distance from the input query hypervector. Accuracy-Energy Tradeoff in D-HAM: To address this energy issue in D-HAM, we exploit the distributed property of hypervectors that enables us to accurately compute Hamming distance from an arbitrary subset of hypervector components (d < D). This enables D-HAM to applies sampling on the hypervector with i.i.d. components. As shown in Figure 3.7, such a sampling ratio can impact the classification accuracy [99]. To meet the maximum classification accuracy, D-HAM ignores 1,000 bits and computes Hamming distance for the rest of d = 9,000 bits. In this way, D-HAM ensures the error-free computation of Hamming distance on 90% of the hypervector bits while intentionally eliminating the 10% of the bits. Further, ignoring the dimensions up to 3,000 bits (i.e., d = 7,000) guarantees the moderate classification accuracy (see

65 Figure 3.7). Such sampling results in energy saving in D-HAM which is linearly related to the size of sampling: 7% (or 22%) energy saving is achieved with d = 9,000 (or d = 7,000) for the maximum (or moderate) classification accuracy compared to baseline D-HAM with D = 10,000.

3.5.2 R-HAM: Resistive Hyperdimensional Associative Memory

We utilize non-volatile memory elements in order to design R-HAM, a fast and scalable as- sociative memory module. Figure 3.9(a) shows the overview of the proposed R-HAM, consisting of two main modules. (1) Resistive CAM: it stores the learned hypervectors by using a crossbar of resistive cells. The resistive CAM is partitioned to M stages as shown in Figure 3.9(b) each containing D/M bits with C rows, where D and C are the hypervector dimension and the number of classes, respectively. Each CAM stage finds the distance between the query hypervector and the learned hypervectors in parallel. (2) Distance computation: R-HAM uses a set of parallel counters to count the number of Hamming distances in all partial CAM stages corresponding to each row. This counter is different from the conventional binary counters that D-HAM uses because the CAM stages in R-HAM generate a non-binary code (See Figure 3.9(c)). This new coding has a lower switching activity compared to the dense binary coding. More details about the functionality and implementation of this module will be explained in the next section. Finally, R-HAM uses similar comparators as D-HAM to find a row with the minimum Hamming distances. A digital-based realization of HAM using CMOS (called D-HAM) spends 81% of the total energy consumption and 58% of the total area for storing and comparing the hypervectors in the CAM array which is replaced with the dense memristive crossbar in R-HAM. The memristive CAM improves both energy and area of R-HAM: (1) The energy is reduced due to the lower switching activity in the crossbar compared to D-HAM which uses XOR gates to find mismatches. Table 3.3 shows the average switching activity of D-HAM and R-HAM for different block sizes. R-HAM shows lower switching activity in large block sizes, and exhibits about 50% lower switching activity compared to D-HAM with blocks of 4 bits. Further, the resistive CAM array

66 D

Query Hypervector Log(C) stages Log

SA Q[1]

( D Dbits SL[1] SL [1] SL [2] ) ) SL [2] SL[3] SL [3] SL[4] SL [4]

Buffer Buffer Buffer bits SA Q[2]

ML Comp C C C C C C C C C C C C Counter 1 SA Q[3]

Comp Q[4] ClK SA

C C C C C C C C C C C C Counter 2

Comp

Row Driver Row

Row Driver Row Row Driver Row Clk C C C C C C C C C C C C C Counter 3 Nearest Hamming EnL

Comp distance

C C C C C C C C C C C C Counter 4 Comp Sense circuitry

C C C C C C C C C C C C Counter C

Comp Comp CAM Array Counter Comparator

D D Query Hypervector Query Hypervector Log(C) stages Query Hypervector Buffer 4-bit Buffer Buffer Buffer Buffer Buffer D bits Log (D) bits

Buffer MLs Comp C C C C C C C C C C C C Counter 1

C C C C C C C C C C C C Comp C C C C C C C C C C

C C SenseAmplifiers C C C C C C C C C C C C Counter 2

Row Driver Row

Row Driver Row

Row Driver Row

Comp

Row Driver Row

Row Driver Row Row Driver Row C C C C C C Row Driver CAM Array C C C C C C C C C C C C C C C C C C C C Counter 3 C C C C C C C C C C C C (B) Comp

C C C C C C C C C C C C Counter 4 distance

C C C C C C C C C C C C

Comp Nearest Hamming Nearest Hamming

C C C C C C C C C C C C Counter C Comp

Counter Counter Counter Counter Comp Counter Dimension 1 Dimension 2 Dimension 3 Dimension D CAM Array Counters Comparators LogD-bit (a)

Comp Comp Comp Comp SA Q[1] Buffer Comp Q[1..4] Comparator Comp SL[1] SL [1] SL [2] SL [2] SL[3] SL [3] SL[4] SL [4] SA Q[2] ML SA Q[1] Comp SA Q[3] SL[1] SL [1] SL [2] SL [2] SL[3] SL [3] SL[4] SL [4] SA Q[2]

SenseAmplifiers ML SA Q[4] ClK SA Q[3] (C) Row Driver Nearest Hamming distance Q[4] ClK SA Clk EnL

Clk (A) EnL

ML Sense circuitry (A) Sense circuitry

(b) (c)

Figure 3.9: Overview of R-HAM: (a) Resistive CAM array with distance computation; (b) A 4 bits resistive block; (c) Sensing circuitry with non-binary code generation.

can operate in the lower supply voltage bringing additional energy saving. (2) The area is reduced (B) by tightly integrating the storage array and the mismatch finding logic in the same crossbar. In contrast to D-HAM which uses a large XOR array to determine the mismatches, R-HAM uses high-density memristive crossbars for both storage and partial determination of mismatches. Nearest Distance in Resistive Crossbars: Our R-HAM finds the nearest distance by efficiently using the timing property of memristive crossbars. R-HAM consists of a CAM array (C) where all cells in a row share the same match line (ML) to represent a hypervector. The search

D operation has two main phases: precharging and evaluation. There is a set of C row drivers Query Hypervector that precharge all the MLs before the search operation. In the evolution, a set of input buffers Buffer Buffer Buffer C C C C C C C C C C C C distribute the components of the input query among all the rows through D vertical bitlines (BLs).

C C C C C C C C C C C C

Row Driver Row

Row Driver Row Row Driver Row

D C C C C

C C C C C C C C CAM Array C Query Hypervector C C C C C C C C C C C C

Buffer Buffer Buffer C C C C C C C C C C C C C C C C C C C C C C C C 67

C C C C C C C C C C C C

Row Driver Row

Row Driver Row Row Driver Row Counter C C C C C C C C C C C C Counter Counter Counter CAM Array C Dimension 1 Dimension 2 Dimension 3 Dimension D Counter C C C C C C C C C C C C LogD-bit C C C C C C C C C C C C Comp Comp Comp Comp

Comp Comp Comparator Counter Counter Counter Counter Counter Dimension 1 Dimension 2 Dimension 3 Dimension D Comp

LogD-bit Nearest Hamming distance Comp Comp Comp Comp

Comparator Comp Comp

Comp

Nearest Hamming distance

(A) The buffer role is to strengthen the input signal such that all the rows receive the signal at the same time. During the evaluation, any cell with a value differing from the input query component discharges the ML. Therefore, all MLs will be discharged, except the ML for the row that contains all matched cells. This matched row can be detected by the sense amplifier circuitry and by sampling the ML voltage at a certain time. Figure 3.10(a) shows the normalized ML discharging voltage curves during the search operation for different distances (i.e., the number of mismatches on the ML). The ML discharging speed depends on the distance value. For example, a row that has a Hamming distance of 2 from the input query discharges ML about 2× faster than a row with a Hamming distance of 1. This timing characteristic can be used to identify the distance of the query hypervector with the learned hypervectors. However, there is not a linear dependency between the speed of ML discharging and Hamming distances. As Figure 3.10(a) shows, the first mismatch usually has a higher impact on ML discharging compared to the last mismatch. Because the ML discharging current saturates after having the first few mismatches, many later mismatches do not change the ML discharging speed. For example, there is a distinguishable time difference between Hamming distances of 1 and 2 while Hamming distances of 4 and 5 have similar ML discharging time. In this experiment, each row has solely 10 bits that clearly indicate the limitation of the method for higher dimensionality. Other works have observed the same restriction that limits their approaches to resistive CAMs with small dimensionality of 64 bits [100, 101]. To address this non-uniform ML discharging time, we split the R-HAM array to M shorter blocks. Among various configurations, we have observed that the maximum size of a

Table 3.3: Average switch activity of D-HAM and R-HAM.

Block size R-HAM D-HAM 1 bit 25% 25% 2 bits 21.4% 25% 3 bits 18.3% 25% 4 bits 13.6% 25%

68 1 1 1

0.9 0.9 0.9 Distance of 0 bit Distance of 0 bit 0.8 0.8 Distance of 1 bit 0.8 Distance of 1 bit Distance of 0 bit Distance of 2 bits Distance of 2 bits Distance of 1 bit 0.7 0.7 Distance of 3 bits 0.7 Distance of 3 bits Distance of 2 bits Distance of 4 bits Distance of 4 bits Distance of 3 bits 0.6 Distance of 5 bits 0.6 0.6 Distance of 4 bits Distance of 5 bits 0.5 0.5 0.5

0.4 0.4 0.4 ML Voltage (V) ML Voltage (V) ML Voltage (V) 0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0 T T T T T T T T T T T T T T T T T 4 3 2 1 0 4 3 2 1 0 6 5 4 3 2 1 0 Time Time Time (a) 10 bits CAM (b) 4 bits CAM w/o voltage over- (c) 4 bits CAM with voltage over- scaling scaling

Figure 3.10: Match line (ML) discharging time and its relation to detecting Hamming distance for various CAMs. block can be 4 bits for accurate determination of the different distances. To further alleviate the current saturation issue, we use memristor devices with reported large ON resistance [102] that provide stable ML voltage for the better distinction between the distances at a cost to the slower search operation. Figure 3.10(b) shows the ML discharging time of the 4 bits block for different distances. This figure shows that the timing difference between the distances becomes approximately uniform. We accordingly design a sense amplifier for R-HAM that can identify the difference between Hamming distances of 1, 2, 3, and 4 by measuring the ML discharging time. As shown in Figure 3.10(b), sampling at T0 corresponds to ML without any mismatches (i.e., a Hamming distance of 0). Similarly, sampling at T1 detects a row with a Hamming distance of 1 and 0. Figure 3.9(c) shows the structure of the sense amplifier with the tuned sampling time to detect each distance. Our design uses four distinct sense amplifiers to detect Hamming distances of 0, 1, 2, and 3. Then, based on the table shown in Figure 3.9(c), it can identify the number of mismatches on each row by generating a non-binary code. We use a buffer to generate a small delay (≈0.1 ns) on the clock of the sense amplifiers. To justify this delay, we change the size of the buffer to set the buffer delay. To guarantee the correct functionality in the presence of variations, we design the CAM and sense circuitry considering 10% process variation on the

69 transistors (size and threshold voltage) and the resistance values. Figure 3.9(c) shows inside of a block of R-HAM with 4 bits. To compute a distance from the outputs of these blocks a counter counts the number of mismatches in all partial blocks. This counter is designed to work with the coding produced as the output of the sense amplifiers. The proposed encoding decreases the bit difference between the output signals. For example, patterns with a distance of 3-bits and 4-bits have three bits difference in the binary representation (0011vs0100), while encoding reduces this switching to a single bit (1110vs1111). Mathematically considering all possible matching combinations shows that proposed encoding significantly reduces R-HAM switching activity compared to D-HAM especially when using larger block sizes (See Table 3.3). Accuracy-Energy Tradeoff in R-HAM To exploit the robustness of HD computing for improving energy efficiency, R-HAM supports tow techniques. First, it applies a sampling technique similar to D-HAM by ignoring up to 10% of the blocks, i.e., accepting up to 1,000 bits error in the distance (See Figure 3.7). These 250 blocks (out of the total 2500 blocks) can be directly excluded from the R-HAM design to save energy while meeting the maximum classification accuracy. Further ignoring the number of blocks to 750 decreases the classification accuracy to the moderate level. The second technique leverages the holographic property of the hypervectors by intentionally distributing the erroneous bits to a large number of blocks rather than jamming them into few blocks. To do so, R-HAM overscales the supply voltage of every block to 780 mV such that a block is restricted to undergo a maximum of one-bit error in Hamming distance (See Figure 3.10(c)). With this technique, 40% (or 100%) of the total blocks can operate with a lower voltage while providing the maximum (or the moderate) classification accuracy. As a result, these blocks quadratically save the energy while computing a distance metric in which the cumulative effect of their errors is acceptable for the classification using HD computing. To implement the voltage overscaling, we use an energy-efficient and fast voltage supply boosting technique [103].

70 Figure 3.11: Energy saving of R-HAM using structured sampling versus distributed voltage overscaling.

Figure 3.11 compares the energy saving of R-HAM using the two techniques of sampling and voltage overscaling. Targeting the maximum accuracy, the sampling technique achieves a relative energy saving of 9% by turning 250 blocks off, while the voltage overscaling technique achieves almost 2× higher saving by reducing the voltage for 1000 blocks. This trend of relative energy saving is consistent with targeting the moderate accuracy: 22% by turning off 750 blocks, and 50% by reducing the voltage for all the 2500 blocks. However, R-HAM can not maintain such linear energy-saving beyond 2,500 bits error in the distance. This is because all the blocks are already under the voltage overscaling, and accepting more than 2,500 bits error requires some blocks to accept a Hamming distance of 2. The energy gain that a block of R-HAM can achieve by accepting a distance of 2 (by operating at 720 mV) is very similar to the distance of 1 (i.e., 780 mV).

3.5.3 A-HAM: Analog-based Hyperdimensional Associative Search

We propose an analog HAM (A-HAM) to exploit the timing characteristics of the ML by observing discharging current to compute its distance. Figure 3.12(a) shows the overall architecture of the proposed A-HAM. We use a memristor device with a high OFF/ON resistance

71 D D Query Hypervector Log(C) stage Query Hypervector Buffer Sense Block Loser Takes All (LTA) Buffer MLs Precharger MLs Loser Takes All (LTA)

Sensing Block Block CAM Sense CAM CAM CAM LTA M B2 MB3 M Cell Cell Cell Cell O1 MO3 Block Il1 CAM CAM CAM CAM Sense

Cell Cell Cell Cell Precharger Nearest Hamming LTA Pre MP Pre MB1 MB2 MB3

MO1 MO3 M11 M13 MO2 MO4 Block

Sense Sense ML

Il1 CAM CAM CAM CAM

distance Block

Sense Sense Cell Cell Cell Cell CAM CAM CAM CAM IOut Pre MP Pre MB1 C M12 M14 CAM Array Cell Cell Cell Cell M M

ML 11 M13 O2 MO4 LTA

IOut C

M12 M14 BL

Precharger

Precharger Block

Sense Sense M M CAM CAM CAM CAM R3 R4 I

Block L2 LTA

Sense Sense BL CAM CAM CAM CAM Cell Cell Cell Cell Cell Cell Cell Cell Pre Pre

MR3 MR4 LTA MR1 M M21 M23 R2

IL2 Block CAM CAM CAM CAM Sense Reset Pre Pre M Cell Cell Cell Cell 22 M24 MR1 M M21 M23 R2 LTA LTA LTA LTA Reset M BL 22 M24 Current-based CAM Array Comparators Comparator LTA LTA BL (a) LTA (b)

Nearest Hamming distance Figure 3.12: Overview of A-HAM: (a) Resistive CAM array with LTA comparators; (b) Circuit (B) C D (A) Query Hypervector details of two rows. Row Driver

Buffer

TCAM TCAM TCAM

TCAM TCAM TCAM TCAM

TCAM TCAM Sensing Block Loser Takes All (LTA) D

Cell Cell

Cell Cell Cell MLs Loser Takes All (LTA) Query Hypervector Log(C) stage Sense Block

CAM CAM CAM CAM Buffer Precharger Cell Cell Cell Cell MB2 M ratio [104] to design a ternary CAM (TCAM) cell. The A-HAM consists of an array of TCAM B3 Precharger MO1 MO3 CAM Array CAM Il1 Buffer MB2 MB3 MO1 MO3 Il1

Pre MP Pre MB1

Buffer

TCAM TCAM TCAM TCAM TCAM

TCAM TCAM MLs TCAM TCAM

CAM CAM CAM CAM Buffer

Cell Cell

Cell Cell Cell M M C ML 11 M13 O2 MO4 Pre Pre

CAM Array Cell Cell Cell Cell MP MB1

D Block CAM Sense CAM CAM CAM LTA M M ML 11 M13 O2 MO4 IOut Cell cellsCell formingCell aCell D ×C crossbar similar to R-HAM. The A-HAM design searches for the query M12 M14 IOut Row Driver Row M

LTA 14

TCAM TCAM

TCAM TCAM TCAM TCAM

TCAM TCAM M

TCAM TCAM 12

Cell Cell

Cell

Cell

Cell Block BL CAM CAM CAM CAM Sense Cell Cell Cell Cell CAM CAM CAM CAM Buffer hypervector among all TCAM rows in parallel andBL then compares their currents using a set of Cell Cell Cell Cell

MR3 MR4 Nearest Hamming

TCAM TCAM TCAM

TCAM TCAM TCAM TCAM

TCAM TCAM I

LTA Cell

Cell L2

Cell

Cell Cell C distance MR3 MR4

Pre Pre IL2 MLs M M MR1 MR2

21 23 Pre Pre

Precharger Buffer Reset MR1 MR2

Block M M Loser Takes All (LTA) Sense blocks [105]. The LTA blocks form a binary tree with a height21 23 of logC. CAM CAM CAM CAM LTA M LTA LTA LTA 22 LTA Buffer Buffer Buffer Buffer Buffer M24 Cell Cell Cell Cell Reset M

Current-based 22 M24 LTA

Comparator Comparator

LTA LTA LTA LTA LTA BL Block CAM CAM CAM CAM Sense BL Cell TheCell MLCell dischargingCell current is related to the number of mismatched cells. A buffer block senses LTA LTA

Comparator LTA CAM Array Current-based Nearest Hamming distance the ML current and sends it to the LTA block to be compared with the next row. A row with a large

D number of mismatches has a higher amount of discharging current. The goal is to find a row that Sensing Block Loser Takes All (LTA) Query Hypervector Sensing Block

Precharger Precharger Buffer MB2 MB3 MB2 MB3 MO1 MO3 has the minimum number of mismatches and thus the minimum discharging current. Therefore, Il1 MLs

Pre MP Pre MB1 Pre Pre M Block MP B1 CAM CAM CAM CAM Sense M M ML 11 M13 O2 MO4 Cell Cell Cell Cell ML

Il1 IOut M M14 the binary tree blocks of LTA compares the output current of every two neighbor rows to find the

12 Block CAM CAM CAM CAM Sense C CAM Array Cell Cell Cell Cell BL BL

M2 M3 MR3 MR4 Precharger row that has the minimum Hamming distance with the query hypervector. However, such current IL2 Pr M1 Pre Pre

e Block Sense Sense I MR1 M CAM CAM CAM CAM L2 M21 M23 R2 Cell Cell Cell Cell Reset comparison cannot be directly scaled to large dimensions. This is because the discharging current M 22 M24

BL BL Loser Takes All (LTA) LTA LTA LTA LTA Current-based of the rows will be very close that the LTA does not have enough precision to find a row with the Comparator LTA LTA MO1 MO3 M21 M23 MR3 MR4 M11 M13

LTA M 22 M24 M12 M14 minimum distance. Moreover, both ML discharging current and the LTA blocks are sensitive to MO4 MR1 MR2 Nearest Hamming distance Reset IOut (A) (B) the process and the voltage variations. To address these issues, we propose ML stabilizer and multistage search operation techniques in the following sections. Mtach Line Stabilizer: In conventional TCAMs, an ML current saturation, described in the following, is the main limitation in identifying the number of mismatches. For instance,

72 when a single cell in a row does not match with an input data, the ML discharges with I1 current. However, having two mismatched cells does not result in the same I1 leakage current on every cell causing a total discharging current of less than 2*I1. This non-linearity of current-voltage dependency is more pronounced in large dimensions, where having D > 7 cells has a minor impact on the total ML discharging current. This is so-called current saturation and occurs due to the ML voltage-current dependency. In the current saturation, a large number of mismatched cells drop the ML voltage immediately and decrease the passing current through each cell. This makes detecting the exact number of mismatches challenging. To identify the number of mismatches on a row, we need to have a fixed supply voltage on the ML during the search operation. In this condition, the ML voltage depends on the number of mismatched cells. In contrast to the conventional TCAMs which work based on ML discharging voltage, our design is current-based. A-HAM stabilizes the ML in a fixed voltage during the search operation and identifies the number of mismatched cells by tracking the current passing through the sense circuitry. Figure 3.12(b) shows two rows of the proposed A-HAM. Before the search operation, the ML charges using a precharge transistor (Mp). During the precharge mode, all A-HAM cells deactivate by connecting select lines of all cells to zero voltage. After precharge mode, the search operation starts on the TCAM cells. The select lines activate TCAM cells by the input data. Any cell that has a different value with the select line value discharges the ML voltage. In this case, the

MB1 transistor is activated and tries to fix the ML voltage to the supply voltage. Thus the currents of all mismatched cells pass through the MB1 and MB2 transistors. The input buffer mirrors the

MB1 current to a branch containing the MB3 transistor. This stage sends the data to the LTA block to be compared with the discharging current of the next row. The value of IL1 current linearly depends on the number of mismatches in each row. The precision of the detection and the number of mismatches depend on the accuracy of the LTA block. The LTA block accepts two input currents and returns a line with the lower current (See

73 Figure 3.12(b)). The LTA block is based on the current mirror and a reset circuitry which compares the currents in a different resolution. The bit resolution of this comparison identifies the number of mismatches that our design can detect. Multistage A-HAM: The ML discharging current increases the energy consumption of A-HAM to higher than that of a conventional TCAM. To address this issue, we use resistive devices with high ON resistance [104] to reduce the discharging current of each missed cell. However, the large ON resistance imposed the following issues: (1) It degrades the sense margin by decreasing the ON to OFF current ratio. Therefore, we use the memristor devices with very large OFF/ON resistance [104] to provide enough sense margin and stability. (2) In addition, the large ON resistance slows down the search operation by increasing the response time (T = RC). This creates a tradeoff between energy consumption and search speed. Our evaluation shows that

A-HAM with RON ∼ 500K and ROFF ∼ 100G is able to identify a Hamming distance of up to 512 bits using the LTA blocks with 10 bits resolution. However, increasing the dimension limits the distance difference that can be identified across the rows. Figure 3.13 shows the minimum detectable Hamming distance with increasing the dimension. For D=256 and lower, A-HAM provides a resolution of one bit in comparing Hamming distances of various rows. Increasing D=10,000 increases the minimum detectable distance to 43 bits, i.e., A-HAM using a single stage can not differentiate between Hamming distances lower than 42. This precision loss in comparing the distances can be improved by the following multistage technique. We observed that the ML voltage cannot be fixed during the search operation for the large dimensions. This degrades the resolution of identifying the distances for the hypervectors with D=256 and higher. Even using the LTA with higher resolution (¿10 bits) cannot provide acceptable accuracy. To address this issue, we split the search operation to multiple shorter stages and calculate the mismatched current of each part separately. Then, an additional current mirror circuit adds these partial currents (I1 and I2 currents in the example of Figure 3.14) in node A. The LTA compares the IL currents for different rows. In A-HAM, the minimum detectable distance

74 Figure 3.13: Minimum detectable distance in A-HAM. can be controlled by the number stages and the bit width of the LTA blocks. Figure 3.13 shows the minimum distance that can be identified by A-HAM using multi- stages; the top X-axis shows the number of stages and the LTA bit resolution that is used for achieving such distance resolution. This multistage search technique enables the extension of the minimum detectable distance of 1 to D=512. For D=10,000, the minimum detectable Hamming distance is improved to 14 bits using 14 stages. This precision in distinguishing the distances is sufficient to maintain the classification accuracy. We observe that the minimum Hamming distance among any learned language hypervector with the other 20 learned language hypervectors is 22; and the next minimum distance is 34. Intuitively, hypervector within a language family should be closer to each other than hypervectors for unrelated languages. Therefore, the LTA blocks with the minimum detectable Hamming distance lower than 22 bits does not impose any misclassification (the boarder is shown in Figure 3.13). However, increasing the dimensionality, or the variations in the process and the voltage of the LTA blocks can increase the minimum detectable distance that degrades the classification accuracy. Accuracy-Energy Tradeoff in A-HAM: A-HAM has three main sources of the energy consumption: the resistive CAM, the sense circuitry, and the LTA blocks. For the CAM with a

75 n/2 bits n/2 bits

M1 M2 A

IL

LTA

Block

Block Sense Sense Sense Sense I CAM CAM I1 CAM CAM 2 Cell Cell Cell Cell

M3 M4 A

IL

Block

Sense Block I Sense I2 CAM CAM 1 CAM CAM Cell Cell Cell Cell

D/N bits D/N bits

M1 M2

Block

Block

Sense

Sense Block Sense A I L LTA

CAM CAM IN I2 CAM CAM I1 I1+ +IN Cell Cell Cell Cell

M3 M4

Block

Sense

Block

Block

Sense Sense

CAM CAM CAM CAM Cell Cell Cell Cell

Stage N Stage 1

Figure 3.14: Multistage A-HAM architecture. large number of rows, the input buffers slow down the search operation and dominate the CAM energy. The sense circuitry fixes the ML voltage level and works as the sense amplifier. Its energy consumption is related to the average time that the search operation is continued. Our results show that LTA blocks are the main source of A-HAM energy consumption in large sizes. The LTA bit width can be reduced to lower the energy consumption at a cost to loss of classification accuracy. For D=10,000, we optimize the bit with of the LTA blocks to 14 bits (and 11 bits) such that A-HAM with 14 stages can meet the maximum (and the moderate) classification accuracy while improving its relative energy-delay product by 1.3× (and 2.4×).

3.5.4 Comparison of Different HAMs

Energy Comparison: Figure 3.15 shows the energy-delay product of R-HAM and A-HAM normalized to D-HAM for the errors in Hamming distance. In D-HAM, the energy-delay improves linearly by increasing the errors as D-HAM excludes more dimensions during the distance calculation. The R-HAM shows a higher rate in energy-delay saving compared to D-HAM thanks to applying the voltage overscaling on more blocks. This saving rate is faster for A-HAM by reducing the resolution of the LTA blocks. Targeting the maximum (or the moderate) classification accuracy, R-HAM achieves 7.3× (9.6×) and A-HAM achieves 746× (1347×)lower

76 0 10

1.1x −1 1.4x 10

Maximum Moderate Accuracy Accuracy R−HAM −2 10 A−HAM

1.3x −3 2.4x 10

Energy−Delay Product (Normalized to D−HAM) −4 10 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 Error in Distancce (Number of Bits) Figure 3.15: Energy-delay of the HAMs with accuracy. energy-delay product compared to D-HAM. Overall, A-HAM is highly amenable to be used when lower classification accuracy is required: by switching from the maximum accuracy to the moderate accuracy A-HAM achieves 2.4× lower energy-delay product, while the R-HAM does it for 1.4×. Such improvement in A-HAM is due to the faster search delay with the LTA blocks with low bitwidth resolution. However, the search latency in R-HAM does not change with lower accuracy since the voltage overscaling can be solely applied to the CAM blocks. Area Comparison: The area comparison of D-HAM, R-HAM and A-HAM using D=10,000 and C=100 is shown in Figure 3.16. D-HAM consumes most of its area in the CAM array for the bit level comparison; it is also penalized by its interconnect complexity. The area of R-HAM is 1.4× lower than D-HAM because of using high-density memristive elements in the CAM design. However, R-HAM cannot fully utilize such dense technology as it requires to insert digital counters and comparator for every 4 bits block. A-HAM resolves this issue by using the current-based searching with analog circuitry that allows every CAM stage to include ≈700 memristive bits. Overall, A-HAM achieves 3× lower area than D-HAM, and its LTA blocks occupy 69% of the total A-HAM area. Limitations: In a nutshell, among the HAM designs, A-HAM exhibits the best energy- delay scaling with both increasing the dimensionality and lowering the classification accuracy

77 Figure 3.16: Area comparison between the HAMs.

(See Figure 3.15). A-HAM also surpasses other designs in the area (See Figure 3.16). However, R-HAM shows a slightly lower rate of increasing energy-delay by increasing the number of classes. Nevertheless, R-HAM cannot fully exploit the high density of memristive elements since its digital counters and comparators have to be interleaved among the 4 bits blocks of the crossbar; R-HAM is also very sensitive to any voltage variation because the crossbar is already voltage overscaled to 0.78 V to accept 1-bit mismatch among the 4 bits (See Figure 3.10(c)). In the following, we assess the limitations of A-HAM. In A-HAM, the LTA capability to detect the minimum Hamming distance can be signifi- cantly affected by the variations in process and voltage. To assess such susceptibility, we model the variations on the transistors length and the threshold voltage using a Gaussian distribution with

3σ of 0% to 35% of the absolute parameters value [106]; we also consider 5% and 10% variation on the supply voltage of the LTA blocks that reduce the supply voltage from the nominal 1.8 V to a minimum of 1.71 V and 1.68 V, respectively. Figure 3.17 shows that increasing the process variations especially for the large voltage variations significantly reduces the minimum detectable Hamming distance of the LTA blocks. In the lower voltages, the process variation has a more destructive impact on the LTA detectable Hamming distance compared to the nominal voltage.

78 Figure 3.17: Impact of process and voltage variations for the minimum detectable Hamming distance in A-HAM.

As shown, A-HAM with the nominal supply voltage and more than 15% process variation could degrade the classification accuracy below the moderate level; this also holds for 5% (or 10%) voltage variation combined with more than 10% (or 5%) process variation. Considering the 35% process variation, A-HAM with the nominal voltage, 5% and 10% voltage variations achieves 94.3%. 92.1% and 89.2% classification accuracy, respectively. This might limit the usage of A-HAM to the smaller feature sizes or in situations with low signal-to-noise ratio.

3.6 Conclusion

In summary, we introduced Hyperdimensional computing, as an alternative computing method, to perform efficient and robust classification. We show how to use HD mathematics to encode text-like data and feature vectors and perform classification in high-dimensional space. Then, we introduced several algorithm-hardware optimizations that revisit the learning and inference phases of HD computing for efficient hardware implementation. QuantHD is a framework for quantizing the HD model for efficient inference. SearcHD performs fully binary

79 learning in HD computing supporting single-pass training. We designed novel architectures to accelerate HD classification on emerging in-memory platforms. As compared to optimized digital implementation, HD classification can provide 1347× more energy efficiency, while providing the same accuracy. In the next chapter, we explain how to use HD computing in IoT systems for secure collaborative learning. Chapters 4 contains material from “QuantHD: A Quantization Framework for Hyperdi- mensional Computing”, by Mohsen Imani, Samuel Bosch, Sohum Datta, Sharadhi Ramakrishna, Sahand Salamat, Jan M. Rabaey, and Tajana Rosing, which appears in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2019 [2]. The dissertation author was the primary investigator and author of this paper. Chapters 4 contains material from “SearcHD: A Memory-Centric Hyperdimensional Com- puting with Stochastic Training”, by Mohsen Imani, Xunzhao Yin, John Messerly, Saransh Gupta, Michael Nemier, Xiaobo Sharon Hu, and Tajana Rosing, which appears in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2019 [3]. The dissertation author was the primary investigator and author of this paper. This chapter contains material from “Exploring Hyperdimensional Associative Memory”, by Mohsen Imani, Abbas Rahimi, Deqian Kong, Tajana Rosing, and Jan M. Rabaey, which appears in IEEE International Symposium on High-Performance Computer Architecture, February 2017 [4]. The dissertation author was the primary investigator and author of this paper.

80 Chapter 4

Collaborative Learning with Hyperdimensional Computing

81 4.1 Introduction

In the previous chapter, we proposed an HD computing-based learning which encodes the data to the hypervectors and performs the rest of the learning procedure running on a single node. In practice, the learning in many IoT systems is done with data that is held by a large number of devices. To analyze the collected data using machine learning algorithms, IoT systems typically send the data to a centralized location, e.g., local servers, cloudlets and data centers [107, 108, 109]. However, sending the data not only consumes lots of bandwidth, battery power, but is also undesirable due to privacy and security concerns [110, 111, 112, 25, 113, 114, ?]. Many machine learning models usually require unencrypted data original images, to train models and perform inference. When offloading the computation tasks, sensitive information is exposed to the untrustworthy cloud system which is susceptible to internal and external attacks [115, 116] . The users may also be unwilling to share the original data with the cloud and other users [117, 118, 119, 120]. An existing strategy applicable to this scenario is to use Homomorphic Encryption (HE). HE enables encrypting the raw data and allowing certain operations to be performed directly on the ciphertext without decryption [121]. However, this approach significantly increases computation burden. For example, in our evaluation, with Microsoft SEAL, a state-of-the-art homomorphic encryption library [122], it takes around 14 days to encrypt all of the 28x28 pixel images in the entire MNIST dataset, and increases the data size 28 times. More recently, Google presented a protocol for secure aggregation of high-dimensional data that can be used in federated learning model [123]. This approach trains Deep Neural Networks (DNN) when data is distributed over different users. In this technique, the users’ devices run the DNN training task locally to update the global model. However, IoT edge devices often do not have enough computation resources to perform such complex DNN training. In this chapter, we design SecureHD, an efficient, scalable, and secure collaborative

82 learning for the distributed computing in the IoT hierarchy. HD computing does not require complete knowledge for the original data that the conventional learning algorithms need – it runs with a mapping function that encodes data to a high-dimensional space. The original data cannot be reconstructed from the mapped data without knowing the mapping function, resulting in secure computation. We address several technical challenges to enable HD-based trustworthy, collaborative learning. To map the original data into hypervectors, it uses a set of randomly-generated base hypervectors as described in Chapter 3. Since the base hypervectors can be used to estimate the original data, every user has to have different base hypervectors to ensure the confidentiality of the data. However, in this case, the HD computation cannot be performed with the data provided by different users. SecureHD fills the gap between the existing HD computing and trustworthy, collaborative learning by providing the following contributions: i) We design a novel secure collaborative learning protocol that securely generates and distributes public and secret keys. SecureHD utilizes Multi-Party Computation (MPC) techniques which are proven to be secure when each party is untrusted [124]. With the generated keys, the user data are not revealed to the cloud server, while the server can still learn a model based on the data encoded by users. Since MPC is an expensive protocol, we carefully optimize it by replacing a part of tasks with two-party computation. In addition, our design leverages MPC only for a one-time key generation operation. The rest of the operations such as encoding, decoding, and learning are performed without using MPC. ii) We propose a new encoding method that maps the original data with the secret key assigned to each user. Our encoding method significantly improves classification accuracy as compared to the state-of-the-art HD work [95, 4]. Unlike existing HD encoding functions, the proposed method encodes both the data and the metadata, e.g., data types and color depths, in a recover-friendly manner. Since the secret key of each user is not disclosed to anyone, although

83 Cloud (untrusted)

Learning

Encrypted Data

Encrypted

Data

Encrypt Encrypt

Encrypt

Decrypt Decrypt Decrypt Client 1 Client 2 Client n

Figure 4.1: Motivational scenario one may know encoded data of other users, they cannot be decoded. iii) SecureHD provides a robust decoding method for the authorized user who has the secret key. We show that the cosine similarity metric widely used in HD computing is not suitable to recover the original data. We propose a new decoding method which recovers the encoded data in a lossless manner through an iterative procedure. iv) We present scalable HD-based classification methods for many practical learning problems which need the collaboration of many users, e.g., human activity and face image recognition. We propose two collaborative learning approaches, cloud-centric learning for the case that end-node devices do not have enough computing capability, and edge-based learning that all the user devices participate in secure distributed learning. v) We also show a hardware accelerator design that significantly minimizes the costs paid for security. This enables secure HD computing on less-powerful edge devices, e.g., gateways, which are responsible for data encryption/deception. We design and implement the proposed SecureHD framework on diverse computing devices in IoT systems, including a gateway-level device, a high-performance system, and our proposed hardware accelerator.

84 Execution Time (s) 102 103 104 105 106 107

13.8 Days Encryption 20.8 Hours

2.7 Days Decryption 1.6 Hours

Figure 4.2: Execution time of homomorphic encryption and decryption over MNIST dataset

4.2 Motivational Scenario

Figure 4.1 shows the scenario that we focus in this chapter. The clients, e.g., user devices, send either their sensitive data or partially trained models in an encrypted form to the cloud. The cloud performs a learning task by collecting the encrypted information received from multiple clients. In our security model, we assume that a client cannot trust the cloud as well as other clients. When requested by the user, the cloud sends back the encrypted data to clients. The client then decrypts the data with its private key. As an existing solution, homomorphic encryption enables processing on the encrypted version of data [121]. Figure 4.2 shows the execution time of a state-of-the-art homomorphic encryption library, Microsoft SEAL [122], for MNIST training dataset, which includes 60000 images of 28×28 pixels. We execute the library on two platforms that a client in IoT systems may use, a high-performance computer (Intel i7-8700K) and a Raspberry Pi 3 (ARM Cortex A53). The result shows that, even with the simple dataset of 47 MBytes, it takes significantly large execution time, e.g., more than 13 days on ARM to encrypt. Another approach is to utilize secure Multi-Party Computation (MPC) techniques [125, 124]. In theory, any function, which can be represented as a Boolean circuit with inputs from multiple parties, can be evaluated securely without disclosing each party’s to anyone else. For

85 PKeyn Encoding Decoding MPC-based PKey2 Key Original Original Generation PKey1 Client Client Client Data Data 1 2 n SKey1 Data Data Encoded PKeyn data Encoding Recovery SKeyn HD Metadata Metadata Learning Injection Recovery SKey Encoded Data Hypervector Hypervector

Cloud Figure 4.3: Overview of SecureHD example, by describing the machine learning algorithm as a Boolean circuit with learning data as inputs to the circuit, one can securely learn the model. However, such solutions are very costly in practice and are computation and communication intensive. In SecureHD, we only use MPC to securely generate and distribute users’ private keys which is orders of magnitude less costly than performing the complete learning task using MPC. The key generation step is a one-time operation so the small cost associated with it is quickly amortized over time for future tasks.

4.3 Secure Learning in HD Space

4.3.1 Security Model

In SecureHD, we consider the server and other clients to be untrusted. More precisely, we consider Honest-but-Curious (HbC) adversary model where each party, server or a client, is untrusted but follows the protocol. Both the server and other clients are not able to extract any information based on the data that they receive and send during the secure computation protocol. For the task of key generation and distribution, we utilize a secure MPC protocol which is proven to be secure in the HbC adversary model [124]. We also use two-party Yao’s Garbled Circuits (GC) protocol which is also to be secure in the HbC adversary model as well [126]. The

86 intermediate results are stored as additive unique shares of PKey by each client and the server.

4.3.2 Proposed Framework

In this section, we describe the proposed SecureHD framework which enables trustworthy, collaborate HD computing. Figure 4.3 illustrates the overview of SecureHD. The first step is to create different keys for each user and cloud-based on an MPC protocol. To perform a HD learning task, the data are encoded with a set of base hypervectors. The MPC protocol creates the base hypervectors for the learning application, called global keys (GKeys). Instead of sharing the original GKeys with clients, the server distributes permutations of each GKey, i.e., a hypervector whose dimensions are randomly shuffled. Since each user has different permutations of GKeys, called personal keys (PKeys), no one can decode encoded data of others. The cloud has dimension indexes used in the GKey shuffling, called shuffling keys (SKeys). Since the cloud does not have the GKeys, it cannot decrypt the encoded data of clients. This MPC-based key generation runs only once. After the key generation, each client can encode their data with its PKeys. SecureHD securely injects a small amount of information into the encoded data. We exploit this technique to store the metadata, e.g., data types, which are important to recover the entire original data. Once the encoded data is sent to the cloud, the cloud reshuffles the encoded data with the SKeys for the client. This allows the cloud to perform the learning task with no need for accessing GKeys and PKeys. With the SecureHD framework, the client can also decode the data from the encoded hypervectors. For example, once a client fetches the encoded data from the cloud storage service, it can exploit the framework to recover the original data using its own PKeys. Each client may also utilize the specialized hardware to accelerate both the encoding and decoding procedures.

87 Cloud Cloud (Untrusted) Si* GKey SKey

Si* GKey

S1 S*1 S2 GKey GKey Waksman GKey Waksman

S*n Block Block GC MPC GKey Sn Phase 1 Phase 2 * * * * S1, S1 Sn, Sn S1 PKey1 Sn PKeyn

Client 1 Client n Client 1 Client n Figure 4.4: MPC-based key generation Encoding Decoding

Features/Pixels Values Base hypervectors Hypervector (Data + Meta-data) A A A A (PKey for a client) 1 2 N

f1 B1 hypervector * Meta-vector Metadata f2 B hypervector C * 2 M1 C Data + File Type M2 Value C Color Depth fn * Bn hypervector Extraction

D Mk C

Metadata Decoding Data Hypervector (H) DataEncoding

Metadata Base hypervectors A Segment d (PKey for a client) D A A A B Metadata Meta-vector 1 2 N B1 hypervector f1 File Type C M1 + B2 hypervector Value f2 Color Depth * M2 C M Extraction * + Bn hypervector fn Mk C A A A * 1 2 N Feature

A 1 A2 AN MetadataEncoding Values Data Hypervector (Data + Meta-data) Data Decoding Reconstruction

Figure 4.5: Illustration of SecureHD encoding and decoding procedures

4.3.3 Secure Key Generation and Distribution

Figure 4.4 illustrates how our protocol securely create the key hypervectors. The protocol runs two phases: Phase 1 that all clients and the cloud participate, and Phase 2 that two parties, a single client and cloud, participate. Recall that in order for the cloud server to be able to learn the model, all have to be projected based on the same base hypervectors. Given the base hypervector and the encoded result, one can reconstruct the plaintext data. Therefore, all clients have to use the same key without anyone having access to the base hypervectors. We realize these two constraints at the same time with a novel hybrid secure computation solution.

88 In the first phase, we generate the base hypervectors, which we denote by GKey. The main idea is that the base hypervectors are generated collaboratively inside the secure Multi-Party Computation (MPC) protocol. At the beginning of the first phase, each party i inputs two sets

∗ of random strings called Si and Si . Each stream length is D, where D is the dimension size of a hypervector. The MPC protocol computes element-wise XOR (⊕) of all the provided bitstreams, and the substream of D elements represent the global base hypervector, i.e., GKey. Then, it

∗ performs XOR for the GKeys again with Si provided by each client. At the end of the first MPC ∗ protocol phase, the cloud receives Si ⊕ GKey corresponding to each user i and stores these secret ∗ keys. Note that since Si and Si are inputs from each user to the MPC protocol, it is not revealed to any other party during the joint computation. It can be seen that the server has a unique XOR-share of the global key GKey for each user. This, in turn, enables the server and each party to continue their computation in a point-to-point manner without involving other parties during the second phase. Our approach has a strong property that even if all other clients are dishonest and provide zero vectors as their share to generate the Gkey, the security of our system is not hindered. The reason is that the Gkey is generated with XOR of Si for all clients. That is, if one generates its seed randomly, the global key will have a uniform random distribution. In addition, the server only receives an XOR-share of the global key. The XOR-sharing technique is equivalent to One-Time Pad encryption and is information-theoretic secure which is superior to the security against computationally-bounded adversaries in standard encryption schemes such as Advanced Encryption Standard (AES). We only use XOR gates in MPC which are considerably less costly than non-XOR gates [127]. In the second phase, the protocol distributes the secret key for each user. Each party engages in a two-party secure computation using the GC protocol. Server’s inputs are SKeyi and ∗ ∗ Si ⊕ GKey, while the client’s input is Si . The global key GKey is securely reconstructed inside ∗ ∗ the GC protocol by XOR of the two shares: GKey = Si ⊕ (Si ⊕ GKey). The global key is then

89 shuffled based on the unique permutation bits held by the server (SKeyi). In order to avoid costly random accesses inside the GC protocol, we use the Waksman permutation network with SKeyi being the permutation bits [128]. The shuffled global key is sent back to the user, and we perform a single rotational shift for the GKey to generate the next base hypervector. We repeat this n times where n is the required number of base hypervectors, e.g., the feature size. The permuted base hypervectors serve as user’s personal keys, called PKey, for the projection. Once a user performs the projection with PKey, she can send the result to the server, and the server permutes back based on the SKeyi for the learning process.

4.4 SecureHD Encoding and Decoding

Figure 4.5 shows how the SecureHD framework performs the encoding and decoding of a client with the generated PKeys. The example has been shown for an image input data with n pixel values, { f1,..., fn}. Our design encodes each input data into a high-dimensional vector from the feature values (•A ). It exploits the PKeys, i.e., a set of the base hypervectors for the client, where 0 and 1 in the PKeys correspond to -1 and 1 to form a bipolar hypervector ({−1,+1}D).

We denote them by PKeys = {B1,...,Bn}. To store the metadata with negligible impact on the encoded hypervector, we devise a method which injects several metadata to small segments of an encoded hypervector. This method exploits another set of base vectors, {M1,...,Mk} (•B ). We call them as metavector. The encoded data are sent to the cloud to perform HD learning. Once the encoded data is received from the cloud, SecureHD can also decode them back to the original domain. This is useful for other cloud services, e.g., cloud storage. This procedure starts with identifying the injected metadata (•C ). Based on the injected metadata, it figures out the base hyperevectors that will be used in the decoding. Then, it reconstructs the original data from the decoded data (•D ). The key of the data recovery procedure is the value extraction algorithm, which retrieves both metadata and data.

90 7 1 10 n=1200, D=7,000 B (f =50) 106 1 1 n=1000, D=7,000 0.5 B (f =26) 105 n=1200, D=10,000 2 2 n=1000, D=10,000 B (f =77) 4 3 3 10 0 103

Variance 2 Cosine 10 -0.5 10 1 -1 0 3 6 9 12 15 18 0 20 40 60 80 100 120 Iterations (a) (b)

Figure 4.6: Value extraction example 4.4.1 Encoding in HD Space

Data Encoding The first step of SecureHD is to encode input data into hypervector, where an original data point has n features. We associate each feature with a hypervector. The features can have discrete value (e.g., alphabets in the text), in which we perform a straight mapping to hypervectors, or they can have a continuous range, in which case the values can be quantized and then mapped similar to discrete features. Our goal is to encode each feature vector to a hypervector that has D dimensions, e.g. D = 10,000.

To differentiate each feature, we exploit a PKey for each feature value, i.e., {B1,B2,...,Bn}, where n is the feature size of an original data point. Since the PKeys are generated from the random bit streams, the similarity of different base hypervectors are nearly orthogonal [129]:

δ(Bi, B j) ' 0 (0 < i, j ≤ n, i 6= j).

The orthogonality of feature hypervectors is ensured as long as the hypervector dimension, D, is large enough compared to the number of features (D >> n) in the original data. Different features are combined by multiplying feature values with the corresponding

D base hypervector, Bi ∈ {−1,+1} and adding them for all the features. For example, where fi is

91 a feature value, the following equation represents the encoded hypervector, H:1

H = f1 ∗ B1 + f2 ∗ B2 + ... + fn ∗ Bn.

If two original feature values are similar, their encoded hypervectors are also similar, thus providing the learning capability for the cloud without any knowledge for the PKeys. Please note that, with this encoding scheme, although an attacker intercepts sufficient hypervectors, the upper bound of the information leakage is the distribution of the data. It is because the hypervector does not preserve any information of the feature order, e.g., pixel positions in an image, and there are extremely large combinations of values in hypervector elements which exponentially grow as n increases. In the case that n is small, e.g., n < 20, we can simply add extra features drawn from a uniform random distribution, and it does not affect the data recovery accuracy and HD computation results. Metadata Injection A client may receive an encoded hypervector where SecureHD processes multiple data types. In this case, to identify base hypervectors used in the prior encoding, it needs to embed additional information of the data identifier and metadata, such as data type (e.g., image or text) and color depth. One naive way is to store this metadata as attached bits to the original hypervector. However, this does not keep the metadata secure. To embed the additional metadata into hypervectors, we exploit the fact that HD computing is robust to small modification of hypervector elements. Let us consider a data hypervector as a concatenation of several partial vectors. For example, a single hypervector with the D dimension can be viewed as the concatenation of different d-dimensional vectors, A1,...,AN:

H = A1 a A2 a ··· a AN

where D = N × d, and each Ai vector is called as a segment. We inject the metadata in a minimal number of segments.

D 1The scalar multiplication, denoted by *, can make a hypervector that has integer elements, i.e., H ∈ N .

92 Figure 4.5 shows the concatenation of a hypervector to N = 200 segments with d = 50 dimensions. We first generate a random d dimensional vector with bipolar values, Mi, i.e., metavector. A metavector corresponds to a metadata type. For example, M1 and M2 can correspond to the image and text types, while M3, M4, and M5 correspond to each color depth, e.g., 2-bit, 8-bit, and 32-bit. Our design injects each Mi into one of the segments in the data hypervector. We add the metavector multiple times to better distinguish it against the values already stored in the segment. For example, if we inject the metavector in the first segment, the following equation denotes the metadata injection procedure:

0 A1 = A1 + C ∗ M1 + C ∗ M2 + ... + C ∗ Mk where C is the number of injections for each metavector.

4.4.2 Decoding in HD Space

Value Extraction In many of today’s applications, the clouds are used as a storage, so the clients should be able to recover the original data from encoded ones. The key component of the decoding procedure is a new data recovery method that extracts the feature values stored in the encoded hypervectors. Let us consider an example of H = f1 ∗ B1 + f2 ∗ B2 + f3 ∗ B3, where

Bi is a base hypervector with D dimensions and fi is a feature value. The goal of the decoding procedure is to find a fi for a given Bi and H. A possible way is to exploit the cosine similarity metric, δ. For example, if we measure the cosine similarity of H and B1 hypervectors, δ(H, B1), the higher δ value represents higher chance of the existence of B1 in H. Thus, one method may iteratively subtracts one instance of B1 from H to check when the cosine similarity is zero, i.e., 0 0 δ(H ,B1) where H = H − m ∗ B1.

Figure 4.6a shows an example of the cosine similarity for each Bi when f1 = 50, f2 = 26 and f3 = 77 and m changes from 1 to 120. The result shows that the similarity decreases as subtracting more instances of B1 from H. For example, the similarity is zero when m is close to

93 st 1 Encoded Vector (H) 1 Estimated Features (F ) Value 1 1 1 h h h f f f D 2 1 Discovery 1 2 n 1st Estimated Vector (H1) 1 1 1 - h D h 2 h 1 Encoding 1st Error Vector (ΔH1) 1st Estimated Errors (E1) 1 1 1 Value 1 1 1 Δh Δe Δe e 1 e 2 e n D 2 1 Discovery + 2st Estimated Features (F2) 2 2 2 f 1 f 2 f n 2nd Estimated Vector (H2) 2 2 2 - h D h 2 h 1 Encoding st 2 2nd Error Vector (ΔH2) 2 Estimated Errors (E ) 2 2 2 Value 2 2 2 Δh Δe Δe e 1 e 2 e n D 2 1 Discovery + 3rd Estimated Features (F3) 3 3 3 f 1 f 2 f n

Figure 4.7: Iterative error correction procedure

0 fi as expected, and it gets negative values for further subtractions, since H has the term of −B1. Regardless of the initial similarity of H with B, the cosine similarity is around zero when m is close to each feature value fi. However, there are two main issues in the cosine similarity-based value search. First, finding the feature values in this way needs iterative procedures, slowing down the runtime of data recovery. In addition, it is more challenging when feature values are represented in floating points. Second, the cosine similarity metric may not give accurate results in the recovery. In our earlier example, the similarity of each fi is zero, when mi is 49, 29 and 78 respectively.

To efficiently estimate fi values, we exploit another approach that utilizes the random distribution of the hypervector elements. Let us consider the following equation:

H · Bi = fi ∗ (Bi · Bi) + ∑ f j ∗ (Bi · B j). j,∀ j6=i

Bi ·Bi is D since each element of the base hypervector is either 1 or -1, while Bi ·B j is almost zero due to their near-orthogonal relationship. Thus, we can estimate fi with the following equation, called value discovery metric:

94 fi ' H · Bi/D.

1 1 1 This metric yields an initial estimate of all feature values, say F = { f1 ,..., fn }. Start- ing with the initial estimation, SecureHD minimizes the error through an iterative procedure. Figure 4.7 shows the iterative error correction mechanism. We encode the estimated feature

1 1 1 1 1 1 vector, F , into the high dimensional space, H = {h1,...,hD}. We then compute ∆H = H − H , and apply the value extraction metric for ∆H1. Since this yields the estimated error, E1, in the original domain, we add it to the estimated feature vector for the better estimate of the actual features, i.e., F2 = F1 + E1. We repeat this procedure until the estimated error converges. To determine the termination condition, we compute the variance of the error hypervector, ∆Hi, at the end of each iteration. Figure 4.6b shows the variance changes when decoding four example hypervectors. For this experiment, we used two feature vectors whose size is either n = 1200 or 1000, where the feature values are uniform-randomly generated. We encoded each feature vector to two hypervectors with either D = 7,000 or D = 10,000. As shown in the results, the iterations required for accurate recovery depends on both the number of features in the original domain and hypervector dimensions. In the rest of the chapter, we use the ratio of the hypervector dimension to the number of features in the original domain, i.e., R = D/n, to evaluate the quality of the data recovery for different feature sizes. The larger R ratio, the larger the retraining iterations are expected to sufficiently recover the data. Metadata Recovery We utilize the value extraction method to recover the metadata. We calculate how many times each metavector {M1,...,Mk} presents in a segment. If the extracted instances of metavector are similar to the actual C value that we injected, such metavector is considered to be in the segment. However, since the metavector has a small number of elements, i.e., d << D dimensions, it might have a large error in finding the exact C value. Let’s assume that, when injecting a metavector C times, the value extraction method identifies a value, Cb, in a range of [Cmin,Cmax]. The range also includes C. If the metavector does not exist, the value Cb will be approximately zero, i.e., a range of [−ε,ε]. The amount of ε depends on the other information

95 +ε cmin

cmin +ε NM>0

푪 (a) 4 Metadata, C=10 (b) 4 Metadata, C=128

cmin

C (c) 15 Metadata, C=128 (d) Segment Size

Figure 4.8: Relationship between the number of metavector injections and segment size stored in the segment.

Figure 4.8a shows the distribution of extracted values, Cb, when injecting 5 metadata 10 times (C = 10) into a single segment of a hypervector. These distributions are reported using a Monte Carlo simulation with 1500 randomly generated metavectors. The results show that the distributions of the existing and non-existing cases are overlapped, making the estimation difficult. However, as shown in Figure 4.8b, when using C = 128, there is a clear margin between these two distributions which identify the existence of a metadata. Figure 4.8c shows the distributions when we inject 8 metadata into a single segment with C = 128. In that case, two distributions overlap, i.e., there are a few cases when we cannot fully recover the metadata.

We determine C so that the distance between Cmin and ε is larger than 0. We define the distance as the noise margin, NM = Cmin − ε. Figure 4.8d shows how many metavectors can be injected for different C values. The results show that the number of meta vectors that we can inject saturates for larger C values. Since the large number of C and segment size, d, also have a

96 Cloud Cloud Repermute (PKey1) (PKey ) Trained Model 1 SKey Permute Permute 1 Encoding 1 Training SKey1 Vector 1 SKey 1 Vector @ Client 1 Client 1 Class 1 Vector ρ Class Vector @

@ @ Class 2

Data

Training

Models Encoded

Vector 1 Aggrigation Vector @

Individual Individual Trained Trained Model (PKeyn) (PKeyn) Class K Permute Permute Encoding ClassK Training SKeyn Vector ρ SKeyn @ @ Repermute Client n Client n SKeyn

(a) Centralized Training (b) Federated Training

Figure 4.9: Illustration of the classification in SecureHD higher chance to influence on the accuracy of the data recovery, we choose C = 128 and d = 50 for our evaluation. Data Recovery After recovering the metadata, SecureHD can recognize the data types and choose the base hypervectors for decoding. We subtract the metadata from the encoded hypervector and start decoding the main data. SecureHD utilizes the same value extraction method to identify the values for each base hypervector. The quality of data recovery depends on the dimension of hypervectors in the encoded domain (D) and the number of features in the original space (n), i.e., R = D/n defined in Section 4.4.2. Intuitively, with the larger the R value, we can achieve a higher accuracy during the data recovery at the expense of the size of encoded data. For instance, when storing an image with n = 1000 pixels in a hypervector with D = 10,000 dimensions (R = 10), it is expected to achieve high accuracy for the data recovery. In our evaluation, we observed that, with R = 7, it is enough to ensure lossless data recovery in the worst case. In Section 4.6.4, we explore more detailed discussion about how R impacts on the accuracy of the recovery procedure.

97 4.5 Collaborative Learning in HD Space

4.5.1 Hierarchical Learning Approach

Figure 4.9 shows the HD-based collaborative learning in the high-dimensional space. In this chapter, we show two training approaches, centralized and federated training, which performs classification learning with a large amount of data provided by many clients. The cloud can perform the training procedures using the encoded hypervectors without explicit decoding. It only needs to permute the encoded data using the SKey of each client. Note that the permutation aligns the encoded data on the same GKey base, even though the cloud does not have the GKeys. It reduces the cost of the learning procedure, and the data can be securely classified even on the untrustworthy cloud. The training procedure creates multiple hypervectors as the trained model, where each hypervector represents the pattern of data points in one class. We refer them to class hypervectors. Approach 1: Centralized Training In this approach, the clients send the encoded hyper- vectors to the cloud. The cloud permutes them with the SKeys, and a trainer module combines the permuted hypervectors. The training is performed with the following sub-procedures. (i) Initial training: At the initial stage, it creates the class hypervectors for each class. As an example, for a face recognition problem, SecureHD creates two hypervectors representing “face” and “non-face”. These hypervectors are generated with element-wise addition for all encoded inputs which belong to the same class, i.e., one for ”face” and the other one for ”non-face”. (ii) Multivector expansion: After training the initial HD model, we expand the initial model with cross-validation, so that each class has multiple hypervectors of the size of ρ. The key idea is that, when training with larger data, it may need to capture more distinct patterns with different hypervectors. To this end, we first check cosine similarity for each encoded hypervector again to the trained model. If an encoded data does not correctly match with its corresponding class, it means that the encoded hypervector has a distinct pattern as compared to the majority of all the

98 inputs in the class. For each class, we create a set that includes such mismatched hypervectors and the original model. We then choose two hypervectors, whose similarity is the highest among all pairs in the set, and update the set by adding the selected two into a new hypervector. This is repeated until the set includes only ρ hypervectors. (iii) Retraining: As the last step, we iteratively adjust the HD model over the same dataset to give higher weights for misclassified samples that may often happen in a large dataset. We check the p similarity for each encoded hypervector again with all existing classes. Let us assume that Ck is one of the class hypervectors belonging to kth class, where p is the index of multiple hypervectors

th miss in the class. If an encoded hypervector Q belonging to i class is incorrectly classified to C j , we update the model by miss miss τ τ C j = C j − αQ and Ci = Ci + αQ

t where τ = argmaxt δ(Ci,Q) and α is a learning rate in a range of [0.0, 1.0]. In other words, in the case of misclassification, we subtract the encoded hypervector from the class which it is incorrectly classified to, while adding it to the class hypervector which has the highest similarity in the correct class. This procedure is repeated for predefined iterations, and the final class hypervectors are used for the future inference. Approach 2: Federated Training The clients may not have enough network bandwidth to send every encoded hypervector. To address this issue, we present the second approach, called federated training, as an edge computing. In this approach, the clients individually train initial models, i.e., one hypervector for each class, only using their own encoded hypervectors. Once the cloud receives the initial models of all the clients, it permutes the models with the SKeys and

th performs element-wise additions to create a global model, Ck, for each k class. Since the cloud only knows the initial models for each client, the multivector expansion procedure is not performed in this approach, but we can still execute the retraining procedure explained in Section 4.5.1. To this end, the cloud re-permutes the global model and sends it back

99 105 105

103 103

10 10

10-1 10-1

-3 -3

Execution Time (ms) 10 Execution Time (ms) 10 MNIST ISOLET UCIHAR PAMPA EXTRA FACE MNIST ISOLET UCIHAR PAMPA EXTRA FACE (a) Encoding (b) Decoding

Figure 4.10: Comparison of SecureHD efficiency to homomorphic algorithm in encoding and decoding to each client. With the global model, each client performs the same retraining procedure. Let us i th ˜ i assume that Cek is the retrained model by the i client. After the cloud aggregates all Ck with the i permutation, it updates the global models by Ck = ∑i Cek − (n − 1) ∗ Ck. This is repeated for the predefined iterations. This approach allows the clients to send the trained class hypervectors only for each retraining iteration, thus significantly reducing the network usage.

4.5.2 HD Model-Based Inference

With the class hypervectors generated by either approach, we can perform the inference in any device including the cloud and clients. For example, the cloud can receive an encoded hypervector from a client, and permute the dimension with the SKey in the same way to the training procedure. Then, it checks cosine similarity of the permuted hypervector to all trained class hypervectors to label with the corresponding class to the most similar class hypervector. In the case of the client-based inference, once the cloud sends re-permuted class hypervectors to a client, the client can perform the inference for its encoded hypervector with the same similarity check.

100 Table 4.1: Datasets (n: feature size, K: number of classes)

Data Train Test n K Size Size Size Description/State-of-the-art Model MNIST 784 10 220MB 60,000 10,000 Handwritten Recognition/DNN[133, 134] ISOLET 617 26 19MB 6,238 1,559 Voice Recognition/DNN [135, 136] UCIHAR 561 12 10MB 6,213 1,554 Activity recognition(Mobile)/DNN[137, 136] PAMAP2 75 5 240MB 611,142 101,582 Activity recognition(IMU)/DNN[138] EXTRA 225 4 140MB 146,869 16,343 Phone position recognition/AdaBoost[139] FACE 608 2 1.3GB 522,441 2,494 Face recognition/Adaboost[61]

4.6 Evaluation

4.6.1 Experimental Setup

We have implemented the SecureHD framework including encoding, decoding, and learning in high-dimensional space using C++. We evaluated the system on three different platforms: Intel i7 7600 CPU with 16GB memory, Raspberry Pi 3, and Kintex-7 FPGA KC705. We also exploit a network simulator, NS-3 [130], for large-scale simulation. We verify the FPGA timing and the functionality of the encoding and decoding by synthesizing Verilog using Xilinx Vivado Design Suite [131]. The synthesis code has been implemented on the Kintex-7 FPGA KC705 Evaluation Kit. We compare the efficiency of the proposed SecureHD with SEAL, the state-of-the-art C++ implementation of a homomorphic library, Microsoft SEAL [132]. For SEAL, we used the default parameters: polynomial modulus of n = 2048, coefficient modulus of q = 128−bit, plain modulus of t = 1 << 8, noise standard deviation of 3.9, and decomposition bit count of 16. We evaluate the proposed SecureHD framework with real-world datasets including human activity recognition, phone position identification, and image classification. Table 4.1 summarizes the evaluated datasets. The tested benchmarks range from relatively small datasets collected in a small IoT network, e.g., PAMAP2, to a large dataset which includes hundreds of thousands of images of facial and non-facial data. We also compare the classification accuracy of SecureHD for the datasets with the state-of-the-art learning models shown in the table.

101 4.6.2 Encoding and Decoding Performance

As explained in Section 4.3.3, SecureHD performs a one-time key generation to distribute the PKeys to each user using the MPC and GC protocols. This overhead comes mostly from the first phase of the protocol, since the second phase has been simplified with the two-party GC protocol. The cost of the protocol is dominated by network communication. In our simulation conducted under our in-house network of 100 Mbps, it takes around 9 minutes to create D = 10,000 keys for 100 participants. Note that the runtime overhead is negligible since the key generation happens only once before all future computation. We have also evaluated the encoding and decoding procedure running on each client. We compare the efficiency of SecureHD with the Microsoft SEAL [132]. We run both the SecureHD framework and homomorphic library on ARM Cortex 53 and Intel i7 processors. Figure 4.10 shows the execution time of the SecureHD and homomorphic library to process a single data point. For SecureHD, we used R = 7 to ensure 100% data recovery rate for all benchmark datasets. Our evaluation shows that SecureHD achieves on average 133× and 14.7× (145.6× and 6.8×) speedup for the encoding and decoding, respectively, as compared to the homomorphic technique running on the ARM architecture (Intel i7). The encoding of SecureHD running on embedded devices (ARM) is still 8.1× faster than the homomorphic encryption running on the high-performance client (Intel i7). We also compare the SecureHD efficiency on the FPGA implementation. We observe that the encoding and decoding of SecureHD achieve 626.2× and 389.4× (35.5× and 20.4×) faster execution as compared to the SecureHD execution on the ARM (Intel i7). For example, the proposed FPGA implementation is able to encode 2,600 data points and decode 1,335 for the MNIST images in a second.

4.6.3 Evaluation of SecureHD Learning

Learning Accuracy Based on the proposed SecureHD, clients can share the information

102 Centeralized-64 Centeralized-16 Federated One-shot Baseline Learning 100

90

80

Accuracy (%) 70

MNIST ISOLET UCIHAR PAMAP EXTRA FACE

Figure 4.11: SecureHD classification accuracy with the cloud in a secure way, such that the cloud cannot understand the original data while still performing the learning tasks. Along with the proposed two learning approaches, we also evaluate the state-of-the-art HD classification approach, called one-shot HD model, which trains the model using a single hypervector per class with no retraining [4, 95]. For the centralized training, we trained two models, one that has 64 class hypervectors for each class and the other one that has 16 for each class. We call them as Centralized-64 and Centralized-16. The retraining procedure was performed for 100 times with α = 0.05, since the classification accuracy was converged with this configuration for all the benchmarks. Figure 4.11 shows the classification accuracy of the SecureHD for the different bench- marks. The results show that the centralized training approach achieves high classification accuracy comparable to the state-of-the-art learning methods such as DNN models. We also ob- served that, by training more hypervectors per class, it can provide higher classification accuracy. For example, for the federated training approach, which does not use multivectors, the classifica- tion accuracy is 90% on average, which is 5% lower than the Centralized-64. As compared to the state-of-the-art one-shot HD model which does not retrain models, Centralized-64 achieves 15.4% higher classification accuracy on average. Scalability of SecureHD Learning As discussed in Section 4.5.1, the proposed learning method is designed to effectively handle a large amount of data. To understand the scalability of the proposed learning method, we evaluate how the accuracy is changed when the training data

103 are come from different numbers of clients, with simulation on NS-3 [130]. In this experiment, we exploit three datasets, PAMAP2, EXTRA, and FACE, which include information of where data points are originated. For example, PAMAP2 and EXTRA are gathered from 7 and 56 individual users. Similarly, the FACE dataset includes 100 clients that have different facial images with each other. Figure 4.12a and b show the accuracy changes for the centralized and federated training approaches. The result shows that increasing the number of clients improves classification accuracy by training with more data. Furthermore, as compared to the one-shot HD model, the two proposed approaches show better scalability in terms of accuracy. For example, the accuracy difference between the proposed approach and the one-shot model grows as more clients engage in the training. Considering the centralized training, the accuracy difference for the FACE dataset is 5% when trained with one client, while it is 14.7% for the 60-client case. This means that the multivector expansion and retraining techniques are effective to learn with a large amount of data. We also verify how the SecureHD learning methods work with constrained network conditions that often happen in IoT systems. In our network simulation, we assume the worst-case network condition, i.e., all clients share the bandwidth of a standard WiFi 802.11 network. Note that it is a worst-case scenario and in practice, each embedded device may not share the same network. Figure 4.12c shows that the network bandwidth limits the number of hypervectors that can be sent for each second as multiple clients involve the learning task. For example, a network with 100 clients can send the lower number of hypervectors by 23.6× than a single-client case. As discussed before, the federated learning can be exploited to overcome the limited network bandwidth at the expense of the accuracy loss. Another solution is to use a reduced dimension in the centralized learning. As shown in Figure 4.12c, when D = 1,000, clients can send the data to the cloud with 353 samples per second, which is 10 times higher than the case of D = 10,000. Figure 4.12d shows how learning accuracy changes for different dimension settings. The results show that reducing the hypervector dimensions to 4000 and 1000 dimensions has less

104

(a) Centralized (b) Federated

second

/ Samples

(c) Sample Rate Simuation (d) Dimension Reduction Figure 4.12: Scalability of SecureHD classification than 1.4% and 5.3% impact on the classification accuracy. This strategy gives another choice of the trade-off between accuracy and network communication cost.

4.6.4 Data Recovery Trade-offs

As discussed in Section 4.4.2, the proposed SecureHD framework provides a decoding method for the authorized user that has the original Pkeys used in the encoding. Figure 4.13a shows the data recovery rate on images with different pixel sizes. To verify the proposed recovery method in the worst case scenario, we created 1000 images whose pixel values are randomly chosen, and report the average error when we map the 1000 images to D = 10,000 dimension. The x-axis shows the ratio R, i.e., D/n where the number of hypervector dimension (D) to the number of pixels (n) in an image. The data recovery rate depends on the precision of the pixel values. Using high-resolution images, SecureHD requires a larger R value to ensure 100% accuracy. For instance, for images with 32-bit pixel resolution, SecureHD can achieve 100% data recovery using R = 7, while lower resolution images (e.g., 16 and 8-bits) requires R = 6 to

105 100 100 80 80 60 60 8-bit Pixels 40 40 16-bits Pixels English (26 Letters) 20 32-bits Pixels 20 German (29 Letters) Russian (49 Letters)

Data Recovery (%) 0 Data Recovery (%) 0 2 4 6 8 2 4 6 8 10 Ratio (R) Ratio (R) (a) Image (b) Text

Figure 4.13: Data recovery accuracy of SecureHD

Recovery : 100% 96% 86% 100% 96% 86% Accuracy

R=6 R=5 R=4 R=6 R=5 R=4 Lena MNIST Figure 4.14: Example of image recovery ensure 100% data recovery. Our evaluation shows that our method can decode any input image with 100% data recovery rate using R = 7. This means that we can securely encode data with 4× smaller size compared to the homomorphic encryption library which increases the data size by 28 times through the encryption. We also evaluate the SecureHD framework with a text dataset written in three different European languages [98]. Figure 4.13b shows the accuracy of data recovery for the three languages. The x-axis is the ratio between the length of hypervectors to the number of characters in the text when D = 10,000. Our method assigns a single value to each alphabet letter and encodes the texts with the hypervectors. Since the number of characters in these languages is less than 49, we require at most 6 bits to represent each alphabet. In terms of the data recovery, it is equal to encoding the same size image with the 6-bit pixel resolution. Our evaluation shows that SecureHD can provide 100% data recovery rate with R = 6. Figure 4.14 shows the quality of the data recovery for two example images. The Lena and MNIST image have 100 × 100 pixels and 28 × 28 pixels, respectively. Our encoding maps the

106 input data to hypervectors with different dimensions. For example, the Lena image with R = 6 means that the image has been encoded with D = 60,000 dimensions. Our evaluation shows that SecureHD can achieve lossless data recovery on Lena photo when R ≥ 6, while using R = 5 and R = 4 the data recovery rates are 93% and 68%. Similarly, R = 5 and R = 4 provide 96% and 56% data recovery for the MNIST images.

4.7 Conclusion

In this chapter, we presented a novel framework, called SecureHD, which provides secure data encoding and learning based on HD computing. With our framework, clients can securely send their data to untrustworthy cloud, while the cloud can perform the learning tasks without the knowledge of the original data. Our proof-of-concept implementation demonstrates that the proposed SecureHD framework successfully performs the encoding and decoding tasks with high efficiency, e.g., 145.6× and 6.8× faster than the state-of-the-art encryption/decryption library [122]. Our learning method achieves accuracy of 95% on average for diverse practical learning tasks, which is comparable to the state-of-the-art learning methods [133, 136, 136, 138, 139, 61]. In addition, SecureHD provides lossless data recovery with 4× reduction in the data size compared to the existing encryption solution. This chapter contains material from “A Framework for Collaborative Learning in Secure High-Dimensional Space”, by Mohsen Imani, Yeseong Kim, Sadegh Riazi, John Merssely, Patrick Liu, Farinaz Koushanfar and Tajana S. Rosing, which appears in IEEE Cloud Computing, July 2019 [5]. The dissertation author was the primary investigator and author of this paper.

107 Chapter 5

Summary and Future Work

With the emergence of the Internet of Things (IoT), devices are generating massive data streams. Running big data processing algorithms, e.g., machine learning, on embedded devices poses substantial technical challenges due to limited device resources. The goal of our research is to dramatically increase computing efficiency as well as the learning capability of today’s computers. We identify opportunities for designing future learning and computing techniques, which are intelligent, fast, efficient, and reliable. The main approach is to design brain-inspired algorithms based on hardware and technology requirements. In this dissertation, we propose two classes of solutions for IoT learning. We show how to (i) accelerate deep neural networks using processing n-memory computing (Chapter 2), and (ii) design Hyperdimensional computing for efficient and robust learning systems (Chapter 3 and Chapter 4). The following sections summarize the contributions of this dissertation and outline the future directions.

108 5.1 Thesis Summary

Deep Learning Acceleration: Running data/memory-intensive workloads on traditional cores results in high energy consumption and slow processing speeds, primarily due to a large amount of data movement between memory and processing units. In Chapter 2, we propose FloatPIM, a digital-based Processing in-memory (PIM) platform capable of accelerating deep learning in both training and inference phases [1, 59]. FloatPIM supports vector-based PIM computation over commercially available memory devices, i.e., Intel 3D XPoint, and provides the following advantages: (1) enables highly-parallel and scalable computation on digital data stored in memory which eliminates ADC/DAC blocks, (2) addresses the internal data movement issue by enabling in-place computation where the big data is stored, and (3) natively supporting floating-point precision; a necessity for many scientific applications including Deep learning training. FloatPIM precisely computes both training and inference phases with digital data stored in the storage class memory, which guarantees a highly accurate and reliable model. Our evaluation shows that FloatPIM using 16-bit floating-point precision achieves up to 5.1% higher classification accuracy as well as 303× faster and 48× higher energy efficient training, as compared to the state-of-the-art GPU architecture. Brain-Inspired Hyperdimensional Computing: In chapter 3, we develop a Hyperdi- mensional (HD) computing system that not only accelerates machine learning in hardware but also redesigns the algorithms themselves using strategies that more closely models the human brain. HD computing is motivated by the observation that the key aspects of human memory, perception, and cognition can be explained by the mathematical properties of high-dimensional spaces. HD computing mimics several desirable properties of the human brain, including: robustness to noise and hardware failure and single-pass learning where training happens in one-shot without storing the training data points or using complex gradient-based algorithms. These features make HD computing a promising solution for: (1) today’s embedded devices with limited storage, battery,

109 and resources, (2) future computing systems in deep nanoscaled technology which will have high noise and variability. As we explained in chapter 3, our platform includes: (1) novel HD algorithms supporting classification [23, 24], (2) two novel training approaches for binarizing the HD computing model, and (3) novel HD hardware accelerators using both CMOS and emerging NVM technology [4, 2]. In Chapter 4, we also introduced an infrastructure that makes it easy for users to integrate HD computing as a part of a larger system and enable secure distributed learning on encrypted information [5]. Our evaluations show that HD computing is 39× faster and 56× more energy efficient as compared to state-of-the-art deep learning accelerator running on the same in-memory platform.

5.2 Future Direction

IoT systems and applications continue to evolve while introducing new problems such as automated online learning and edge-based computing. Our future plan is to design a PIM architecture with the following features: (1) Developing a cross-layer infrastructure that can automatically analyze the application’s code and decide which portion needs to run on PIM or a general-purpose processor, (2) Exploiting the potential of other emerging technologies, e.g., Carbon Nanotube FET, Ferroelectric FETs, and Spintronics, to support fast and efficient PIM operations, (3) Looking to enable PIM to support essential machine learning operations (e.g., ranking or sorting operations) with lower sequential cycles and less/no impact on technology endurance and reliability. The proposed general-purpose framework will consist of three com- ponents. First, middleware that identifies program blocks that can be effectively accelerated using PIM operations. Second, a new instruction set architecture (ISA) that supports operations executed at both inter-block and intra-block levels to enable high levels of parallelism. Third, an underlying PIM hardware that executes the requested tasks across multiple memory blocks in a massively parallel way.

110 Bibliography

[1] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory acceleration of deep neural network training with high precision,” in Proceedings of the ISCA. ACM, 2019.

[2] M. Imani, S. Bosch, S. Datta, S. Ramakrishna, S. Salamat, J. M. Rabaey, and T. Rosing, “Quanthd: A quantization framework for hyperdimensional computing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019.

[3] M. Imani, X. Yin, J. Messerly, S. Gupta, M. Niemier, X. S. Hu, and T. Rosing, “Searchd: A memory-centric hyperdimensional computing with stochastic training,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019.

[4] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring hyperdimensional associative memory,” in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 2017, pp. 445–456.

[5] M. Imani, Y. Kim, S. Riazi, J. Messerly, P. Liu, F. Koushanfar, and T. Rosing, “A frame- work for collaborative learning in secure high-dimensional space,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 2019, pp. 435–446.

[6] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, “Predicting parameters in deep learning,” in Advances in neural information processing systems, 2013, pp. 2148– 2156.

[7] A. Zaslavsky, C. Perera, and D. Georgakopoulos, “Sensing as a service and big data,” arXiv preprint arXiv:1301.0159, 2013.

[8] Y. Sun, H. Song, A. J. Jara, and R. Bie, “Internet of things and big data analytics for smart and connected communities,” IEEE Access, vol. 4, pp. 766–773, 2016.

[9] R. Khan, S. U. Khan, R. Zaheer, and S. Khan, “Future internet: the internet of things archi- tecture, possible applications and key challenges,” in Frontiers of Information Technology (FIT), 2012 10th International Conference on. IEEE, 2012, pp. 257–260.

[10] G. Tzimpragos, A. Madhavan, D. Vasudevan, D. Strukov, and T. Sherwood, “Boosted race trees for low energy classification,” in Proceedings of the Twenty-Fourth International

111 Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 215–228.

[11] G. Gobieski, B. Lucia, and N. Beckmann, “Intelligence beyond the edge: Inference on intermittent embedded systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 199–213.

[12] A. Holmes, M. R. Jokar, G. Pasandi, Y. Ding, M. Pedram, and F. T. Chong, “Nisq+: Boosting quantum computing power by approximating quantum error correction,” in Proceedings of the 47th International Symposium on Computer Architecture, 2020.

[13] P. Kanerva, “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009.

[14] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning.” HPCA, 2017.

[15] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 27–39.

[16] “Intel and micron produce breakthrough memory technology.” http: //newsroom.intel.com/community/intel newsroom/blog/2015/07/28/ intel-and-micron-produce-breakthrough-memory-technology.

[17] S. Gupta, M. Imani, B. Khaleghi, V. Kumar, and T. Rosing, “Rapid: A reram processing in-memory architecture for dna sequence alignment,” in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2019, pp. 1–6.

[18] H. Nejatollahi, S. Gupta, M. Imani, T. S. Rosing, R. Cammarota, and N. Dutt, “Cryptopim: In-memory acceleration for lattice-based cryptographic hardware.”

[19] J. B. Kotra, M. Arjomand, D. Guttman, M. T. Kandemir, and C. R. Das, “Re-nuca: A practical nuca architecture for reram based last-level caches,” in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016, pp. 576–585.

[20] H. Saadeldeen, D. Franklin, G. Long, C. Hill, A. Browne, D. Strukov, T. Sherwood, and F. T. Chong, “Memristors for neural branch prediction: a case study in strict latency and write endurance challenges,” in Proceedings of the ACM International Conference on Computing Frontiers, 2013, pp. 1–10.

[21] K. K. Likharev, “Hybrid cmos/nanoelectronic circuits: Opportunities and challenges,” Journal of Nanoelectronics and Optoelectronics, vol. 3, no. 3, pp. 203–230, 2008.

112 [22] P. Kanerva, “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009.

[23] M. Imani, J. Hwang, T. Rosing, A. Rahimi, and J. M. Rabaey, “Low-power sparse hyper- dimensional encoder for language recognition,” IEEE Design & Test, vol. 34, no. 6, pp. 94–101, 2017.

[24] M. Imani, J. Messerly, F. Wu, W. Pi, and T. Rosing, “A binary learning framework for hyperdimensional computing,” in DATE. IEEE/ACM, 2019.

[25] B. Khaleghi, M. Imani, and T. Rosing, “Prive-hd: Privacy-preserved hyperdimensional computing.”

[26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.

[27] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.

[28] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European conference on computer vision. Springer, 2014, pp. 184–199.

[29] L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations and Trends R in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014.

[30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit- twieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot, “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[31] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[32] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 525–542.

[33] K. Seto, H. Nejatollahi, J. An, S. Kang, and N. Dutt, “Small memory footprint neural network accelerators,” in 20th International Symposium on Quality Electronic Design (ISQED). IEEE, 2019, pp. 253–258.

[34] M. Zhou, M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Gram: graph processing in a reram-based computational memory,” in Proceedings of the 24th Asia and South Pacific Design Automation Conference. ACM, 2019, pp. 591–596.

113 [35] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A hardware accelera- tor for combinatorial optimization and deep learning,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 2016, pp. 1–13.

[36] M. Zhou, M. Imani, S. Gupta, and T. Rosing, “Gas: A heterogeneous memory architecture for graph processing,” in Proceedings of the International Symposium on Low Power Electronics and Design. ACM, 2018, p. 27.

[37] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 14–26.

[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu- tional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[39] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh et al., “Mixed precision training,” arXiv preprint arXiv:1710.03740, 2017.

[40] M. Drumond, T. Lin, M. Jaggi, and B. Falsafi, “End-to-end dnn training with block floating point arithmetic,” arXiv preprint arXiv:1804.01526, 2018.

[41] C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling, “Relaxed quantization for discretized neural networks,” arXiv preprint arXiv:1810.01875, 2018.

[42] C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. Re,´ “High-accuracy low-precision training,” arXiv preprint arXiv:1803.03383, 2018.

[43] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small- footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4. ACM, 2014, pp. 269–284.

[44] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher, “Ucnn: Exploit- ing computational reuse in deep neural networks via weight repetition,” arXiv preprint arXiv:1804.06508, 2018.

[45] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, and G. Yuan, “Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 395–408.

[46] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, and S. Subhaschandra, “Can fpgas beat gpus in accelerating next- generation deep neural networks?” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp. 5–14.

114 [47] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 243–254.

[48] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, and N. Sun, “Dadiannao: A machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.

[49] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” arXiv preprint arXiv:1805.03718, 2018.

[50] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” in Proceedings of the 50th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture. ACM, 2017, pp. 288–301.

[51] R. Sharifi and Z. Navabi, “Online profiling for cluster-specific variable rate refreshing in high-density dram systems,” in 2017 22nd IEEE European Test Symposium (ETS). IEEE, 2017, pp. 1–6.

[52] M. N. Bojnordi and E. Ipek, “The memristive boltzmann machines,” IEEE Micro, vol. 37, no. 3, pp. 22–29, 2017.

[53] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang, “Time: A training-in- memory architecture for memristor-based deep neural networks,” in Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017, p. 26.

[54] Y. Cai, T. Tang, L. Xia, M. Cheng, Z. Zhu, Y. Wang, and H. Yang, “Training low bitwidth convolutional neural network on rram,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 2018, pp. 117–122.

[55] Y. Cai, Y. Lin, L. Xia, X. Chen, S. Han, Y. Wang, and H. Yang, “Long live time: im- proving lifetime for training-in-memory engines by structured gradient sparsification,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 107.

[56] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enabling scientific computing on memristive accelerators,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 367–382.

[57] M. Imani, M. Samragh, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, “Rapidnn: In- memory deep neural network acceleration framework,” arXiv preprint arXiv:1806.05794, 2018.

[58] S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, “Nnpim: A processing in-memory architecture for neural network acceleration,” IEEE Transactions on Computers, 2019.

115 [59] M. Imani, M. S. Razlighi, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, “Deep learn- ing acceleration with neuron-to-memory transformation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 1–14.

[60] M. Imani, S. Gupta, and T. Rosing, “Genpim: Generalized processing in-memory to accelerate data intensive applications,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 1155–1158.

[61] Y. Kim, M. Imani, and T. Rosing, “Orchard: Visual object recognition accelerator based on approximate in-memory processing,” in Computer-Aided Design (ICCAD), 2017 IEEE/ACM International Conference on. IEEE, 2017, pp. 25–32.

[62] M. Imani, S. Gupta, S. Sharma, and T. Rosing, “Nvquery: Efficient query processing in non-volatile memory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.

[63] M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, “Efficient neural network acceler- ation on gpgpu using content addressable memory,” in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2017, pp. 1026–1031.

[64] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE transactions on pattern analysis and machine intelligence, vol. 12, no. 10, pp. 993–1001, 1990.

[65] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “Magic—memristor-aided logic,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, pp. 895–899, 2014.

[66] S. Gupta, M. Imani, and T. Rosing, “Felix: Fast and energy-efficient logic in memory,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2018, pp. 1–7.

[67] A. Siemon, S. Menzel, R. Waser, and E. Linn, “A complementary resistive switch-based crossbar array adder,” IEEE journal on emerging and selected topics in circuits and systems, vol. 5, no. 1, pp. 64–74, 2015.

[68] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “Memristor- based material implication (IMPLY) logic: design principles and methodologies,” TVLSI, vol. 22, no. 10, pp. 2054–2066, 2014.

[69] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, “Memristive switches enable stateful logic operations via material implication,” Nature, vol. 464, no. 7290, pp. 873–876, 2010.

[70] B. C. Jang, Y. Nam, B. J. Koo, J. Choi, S. G. Im, S.-H. K. Park, and S.-Y. Choi, “Memristive logic-in-memory integrated circuits for energy-efficient flexible electronics,” Advanced Functional Materials, vol. 28, no. 2, 2018.

116 [71] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “Vteam: A general model for voltage-controlled memristors,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786–790, 2015.

[72] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within memristive memories using memristor-aided logic (magic),” IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635–650, 2016.

[73] M. Imani, S. Gupta, and T. Rosing, “Ultra-efficient processing in-memory for data intensive applications,” in Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017, p. 6.

[74] A. Haj-Ali, R. Ben-Hur, N. Wald, and S. Kvatinsky, “Efficient algorithms for in-memory fixed point multiplication using magic,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2018, pp. 1–5.

[75] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y. Xie, “Overcoming the challenges of crossbar resistive memory architectures,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 476–488.

[76] A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. P. Strachan, and N. Mu- ralimanohar, “Newton: Gravitating towards the physical limits of crossbar acceleration,” IEEE Micro, vol. 38, no. 5, pp. 41–49, 2018.

[77] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[78] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015.

[79] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,´ L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,´ R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viegas,´ O. Vinyals, P. Warden, M. Watten- berg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016.

[80] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory,” in Emerging Memory Technologies. Springer, 2014, pp. 15–50.

[81] D. Compiler, R. User, and M. Guide, “Synopsys,” Inc., see http://www. synopsys. com, 2000.

117 [82] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[83] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.

[84] P. Kanerva, Sparse distributed memory. MIT press, 1988.

[85] B. Babadi and H. Sompolinsky, “Sparseness and expansion in sensory representations.” Neuron, vol. 83, no. 5, pp. 1213–1226, Sep. 2014. [Online]. Available: http: //view.ncbi.nlm.nih.gov/pubmed/25155954

[86] A. Mitrokhin, P. Sutor, C. Fermuller,¨ and Y. Aloimonos, “Learning sensorimotor control with neuromorphic sensors: Toward hyperdimensional active perception,” Science Robotics, vol. 4, no. 30, p. eaaw6736, 2019.

[87] O. Ras¨ anen¨ and S. Kakouros, “Modeling dependencies in multiple parallel data streams with hyperdimensional computing,” IEEE Signal Processing Letters, vol. 21, no. 7, pp. 899–903, 2014.

[88] O. Rasanen and J. Saarinen, “Sequence prediction with sparse distributed hyperdimensional coding applied to the analysis of mobile phone use patterns,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–12, 2015.

[89] A. Joshi, J. Halseth, and P. Kanerva, “Language geometry using random indexing,” Quan- tum Interaction 2016 Conference Proceedings, In press.

[90] S. Jockel, “Crossmodal learning and prediction of autobiographical episodic experiences using a sparse distributed memory,” 2010.

[91] T. Wu, P. Huang, A. Rahimi, H. Li, J. Rabaey, P. Wong, and S. Mitra, “Brain-inspired computing exploiting carbon nanotube fets and resistive ram: Hyperdimensional computing case study,” in IEEE Intl. Solid-State Circuits Conference (ISSCC). IEEE, 2018.

[92] H. Li, T. F. Wu, A. Rahimi, K. Li, M. Rusch, C. Lin, J. Hsu, M. M. Sabry, S. B. Eryilmaz, J. Sohn, W. Chiu, M. Chen, T. Wu, J. Shieh, W. Yeh, J. M. Rabaey, S. Mitra, and H. P. Wong, “Hyperdimensional computing with 3d vrram in-memory kernels: Device-architecture co- design for energy-efficient, error-resilient language recognition,” in Electron Devices Meeting (IEDM), 2016 IEEE International. IEEE, 2016, pp. 16–1.

[93] M. Imani, D. Kong, A. Rahimi, and T. Rosing, “Voicehd: Hyperdimensional computing for efficient speech recognition,” in International Conference on Rebooting Computing (ICRC). IEEE, 2017, pp. 1–6.

118 [94] M. Imani, S. Salamat, B. Khaleghi, M. Samragh, F. Koushanfar, and T. Rosing, “Sparsehd: Algorithm-hardware co-optimization for efficient high-dimensional computing,” in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 190–198.

[95] A. Rahimi, P. Kanerva, and J. M. Rabaey, “A robust and energy-efficient classifier using brain-inspired hyperdimensional computing,” in Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 2016, pp. 64–69.

[96] M. Imani, J. Morris, S. Bosch, H. Shu, G. De Micheli, and T. Rosing, “Adapthd: Adap- tive efficient training for brain-inspired hyperdimensional computing,” in 2019 IEEE Biomedical Circuits and Systems Conference (BioCAS). IEEE, 2019, pp. 1–4.

[97] M. Imani, C. Huang, D. Kong, and T. Rosing, “Hierarchical hyperdimensional computing for energy efficient classification,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 108.

[98] U. Quasthoff, M. Richter, and C. Biemann, “Corpus portal for search in monolingual corpora,” in Proceedings of the fifth international conference on language resources and evaluation, vol. 17991802, 2006, p. 21.

[99] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, “Approxhadoop: Bringing approximations to mapreduce frameworks,” in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’15. New York, NY, USA: ACM, 2015, pp. 383–397. [Online]. Available: http://doi.acm.org/10.1145/2694344.2694351

[100] M. Imani, A. Rahimi, and T. S. Rosing, “Resistive configurable associative memory for approximate computing,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016, pp. 1327–1332.

[101] A. Ghofrani, A. Rahimi, M. A. Lastras-Montano,˜ L. Benini, R. K. Gupta, and K.-T. Cheng, “Associative memristive memory for approximate computing in gpus,” 2016.

[102] K.-H. Kim, S. Gaba, D. Wheeler, J. M. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu, “A functional hybrid memristor crossbar-array/cmos system for data storage and neuromorphic applications,” Nano letters, vol. 12, no. 1, pp. 389–395, 2011.

[103] N. Pinckney, M. Fojtik, B. Giridhar, D. Sylvester, and D. Blaauw, “Shortstop: An on-chip fast supply boosting technique,” in 2013 Symposium on VLSI Circuits. IEEE, 2013, pp. C290–C291.

[104] J. J. Yang, D. B. Strukov, and D. R. Stewart, “Memristive devices for computing,” Nature nanotechnology, vol. 8, no. 1, pp. 13–24, 2013.

119 [105] R. Dlugosz, A. Rydlewski, and T. Talaska, “Low power nonlinear min/max filters imple- mented in the cmos technology,” in Microelectronics Proceedings-MIEL 2014, 2014 29th International Conference on. IEEE, 2014, pp. 397–400.

[106] C. Zhuo, D. Sylvester, and D. Blaauw, “Process variation and temperature-aware reliability management,” in Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 2010, pp. 580–585.

[107] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server.” in OSDI, vol. 14, 2014, pp. 583–598.

[108] J. Andriessen, M. Baker, and D. D. Suthers, Arguing to learn: Confronting cognitions in computer-supported collaborative learning environments. Springer Science & Business Media, 2013, vol. 1.

[109] A. Papadimitriou, R. Bhagwan, N. Chandran, R. Ramjee, A. Haeberlen, H. Singh, A. Modi, and S. Badrinarayanan, “Big data analytics over encrypted datasets with seabed.” in OSDI, 2016, pp. 587–602.

[110] S. Suthaharan, “Big data classification: Problems and challenges in network intrusion prediction with machine learning,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 4, pp. 70–73, 2014.

[111] A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, 2016.

[112] R. Sharifi and A. Venkat, “Chex86: Context-sensitive enforcement of memory safety via microcode-enabled capabilities.”

[113] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Towards the science of security and privacy in machine learning,” arXiv preprint arXiv:1611.03814, 2016.

[114] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015.

[115] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds,” in Proceedings of the 16th ACM conference on Computer and communications security. ACM, 2009, pp. 199–212.

[116] M. Taram, A. Venkat, and D. Tullsen, “Context-sensitive fencing: Securing speculative execution via microcode customization,” in International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2019.

[117] A. Baumann, M. Peinado, and G. Hunt, “Shielding applications from an untrusted cloud with haven,” ACM Transactions on Computer Systems (TOCS), vol. 33, no. 3, p. 8, 2015.

120 [118] G. Zhao, C. Rong, J. Li, F. Zhang, and Y. Tang, “Trusted data sharing over untrusted cloud storage providers,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on. IEEE, 2010, pp. 97–103.

[119] A. J. Feldman, W. P. Zeller, M. J. Freedman, and E. W. Felten, “Sporc: Group collaboration using untrusted cloud resources.” in OSDI, vol. 10, 2010, pp. 337–350.

[120] S. Choi, G. Ghinita, H.-S. Lim, and E. Bertino, “Secure knn query processing in untrusted cloud environments,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 11, pp. 2818–2831, 2014.

[121] M. Van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully homomorphic encryption over the integers,” in Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 2010, pp. 24–43.

[122] H. Chen, K. Han, Z. Huang, A. Jalali, and K. Laine, “Simple encrypted arithmetic library v2.3.0,” in Microsoft, 2017.

[123] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in CCS. ACM, 2017.

[124] A. Ben-Efraim, Y. Lindell, and E. Omri, “Optimizing semi-honest secure multiparty computation for the internet,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2016, pp. 578–590.

[125] W. Du and M. J. Atallah, “Secure multi-party computation problems and their applications: a review and open problems,” in Proceedings of the 2001 workshop on New security paradigms. ACM, 2001, pp. 13–22.

[126] A. C.-C. Yao, “How to generate and exchange secrets,” in Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986, pp. 162–167.

[127] V. Kolesnikov and T. Schneider, “Improved garbled circuit: Free XOR gates and applica- tions,” in Automata, Languages and Programming. Springer, 2008.

[128] A. Waksman, “A permutation network,” Journal of the ACM (JACM), vol. 15, no. 1, pp. 159–163, 1968.

[129] P. Kanerva, J. Kristofersson, and A. Holst, “Random indexing of text samples for latent semantic analysis,” in Proceedings of the 22nd annual conference of the cognitive science society, vol. 1036. Citeseer, 2000.

[130] T. R. Henderson, M. Lacage, G. F. Riley, C. Dowell, and J. Kopena, “Network simulations with the ns-3 simulator,” SIGCOMM demonstration, vol. 14, no. 14, p. 527, 2008.

[131] T. Feist, “Vivado design suite,” White Paper, vol. 5, 2012.

121 [132] H. Chen, K. Laine, and R. Player, “Simple encrypted arithmetic library-seal v2. 1,” in International Conference on Financial Cryptography and Data Security. Springer, 2017, pp. 3–18.

[133] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[134] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3642–3649.

[135] “Uci machine learning repository,” http://archive.ics.uci.edu/ml/datasets/ISOLET, 1994.

[136] M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing, “Looknn: Neural network with no multiplication,” in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2017, pp. 1775–1780.

[137] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “Human activity recog- nition on smartphones using a multiclass hardware-friendly support vector machine,” in International workshop on ambient assisted living. Springer, 2012, pp. 216–223.

[138] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Wearable Computers (ISWC), 2012 16th International Symposium on. IEEE, 2012, pp. 108–109.

[139] Y. Vaizman, K. Ellis, and G. Lanckriet, “Recognizing detailed human context in the wild from smartphones and smartwatches,” IEEE Pervasive Computing, vol. 16, no. 4, pp. 62–74, 2017.

122