Device-Aware Inference Operations in SONOS Non- Volatile Memory Arrays
Total Page:16
File Type:pdf, Size:1020Kb
Device-aware inference operations in SONOS non- volatile memory arrays Christopher H. Bennett1*, T. Patrick Xiao1*, Ryan Dellana1, Ben Feinberg1, Sapan Agarwal1, and Matthew J. Marinella1 1. Sandia National Laboratories, Albuquerque, NM 87185-1084, USA {cbennet,txiao,mmarine}@sandia.gov *These authors contributed equally Vineet Agrawal2, Venkatraman Prabhakar2, Krishnaswamy Ramkumar2, Long Hinh2, Swatilekha Saha2, Vijay Raghavan3, Ramesh Chettuvetty3 2. Cypress Semiconductor Corporation, 198 Champion Court, San Jose, CA, 95134, USA 3. Cypress Semiconductor Corporation, Ste #120, 536 Chapel Hills Drive, Colorado Springs, CO, 80920, USA [email protected] Abstract— Non-volatile memory arrays can deploy pre-trained noise, variability, and circuit non-idealities [6],[7]. We neural network models for edge inference. However, these additionally investigate the impact of retention loss as systems are affected by device-level noise and retention issues. measured in real devices and show that the same device-aware Here, we examine damage caused by these effects, introduce a training methods can be applied to mitigate the effects of mitigation strategy, and demonstrate its use in fabricated array weight drift. In the following sections, we demonstrate that the of SONOS (Silicon-Oxide-Nitride-Oxide-Silicon) devices. On following two approaches – in terms of software neural MNIST, fashion-MNIST, and CIFAR-10 tasks, our approach network and circuit architecture respectively – can mitigate increases resilience to synaptic noise and drift. We also show device non-idealities: strong performance can be realized with ADCs of 5-8 bits precision. The use of noise regularization during training which implicitly counteracts generic noise disturbance, device- Index Terms-- SONOS, CTM, neural networks, edge inference to-device variability, and charge decay phenomenon: The use of an appropriate 1T,1CTM cells in a dual array I. INTRODUCTION configuration to reflect analog weights, which counteracts Non-volatile memory (NVM) arrays deliver low-latency device-to-device variability and high-performance in-memory computing at projected At measured levels of charge decay and at small levels of throughputs of multiple tera-operations (TOPs) per second. intrinsic system noise, we demonstrate that SONOS arrays They can also achieve femto-joule energy budgets per which reflect neural network weight matrices as analog multiply-and-accumulate (MAC) operation by directly physical quantities can yield inference accuracies that are implementing within the analog circuit the vector-matrix relevant to modern machine-learning vision tasks. multiplications (VMM) that are critical to neural network and scientific computing applications [1, 2]. While emerging analog NVM options such as filamentary/resistive RAM, phase-change memory (PCM) and conductive-bridge RAM (CB-RAM) are limited by relatively poor endurance, non- linearity, and cell-to-cell variation, foundry-accessible Silicon- Oxide-Nitride-Oxide-Silicon (SONOS) based charge-trap memory (CTM) cells offer an alternative pathway towards more standardized neural network realizations with 120× greater energy efficiency than an equivalent SRAM counterpart [3, 4]. Recently, similar three-terminal NOR Flash arrays implementing computer vision tasks were demonstrated [5], yet such systems are not at industry-standard complexity and do not consider several imperfect device issues. In this Figure 1. (a) A three-terminal SONOS devices in which a synaptic weight is work, we interface a neural training library with device-aware stored as a specified quantity of charge residing in a nitride charge trapping inference modeling and propose how CTM devices can layer, (b) A computational array made up of SONOS devices for performing efficiently implement deep networks. matrix-vector multiplication; inputs are applied as pulsed row voltages from Prior work has considered the injection of noise during the left and the outputs can be read out as currents at the bottom; a constant training to increase the resilience of the hardware to read gate voltage is applied to all the SONOS devices (red). The 2T array uses an access transistor in every cell (black), used during programming.. Select Gate (SG) having a SiO2 gate dielectric. We also use two SONOS arrays to represent a matrix of real-valued weights, where each weight is represented by the difference of the currents drawn by a pair of SONOS cells. The currents drawn by corresponding columns in the two arrays are subtracted, and the result is again digitized using an analog- to-digital converter (ADC). III. CNN INFERENCE SIMULATIONS Arrays of NVM devices can greatly accelerate matrix operations in neural networks, but with a potential loss in accuracy brought upon by non-ideal effects at the device level. Inference operations may be perturbed by cycle-to-cycle read noise, imprecise programming, device-to-device variation, and loss of state retention. In SONOS devices, these issues arise from imperfect control over the device dimensions, the spatial Figure 2. (a) Sub-sets of an input (blue) are applied as vector inputs to an array containing a set of convolutional filters in a layer; each set (red) is and energy distribution of desired and undesired traps, and the then passed to a following array. (b) Our neural network, trained on MNIST quantity of stored charge. Partly as a result, the or f-MNIST, where example images are convolved over a set of 4 layers implementation in NVM arrays of large convolutional neural followed by 2 fully connected (FC) layers. networks (CNNs) – which generalize to hard real-time recognition tasks – has so far been limited [8]. Cognizant of II. SONOS DEVICE AND ARRRAY this, we connect Keras, a library for training deep neural network models, and a new inference engine within the SONOS devices are highly suitable for analog memory CrossSim platform, an open-source Sandia software that with multiple levels because both program and erase models crossbar arrays as highly parameterizable neural cores. operations can be achieved using Fowler-Nordheim (FN) In order to implement special bounding and clipping scehmes tunneling, making it possible to target states with high during training, we have also used functions from the precision in the drain current or threshold voltage. The ONO Whetstone library [9]. Our NVM-CNN uses an existing stack is a critical part of the SONOS memory cell, shown in scheme to implement convolutional kernels on memory Fig. 1(a). Charge is stored as electrons or holes residing in arrays, shown in Fig. 2 [10]. During inference, we import traps embedded in the nitride layer of the ONO stack. SONOS-specific parameters and inject cycle-to-cycle device Depending on the type and amount of charge trapped, the noise to the synaptic weights (using CrossSim), while neuron threshold voltage (Vt) and current (Id) of the cell can be noise is added during training (using Keras) onto the inputs to placed at desired levels. The programming voltage and/or the rectified linear unit (ReLU) activation function. We bound pulse width can be varied to set the device to different current the activations to a maximum value of 1.0, though we will levels. While the drain current can in theory be set by precise also later consider a case with the bound removed. Implicitly, the use of training noise provides an effective form of neural programming bias voltages, variation in these states is network regularization and combats over-fitting [11]. The expected in practice, with a spread that depends on many spread σ of the Gaussian noise introduced during training device-level factors such as interface states in the SONOS neu can be linked to the spread of cycle-to-cycle synapse noise σsyn stack, random dopant fluctuations and variations in charge during inference as: trapping within the nitride. Moreover, the amount of variation can increase over time. So far, the realistic (1) properties of the SONOS stack and their implications for a where γact is the average value of an activation calculated over neural network inference accelerator have not been n synapses. When importing trained models, we select the considered in detail. device dynamic range used in each layer to contain the central We study three-terminal SONOS devices implemented in 10–90% of the weight values in that layer; the extreme weight a dense array that holds a matrix of weight values, a small values are clipped to the maximum or minimum allowable example of which is shown in Fig 1(b). The charge-trapping device currents. By adaptively tuning the weight values that nitride layer can be programmed to a specified current level are represented by the device current levels, the effect of by applying a large voltage between the poly-cap gate and the imprecision in the programmed currents (due to write errors, drain. During inference, a fixed voltage VG,read is applied to read noise, variability, or drift) can be minimized. In the the pass transistors of every cell in a bitwise fashion and the results to follow, we find that clipping the extreme 20% of current drawn by each device is scaled by the input voltage, weight values during inference has a very small influence on applied between the source and drain. The device currents the network’s accuracy. along a column are summed by Kirchoff’s current law and IV. ACCURACY DEGRADATION DUE TO INTERNAL NOISE ultimately yield an output via the analog to digital converter (ADC). During programming [3, 4], our studied array uses As a first lens of analysis, we analyze the resilience of pre- the Control Gate (CG) having an ONO gate dielectric and a trained neural networks – with and without noise regularization during training – to the disturbance of VMM number of parameters (~500,000) between the models trained for the MNIST and fashion MNIST (f-MNIST) image recognition tasks, failure is more noticeable on the more difficult f-MNIST task, as shown in Fig. 3(c) and (d). The accuracy for neural networks trained both with and without noise regularization now falls substantially below the baseline without noise (~90%) on the fashion-MNIST task, decreasing below 85% when σsyn < 0.1.