Application-level Studies of Cellular Neural Network-based Hardware Accelerators

Qiuwen Lou1, Indranil Palit1, Andras Horvath2, Li Tang3, Michael Niemier1, X. Sharon Hu2 1Department of and Engineering, University of Notre Dame 2Faculty of Information Technology, Pazmany Peter Catholic University 3Los Alamos National Laboratory {qlou, ipalit, mniemier, shu}@nd.edu, [email protected], [email protected]

Abstract—As cost and performance benefits associated with CPUs/GPUs (e.g., for image recognition, CeNNs can be highly Moore's Law scaling slow, researchers are studying alternative ar- efficient for feature extraction tasks given the architecture’s chitectures (e.g., based on analog and/or spiking circuits) and/or parallel nature; for more mathematical operations, CPUs may computational models (e.g., convolutional and recurrent neural networks) to perform application-level tasks faster, more energy be more efficient.) (ii) Next, given the analog nature of efficiently, and/or more accurately. We investigate cellular neural a CeNN, and the inherent nature of inference applications, network (CeNN)-based co-processors at the application-level for algorithmic accuracy must be evaluated at multiple levels (e.g., these metrics. While it is well-known that CeNNs can be well- we must address overall algorithmic quality, and any impact suited for spatio-temporal information processing, few (if any) on algorithmic quality due to lower precision hardware.) Algo- studies have quantified the energy/delay/accuracy of a CeNN- friendly and compared the CeNN-based approach to rithmic refinement may be needed. (iii) must then the best von Neumann algorithm at the application level. We be mapped to a suitable hardware architecture (e.g., parallel present an evaluation framework for such studies. As a case CeNNs vs. a CeNN that is used serially). Again, algorithm study, a CeNN-friendly target-tracking algorithm was developed refinement may be necessary. (iv) Finally, we must compare and mapped to an array architecture developed in conjunction energy, delay, and accuracy projections to the best von Neu- with the algorithm. We compare the energy, delay, and accuracy of our architecture/algorithm (assuming all overheads) to the mann algorithm for the same application-level problem. most accurate von Neumann algorithm (Struck). Von Neumann As a case study, we have developed a CeNN-based tar- CPU data is measured on an Intel i5 chip. The CeNN approach get tracking algorithm to complete an application-level case is capable of matching the accuracy of Struck, and can offer study. Target tracking is a fundamental component of many approximately 1000× improvements in energy-delay product. applications [7]. We have also developed a CeNN array architecture for our algorithm with analog CeNN I.INTRODUCTION circuitry, analog-to-digital (ADC) and digital-to-analog (DAC) As Moore’s Law-based device scaling/accompanying per- converters, and an 8-bit register. We compare our approach formance scaling trends slow, there is continued interest in new to the best reported von Neumann algorithm in terms of technologies and computational models to perform a given task energy, delay, and accuracy. Per [7], we have identified the best faster and more energy efficiently [1], [2]. Furthermore, there von Neumann algorithm (Struck [8]) for images sequences in is growing evidence that, with respect to traditional Boolean the VisualTracking.net benchmark suite [9]. Our CeNN-based circuits, it will be challenging for emerging devices to compete approach (with ADC/DAC overheads, etc.) is competitive with with CMOS technology. [1]–[3]. Struck in terms of accuracy, and improvements in energy-delay Researchers are increasingly looking for new ways to ex- products (EDP) can be >1000X. Finally, we show how our ploit the unique characteristics of emerging devices (e.g., non- algorithm can be tuned such that a CeNN approach becomes volatility) as well as architectural paradigms that transcend the competitive for sequences where Struck has superior accuracy.

arXiv:1903.06649v2 [cs.ET] 12 Jun 2019 energy efficiency wall. In this regard, cellular neural networks (CeNNs) are now under investigation via the Semiconductor II.BACKGROUND Research Corporation’s benchmarking activities [3], [4] as We review CeNNs and prior work with tracking algorithms. (i) they can solve a broad set of problems [5] (e.g,. image processing, associative memories, etc.), and (ii) can exploit A. Cellular Neural Networks the unique properties of both spin- and charge- based de- A spatially invariant CeNN architecture [5] is an M × N vices [2], [6]. However, we must consider what application array of identical cells (Fig. 1a). Each cell, Cij, (i, j) ∈ spaces/computational models might benefit from CeNNs, and {1, ..., M} × {1, ..., N}, has identical connections with adja- if CeNNs outperform other alternative architectures/models, cent cells in a predefined neighborhood, Nr(i, j) of radius r. for the same problem, at the application-level. The size of the neighborhood is m = (2r + 1)2, where r is a We have developed a framework for evaluating CeNN co- positive integer. A conventional analog CeNN cell consists of processors to quantitatively assess CeNNs at the application- one resistor, one capacitor, 2m linear VCCSs, one fixed current level. (i) We begin with algorithm development where pro- source, and one specific type of non-linear voltage controlled cessing tasks are mapped to CeNNs or more conventional voltage source (Fig. 1b). The input, state, and output of a � � �11 �12 �13 �14 �� �� ��� continuously before a fixed background, these methods are not �� � � as general if used as a learning system. As we will show in Sec. �21 �22 23 24 C ������ Z R IV, our approach can select the optimal template sequences � � � 31 32 33 �34 �� … ������ from a larger set in runtime based on the features of the tracked (a) �41 �42 �43 �44 (b) object, and the features of the background. When considering algorithms for von Neumann machines, Fig. 1. (a) CeNN architecture, (b) circuitry in CeNN cell. convolution neural networks have been proposed to learn discriminative features or to help in the extraction of the dis- given cell Cij, correspond to the nodal voltages, uij, xij, and criminative saliency map [15], [18]. The outputs from a hidden yij respectively. VCCSs controlled by the input and output layer can be used as feature descriptors. The Struck algorithm voltages of each neighbor deliver feedback and feedforward [8] that we compare to is an adaptive tracking-by-detection currents to a given cell. The dynamics of a CeNN are captured methods for single target tasks. This is fundamentally an by a system of M × N ordinary differential equations, each object recognition problem that places significant constraints of which is simply the Kirchhoff’s Current Law (KCL) at the on training/learning an object model. A tracker is initialized state nodes of the corresponding cells per Eq. 1. with a single bounding box provided as a ground truth for the starting frame of a video sequence. After initialization, dxij (t) xij (t) X C = − + aij,klykl (t) the tracker maps each input frame to an output bounding box dt R Ckl∈Nr (i,j) containing the predicted location of the object in that frame. X Struck uses a kernelized, structured, output support vector + bij,klukl + Z (1) machine to provide adaptive tracking. It is well suited for Ckl∈Nr (i,j) drastic appearance changes for a tracked image in a given CeNN cells typically employ a non-linear sigmoid-like transfer image sequence, and has the highest reported accuracy for the function [10] at the output to ensure fixed binary output levels. dataset that we use to evaluate our CeNN-based approach [7]. The parameters aij,kl, and bij,kl serve as weights for the To quantify accuracy, in this paper, we use a success plot to feedback and feedforward currents from cell Ckl to cell Cij. evaluate the bounding box overlap [7]. Given a tracked bound- a b ij,kl, and ij,kl are space invariant and are denoted by two ing box rt and the ground truth bounding box ra, the success (2r + 1) × (2r + 1) matrices. (If r = 1, they are captured plot is defined as S = |rt∩ra| , where ∩ and ∪ represent the |rt∪ra| by 3 × 3 matrices.) The matrices of a and b parameters are intersection and union of two regions, respectively. typically referred to as the feedback template (A) and the feedforward template (B) respectively. Design flexibility is III.CELLULARNEURALNETWORKSYSTEM further enhanced by the fixed bias current Z that provides a means to adjust total current flowing into a cell. A CeNN can Here, we discuss our CeNN-based co-processor and a solve a wide range of image processing problems by carefully framework for mapping a problem to such a co-processor. selecting the values of the A and B templates (as well as Z). A. A CeNN-based co-processor Various circuits including inverters, Gilbert multipliers, op- erational transconductance amplifiers (OTAs), etc [11], [12] The architecture and the processing unit schematics for can be used to realize VCCSs. We use the OTA design our CeNN-based system are shown in Fig. 2. Analog input from [13]. OTAs provide a large linear range for voltage signals (e.g., from sensors) are provided to the CeNN based co- to current conversion, and can implement a wide range of processor. The CPU calls the co-processor to execute a given transconductances allowing for different CeNN templates. task/template by asserting a control signal. After processing, the CeNN based hardware accelerator generates an analog out- B. Target Tracking put which can either be converted to a N-bit digital signal and Given the initial position/size of a target object in a frame, communicated with the CPU, or can be reused as an analog a tracking algorithm should determine the position/size of the input to itself for further processing. Since each cell might object in subsequent frames. Many studies have been per- need to send/receive data to/from the CPU, some mechanism formed to track objects in different scenarios [14], [15] (e.g., is required to schedule the data movement. Each processing occlusion, illumination variation, etc. [7]). Target tracking unit consists of a N-bit analog-to-digital converter, an analog remains a challenging problem in computer vision as varying CeNN cell, an analog multiplexer, and a N-bit register. The scene dynamics need to be reconciled in a single algorithm. analog multiplexer selects a signal either from the external Work in [16], [17] initially considered CeNN-based trackers environment, or the analog CeNN cell itself. The analog CeNN where CeNNs performed pixel-level , and a digital cell is as described in Sec. II-A. signal processor processed remaining information. However, [16], [17] do not present a thorough energy/delay study. B. A framework for evaluation Moreover, [16], [17] use a predefined sequence of kernels, We now discuss an evaluation framework (see Fig. 3) to and object tracking is based primarily on motion and change quantify the benefits of a CeNN co-processor. We must first detection. While this may be useful if an object is moving develop an algorithm that: (i) is suitable for CeNN hardware, … Control into two parts, training and tracking. (While we evaluate signal

ADC the tracking portion – i.e., to make an ”apples-to-apples” External M Analog comparison to Struck – we also discuss the training portion … …Analog signal U CNN cell 8-bit CPU X register for completeness. Other approaches in the literature also Processing unit Digital DAC signal

… consider image recognition with off-line training, and then report energy-delay-accuracy per classification –e.g., [19] for Fig. 2. A CeNN based hardware accelerator. MNIST.) A. Training Develop Map functions Evaluate Evaluate algorithm to hardware accuracy energy, delay In the training process, we select a set of Difference of Gaussians (DoG) kernels from a feature pool, assign weight to each kernel, and select a threshold to profile the interest Fig. 3. Framework for evaluating the CeNN hardware accelerator. area. The selected parameters above can be used on subsequent images to identify the target. and (ii) achieves the same functionality as a traditional von- Training is currently done off-line with a CPU. A set Neumann algorithm at application-level. As CeNN arrays pro- of features are selected that best match the target object cess information via template operations on local data, we first and separates it from the features of the background and identify the global problem and describe it by local operations surrounding objects. The feature pool is generated by running and descriptors. For target tracking, we must identify local a number of DoG kernels for features extraction on an image features (e.g., orientation, movement direction, etc.), and the and pooling the results. A DoG kernel performs two isotropic positions of detected features can be used to estimate the or directional Gaussian diffusions along with a subtraction position of the target. For example, in the CeNN paradigm, the operation. Each DoG operation is followed by a pooling step, position of local features can be detected by shadow templates which also leverages fast-running CeNN templates. The most that can identify the vertical and horizontal bounding box descriptive threshold is selected from a series of thresholds sides of local features. (Per Sec. IV, the local kernels used at different levels, and the binary result is diffused with an for feature extraction are selected via . A isotropic template. The result of this sequence of operations varying number of kernels can be selected which give a sparse corresponds to different feature responses. This approach is and strong response on many images.) similar to the operation used in deep learning. After the algorithm is defined, its functional steps must be From this image set, a subset of descriptor images can be mapped on to suitable portions of the system. The algorithm selected to create a globally sparse and high-response area is designed to maximize the use of CeNN hardware. How- corresponding to the estimated object position, given as a ever, portions of the algorithm might still require numerical bounding box. A user-specified number of features is selected operations on a CPU. Computation steps that involve parallel, from the pool based on response strength within the bounding independent processing of signals are mapped on to CeNN box. The features are then weighted by a . co-processors, e.g., feature extraction, classification at the After the genetic algorithm produces optimal linear weights, pixel level, etc. requiring precise floating point the linear sum of the descriptors is thresholded at a value to operations are mapped on to the CPU. The N-bit register highlight the interest area. within each processing unit of the CeNN architecture can facilitate any required data communication to and from the B. Tracking CPU. Our CeNN-based tracking algorithm/information processing The accuracy of the algorithm’s output must also be eval- flow is shown in Fig. 4. (The remaining part is Kalman filter- uated, and the algorithm may need to be refined to achieve based processing which is computed by the CPU.) suitable accuracy. Inaccuracies can be sourced from both the The predefined DoG kernels, running by CeNN, are used quality of the algorithm, and the analog nature of the hardware in tracking to profile the object of interest. Each DoG kernel platform, and accuracy should be evaluated at multiple levels. is weighted by a generic algorithm; the normalization and Finally, the algorithm and architecture must be evaluated to weighted sum of the image after DoG, performed by a CPU, quantify the power, energy, and delays required for compu- would be the featured image. The runtime for each DoG tation, and compared to the best von Neumann algorithm for operation is less than the settling time of a CeNN, so extra the same problem. Analysis at this stage can lead to further circuitry would be needed to control the timing of operation. modification of both the algorithm and architecture to maintain Then, tracking is performed by taking the centroid of the local desired power-density of the co-processor, frame processing response area of the current frame and resizing the bounding rates of the system (e.g., to enable real time tracking), etc. box according to the dimensions of the response area relative to the response area of the first frame. After the image is IV. CENN BASEDTARGETTRACKINGALGORITHM constructed, several CeNN operations follow. A THRESHOLD We now provide a detailed description of our CeNN- operation can be performed to further highlight the object of based target tracking algorithm. The algorithm can be divided interest. A following AND operation with the ground truth CeNN operation at a speed of 10,000 samples per second. A data acquisition Input image CPU operation system records the current data when the CPU executes code Gt_image Obtain specific image area from tracking for a respective algorithm. We start the timer after the CPU LOGAND reads the image, and synchronize the execution time. Thus, we DIFFU1 DIFFU2 RECALL only compute the delay/energy of the CPU. Memory overhead Subtraction ones(size(image)) ones(size(image)) is not considered as most program data is stored in registers SHADOWD SHADOWL DIFFU3 for both architectures. For the CeNN based algorithm, we also Normalization Height of object h Width of object w set the timer at the beginning and end of CPU operations to

푦 = 푦 + 훼퐼푥퐼 i

DIFFUS2 7.5 1.5 1.58 0.0081 rate Success SUB 7.5 3.0 3.08 0.0158 0.2 DIFFUS3 25 1.5 1.58 0.027 Init 0.3 1.5 1.58 0.0324 0 0 0.2 0.4 0.6 0.8 1 THRES 0.3 3.0 3.08 0.0781 bbbbbb b. Overlap threshold LOGAND 0.3 6.0 6.08 0.154 RECALL 35.2 21.8 21.8 64.9 Fig. 6. Accuracy for RedTeam and Vase benchmarks. SHADOWL 35.2 9.0 9.08 27 Init 0.3 1.50 1.58 0.00324 SHADOWD 24 9.0 1.58 18.4 approach is still orders of magnitude slower than the system DILATION 1.6 13.5 13.6 1.84 containing a CeNN accelerator. Furthermore, our software Total/frame 160 76.9 112 implementation indicates that 99% of the computation time is Total (all frames) 0.306s 0.216J attributed to CeNN operations. Hence, the CeNN co-processor has a substantial, positive impact on the CeNN-based system. Struck algorithm CPU part in CeNN algorithm Finally, for the Struck algorithm, energies and delays for a. Hardware accelerator part in CeNN algorithm 100 various sequences are directly obtained from measured CPU data per Sec. V-B. For the different image sequences in Fig. 5,

1 EDPs are largely 1000X to 10000X better for our CeNN-based Runtime (s) Runtime 0.01 approach versus Struck. RedTeam CarScale Walking2 BlurBody Human9 BlurCar3 Vase Sylvester Struck algorithm D. Accuracy analysis CPU part in CeNN algorithm 10000 b. Hardware accelerator part in CeNN algorithm Given the potential for energy/delay improvements, we now 100 consider tracker accuracy. In Fig. 6, we show representative re-

1 sults for accuracy for the RedTeam and Vase image sequences Energy (J) Energy from Dataset 1. Per the inset, the solid box represents the 0.01 RedTeam CarScale Walking2 BlurBody Human9 BlurCar3 Vase Sylvester ground truth, while the dashed box represents the position c. established by the tracker. The overlap threshold is the inter- 10000

delay delay section of the two boxes over the union of the two boxes. After -

100 collecting this data for every image in the sequence, we can Energy product(EDP) 1 plot success rate (i.e., the percentage of time a given overlap RedTeam CarScale Walking2 BlurBody Human9 BlurCar3 Vase Sylvester threshold was achieved). For a perfect tracker, this plot would Fig. 5. (a) Run time and (b) energy for the CeNN based algorithm and Struck simply be a horizontal line at ’1’ on the y-axis. To represent algorithm. The stacked bars are for the CeNN algorithm – dotted bar for the accuracy as a single number, we calculate the area under the CeNN part, bar with diagonal lines for the CPU part (c) energy-delay product curve (AUC) for a given image sequence and algorithm. (EDP) of Struck algorithm over the CeNN algorithm. AUC scores were calculated for the 24 sequences for both Struck and our CeNN-based approach. For Struck, the algo- Delay, energy, and EDP for the CeNN-based system is rithm can be tuned such that energy, accuracy, or run-time may depicted by the stacked bars in Fig. 5(a)-(c) respectively. (The vary (e.g., leveraging more kernels, changing the search radius dotted bars represent the CeNN-based co-processor, while the etc.). We set these parameters to obtain the highest possible bars with diagonal lines represent computations mapped to the accuracy as per Sec. V-C, our CeNN-based approach can be CPU.) The rightmost solid bar represents the Struck algorithm. superior to a von Neumann approach in terms of energy and Here, computations mapped to the CPU in the CeNN based delay. As such, we consider these metrics as we approach iso- algorithm generally have higher delay and energy when com- accuracy. To obtain AUC data for our CeNN-based approach, pared to information processing operations performed on the we used a Matlab-based CeNN simulator [26]. While both CeNN. the Matlab simulator and Struck algorithm run on a 64-bit To fully appreciate the benefit of CeNNs, we have run a machine, the image quality is 8-bits. Thus, CeNN hardware version of the CeNN-based algorithm implemented in Matlab should be sufficient to process these video sequences. on a CPU. Run times are on the order of a few hours for The accuracy results (AUC scores for all 24 benchmarks) a given image sequence. Given that the corresponding C are shown in Fig. 7. For dataset 1, the average AUC for code for CeNN algorithm runs 100X faster, the Matlab-based our algorithm and Struck are 0.438 and 0.402, respectively oprdiseeg/ea/cuayt ihyacrt,von accurate, highly a to and energy/delay/accuracy co-processor, CeNN-based CeNN-based its a a compared developed for algorithm we tracking study, target case a As co-processors. work. future in studied realized be efficiently will This be substantial will [27]). also a CeNNs this V-C, can with Sec. kernels While (Gabor per 0.378. delay, exists. to margin and energy improves increase score likely AUC the from kernels of connectivities with number CeNN 2 and the 50, 5 increase Dataset to we 25 from When from 0.186. 7b) kernels of Fig. benchmark score AUC in Car4 quantitatively initial the sequence To considered highlighted extraction. we (see feature hypothesis, will enhance this orientation kernels the address and and Gabor edges frequencies area. of spatial be cap- feature to also to sensitivity local can increase increased expand connectivity be CeNN to features. can increased detailed kernels As more improved. of ture further the number be of the can configurations examples, which basic algorithm, consider CeNN-based only currently frames. consecutive gorithm only two moving in object pixels an of with – number features example limited local an a algorithm on as CeNN more to the dataset – focus as Bolt running can athletes the hypothesis several in Using are initial there case. algorithm-level where our the the is discuss at this briefly why 7c). more we Fig. this work, in consider future sequences will highlighted similar we (see either is While Struck approach to CeNN superior the or Bolt), sequences and image Ironman challenging there most (i.e., said, the That Among benchmark. exceptions. AUC than the are better both for 2X algorithm for approximately 7b, CeNN is (Figs. scores the approach 3 AUC Struck-based and the 2 the object c) Datasets an track, for However, that to decrease. such algorithms change difficult dynamics more scene becomes As 7a). Fig. (see × ehv rsne rmwr o vlaigCeNN-based evaluating for framework a presented have We al- CeNN-based the for projections accuracy Additionally, i.7 U cr o a aae ,()Dtst2 n c aae 3. Dataset (c) and 2, Dataset (b) 1, Dataset (a) for score AUC 7. Fig. c. b. a. 5 AUC score AUC score AUC score U nrae o020 fw mlyCUbsdGabor CPU-based employ we If 0.200. to increases AUC , Dataset #3 Dataset #2 Dataset #1 I C VI. ONCLUSION 3 × 3 to Tn,Li, Tang, [22] Plt Indranil, Palit, [23] real-time Haar from S. tracking and [21] classification target “Moving Lipton, t. Modha, J. S. A. D. [20] and Arthur, V. J. Merolla, via A. P. tracking Appuswamy, visual R. “Robust Esser, Yang, S. M.-H. and [19] Wu, Y. Liu, Q. Zhang, K. [18] Csaba, Rekeczky, [17] tracking multitarget real-time “A Rekeczky, Csaba and Timar, Gergely [16] Seunghoon, Hong, [15] Seles rodWM, Arnold Smeulders, [14] [28]). networks enable diffusion may efficient/compact which (e.g., highly the evaluate technologies study emerging (ii) (iii) other of CeNNs, and impact refinements, with algorithmic will process with we training energy/delay work, the future accelerate In improvement competitive. (i) approach algorithmic is our to Struck makes paths When that outlined EDP. have in we offer improvements superior, can magnitude and as sequences, of accurate image orders many as for is algorithm tracker Struck the CeNN-based Our tracker. Neumann Que Lou, Qiuwen [13] Wang, Lei [12] Molinar-Solis, E. Jesus [11] LOCu n i ag Clua erlntok Theory,” network: neural “Cellular Yang, Lin and Chua O L [10] SmHare, Sam [8] tunnel Yi, Wu, on based [7] processor signal mixed cnn-inspired “A Sedighi, t. B. [6] Roska, T. and Available: Chua O. [Online]. L. 2016. [5] initiative,” https: STARnet Available: “SRC-DARPA [Online]. ——, 2016. [4] center,” benchmarking “SRC SRC, neural cellular [3] energy-efficient for proposal “A Naeemi, A. and Pan C. [2] exploratory beyond-cmos of “Benchmarking Young, I. and Nikonov D. [1] Y W. Y. [9] ehd n t nryipiain, in implications,” energy its and methods samhare/struck sn te lp eie, in Design devices,” Computer-Aided slope steep using in video,” (NIPS) Systems Processing Information computing,” neuromorphic energy-efficient for “ networks,” convolutional Applications Their and Networks Neural in Cellular machines,” universal cnn adaptive algorithms,” cnn-um I multichannel Systems and robust Circuits with system in network,” neural convolutional arXiv with map saliency 2014. 1442–1468, pp. 36.7, vol. vey,” 05 p 277–282. in pp. systems,” 2015, CNN for design Vision Computer on Conf. Int. 2011 Recognition Pattern and Vision Computer on 1150–1155. pp. 2015, in transistors,” 2002. Press, University Applications and Foundations ing: https://www.src.org/program/starnet/ //www.src.org/program/nri/benchmarking/ devices,” nology spintronic on based network Circuits and Devices Computational circuits,” integrated logic for devices N ihcl-tt outputs,” cell-state with CNN Technology Video and Image, Signal, for in Systems unit,” inverter floating-gate Systems and Circuits on Transactions tracker o.10.69,2015. 1502.06796, vol. , EETascin nPtenAayi n ahn Intelligence Machine and Analysis Pattern on Transactions IEEE o.1,n.5 p 2–2,Sp 2016. Sept 820–827, pp. 5, no. 15, vol. , t al et. t al et. benchmark/benchmark tal et tal et plctoso optrVision Computer of Applications t al et. tal et Vsa rce ecmr, http://cvlab.hanyang.ac.kr/ benchmark,” tracker “Visual , Src, 06 Oln] vial:https://github.com/ Available: [Online]. 2016. “Struck,” , Oln bettakn:Abnhak”in benchmark,” A tracking: object “Online , tal et Guaclrto fdt sebyi nt element finite in assembly data of acceleration “Gpu , ein uoain&Ts nEurope in Test & Automation Design, Tm utpee oo mg rcsigbsdo a on based processing image color multiplexed “Time , tal. et Src:Srcue upttakn ihkres”in kernels,” with tracking output Structured “Struck: , Te-ae prtoa rncnutneamplifier transconductance operational “Tfet-based , tal. et tal et Clua erlntok o mg analysis image for networks neural “Cellular , o.5.,p.15–31 2005. 1358–1371, pp. 52.7, vol. , 04 p 92–95. pp. 2014, , R tal et Mlitre rcigwt trdprogram stored with tracking “Multi-target , Oln rcigb erigdiscriminative learning by tracking “Online , ri rpitarXiv:1501.04505 preprint arXiv ellrNua ewrsadVsa Comput- Visual and Networks Neural Cellular tal et EFERENCES Pormal mscncl ae on based cell cnn cmos “Programmable , EETVLSI IEEE Vsa rcig neprmna sur- experimental An tracking: “Visual , C ra ae ypsu nVLSI on Symposium Lakes Great ACM EEAMItrainlCneec on Conference International IEEE/ACM h ora fVS inlProcessing Signal VLSI of Journal The v10.html. e ok Y S:Cambridge USA: NY, York, New . 01 p 263–270. pp. 2011, , o.1 p –1 e 2015. Dec 3–11, pp. 1, vol. , EEJ nEpoaoySolid-State Exploratory on J. IEEE o.3,p.15–22 1988. 1257–1272, pp. 35, vol. , EEItrainlWrso on Workshop International IEEE EETascin nNanotech- on Transactions IEEE p –,2015. 1–9, pp. , EEASAP IEEE o.62 p 1–2,1998. 314–322, pp. 6.2, vol. , 98 p 8–14. pp. 1998, , 03 p 2411–2418. pp. 2013, , 07 p 207–216. pp. 2007, , 02 p 299–306. pp. 2002, , 03 p 321–8. pp. 2013, , e.DT ’15, DATE ser. , ri preprint arXiv 2015. , EET on T. IEEE EEConf. IEEE Neural IEEE , , [24] J. Laboratories of the Pazm´ any´ P. Catholic University, “Software library for cellular wave computing engines,” 2016. [Online]. Available: http://cnn-technology.itk.ppke.hu/Template library v3.1.pdf [25] Y. Xu and T. Ytterdal, “A 7-bit 40 ms/s single-ended asynchronous sar adc in 65 nm cmos,” Analog Integrated Circuits and , vol. 80: 349, 2004. [26] A. H. et. al, “CNN Simulator,” 2016. [Online]. Available: cnn-technology.itk.ppke.hu [27] B. E. Shi, “Gabor-type filtering in space and time with cellular neural networks,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 45.2, 1998. [28] Sedighi, B., et al., “Nontraditional computation using beyond-cmos tunneling devices,” IEEE J. on Emerging and Selected Topics in Circuits and Systems, vol. 4, no. 4, pp. 438–449, 2014.