<<

1

Survey and Benchmarking of Accelerators Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner MIT Lincoln Laboratory Supercomputing Center Lexington, MA, USA {reuther,pmichaleas,michael.jones,vijayg,sid,kepner}@ll.mit.edu

Abstract—Advances in multicore processors and accelerators components play a major role in the success or failure of an have opened the flood gates to greater exploration and application AI system. of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and ma- chine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption . The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially- available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and Fig. 1. Canonical AI architecture consists of sensors, conditioning, mobile machine learning inference applications that are most algorithms, modern computing, robust AI, human-machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and applicable to the DoD and other SWaP constrained users. We systems. determine how they actually perform with real-world images and neural network models, compare those results to the reported performance and power consumption values and evaluate them On the left side of Figure 1, structured and unstructured against an CPU that is used in some embedded applications. data sources provide different views of entities and/or phe- nomenology. These raw data products are fed into a data con- Index Terms—Machine learning, GPU, TPU, dataflow, accel- erator, embedded inference ditioning step in which they are fused, aggregated, structured, accumulated, and converted to information. The information generated by the data conditioning step feeds into a host I.INTRODUCTION of supervised and unsupervised algorithms such as neural networks, which extract patterns, predict new events, fill in Artificial Intelligence (AI) and machine learning (ML) have missing data, or look for similarities across datasets, thereby the opportunity to revolutionize the way many industries, converting the input information to actionable knowledge. arXiv:1908.11348v1 [cs.PF] 29 Aug 2019 militaries, and other organizations address the challenges of This actionable knowledge is then passed to human beings evolving events, data deluge, and rapid courses of action. for decision-making processes in the human-machine teaming Innovations in computations, data sets, and algorithms have phase. The phase of human-machine teaming provides the driven many advances for machine learning and its application users with useful and relevant insight turning knowledge into to many different areas. AI solutions involve a number of actionable intelligence or insight. different pieces that must work together in order to provide Underlying all of these phases is a bedrock of modern capabilities that can be used by decision makers, warfighters, computing systems that is comprised of one or more heteroge- and analysts; Figure 1 depicts these important pieces that are nous computing elements. For example, sensor processing may needed when developing an end-to-end AI solution. While occur on low power embedded , while algorithms certain components may not be as visible to end-users as may be computed in very large data centers. With regard to others, our experience has shown that each of these interrelated performance advances in these computing elements, Moore’s This material is based upon work supported by the Assistant Secretary of law trends have ended [1], as have a number of related Defense for Research and Engineering under Air Force Contract No. FA8721- laws and trends including Denard’s scaling (power density), 05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or clock frequency, core counts, instructions per clock cycle, and recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for instructions per Joule (Koomey’s law) [2]. Many of the tech- Research and Engineering. nologies, tricks and techniques of designers that 2 extended these trends have been exhausted. However, all is not including integer representations, have been shown to lost, yet; advancements and innovations are still progressing. be reasonably effective for inference [7], [8]. However, In fact, there has been a Cambrian explosion of computing it has also generally been established that very limited technologies and architectures in recent years. Specializa- numerical precisions like int4, int2, and int1 do not tion of circuits for certain functionalities is being exploited adequately represent model weight parameters and sig- whereby certain often-used operational kernels, methods, or nificantly affect model output predictions. functions are being accelerated with specialized circuit blocks The survey in the section of this paper focuses on the and chips. These accelerators are designed with a different computational throughput of the processors and accelerators balance between performance and functional flexibility. One along with the power that is consumed to achieve that perfor- area in which we are seeing an explosion of accelerators is mance. Other factors include the memory bandwidth to load ML processors and accelerators [3]. Understanding the relative and update model parameters and data; memory capacity for benefits of these technologies is of particular importance to model weight parameters and input data, both close to the applying AI to domains under significant constraints such as arithmetic units and the global memory of the processors and size, weight, and power, both in embedded applications and accelerator; and arithmetic intensity [9] of the neural network in data centers. models being processed by the processor or accelerator. These But before we get to the survey of ML processors and factors are involved in managing model parameters and input accelerators, we must cover several topic that are important for data flows within the model; hence, they also influence the understanding several dimensions of evaluation in the survey. trade-offs between chip bandwidth capabilities, data flow We must discuss the types of neural networks for which these flexibility, and configuration and amount of computational ML accelerators are being designed; the distinction between capability. These factors, however, are beyond the scope of neural network training and inference; and the numerical this paper, and they will be addressed in future phases of this precision with which the neural networks are being used for research. training and inference: II.SURVEY OF PROCESSORS • Types of Neural Networks – AI and machine learning encompass a wide set of statistics-based technologies as Many recent advances in AI can be at least partly credited one can see in the taxonomy detailed in the algorithm to advances in computing hardware [10], [11]. In particular, section (Section 3) of this MIT Lincloln Laboratory modern computing advances have been able to realize many technical report [4]. Even among neural networks, there computationally heavy machine-learning algorithms such as are a growing number of neural network patterns [5]. neural networks. While machine-learning algorithms such as This paper will focus on processors that are geared neural networks have had a rich theoretic history [12], recent toward deep neural networks (DNNs) and convolutional advances in computing have made the application of such neural networks (CNNs). Overall, the most emphasis of algorithms a reality by providing the computational power computational capability for machine learning is on DNN needed to train and process massive quantities of data. Al- and CNNs because they are quite computationally inten- though the computing landscape of the past decade has been sive [6], with the fully connected and convolutional lay- rich with numerous innovations, more embedded and mobile ers being the most computationally intense. Conversely, applications that require low size, weight, and power (SWaP) pooling, dropout, softmax, and recurrent/skip connection systems will need capabilities that are beyond those delivered layers are not computationally intensive since these types by the traditional architectures of central processing units of layers stipulate for weight and data operands. (CPUs) and graphics processing units (GPUs). For example, • Neural Network Training versus Inference – Neural net- in commercial applications, it is common to off-load data work training uses libraries of input data to converge conditioning and algorithms to non-SWaP constrained plat- model weight parameters by applying the labeled input forms such high-performance computing clusters or processing data (forward projection), measuring the output predic- clouds. Defense applications among others, on the other hand, tions and then adjusting the model weight parameters may need AI applications to be performed inside low-SWaP to better predict output predictions (back projections). platforms or local networks (edge computing) and without the Neural network inference is using a trained model of use of the cloud due to insufficient security or communication weight parameters and applying it to input data to receive infrastructure. output predictions. Processors designed for training can The survey in this section gathers performance and power also perform well at inference, but the converse is not information from publicly available materials including re- always true. search papers, technical trade press, company benchmarks, etc. • Numerical precision – The numerical precision with While there are ways to access information from companies which the model weight parameters are stored and com- and startups (including those in their silent period), this infor- puted has an impact on the effectiveness and efficiency mation is intentionally left out of this survey; such data will with which networks are trained and used for inference. be included in this survey when it becomes publicly available. Generally higher numerical precision representations, The key metrics of this public data are plotted in Figure 2, particularly floating point representations, are used for which graphs recent processor capabilities (as of May 2019) training, while lower numerical precision representations, mapping peak performance vs. power consumption. The x-axis 3 indicates peak power, and the y-axis indicate peak giga opera- Loading the model turns the model’s parameters into constants tions per second. (GOps/s) Note the legend on the right, which that are stored with the operator rather than operands that must indicates various parameters used to differentiate computing be loaded from volatile (DRAM or SRAM) memory thereby techniques and technologies. The computational precision of reducing the number of operand/parameter loads that must the processing capability is depicted by the geometric shape occur separate from the instruction load. used; the computational precision spans from single bit int1 There are a number of dimensions with which we can to single int8 and four-byte float32 to eight-byte float64. present the processors and accelerators in this survey. We have The form factor is depicted by the color; this is important chosen to roughly categorize the scatter plot into six regions for showing how much power is consumed, but also how that roughly correspond to performance and power consump- much computation can be packed onto a single chip, a single tion: Very Low Power and Research Chips, () PCI card, and a full system. Blue is only the performance GPUs, Mobile and Embedded Chips and Systems, FPGA and power consumption of a single chip. Orange shows the Accelerators, Chips and Cards, and Data Center performance and power of a card (note that they all are in Systems. In the following listings, the angle-bracketed string is the 200-300 Watt zone). Green shows the performance and the label of the item on the scatter plot, and the square bracket power of entire systems – in this case, single node desktop and after the angle bracket is literature reference from which the server systems. This survey is limited to single motherboard, performance and power values came. Some of the performance single memory-space systems. Finally, the hollow geometric values are reported in frames per second (fps) with a given objects are performance for inference only, while the solid machine learning model. For those values, Samuel Albanie has geometric figures are performance for training (and inference) Matlab code and a web site that lists all of the major machine processing. Mostly, low power solutions are only capable learning models with their operations per epoch/inference, of inference, though there are some high-power accelerators parameter memory, feature memory, and input size [14]; the (WaveDPU, Goya, Arria, and Turing) that are targeting high operations per epoch/inference are used to compute operations performance for inference only. per second from frames per second. Finally, if a neural network From Figure 2, we can make a number of general obser- model is not mentioned, the performance reported is peak vations. First, much of the recent efforts have focused on performance. processors that are in the 10-300W range in terms of power utilization, since they are being designed and deployed as pro- A. Very Low Power and Research Chips cessing accelerators. (300W is the upper limit for a PCI-based accelerator card.) For this power envelope, the performance Chips in the very low power regime have been mainly can vary depending on a variety of factors such as architecture, university and industry research chips. However, a few vendors precision, and workload (training vs. inference). There are have announced or are offering products in this space. many solutions under the 1 TeraOps/W line; however, there • MIT Eyeriss chip hEyerissi [7], [15], [16] is a research are several inference solutions and a few training solutions chip from Vivienne Sze’s group in MIT CSAIL. Their that are reporting greater than 1 TeraOps/W. goal was to develop the most energy efficient inference With the current offerings, at least 100W must be employed chip possible. The result was acquired running AlexNet to perform training; all of the points on the scatter plot below with no mention of batch size. 100W are inference-only processors/accelerators. There are a • The TrueNorth hTrueNorthi [17], [18] is a digital neuro- number of possible explanations for this, but it is likely that morphic research chip from the IBM Almaden research there is currently little driving a requirement for low-power lab. It was developed under DARPA funding in the training, though there is much demand for low-power infer- Synapse program to demonstrate the efficacy of digital ence on devices ranging from to remotely piloted spiking neural network (neuromorphic) chips. Note that aircraft (RPA) and autonomous vehicles. From a technology there are points on the graph for both the system, which standpoint, this may suggests that the trade-offs necessary to draws the 44 W power, and the chip, which itself only do neural network training under the 100W envelope affect draws up to 275 mW. the performance, numerical accuracy, and prediction accuracy • The Intel MovidiusX processor hMovidiusXi [19] is an too greatly. embedded video processor that includes a Neural Engine Many hardware manufacturers, faced with limitations in for video processing and object detection. fabrication processes, have been able to exploit the fact that • In early 2019, Google released a TPU Edge processor machine-learning algorithms such as neural networks can hTPUEdgei [20] for embedded inference application. The perform well even when using limited or mixed precision [8], TPU Edge uses TensorFlow Lite, which encodes the [13] representation of activation functions, weights, and bi- neural network model with low precision parameters for ases. Such hardware platforms (often designed specifically for inference. inference) may quantize weights and biases to half precision • The DianNao series of dataflow research chips came (16 bits) or even single bit representations in order to improve from a university research team in China. They pub- the number of operations/second without significant impact to lished four different designs aimed at different types model prediction, accuracy, or power utilization. To that point, of ML processing [21]. The DianNao hDianNaoi [21] in the inference engines, the entire neural network model is is a neural network inference accelerator, and the Da- usually loaded onto the chip before any inference is performed. DianNao hDaDianNaoi [22] is a many-tile version of Neural Network Processing Performance 4

Legend

WaveSystem Computation Precision Data Center Chips & DGX-2 Int1 Cards Int2 Arria GX1150 DGX-1 Baidu Int8 WaveDPU DGX-Station Turing Int8 -> Int16 Cambricon Data Center Nervana2 V100 Int12 -> Int16 Cambricon TPU3 Systems Int16 ArriaGX1155 TPU2 Goya GraphCoreNode Int32 Nervana AMD-MI60 P100 Very Low Power Mobile TPU1 Float16 GraphCoreC2 Float16 -> Float32 /Second ZCU102 AMD-MI6 /W Xavier K80 Peak Peak Float32 Zynq-060 TrueNorth DaDianNao Phi7290F 2xSkyLakeSP TeraOps Cell TrueNorthSys Float64 AIStorm10 ArriaGX1150 Phi7210F GPUs ArriaGX1150 GOps PuDianNao A12 XilinxCluster /W MovidiusX Mali ArriaGX1150 Form Factor Zynq-020 -76 ArriaGX1150 Mali-75 JetsonTX2 DianNao JetsonTX1 Chip TeraOps Zynq-020 1 Stratix-V ShiDianNao S835 S845 Card /W FPGAs Zynq-020 System GigaOps TPUEdge 100 MIT Eyeriss Computation Type Inference Peak Power (W) Training

Presentation Name - 26 of Author Initials MM/DD/YY Fig. 2. PerformanceSlide courtesy vs. power of Albert scatter Reuther, plot of publicly MIT Lincoln announced Laboratory AI accelerators Supercomputing and processors. Center

the DianNao for larger NN model inference. The ShiD- usually 5W maximum for battery life) for fast inference ianNao hShiDianNaoi [23] is designed specifically for runs, and this performance point is on the VGG-16 model. convolutional neural network inference. Finally, the Pu- • The Kirin 980 (with AMD Mali-76 GPU IP) DianNao hPuDianNaoi [24] is designed for seven repre- hMali-76i [29] and Kirin 970 (with AMD Mali-75 GPU sentative machine learning techniques: k-means, k-NN, IP) hMali-75i [30] make their performance mark with the na¨ıve Bayes, support vector machines, linear regression, ResNet34 and VGG-16 models, respectively. classification tree, and deep neural networks. • Finally, the Snapdragon 835 hS835i and 845 • San Jose startup AIStorm hAIStormi [25] claims to do hS845i [29] are also on the chart with performance some of the math of inference up at the sensor in the numbers using the ResNet34 and InceptionV3 models, analog domain. They originally came to the embedded respectively. space scene with biometric sensors and processing. They call their chip an AI-on-Sensor capability. C. Embedded Chips and Systems • The Rockchip RK3399Pro hRockchipi [26] is an im- The systems in this category are aimed at automotive age and neural co-processor from Chinese company AI/ML, autonomous vehicles, UAVs, robots, etc. They all have Rockchip. They published raw performance numbers for several ARM cores that are mated with CUDA GPU 8bit inference. This appears to be a GPU-based co- cores. processor but details are few. • The -TX1 hJetsonTX1i [31] incorporates 4 ARM cores and 256 CUDA Maxwell cores. It is B. Cell / Smartphone GPU-based Neural Engines aimed at low power applications for inference only. The A number of smartphone vendors are embedding GPU- performance was achieved with GoogLeNet with a batch based neural engines in their smartphones to enable object size of 128. detection, face recognition, and other inference-based tasks. • The Jetson-TX2 hJetsonTX2i [31] mates 6 ARM cores The performance metrics for five inference neural engines, with 256 CUDA Pascal cores. It also is aimed at low which were benchmarked with AImark, are included in this power applications for inference only. The performance survey. AImark runs VGG-16, ResNet34 and InceptionV3 on was achieved with GoogLeNet with a batch size of 128. smartphones, and it is available in the Apple and • The NVIDIA Xavier hXavieri [32] deploys 8 ARM cores the Google Play Store. It is reasonably safe to assume that with 512 CUDA Volta cores and 64 Tensor cores. It is these GPU-based vector processors are executing with Int8 aimed also at low power applications for inference only. precision. • The processor hA12i [27], [28] in the iPhone D. FPGA Co-processors Xs tops out this set. This A12 neural engine bursts its In public literature, the use of FPGAs for neural net- power utilization to 5.5W for short time periods (above its works has been primarily in the technical research domain. 5

Quite a number of research teams around the world have 3) GPU-based Accelerators: There are four NVIDIA cards mapped one or more neural network models onto one or and two AMD/ATI cards on the chart (listed respectively): more FPGAs and collected a variety of performance and the Maxwell architecture K80 hK80i [53], the Pascal archi- model prediction accuracy metrics. Several survey papers tecture P100hP100i [54], [55], the Volta architecture V100 have been published including [33] and [34], and the most hV100i [56], [57], the TU106 Turing hTuringi [58], the MI6 comprehensive survey paper of mapping and running DNNs hMI6i [59], and MI60 hMI60i [60]. The K80, P100, V100, on FPGAs is here [35]. This last paper lists 25 top results MI6, and MI60 GPUs are pure computation cards intended for from published research literature, of which we have chosen 12 both inference and training, while the TU106 Turing GPU is that are the performance leaders for their numerical precision geared to the gaming/graphics market for including inference and/or FPGA model. They are labeled with an abbreviation processing within the graphics processing. of their chip type: hZynq-020i int1 [36], int2 [37], int8 [38]; hZynq-060i int16 accumulator/int12 result [39], hZCU102i 4) Data Center Chips and Cards: This subsection lists a int16 [40], hStratix-Vi int32 [41], hArriaGX1150i int16 ac- series of chips and cards intended for data center deployment. cumulator/int8 result [42], int16 [43], fp16 [44], fp32 [43]; and hArriaGX1155i 1-bit [45] points with different numerical precisions. They are all used for inference. Finally, there is • Intel Corp. bought AI chip startup Nervana in August a 7-FPGA Cluster hXilinxClusteri [46] in which the 2016 to enter the AI accelerator market. The first Ner- research team ganged together one control FPGA and six vana chip hNervanai [61] called Lake Crest is scheduled computational FPGAs to execute much larger neural network to ship in 2019. The follow-on is called Spring Crest models. All of these results are from running one of the hNervana2i [61], and it is scheduled to ship in late 2019. following models: AlexNet, VGG-16, VGG-19, DoReFa-Net, • Google has released three versions of their Tensor Pro- and an LSTM model. Details are in [35]. cessing Unit (TPU) [11]. The TPU1 hTPU1i [62] is only for inference, but Google soon made improvements E. Data Center Chips and Cards that enabled both training and inference on the TPU2 hTPU2i [62] and TPU3 hTPU3i [62]. There are a variety of technologies in this category including • GraphCore.ai has released their C2 card several CPUs, a number of GPUs, a CPU-controlled FPGA hGraphCoreC2i [63] in early 2019, which is being solution, and dataflow accelerators. They are addressed in their shipped in their GraphCore server node (see below). own subsections to group similar processing technologies. This company is a startup headquartered in Bristol, 1) CPU-based Processors: UK with an office in Palo Alto. They have strong • The Intel SkyLake SP processors h2xSkyLakeSPi [47], venture backing from Dell, , and others. The [48] are conventional Xeon server processors. Intel has performance values were achieved with ResNet-50 been marketing these chips to data analytics companies training for the single C2 card with a batch size for as very versatile inference engines with reasonable power training of 8. The card power is an estimate based on a budgets. The performance numbers were measured using typical PCI card power draw. ResNet-50 with batch size of 64 on a 2-socket • The Goya chip hGoyai [64], [65] is an inference chip SkyLakeSP system. being developed by startup Habana Labs, which is based • The Intel processor chips have 64, 68, or 72 in San Jose and Tel Aviv. The performance was achieved cores, with each core having four hardware hyper-threads on ResNet50 inference. Habana Labs is also working on and two AVX-512 (512-bit wide) vector units [49]. Hav- a training chip called the Gaudi, which is expected to be ing these 128 AVX-512 vector units on a 64-core chip released in mid-2019. is equal to 2048 double precision floating point vector • Wave Computing has released their Dataflow Processing ALUs or 4096 single precision floating point vector Unit (DPU) hWaveDPUi [66]. Each card has four DPUs. ALUs. The Phi7210F hPhi7210Fi [50] is the 64-core • The Cambricon dataflow chip hCambriconi [67] was chip we have in the TX-Green Petaflop system, while the designed by a Chinese university team along with the Phi7290F hPhi7290Fi [50] is the top bin, 72-core Xeon Cambricon company, which came out of the university Phi (KNL). team. They published both int8 inference and float16 2) CPU-Controlled FPGA: The Intel Arria solution pairs training numbers that are both significant, so both are on an Intel Xeon CPU with an Altera Arria FPGA hArria the chart. This is the same team that is behind the AMD GX1150i [51], [52] (next to the Baidu point). The CPU is Mali GPU-based Huawei Kirin chip series (see above) used to rapidly download FPGA hardware configurations to that are integrated into Huawei smartphones. the Arria, and then farms out the operations to the Arria for • Baidu has announced an AI accelerator chip called Kun- processing certain key kernels. Since inference models do not lun hBaidui [68], [69]. Presumably this chip is aimed change, this technique is well geared toward this CPU-FPGA at low power data center training and inference and is processing paradigm. However, it would be more challenging supposed to be deployed in early 2019. The two variants to farm ML model training out to the FPGAs. The performance of the Kunlun are the 818-100 for inference and the 818- benchmark was on an Arria 10 1150 FPGA using GoogLeNet 300 for training. The performance number in this chart reporting 900 fps. is the Kunlun 818-300 for training. 6

F. Data Center Systems TABLE I EMBEDDED DEVICE DESCRIPTIONS • There are three NVIDIA server systems on the graph: the DGX-Station, the DGX-1, and the DGX-2: The DGX- EdgeTPU NCS2 i9-SSE4 i9-AVX2 Station is a tower hDGX-Stationi [70] for NN TensorFlow OpenVINO TensorFlow TensorFlow Environment Lite use as a desktop system that includes four V100 GPUs. Mobilenet v1 v2 v2 v2 The DGX-1 hDGX-1i [70], [71] is a server that includes Model eight V100 GPUs that occupies three rack units, while the Reported GOPS 58.5 160 Measured GOPS 47.4 8.29 38.4 40.9 DGX-2 hDGX-2i [71] is a server that includes sixteen Reported Power 2.0 2.0 205 205 V100 GPUs that occupies ten rack units. The DGX-2 (W) networks those sixteen GPUs together using a proprietary Measured Power 0.85 1.35 (W) NV-Link switch. Reported 29.3 80.0 • GraphCore.ai has released a Dell/EMC based server GOPS/W hGraphCoreNodei [63] in early 2019, which contains Measured 55.8 6.14 eight C2 cards (see above). The performance values were GOPS/W Avg. Model 3.66 5.32 0.36 0.36 achieved with ResNet-50 training on the full server with Load Time (s) eight C2 cards. The training batch size for full server Avg. Single 27.4 96.4 19.6 20.8 was 64. The server power is an estimate based on the Image Inference Time (ms) components of a typical Intel based, dual-socket server with 8 PCI cards. • Along with the aforementioned card, Wave Computing (10.0.17763 Build 17763) in a VirtualBox v 6.0 virtual ma- also released a server appliance hWaveSystemi [66], [72]. chine. The neural network model that the Google EdgeTPU The Wave server appliance includes four cards for a total ran was Mobilenet v1 [89] with single shot multibox detec- of sixteen DPUs in the server chassis. tors (SSD) [90] trained with the Microsoft COCO images library [91]. The model that the Intel Neural Compute Stick G. Announced Chips 2 (NCS2) and the Intel i9 9900 system ran was Mobilenet A number of other accelerator chips have been announced v2 [92] also with SSD and trained with COCO. The Edge TPU but have not published any performance and power numbers. and NCS2 both had throttles imposed by the software that only These include: Intel Loihi [73], Facebook [74], Groq [75], allowed one image to be submitted for classification at a time Mythic [76], Amazon Web Services Inferentia [77], Stanford’s (batch size = 1). Further, for both systems the entire neural Braindrop [78], Brainchip’s Akida [79], [80], Tesla [81], network model had to be loaded onto the device for each image [82], Horizon [83], Bitmain [84], Simple that is processed. This seems to be in place to emphasize Machines [85], Eta Compute [86], and Alibaba [87], among that these are development products rather than production others. As performance and power numbers become available products, but in an actual , this limitation for these and other chips, they will be added in future iterations would not be in place since more performance would be of this work. gained by simultaneously submitting more than one image for classification (batch size > 1), but that was not enabled or tested with this benchmarking effort. The NCS2 model was III.BENCHMARKING prepared for download to the device with the Intel Distribution Most of the processors in the very low power space are of the OpenVINO (Open Visual Inference and Neural network either research chips that were developed as proof of concepts Optimization) toolkit: 2018 R5.0.1 (30, Jan 2019). For both in university research labs or they are FPGA-based solutions, the TPU Edge and NCS2 devices, power draw was measured also usually from university research labs. However, there are with a USB multimeter. Finally, on the Intel Core i9-9900k, a few processors that have been commercially released. These TensorFlow was compiled to separately use the SSE4 and commercial low-power accelerators are of interest for many AVX2 vector engine instruction sets. The measurements for embedded machine learning inference applications in the DoD these two trials are depicted as i9-SSE4 and i9-AVX2, respec- and beyond. Amazon Web Services has disclosed that ”... tively. The Intel Core i9-9900k performs somewhat better and inference actually accounts for the majority of the cost and draws more power than typical VPX board based embedded complexity for running machine learning in production (for ev- single board computers from companies including Curtiss- ery dollar spent on training, nine are spent on inference).” [88]. Wright and Mercury Systems [93], [94], which generally are In this section, we will present the preliminary results of based on Intel Core i7 processors that draw a maximum of benchmarking Google TPU Edge [20] and Intel 70W for the entire system. X-based [32] Neural Compute Stick 2 (NCS2) systems and Table I summarizes the reported and measured giga oper- comparing them to an Intel Core i9-9900k processor system. ations per second (GOPS), power (W), and GOPS/W along All of the benchmarks in this section were executed on with average model load time in seconds and average single an Intel-based tower desktop with an Intel Core image inference time in milliseconds. One can observe that the i9-9900k, 32GB (3200Mhz) RAM, and a Samsung 970 TPU Edge and NCS2 have much lower power consumption Pro NVME storage disk. It was running Windows 10 Pro and much higher model load times then the Intel i9. However, 7

0.12 [2] M. Horowitz, “Computing’s Energy Problem (and What We Can Do About It),” in 2014 IEEE International Solid-State Circuits Conference 0.1 Digest of Technical Papers (ISSCC). IEEE, feb 2014, pp. 10–14. [Online]. Available: http://ieeexplore.ieee.org/document/6757323/ 0.08 [3] J. L. Hennessy and D. A. Patterson, “A New Golden Age for ,” Communications of the ACM, vol. 62, no. 2, 0.06 pp. 48–60, jan 2019. [Online]. Available: http://dl.acm.org/citation.cfm? doid=3310134.3282307 0.04 [4] V. Gadepally, J. Goodwin, J. Kepner, A. Reuther, H. Reynolds, S. Samsi, J. Su, and D. Martinez, “AI Enabling Technologies,” MIT Lincoln 0.02 Laboratory, Lexington, MA, Tech. Rep., 2019. [5] F. A. I. Van Veen and S. A. I. Leijnen, “The Neural Network Zoo,” 2019. Single Image Average Inference Time (s) Time Inference Average Image Single 0 [Online]. Available: http://www.asimovinstitute.org/neural-network-zoo/ Edge TPU Intel NCS2 i9-SSE4 i9-AVX2 [6] A. Canziani, A. Paszke, and E. Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications,” arXiv preprint arXiv:1605.07678, 2016. Fig. 3. Box and whisker plot of single image inference times. [7] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, dec 2017. single image inference times are generally the same, though [8] S. Narang, G. Diamos, E. Elsen, P. Micikevicius, J. Alben, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and Others, the NCS2 is somewhat slower. Also, the Edge TPU GOPS/W “Mixed precision training,” Proc. of ICLR,(Vancouver Canada), 2018. numbers are reasonably similar, while the measured GOPS/W [9] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful is much lower than the reported GOPS/W for the NCS2. Visual Performance Model for Multicore Architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, apr 2009. [Online]. Available: Further, Figure 3 shows a box and whiskers plot of the average http://doi.acm.org/10.1145/1498765.1498785 and standard deviation of single image inference times for each [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica- of the four technologies. From the box and whiskers plot, tion with Deep Convolutional Neural Networks,” Neural Information we see that the single image inference times are reasonably Processing Systems, vol. 25, 2012. [11] N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A Domain-Specific uniform across all four technologies. Architecture for Deep Neural Networks,” Communications of the As more low power commercial systems become available, ACM, vol. 61, no. 9, pp. 50–59, aug 2018. [Online]. Available: we intend to purchase and benchmark them to add to this body http://doi.acm.org/10.1145/3154484 [12] M. L. Minsky, Computation: Finite and Infinite Machines. Upper of work. We expect to have performance and power numbers Saddle River, NJ, USA: Prentice-Hall, Inc., 1967. for the NVIDIA Jetson Xavier [32] and perhaps the NVIDIA [13] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Jetson NANO in time for the conference. Learning with Limited Numerical Precision,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 1737–1746. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045303 IV. SUMMARY [14] S. Albanie, “Convnet Burden,” 2019. [Online]. Available: https: //github.com/albanie/convnet-burden In this paper, we have presented a survey of processors [15] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for and accelerators for machine learning, specifically deep neural Energy-Efficient Dataflow for Convolutional Neural Networks,” IEEE networks along with some benchmarking results that we con- Micro, p. 1, 2018. ducted on commercial low power processing systems that are [16] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural relevant to DoD and other embedded applications. We started Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. by overviewing the trends in machine learning processor 127–138, jan 2017. technologies – that many processor trends including transistor [17] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba, density, power density, clock frequency, and core counts are M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, no longer increasing. This is prompting a drive to application and D. S. Modha, “TrueNorth: Design and Tool Flow of a 65 mW 1 specific accelerators that are designed specifically for deep Million Neuron Programmable Neurosynaptic Chip,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, neural networks. Several factors that determine accelerator no. 10, pp. 1537–1557, oct 2015. designs were discussed including the types of neural networks, [18] M. Feldman, “IBM Finds Killer App for TrueNorth Neuromorphic training versus inference, and numerical precision for the com- Chip,” sep 2016. [Online]. Available: https://www.top500.org/news/ -finds-killer-app-for-truenorth-neuromorphic-chip/ putations. We then surveyed and analyzed machine learning [19] J. Hruska, “New Movidius Myriad X VPU Packs a Custom Neural processors categorized into six regions that roughly correspond Compute Engine,” aug 2017. to performance and power consumption. Finally, we presented [20] “Edge TPU,” 2019. [Online]. Available: https://cloud.google.com/ edge-tpu/ benchmarking results for two low power machine learning [21] Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam, “DianNao Family: accelerator systems, the Google Edge TPU and the Intel Energy-Efficient Accelerators For Machine Learning,” Communications Movidius X Neural Compute Stick 2 (NCS2) and compared of the ACM, vol. 59, no. 11, pp. 105–112, oct 2016. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3013530.2996864 the results to an Intel i9-9900k processor system using the [22] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, SSE4 and AVX2 vector engine instruction sets. Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning ,” in 2014 47th Annual IEEE/ACM International Symposium on . IEEE, dec 2014, pp. 609–622. REFERENCES [Online]. Available: http://ieeexplore.ieee.org/document/7011421/ [23] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, [1] T. N. Theis and H. . P. Wong, “The End of Moore’s Law: A New Begin- and O. Temam, “ShiDianNao: Shifting vision processing closer to the ning for Information Technology,” Computing in Science Engineering, sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. vol. 19, no. 2, pp. 41–50, mar 2017. ACM, 2015, pp. 92–104. 8

[24] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, Arria 10,” in Proceedings of the 2017 ACM/SIGDA International and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,” Symposium on Field-Programmable Gate Arrays, ser. FPGA in ACM SIGARCH Computer Architecture News, vol. 43, no. 1. ACM, ’17. New York, NY, USA: ACM, 2017, pp. 55–64. [Online]. 2015, pp. 369–381. Available: http://dl.acm.org/citation.cfm?doid=3020078.3021738http:// [25] R. Merritt, “Startup Accelerates AI at the Sensor,” feb 2019. [Online]. doi.acm.org/10.1145/3020078.3021738 Available: https://www.eetimes.com/document.asp?doc{ }id=1334301 [45] D. J. M. Moss, E. Nurvitadhi, J. Sim, A. Mishra, D. Marr, [26] “Rockchip Released Its First AI Processor RK3399Pro NPU Perfor- S. Subhaschandra, and P. H. W. Leong, “High performance mance up to 2.4TOPs,” jan 2018. binary neural networks on the Xeon+FPGA platform,” in 2017 [27] A. Frumusanu, “The iPhone XS & XS Max Review: Unveiling the Sili- 27th International Conference on Field Programmable Logic and con Secrets,” oct 2018. [Online]. Available: https://www.anandtech.com/ Applications (FPL). IEEE, sep 2017, pp. 1–4. [Online]. Available: show/13392/the--xs-xs-max-review-unveiling-the-silicon-secrets http://ieeexplore.ieee.org/document/8056823/ [28] T. Peng, “AI Chip Duel: Apple A12 Bionic vs Huawei Kirin [46] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, 980,” sep 2018. [Online]. Available: https://medium.com/syncedreview/ “Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA ai-chip-duel-apple-a12-bionic-vs-huawei-kirin-980-ec29cfe68632 Cluster,” in Proceedings of the 2016 International Symposium [29] A. Frumusanu, “The S9 and S9+ Review: and on Low Power Electronics and Design, ser. ISLPED ’16. New Snapdragon at 960fps,” mar 2018. York, NY, USA: ACM, 2016, pp. 326–331. [Online]. Available: [30] ——, “HiSilicon Announces The Kirin 980: First A76, G76 on 7nm,” http://doi.acm.org/10.1145/2934583.2934644 aug 2018. [47] A. Rodriguez, “Intel Processors for Training,” [31] D. Franklin, “NVIDIA Jetson TX2 Delivers Twice the Intelligence to nov 2017. [Online]. Available: https://software.intel.com/en-us/articles/ the Edge,” mar 2017. intel-processors-for-deep-learning-training [32] J. Hruska, “Nvidia’s Jetson Xavier Stuffs Volta Performance Into Tiny [48] “Intel Xeon Platinum 8180 Processor,” 2019. [Online]. Avail- Form Factor,” jun 2018. able: https://ark.intel.com/content/www/us/en/ark/products/120496/ [33] Z. Li, Y. Wang, T. Zhi, and T. Chen, “A survey of neural network intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html accelerators,” Frontiers of Computer Science, vol. 11, no. 5, pp. [49] J. Jeffers, J. Reinders, and A. Sodani, Intel Xeon Phi Processor High Per- 746–761, oct 2017. [Online]. Available: http://link.springer.com/10. formance Programming: Knights Landing Edition 2Nd Edition, 2nd ed. 1007/s11704-016-6159-1 San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2016. [34] S. Mittal, “A survey of FPGA-based accelerators for convolutional [50] “Xeon Phi,” 2019. [Online]. Available: https://en.wikipedia.org/wiki/ neural networks,” Neural Computing and Applications, pp. 1– Xeon{ }Phi 31, oct 2018. [Online]. Available: http://link.springer.com/10.1007/ [51] N. Hemsoth, “Intel FPGA Architecture Focuses on Deep Learning s00521-018-3761-1 Inference,” jul 2018. [Online]. Available: https://www.nextplatform.com/ [35] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A Survey of FPGA- 2018/07/31/intel-fpga-architecture-focuses-on-deep-learning-inference/ Based Neural Network Accelerator,” arXiv preprint arXiv:1712.08934, [52] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O’Connell, dec 2017. [Online]. Available: http://arxiv.org/abs/1712.08934 N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, and G. R. Chiu, [36] H. Nakahara, T. Fujii, and S. Sato, “A Fully Connected Layer Elimi- “DLA: Compiler and FPGA Overlay for Neural Network Inference nation for a Binarizec Convolutional Neural Network on an FPGA,” in Acceleration,” in 2018 28th International Conference on Field Pro- 2017 27th International Conference on Field Programmable Logic and grammable Logic and Applications (FPL), aug 2018, pp. 411–4117. Applications (FPL), 2017, pp. 1–4. [53] R. Smith, “NVIDIA Launches Tesla K80, GK210 GPU,” nov [37] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang, “Accelerating Low 2014. [Online]. Available: https://www.anandtech.com/show/8729/ Bit-Width Convolutional Neural Networks with Embedded FPGA,” in nvidia-launches-tesla-k80-gk210-gpu 2017 27th International Conference on Field Programmable Logic and [54] “ P100.” [Online]. Available: https://www.nvidia.com/ Applications (FPL). IEEE, sep 2017, pp. 1–4. [Online]. Available: en-us/data-center/tesla-p100/ http://ieeexplore.ieee.org/document/8056820/ [55] R. Smith, “NVIDIA Announces Tesla P100 Ac- [38] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, celerator - Pascal GP100 Power for HPC,” apr and H. Yang, “Angel-Eye: A Complete Design Flow for Mapping 2016. [Online]. Available: https://www.anandtech.com/show/10222/ CNN Onto Embedded FPGA,” IEEE Transactions on Computer-Aided nvidia-announces-tesla-p100-accelerator-pascal-power-for-hpc Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, jan [56] “NVIDIA Tesla V100 Tensor Core GPU,” 2019. [Online]. Available: 2018. [Online]. Available: http://ieeexplore.ieee.org/document/7930521/ https://www.nvidia.com/en-us/data-center/tesla-v100/ [39] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, [57] R. Smith, “16GB NVIDIA Tesla V100 Gets S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “ESE: Efficient Speech Reprieve; Remains in Production,” may 2018. Recognition Engine with Sparse LSTM on FPGA,” in Proceedings of [Online]. Available: https://www.anandtech.com/show/12809/ the 2017 ACM/SIGDA International Symposium on Field-Programmable 16gb-nvidia-tesla-v100-gets-reprieve-remains-in-production Gate Arrays, ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 75– [58] E. Kilgariff, H. Moreton, N. Stam, and B. Bell, “NVIDIA Turing 84. [Online]. Available: http://doi.acm.org/10.1145/3020078.3021745 Architecture In-Depth,” sep 2018. [40] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating Fast Algorithms for [59] ExxactCorp, “Taking a Deeper Look at AMD Instinct GPUs for Convolutional Neural Networks on FPGAs,” in 2017 IEEE 25th Annual Deep Learning,” dec 2017. [Online]. Available: https://blog.exxactcorp. International Symposium on Field-Programmable Custom Computing com/taking-deeper-look-amd-radeon-instinct-gpus-deep-learning/ Machines (FCCM). IEEE, apr 2017, pp. 101–108. [Online]. Available: [60] R. Smith, “AMD Announces Radeon Instinct MI60 & http://ieeexplore.ieee.org/document/7966660/ MI50 Accelerators Powered By 7nm Vega,” nov 2018. [41] A. Podili, C. Zhang, and V. Prasanna, “Fast and efficient implementation [Online]. Available: https://www.anandtech.com/show/13562/ of Convolutional Neural Networks on FPGA,” in 2017 IEEE 28th amd-announces-radeon-instinct-mi60-mi50-accelerators-powered-by-7nm-vega International Conference on Application-specific Systems, Architectures [61] N. Rao, “Beyond the CPU or GPU: Why Enterprise-Scale Artificial and Processors (ASAP). IEEE, jul 2017, pp. 11–18. [Online]. Intelligence Requires a More Holistic Approach,” may 2018. Available: http://ieeexplore.ieee.org/document/7995253/ [62] P. Teich, “Tearing Apart Google’s TPU 3.0 AI ,” may 2018. [42] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing [63] D. Lacey, “Preliminary IPU Benchmarks,” oct Loop Operation and Dataflow in FPGA Acceleration of Deep 2017. [Online]. Available: https://www.graphcore.ai/posts/ Convolutional Neural Networks,” in Proceedings of the 2017 preliminary-ipu-benchmarks-providing-previously-unseen-performance-for-a-range-of-machine-learning-applications ACM/SIGDA International Symposium on Field-Programmable Gate [64] L. Armasu, “Move Over GPUs: Startup’s Chip Arrays, ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 45– Claims to Do Deep Learning Inference Better,” sep 54. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3020078. 2018. [Online]. Available: https://www.tomshardware.com/news/ 3021736http://doi.acm.org/10.1145/3020078.3021736 habana-inference-goya-custom-chip,37821.html [43] J. Zhang and J. Li, “Improving the Performance of OpenCL-based FPGA [65] M. Feldman, “AI Chip Startup Puts Inference Cards on the Table,” Accelerator for Convolutional Neural Network,” in Proceedings of the jan 2019. [Online]. Available: https://www.nextplatform.com/2019/01/ 2017 ACM/SIGDA International Symposium on Field-Programmable 28/ai-chip-startup-puts-inference-cards-on-the-table/ Gate Arrays, ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 25– [66] N. Hemsoth, “First In-Depth View of Wave 34. [Online]. Available: http://doi.acm.org/10.1145/3020078.3021698 Computing’s DPU Architecture, System,” aug 2017. [44] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. [Online]. Available: https://www.nextplatform.com/2017/08/23/ Chiu, “An OpenCL\texttrademark Deep Learning Accelerator on first-depth-view-wave-computings-dpu-architecture-systems/ 9

[67] I. Cutress, “Cambricon, Maker of Hauwei’s Kirin NPU [90] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. IP, Build a Big AI Chip and PCIe Card,” may 2018. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” in [Online]. Available: https://www.anandtech.com/show/12815/ Computer Vision – ECCV 2016. Springer International Publishing, cambricon-makers-of-huaweis-kirin-npu-ip-build-a-big-ai-chip-and-pcie-card 2016, pp. 21–37. [Online]. Available: http://link.springer.com/10.1007/ [68] R. Merritt, “Baidu Accelerator Rises in AI,” jul 2018. [Online]. 978-3-319-46448-0{ }2 Available: https://www.eetimes.com/document.asp?doc{ }id=1333449 [91] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [69] C. Duckett, “Baidu Creates Kunlun Silicon for AI,” P. Dollar,´ and C. L. Zitnick, “Microsoft COCO: Common Objects in jul 2018. [Online]. Available: https://www.zdnet.com/article/ Context,” in Computer Vision – ECCV 2014. Springer International baidu-creates-kunlun-silicon-for-ai/ Publishing, 2014, pp. 740–755. [Online]. Available: http://link.springer. [70] P. Alcorn, “Nvidia Infuses DGX-1 with Volta, Eight V100s in a Single com/10.1007/978-3-319-10602-1{ }48 Chassis,” may 2017. [Online]. Available: https://www.tomshardware. [92] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. com/news/nvidia-volta-v100-dgx-1-hgx-1,34380.html Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” [71] I. Cutress, “NVIDIA’s DGX-2: Sixteen Tesla arXiv preprint arXiv:1801.04381, jan 2018. [Online]. Available: V100s, 30TB of NVMe, Only $400K,” mar https://arxiv.org/abs/1801.04381 2018. [Online]. Available: https://www.anandtech.com/show/12587/ [93] “Curtiss-Wright 3U Intel Single Board Computers,” 2019. [On- nvidias-dgx2-sixteen-v100-gpus-30-tb-of-nvme-only-400k line]. Available: https://www.curtisswrightds.com/products/cots-boards/ processor-cards/3u-intel-sbc/ [72] M. Feldman, “Wave Computing Launches Machine Learning [94] “Mercury Systems BuiltSAFE CIOV-2231,” 2019. [On- Appliance,” apr 2017. [Online]. Available: https://www.top500.org/ line]. Available: https://www.mrcy.com/mission-computing-safety-dal/ news/wave-computing-launches-machine-learning-appliance/ single-board-computers-products/avionics-ciov-2231/ [73] N. Hemsoth, “First Wave of Spiking Neural Network Hardware Hits,” sep 2018. [Online]. Available: https://www.nextplatform.com/2018/09/ 11/first-wave-of-spiking-neural-network-hardware-hits/ [74] W. Knight, “Cheaper AI for Everyone is the Promise with Intel and Facebook’s New Chip,” MIT Technology Review, jan 2019. [Online]. Available: https://www.technologyreview.com/s/612722/ cheaper-ai-for-everyone-is-the-promise-with-intel-and-facebooks-new-chip/ [75] J. Morra, “Groq Portrays Power of Its Ar- tificial Intelligence Silicon,” nov 2017. [Online]. Available: https://www.electronicdesign.com/industrial-automation/ groq-outlines-potential-power-artificial-intelligence-chip [76] N. Hemsoth, “A Mythic Approach to Deep Learning Inference,” aug 2018. [Online]. Available: https://www.nextplatform.com/2018/08/23/ a-mythic-approach-to-deep-learning-inference/ [77] “Announcing AWS Inferentia: Machine Learning Inference Chip,” nov 2018. [78] R. Merritt, “AI Vet Pushes for Neuromorphic Chips — EE Times,” may 2019. [Online]. Available: https://www.eetimes.com/document.asp? doc{ }id=1334679{#} [79] W. Wong, “BrainChip Unveils Akida Architecture,” sep 2018. [80] R. Merritt, “BrainChip Discloses SNN Chip,” sep 2018. [Online]. Available: https://www.eetimes.com/document.asp?doc{ }id=1333677 [81] K. Hao, “Tesla Says Its New Self-Driving Chip Will Help Make Its Cars Autonomous,” MIT Technology Review, apr 2019. [82] A. Olofsson, “Epiphany-V: A 1024-core 64-bit RISC processor — Parallella,” oct 2016. [Online]. Available: https://www.parallella.org/ 2016/10/05/epiphany-v-a-1024-core-64-bit-risc-processor/ [83] J. Horwitz, “Chinese AI chip maker Horizon Robotics raises $600 million from SK Hynix, others - Reuters,” feb 2019. [Online]. Available: https://www.reuters.com/article/us-china-tech-semiconductors/ chinese-ai-chip-maker-horizon-robotics-raises-600-million-from-sk-hynix-others-idUSKCN1QG0HW [84] M. Chafkin and D. Ramli, “China’s Crypto-Chips King Sets His Sights on AI - Bloomberg,” jun 2018. [On- line]. Available: https://www.bloomberg.com/news/features/2018-05-17/ china-s-crypto-chips-king-sets-his-sights-on-ai [85] D. Tenenbaum, “As computing moves to cloud, UW- Madison spinoff offers faster, cleaner chip for data centers,” may 2017. [Online]. Available: https://news.wisc.edu/ as-computing-moves-to-cloud-uw-madison-spinoff-offers-faster-cleaner-chip-for-data-centers/ [86] J. Yoshida, “Startup Runs Spiking Neural Network on Arm — EE Times,” mar 2018. [Online]. Available: https://www.eetimes.com/ document.asp?doc{ }id=1333080{#} [87] E. Yu, “Alibaba to launch own AI chip next year — ZDNet,” sep 2018. [Online]. Available: https://www.zdnet.com/article/ alibaba-to-launch-own-ai-chip-next-year/ [88] “Amazon Web Services Announces 13 New Machine Learning Services and Capabilities, Including a Custom Chip for Machine Learning Inference, and a 1/18 Scale Autonomous Race Car for Developers,” nov 2018. [Online]. Avail- able: https://press.aboutamazon.com/news-releases/news-release-details/ amazon-web-services-announces-13-new-machine-learning-services [89] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv preprint arXiv:1704.04861, apr 2017. [Online]. Available: https://arxiv.org/abs/1704.04861