electronics

Review Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions

Weison Lin *, Adewale Adetomi and Tughrul Arslan

Institute for Integrated Micro and Nano Systems, University of Edinburgh, Edinburgh EH9 3FF, UK; [email protected] (A.A.); [email protected] (T.A.) * Correspondence: [email protected]

Abstract: Edge AI accelerators have been emerging as a solution for near customers’ applications in areas such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, , and remote sensing satellites. These applications require meeting performance targets and resilience constraints due to the limited device area and hostile environments for operation. Numerous research articles have proposed the edge AI accelerator for satisfying the applications, but not all include full specifications. Most of them tend to compare the architecture with other existing CPUs, GPUs, or other reference research, which implies that the performance exposé of the articles are not comprehensive. Thus, this work lists the essential specifications of prior art edge AI accelerators and the CGRA accelerators during the past few years to define and evaluate the low power ultra-small edge AI accelerators. The actual performance, implementation, and productized   examples of edge AI accelerators are released in this paper. We introduce the evaluation results showing the edge AI accelerator design trend about key performance metrics to guide designers. Citation: Lin, W.; Adetomi, A.; Last but not least, we give out the prospect of developing edge AI’s existing and future directions Arslan, T. Low-Power Ultra-Small Edge AI Accelerators for Image and trends, which will involve other technologies for future challenging constraints. Recognition with Convolution Neural Networks: Analysis and Future Keywords: edge AI accelerator; CGRA; CNN Directions. Electronics 2021, 10, 2048. https://doi.org/10.3390/ electronics10172048 1. Introduction Academic Editor: Fabian Khateb Convolution neural network (CNN), widely applied to image recognition, is a algorithm. CNN is usually adopted by software programs supported by the Received: 30 June 2021 artificial intelligence (AI) framework, such as TensorFlow and Caffe. These programs are Accepted: 19 August 2021 usually run by central processing units (CPUs) or graphics processing units (GPUs) to Published: 25 August 2021 form the AI systems which construct the image recognition models. The models trained by massive such as big data and infer the result by the given data have been commonly Publisher’s Note: MDPI stays neutral seen running on cloud-based systems. with regard to jurisdictional claims in Hardware platforms for running AI technology can be sorted into the following published maps and institutional affil- hierarchies: bound system, edge-cloud coordination system, and ‘edge’ AI iations. devices. The three hierarchies of hardware platforms from the data center to edge devices require different hardware resources and are exploited by various applications according to their demands. The state-of-the-art applications for image recognition such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, robotics, remote Copyright: © 2021 by the authors. sensing satellites belong to the third hierarchy and are called edge devices. Edge devices Licensee MDPI, Basel, Switzerland. refer to the devices connecting to the internet but near the consumers or at the edge of This article is an open access article the whole (IoT) system. They are also called edge AI devices when they distributed under the terms and utilize AI algorithms. The targeted AI algorithm of the accelerators in this paper is CNN. conditions of the Creative Commons The mentioned applications contain several features, which demand a dedicated Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ system to cover. Edge AI holds several advantages and is qualified to deal with them. The 4.0/). specific capabilities of edge AI technology are:

Electronics 2021, 10, 2048. https://doi.org/10.3390/electronics10172048 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 2048 2 of 16

• Edge AI can improve the user experience when AI technology and data processing are near customers more and more. • Edge AI can reduce the data transition latency, which implies real-time processing ability. • Edge AI can run under no internet coverage to offer privacy through local processing. • Edge AI pursues the compact size and manages power consumption to meet the mobility and limited power . Some edge AI systems are neither power-sensitive nor size-limited such as surveillance systems for face recognition and unmanned shops. Designed as an immobile system is a specific feature of these edge AI systems. Although these kinds of applications do not care about power consumption and size, they tend to be aware of data privacy more. As a result, they also avoid using the first and second hierarchy platforms. However, the scope of these power non-sensitive edge AI systems is not a target in this paper. This paper focuses on surveying AI accelerators designed for power-sensitive and size-limited edge AI devices, which are based on batteries or limited power sources such as systems using solar panels. The following mentioned edge AI refers to this kind of system. Typically, this kind of system is portable and comes to mind when people mention edge AI devices because GPU and CPU-based systems can easily substitute the non-power-sensitive and non-size-limited edge AI systems since they do not care about the power consumption and size. As an edge AI accelerator requires mobility, power consumption and area size are the most caring features. Greedily, when the two features meet the requirement, the computation ability will be expected to be as high as possible. High computation ability helps to solve the critical feature of this kind of edge AI device. The critical feature is the real-time computing ability for predicting or inferring the subsequent decision by pre-trained data. CPUs and GPUs have been used extensively in the first two hierarchies of AI hardware platforms for practicing CNN algorithms. Due to the inflexibility of CPUs and the high- power consumption of GPUs, they are not suitable for power-sensitive edge AI devices. As a result, power-sensitive edge AI devices require a new customized and flexible AI hardware platform to implement arbitrary CNN algorithms for real-time computing with low power consumption. Furthermore, as the edge devices develop into various applications such as monitoring natural hazards by UAV, detecting radiation leakage for nuclear disaster by robotics, and remote sensing in space by satellites, these applied fields are more critical than usual. These critical environments, such as radiation fields, can cause a system failure. As a result, not only power consumption and area size are the key but also the fault tolerance of edge AI devices for satisfying their compact and mobile feature with reliability. There are various research articles have been proposed targeting fault tolerance. Reference [1] introduces a clipped activation technique to block the potentially faulty activations and maps them to zero on a CPU and two GPUs. Reference [2] focuses on fault mitigation, which utilizes fault-aware pruning with/without retraining technique. The retraining feature takes 12 min to finish the retraining at least, and the worst case is 1 h for AlexNet. It is not suitable for edge AI. For permanent faults, Reference [3] proposes a fault-aware mapping technique to minus the permanent fault in MAC units. For power- efficient technology, Reference [4] proposes a computation re-use-aware neural network technique to reuse the weight by constructing a computational reuse table. Reference [5] uses an approximate computing technique to retrain the network for getting the resilient neurons. It also shows that dynamic reconfiguration is the crucial feature for the flexibility to arranging the processing engines. These articles focus on fault tolerance technology specifically. Some of them address the relationship between accuracy and power-efficient together but lack computation ability information [4,5]. Besides these listed articles, there are still many published works targeting fault tolerance in recent years, which indicates that the edge AI with fault tolerance is the trend. In summary, edge AI devices’ significant issues are power sensitivity, device size limitation, limited local-processing ability, and fault tolerance. The limitation of power Electronics 2021, 10, 2048 3 of 16

sensitivity, device size, and local-processing ability are bound by the native constraints of the edge AI devices. The constraints are generated by the limited power source such as battery and the portable feature, which causes the size of an edge AI system limited. To address these issues, examining the three key features of the prior art of the accelerators tailored for edge AI devices is necessary for providing future design directions. From the point of view of a completed edge AI accelerator, the released specifications of the above fault tolerance articles are not comprehensive because they focus on the fault tolerance feature more than a whole edge AI accelerator. Furthermore, most related edge AI accelerator survey works tend to focus on an individual structure’s specific topics and features without comparing the three features. Although each article of the individual structure is interesting to read and learn about, without direct comparison in the same standard units, it is unclear to know how good and what the generation of each structure is. This situation also makes the designers hard to compare the structure of each edge AI accelerator and determine which design is more suitable to reference. As a result, this paper focuses on evaluating the prior arts of edge AI accelerator on the three key features. The prior arts focused on in this work are not limited to the released prior art accelera- tors but also edge accelerators architecture based on coarse-grained reconfigurable array (CGRA) technology because to achieve the flexibility of the hardware structure and deal with its compact size, one of the solutions for edge AI platforms is dynamic reconfiguration. The reconfigurable function realizes different types of CNN algorithms such that they can be loaded into an AI platform depending on the required edge computing. Moreover, the reconfigurable function also potentially provides fault tolerance to the system by reconfig- uring the connections between processing elements (PEs). Overall, this survey will benefit those looking up the low-power ultra-small edge AI accelerators’ specifications and setting up their designs. This paper helps designers choose or design a suitable architecture by indicating reasonable parameters for their low-power ultra-small edge AI accelerator. The rest of this paper is organized as follows: Section2 introduces the hardware types adopted by AI applications. Section3 introduces the edge AI accelerators, including prior edge AI accelerators, CGRAs accelerators, the units used in this paper for evaluating their three key features, and the suitable technologies for implementing the accelerators. Section4 releases the analysis result and indicates the future direction. Conclusion and future works are summarized in Section5.

2. System Platform for AI Algorithms For achieving the performance of AI algorithms, several design trends of a complete platform for AI systems, such as cloud training and inference, edge-cloud coordination, near-memory computing, and in-memory computing, have been proposed [6]. Currently, AI algorithms rely on the cloud or edge-cloud coordinating platforms, such as ’s GPU-based , ’s Versal platform, MediaTek’s NeuroPilot platform, and Apple’s A13 CPU [7]. The advantages and disadvantages of CPU and GPU for applying on edge devices are shown in Table1[ 8,9]. As shown in Table1, CPU and GPU are more suitable for data-center-bound platforms due to CPU’s sequential processing feature and GPU’s power high power consumption. They do not meet the demand of low-power edge devices, which are strictly power-limited and size sensitive [10]. Edge-cloud coordination systems belong to the second hierarchy, which cannot run in areas with no network coverage. Data transfer through the network has significant latency, not acceptable for real-time AI applications such as security and emergency response [9]. Privacy is another concern when personal data is transferred through the internet. Low- power edge AI devices require hardware to support high-performance AI computation with minimal power consumption in real-time. As a result, designing a reconfigurable AI hardware platform that allows the adoption of arbitrary CNN algorithms for low-power edge AI devices with no internet coverage is the trend. Electronics 2021, 10, 2048 4 of 16

Table 1. Pros and cons of CPU, GPU, and Edge AI accelerator.

Processor Pros and Cons CPU GPU Edge AI accelerator

• Can process high throughput • Power and computation video data. efficient. • Easily implement • High memory bandwidth. • Compact size. Advantage any algorithms. • High parallel processing • Customizable design for the ability [11–14]. specific application.

• equires massive power • Customizable for the specific • The sequential support. targeted application processing feature does • RRestricts its application for (inflexible for all types of not match the power-sensitive edge devices. computations). Disadvantage characteristic of CNN, • Images in a streaming video • Computational power is requiring massively and some tracking algorithms limited compared to the data . are inputted sequentially but center CPU and GPU. not parallel [15].

• More suitable for a • Customized for specific data center. • More suitable for a data center. edge devices. Application platform • Cooperate with AI • Cooperate with AI accelerator. • Can cooperate with CPU accelerator. or GPU.

3. Edge AI Accelerators Several architectures and methods have been proposed to achieve compact size, low power consumption, and computation ability of edge devices. The following Sections 3.2 and 3.3, introduce the released prior art edge AI accelerators and state-of-the-art edge accelerators based on CGRA, potentially suitable for low-power edge AI devices. Most of the proposed review articles introduce accelerators feature-by-feature, and some miss mentioning the three key features. These articles tend to report the existing works only but not compare them. On the other hand, another edge AI accelerator articles contain all three key features. Still, the result they release is hard to understand because they only release comparison results with reference accelerators. It turns out that the used units in these articles for comparison of the architecture area, power consumption, and computation ability are not expected by the edge AI designers’ community. Instead of using millimeters squared (mm2), watt (W), operations per second (OPs), the results show how many ‘times’ better their reference works. As a result, we decide to compare the edge AI accelerators and CGRA architectures by the units used by most AI accelerator designers.

3.1. Specification Normalization and Evaluation The presented unit of the computation ability in the following section is OPs (operation per second); MOPs, GOPs, and TOPs represent Mega, Giga, and Tera OPs, respectively. The arithmetic of each accelerator varies in data representation, e.g., floating-point (FP) and fixed-point (fixed). To compare the accelerators, using different arithmetic will lose impartiality. As a result, converting the units will be the following task. The computation ability will be represented as c. If the arithmetic of the accelerator is FP, its c will be defined as cFP. On the other hand, cFixed means the computation ability under Fixed arithmetic. In the computation rows of Tables2–7, the initial c is the original data released by the reference works. It might vary in arithmetic type and precisions. Based on [16,17], the computation ability of FP (cFP) can be converted to the computation ability of cFixed by scaling three times. As a result, (1) is introduced. Electronics 2021, 10, 2048 5 of 16

Converted computation ability to a fixed point is defined as follows:

cFixed = cFP × 3 (1)

However, because not all accelerators have the same data precision, cFixed is not convincing while comparing the accelerators’ ability. Reference [18] indicates that if a structure is not being optimized for precisions like Nvidia does in their GPUs, the theoretical performance of half-precision follows the natural 2×/4× speedups against single/double precisions, respectively. As a result, accelerators’ computation ability performance needs to be normalized to 16-bit as it is used in the majority, without loss of generality. After normalization, the computation ability of each accelerator can be represented as cFixed16. To specify the accelerators’ performance fairly, (2) is introduced. The lasted computation abilities shown in the computation rows in Tables2–7 are the computation ability in 16 bit fixed-point format. Converted computation ability to a 16-bit fixed point is defined as follows:

cFixed16 = cFixed × (precision/16) (2)

To specify the accelerators’ synergy performance, (3) is introduced to represent the accelerators’ evaluation value. Since the edge devices require low power consumption and compact size, in (3), the denominator will be power consumption p (w) times size s (mm2), and the fraction will be computation ability cFixed16 (GOPs). The equation for evaluating accelerator’s synergy performance:

Evaluation value (E) = cFixed16/(p × s) (3)

3.2. Prior Art Edge AI Accelerators The following will show the edge AI accelerators [9,19–28], which focus on the de- Electronics 2021, 10, x FOR PEER REVIEWmands of edge devices and are organized into Tables2–4 according to their precision and6 of 16 Electronics 2021, 10, x FOR PEER REVIEW 6 of 16 power consumption. Table2 shows the accelerators with 16-bit precision and containing less than a watt power consumption. Table3 shows the 16-bit precision accelerators have relatively high-powerTable 2. Prior consumption, Art Edge AI higher Accelerators than a. watt. Table4 shows the accelerators which do not belongTable to2. Prior 16-bit Art precision. Edge AI Accelerators. Three Key Features and Edge AI Accelerators Three Key Features and Table 2. Prior Art Edge AIEdge Accelerators. AI Accelerators the Evaluation Value Kneron 2018 [9] Eyeriss (MIT) 2016 [19] 1.42TOPS/W 2016 [21] Threethe Key Evaluation Features and Value the Kneron 2018 [9] EyerissEdge AI (MIT) Accelerators 2016 [19] 1.42TOPS/W 2016 [21] ComputationEvaluation Value ability 152 GOPs 84 GOPs 64 GOPs Computation ability Kneron152 GOPs 2018 [9] Eyeriss84 (MIT) GOPs 2016 [19] 1.42TOPS/W64 GOPs 2016 [ 21] Precision 16-bit Fixed 16-bit Fixed 16-bit Fixed [9] ComputationPrecision ability 16152-bit GOPs Fixed 16 84-bit GOPs Fixed 1 646-bit GOPs Fixed [9] Power consumption 350 mW 278 mW 45 mW PowerPrecision consumption 16-bit350 mW Fixed 16-bit278 FixedmW 16-bit Fixed45 mW [9] TSMC 65 nm LP 1P9M TSMC 65 nm Power consumption TSMC 65350 nm mW RF 1P6M TSMC 27865 n mWm LP 1P9M TSMC 45 mW 65 nm Size TSMC 65 nm RF 1P6M Chip size 4.0 mm × 4.0 mm LP 1P8M Size Core area 2 mm × 2.5 mm ChipTSMC size 65 4.0 nm mm LP 1P9M× 4.0 mm TSMCLP 65 1P8M nm TSMC 65 nm RF 1P6M Size Core area 2 mm × 2.5 mm CoreChip area size 4.03.5 mmmm× × 4.03.5 mm mm Chip sizeLP 4.0 1P8M mm × 4.0 mm Core area 2 mm × 2.5 mm Core area 3.5 mm × 3.5 mm Chip size 4.0 mm × 4.0 mm Core area 3.518.88 mm × 3.5 mm Chip size 4.0 mm × 4.0 mm Evaluation value E 86.86 (core) 18.88 88.88 Evaluation value E 86.86 (core) 24.6618.88 (core) 88.88 Evaluation value E 86.86 (core) 24.66 (core) 88.88 Implemented in TSMC 65 Implemented24.66 in (core) TSMC 65 nm Implemented in TSMC 65 nm Implementation Implemented in TSMC 65 Implemented in TSMC 65 nm Implemented in TSMC 65 nm Implementation Implementednm technology in TSMC 65 Implementedtechnology in TSMC 65 nm Implementedtechnology in TSMC 65 Implementation nm technology technology technology nm technology technology nm technology

CommerciCommercialal product ex- Commercial product ex- - - exampleample - ample Packaged on a PCIe interface card Packaged in a USB stick (no public sales channel found) Packaged in a USB stick Packaged on a PCIe interface card Packaged in a USB stick Packaged on a PCIe interface card (no public sales channel found) (no public sales channel found) Table 3. Prior Art Edge AI Accelerators. Table 3. Prior Art Edge AI Accelerators. Edge AI Accelerators Three Key Features and Edge AI Accelerators Three Key Features and Myriad x () NVIDIA X1 RK1808 2018 the Evaluation Value Myriad x (Intel) NVIDIA Tegra X1 Rockchip RK1808 2018 Texas Instruments the Evaluation Value 2017 [22] TM660M 2019 [23,24] [25] AM5729 2019 [28] 2017 [22] TM660M 2019 [23,24] [25] AM5729 2019 [28] 1 TFlops 472 GFlops 100 GFlops 120 GOP/s Computation ability 1 TFlops 472 GFlops 100 GFlops 120 GOP/s Computation ability =3 TOPs =1.42 TOPs =300 GOPs =3 TOPs =1.42 TOPs =300 GOPs 16-bit FP Precision 16-bit FP 16-bit FP 16-bit FP 16-bit Fixed Precision 16-bit FP 16-bit FP 300 GOPs@ INT16 16-bit Fixed 300 GOPs@ INT16 5–10 Watts ~3.3 W ≈6.5 W Power consumption <2 Watt 5–10 Watts ~3.3 W ≈6.5 W Power consumption <2 Watt (module-level) (module-level) (module-level) (module-level) (module-level) (module-level) 8.1 mm × 8.8 mm 28-nm 22 nm 28-nm Size 8.1 mm × 8.8 mm 28-nm 22 nm 28-nm Size (package) 23 mm × 23 mm ≈13.9 mm × 13.9 mm 23 mm × 23 mm (package) 23 mm × 23 mm ≈13.9 mm × 13.9 mm 23 mm × 23 mm 3.5 × 10−5 (module- Evaluation value E 21.55 (package s) 5.34 × 10−4 (module-level p) 3.81 (module-level p) 3.5 × 10−5 (module- Evaluation value E 21.55 (package s) 5.34 × 10−4 (module-level p) 3.81 (module-level p) level p) level p) Implemented as system on Implemented in Implemented as system on Implemented as sys- Implemented as sys- Implemented in chip Implemented as sys- Implemented as sys- Implementation TSMC 16 nm tech- chip tem on chip tem on chip Implementation TSMC 16 nm tech- (probably TSMC 16 nm tem on chip tem on chip nology (probably TSMC 16 nm (22 nm technology) (28 nm technology) nology technology) (22 nm technology) (28 nm technology) technology)

Commercialized prod- Commercialized prod- uct example uct example

Packaged in a USB Packaged in a USB stick stick

Electronics 2021, 10, x FOR PEER REVIEW 6 of 16 Electronics 2021, 10, x FOR PEER REVIEW 6 of 16 Electronics 2021, 10, x FOR PEER REVIEW 6 of 16

Electronics 2021, 10, x FOR PEER REVIEW 6 of 16

Table 2. Prior Art Edge AI Accelerators. Table 2. Prior Art Edge AI Accelerators. Table 2. Prior Art Edge AI Accelerators. Three Key Features and Table 2. Prior Art EdgeEdge AI Accelerators AI Accelerators. Three Key Features and Edge AI Accelerators Threethe Key Evaluation Features Value and Kneron 2018 [9] EdgeEyeriss AI (MIT) Accelerators 2016 [19 ] 1.42TOPS/W 2016 [21] the Evaluation Value Edge AI Accelerators theThree Evaluation Key Features Value and Kneron 2018 [9] Eyeriss (MIT) 2016 [19] 1.42TOPS/W 2016 [21] Computation ability Kneron152 2018 GOPs [9] Eyeriss (MIT)84 GOPs 2016 [19] 1.42TOPS/W64 GOPs 2016 [21] Computationthe Evaluation ability Value Kneron152 GOPs 2018 [9] Eyeriss84 (MIT) GOPs 2016 [19] 1.42TOPS/W64 GOPs 2016 [21] ComputationPrecision ability 15216 -GOPsbit Fixed 8416 GOPs-bit Fixed 1646- bitGOPs Fixed [9] ComputationPrecision ability 16-152bit FixedGOPs 16-84bit GOPsFixed 16-bit64 Fixed GOPs [9 ] PowerPrecision consumption 16-bit350 Fixed mW 16-bit278 Fixed mW 16-bit Fixed45 mW [9 ] Power Pconsumptionrecision 13506-bit mW Fixed 12786-bit mW Fixed 16-45bit mW Fixed [9] Power consumption 350 mW TSMC278 65 mW nm LP 1P9M TSMC45 mW 65 nm TSMC 65 nm RF 1P6M TSMC 65 nm LP 1P9M TSMC 65 nm Power consumptionSize TSMC 65350 nm mW RF 1P6M TSMCChip size 65 278 n4.0m LPmmmW 1P9M × 4.0 mm TSMCLP45 65 1P8MmW nm Size TSMCCore area65 nm 2 mm RF 1P6M× 2.5 mm Chip size 4.0 mm × 4.0 mm LP 1P8M Size Core area 2 mm × 2.5 mm ChipCoreTSMC size area 4.0 6 3.55 mm n mmm ×LP 4.0 × 1P9M 3.5 mm mm Chip sizeLPTSMC 1P8M4.0 mm65 nm × 4.0 mm CoreTSMC area 265 mm nm × RF 2.5 1P6M mm Core area 3.5 mm × 3.5 mm Chip size 4.0 mm × 4.0 mm Size CoreChip area size 3.5 4.0 18.88mm mm × 3.5 × 4.0 mm mm Chip size 4.0LP mm 1P8M × 4.0 mm Evaluation value E Core area86.86 2 mm (core) × 2.5 mm 18.88 88.88 Evaluation value E 86.86 (core) Core area24.6618.88 3.5 mm(core) × 3.5 mm Chip size 88.884.0 mm × 4.0 mm Evaluation value E 86.86 (core) 24.66 (core) 88.88 Implemented in TSMC 65 Implemented24.66 18.88(core) in TSMC 65 nm Implemented in TSMC 65 nm EvaluationImplemen valuetation E Implemented86.86 in(core) TSMC 65 Implemented in TSMC 65 nm Implemented 88.88in TSMC 65 nm Implementation Implementednm technology in TSMC 65 Implemented24.66technology in TSMC (core) 65 nm Implementedtechnology in TSMC 65 nm nm technology technology technology Implementation Implemented in TSMC 65 Implemented in TSMC 65 nm Implemented in TSMC 65 nm Implementation nm technology technology technology nm technology technology technology

Commercial product ex- Commercial product ex- - Commerciaamplel product ex- - ample - Commerciampleal product ex- - ample Packaged in a USB stick Packaged on a PCIe interface card Packaged in a USB stick Packaged on a PCIe interface card Packaged in a USB stick (no public sales channel found) Electronics 2021, 10, 2048 Packaged on a PCIe interface card 6 of 16 Packaged in a USB stick (no public sales channel found) (noPackage publicd saleson a PCIechannel interface found )card Table 3. Prior Art Edge AI Accelerators. Table 3. Prior Art(no Edge public AI Accelerators sales channel. found) Table 3. Prior Art Edge AI Accelerators. Table 3. Prior Art EdgeEdge AI Accelerators AI Accelerators. Three Key Features and Table 3. Prior Art EdgeEdge AI Accelerators.AI Accelerators Three Key Features and Myriad x (Intel) NVIDIA EdgeTegra AI X1 Accelerators Rockchip RK1808 2018 Texas Instruments Threethe Key Evaluation Features Value and Myriad x (Intel) NVIDIA TegraEdge X1 AI AcceleratorsRockchip RK1808 2018 Texas Instruments theThree Evaluation Key Features Value Myriad2017 x (Intel) [22] NVIDIATM660M Tegra 2019Edge [X123 , AI24] Accelerators Rockchip RK1808[25] 2018 TexasAM5729 Instruments 2019 [28 ] theThree Evaluation Key Features Value and 2017 [22] TM660M 2019 [23,24] [25] AM5729 2019 [28] and the the Evaluation Value MyriadMyriad20171 TFlops x[22 (Intel)x] (Intel) 2017 TM660MNVIDIANVIDIA472 2019 GFlops Tegra Tegra [23,24 X1X1] RockchipRockchip100[25 RK1808 ]RK1808GFlops 2018 2018AM5729 TexasTexas12 InstrumentsInstruments 02019 GOP/s [28 ] ComputationEvaluation Value ability 1 TFlops 472 GFlops 100 GFlops 120 GOP/s Computation ability 1 TFlops2017=3 [TOPs22 [22] ] TM660MTM660M472= 1.42GFlops 2019 2019 TOPs [23 ,,2424]] 100= 300GFlops[25[25] GOPs] AM5729120 GOP/s 2019 2019 [28 [28] ] Computation ability =3 TOPs =1.42 TOPs =300 GOPs 11 TFlops TFlops 472472 GFlops GFlops 10010016 GFlops -GFlopsbit FP 120 GOP/s Computation ability =3 TOPs =1.42 TOPs =300 GOPs 120GOP/s ComputationPrecision ability 16=3-bit TOPs FP =1.4216-bit TOPs FP =30016-bit GOPs FP 16-bit Fixed Precision 16=-3bit TOPs FP 16=1.42-bit TOPsFP 30016 =GOPs@-300bit FPGOPs INT16 16-bit Fixed 300 GOPs@ INT16 Precision 16-bit FP 16-bit FP 16-bit FP 16-bit Fixed Precision 16-bit FP5– 16-bit10 Watts FP 300 GOPs@16~3.-bit3 INT16 W FP 16-bit≈6. Fixed5 W PowerP recisionconsumption 16<2- bitWatt FP 5–1016 -Wattsbit FP 300 GOPs@~3.3 W INT16 16≈6.-bit5 W Fixed Power consumption <2 Watt 5(module–10 Watts-level) 300(module~3. GOPs@3 W -level) INT16 (module≈6.5 W -level) Power consumption <2 Watt (module5–10- Wattslevel) (module~3.3 W-level) (module≈6.5- Wlevel) Power consumption 8.1 m<2m Watt× 8.8 mm (module5–1028- -level)Wattsnm (module~3.22- 3level)n mW (module≈6.28--5level)nm W Power consumptionSize 8.1 mm<2 × Watt 8.8 m m (module-level)28-nm (module-level)22 nm (module-level)28-nm Size 8.1 mm(package) × 8.8 mm 2(module3 28m-mnm × -2level)3 mm ≈13.(module92 m2 nmm × - 13.level)9 m m 23(module 28mm-nm × -23level) mm (package) 23 mm × 23 mm ≈13.9 mm × 13.9 mm 23 mm × 23 mm Size 8.1 mm × 8.8 mm 28-nm 22 nm 28-nm−5 Size 8.(package)1 mm × 8. 8 mm 23 mm 28× 2-3nm m m ≈13.9 mm2 ×2 13.nm9 mm 233. mm5 × 10 28× 23-nm(module mm - EvaluationSize value E 21.55(package) (package s) 5.34 × 2310− mm4 (module× 23 mm-level p) ≈3.8113.9 mm(module× 13.9-level mm p) 3.523 × mm10−5× (module23 mm - Evaluation value E 21.55(package) (package s) 5.34 × 1023−4 m (modulem × 23 m-levelm p) 3.81≈13. (module9 mm ×- 13.level9 m pm) 3.5 ×23 10 mm−level5 (module × 23 p) mm- −4 −4 −5 Evaluation value E 21.55 (package s) 5.34 × 10 (module5.34 × 10-level p) 3.81 (module-level p) 3.5level×−510 p) Evaluation value E 21.55 (package s) Implemented as system on 3.81 (module-level p) 3.5 level× 10 p(module) - Evaluation value E 21.55Implemented (package in s) Implemented5.34 × (module-level10−4 (module as system-plevel) on p ) 3.81Implemented (module- levelas sys- p) Implemented(module-level asp) sys- Implemented in Implemented caship system on Implemented as sys- Implementedlevel pas) sys- Implementation ImplementedTSMC 16 nm in tech- Implementedchip as system Implementedtem on aschip sys- Implementedtem on aschip sys- Implementation TSMC 16 nm tech- Implemented(probablychip TSMC as system 16 nm on Implemented tem on aschip system Implementedtem on c aship system ImplementedImplementednology in TSMC in (probably onTSMC chip 16 nm Implemented(22 nm technology as sys-) Implemented(28 nm technology) as sys- ImplemeImplementationntation TSMC 16 nm tech- technologychip ) temon on chip chip tem onon chip chip 16 nmnology technology (probably(probably TSMC TSMC 16 nm (22 nm technology) (28 nm technology) Implementation TSMCnology 16 n m tech- technology) (22(22 nm nmtem technology technology) on chip ) (28(28 nm nmtem technology) technology) on chip (probablytechnology16 nm technology) TSMC) 16 nm nology (22 nm technology) (28 nm technology) ElectronicsElectronics 2021 2021, 10,, x10 FOR, x FOR PEER PEER REVIEW REVIEW technology ) 7 of 716 of 16

Commercialized prod- Commercialized prod- Commercuctia examplelized prod- Commercializeduct example PackagedPackaged in ain USB a USB Packaged Packaged on ona devel- a devel- Commercproductuct exampleia examplelized prod- Packaged in a USB uct example PPackagedackaged in in a a USB USB stick PackagedPackaged on ona PCIe a PCIe inter- inter- stickstick opingoping board board stick Packaged in a USB Packaged on a PCIe stick faceface card card Packaged in a USB stick Packaged on a Packagedstick in a USB interface card developing board stick TableTable 4. Prior 4. Prior Art Art Edge Edge AI AcceleratorsAI Accelerators. .

Table 4. Prior Art EdgeEdge AIEdge Accelerators. AI AIAccelerators Accelerators ThreeThree Key Key Features Features and and GTIGTI Lightspeeur Lightspeeur SPR2801S SPR2801S OptimizingOptimizing FPGA FPGA-based-based GoogleGoogle Edge Edge TPU TPU 2018 2018 thethe Evaluation Evaluation Value Value Edge AI Accelerators Three Key Features and the 20192019 [16 ][ 16] 20152015 [20 ][ 20] [26,[2726], 27] GTI Lightspeeur SPR2801S Optimizing FPGA-based Edge TPU 2018 Evaluation Value 5.6 5.6TFlops TFlops 61.6261.62 GFlops= GFlops= 4 TOPs4 TOPs ComputationComputation ability ability 2019 [16] 2015 [20] [26,27] =9.45=9.45 TOPs TOPs 61.6261.62 GOPS GOPS [20 ][ 20] =2 TOPs=2 TOPs 5.6 TFlops 61.62 GFlops 4 TOPs Computation ability InputInput activations: activations: 9-bit 9- bitFP FP PrecisionPrecision =9.45 TOPs =61.6232-bit32 GOPS- bitFP FP [20 ] =2INT8 TOPsINT8 Weights:InputWeights: activations: 14- bit14- bitFP 9-bit FP FP Precision 32-bit FP INT8 Weights:60060 m0W m 14-bit W FP 2 W2 W PowerPower consumption consumption 18.618.61 W1att W att 2.8 2.8TOPs@30 TOPs@30600 mW0 m0W m W (0.5(0. W25 W/TOPs) W/TOPs) Power consumption 18.61 Watt 2.82 TOPs@3008 n2m8 n m mW (0.5 W/TOPs) SizeSize OnOn Virtex7 Virtex7 VX485T VX485T 5.0 5.m0m m ×m 5. ×0 5.m0m m m 7.0 7.× 07. 28×0 7.m nm0m m m Size On Virtex7 VX485T 5.0 mm × 5.0 mm EvaluationEvaluation value value E E 329.147.0 329.14× 7.0 mm -- -- 40.9640.96 Evaluation value E 329.14ImplementedImplemented on – onthe the VC707 VC707 40.96 ImplementedImplemented in TSMCin TSMC 28 nm28 nm Implementedboardboard which on which the VC707 ImplementedImplemented in ASICin ASIC (undis- (undis- ImplemeImplementationntation Implemented in TSMC 28 nm Implemented in ASIC Implementation technologytechnology hashas boarda Xilinx a Xilinx which FPGA FPGA has chip a Xilinx chip Vir- Vir- closedclosed tec htecnology)hnology) technology (undisclosed technology) FPGAtex7 chiptex7 485t Virtex7 485t 485t

CommercializedCommercializedCommercialized prod- product prod- --– -- uctuct exampleexample example

PackagedPPackagedackaged in a in USB aa USBUSB stick stick stick PPackagedackagedPackaged in a ain USBUSB a USB stick stick stick

SomeSome works works, such, such as [as29 ][,29 use], use analog analog components components and and memristors memristors to mimicto mimic neurons neurons for forCNN CNN computing. computing. However, However, none none of theof the commercial commercial proposed proposed systems systems uses uses memris- memris- tors.tors. Several Several developers developers have have researched researched memristor memristor technology technology, including, including HP, HP, Knowm, Knowm, Inc.Inc. (Santa (Santa Fe, Fe,NM, NM, USA) USA), Crossbar,, Crossbar, SK SKHynix, Hynix, HRL HRL Laboratories, Laboratories, and and . Rambus. HP HP built built thethe first first workable workable memristor memristor in 2008,in 2008, yet yet until until now now, it ,still it still has has a distance a distance from from prototype prototype to commercialto commercial application. application. Knowm,Knowm, Inc. Inc. sold sold their their fabricated fabricated memristor memristor for forexperimentation experimentation purposes. purposes. Again, Again, thethe memristor memristor is notis not intended intended for forapplication application in commercialin commercial products products [30 ].[30 Besides,]. Besides, it is it is worthworth mentioning mentioning that that many many CPUs CPUs in smartphonesin contain contain built built-in - neuralin neural processing processing unitsunits (NPU) (NPU) or AIor AImodules; modules; for forexample, example, MediaTek MediaTek Helio Helio P90, P90, Apple A13 Bionic, Bionic, Sam- Sam- sungsung Exynos 990, 990, Huawei Kirin Kirin 990, 990, etc. etc. However, However, individual individual NPU NPU or soor- calledso-called AI AImod- mod- ulesules’ detailed’ detailed performance performance in thesein these commercial commercial CPUs CPUs is not is not public. public. As Asa result, a result, these these AI AI modulesmodules are are hard hard to compare to compare with with the the pure pure AI accelerators,AI accelerators, but but it is it worth is worth keeping keeping an eyean eye on onthese these commercial commercial products products to preventto prevent losing losing the the latest latest information. information.

3.3.3.3. Coarse Coarse-Grained-Grained Cell Array Array Accelerators Accelerators DynamicallyDynamically reconfigurable reconfigurable technology technology is theis the key key feature feature of anof anedge edge AI AIhardware hardware platformplatform for forflexibility flexibility and and fault fault tolerance. tolerance. The The term term ‘dynamic ‘dynamic’ explains’ explains that that during during the the runtime,runtime, reconfiguring reconfiguring the the platform platform is stillis still possible. possible. Generally, Generally, reconfigurable reconfigurable archi architec-tec- turestures can can be be grouped grouped into into two two major major types types: fine: fine-grained-grained reconfigurable reconfigurable architecture architecture (FGRA)(FGRA) and and coarse coarse-grained-grained reconfigurable reconfigurable architecture architecture (CGRA). (CGRA). FGRA FGRA contains contains a large a large amountamount of silicon of silicon to be to allocatedbe allocated for forinterconnecting interconnecting the the logic logic together, together, which which impli implies thates that

Electronics 2021, 10, 2048 7 of 16

After calculating their evaluation value E, accelerators [9,21] show similar abilities, E = 80 s. On the other hand, References [19,22,26] share similar E in the 20 s. Although some of the accelerators have close evaluation value E, the cFixed16, p, s values of these accelerators still need to be examined because the accelerators might apply for different purposes and environments. For example, Reference [16] has the highest evaluation value E, but its size is 9.8 times [9], 3 times [21], and nearly 2 times [26]. Overall, the evaluation value E can tell us a general efficiency of an AI accelerator, which is computation ability per unit area and watt. In Tables2–4, several accelerators [ 23,25,28] lack details of the specifications since they only release the module-level data. As a result, the evaluation value E of them should be treated more conservatively. On the other hand, Reference [20] is a completed system on an FPGA board and does not release its size on a single chip, so its evaluation value E is hard to measure. Nevertheless, its data are a good study material for the designers who intend to build their future projects on an FPGA board for prototyping. Some works, such as [29], use analog components and memristors to mimic neurons for CNN computing. However, none of the commercial proposed systems uses memristors. Several developers have researched memristor technology, including HP, Knowm, Inc. (Santa Fe, NM, USA), Crossbar, SK Hynix, HRL Laboratories, and Rambus. HP built the first workable memristor in 2008, yet until now, it still has a distance from prototype to commercial application. Knowm, Inc. sold their fabricated memristor for experimentation purposes. Again, the memristor is not intended for application in commercial products [30]. Besides, it is worth mentioning that many CPUs in smartphones contain built-in neural processing units (NPU) or AI modules; for example, MediaTek Helio P90, Apple A13 Bionic, Exynos 990, Huawei Kirin 990, etc. However, individual NPU or so-called AI modules’ detailed performance in these commercial CPUs is not public. As a result, these AI modules are hard to compare with the pure AI accelerators, but it is worth keeping an eye on these commercial products to prevent losing the latest information.

3.3. Coarse-Grained Cell Array Accelerators Dynamically reconfigurable technology is the key feature of an edge AI hardware platform for flexibility and fault tolerance. The term ‘dynamic’ explains that during the runtime, reconfiguring the platform is still possible. Generally, reconfigurable architectures can be grouped into two major types: fine-grained reconfigurable architecture (FGRA) and coarse-grained reconfigurable architecture (CGRA). FGRA contains a large amount of silicon to be allocated for interconnecting the logic together, which implies that FGRA impacts the rate for reconfiguring devices in real-time due to the larger bitstreams of instructions needed. As a result, CGRA is a better solution for real-time computing. Reference [31] presents many CGRAs and categorizes them into different categories: early pioneering, modern, larges, and . The article includes plentiful infor- mation, and the authors also collect the statistics to let readers know the developing trend of CGRA comparing to GPU. However, Reference [31] does not focus on the three key features’ comparison for the CGRAs. For understanding the performance between CGRAs and determine which architectures are the potential candidates for edge AI accelerators, this paper presents architectures [32–42] published in the recent few years for comparison. For uniting the units and reference standards used in each article, this paper consults the references and converts the various units to be standardized according to the revealed in- formation of each architecture in Tables5–7. Table5 shows the CGRAs use 32-bit precision, while Table6 shows the 16-bit precision CGRAs. Last but not least, Table7 presents the CGRAs, which neither use 32-bit, nor 16-bit precision. Electronics 2021, 10, 2048 8 of 16

Table 5. Coarse-Grained Cell Array Accelerators.

Three Key Features Coarse-Grained Cell Array ACCELERATORS and the SURE Based ADRES 2017 [32] VERSAT 2016 [33] FPCA 2014 [36] Evaluation Value REDEFINE 2016 [37] 450 Faces/s 4.4 GFlops 1.17 GFlops 9.1 GFlops Computation ability ≈201.6 GOPs =26.4 GOPs (A9) =7.02 GOPs (A9) =54.6 GOPs (A9) (Reference) Precision 32-bit FP (A9) 32-bit Fixed 32-bit FP (A9) 32-bit Fixed Power consumption 115.6 mW 44 mW 12.6 mW 1.22 W Xilinx Virtex6 Size 0.64 mm2 0.4 mm2 5.7 mm2 XC6VLX240T Evaluation value E 356.84 398.86 – 29.48 Implemented in Xilinx Implemented in TSMC Implemented in UMC Implemented in 65 nm Implementation Virtex-6 28 nm technology 130 nm technology technology FPGA XC6VLX240T

Table 6. Coarse-Grained Cell Array Accelerators.

Three Key Features and the Coarse-Grained Cell Array Accelerators Evaluation Value TRANSPIRE 2020 [34] DT-CGRA 2016 [38,39] Heterogenous PULP 2018 [42] 136MOPs Computation ability 95 GOPs 170 MOPs (binary8 benchmark) Precision 8/16-bit Fixed 16-bit Fixed 16-bits Fixed Power consumption 0.57 mW 1.79 W 0.44 mW Size N/A 3.79 mm2 0.872 mm2 Evaluation value E – 14 443.08 Implemented in Implemented in 28 nm Implemented in SMIC 55 nm Implementation STMicroelectronics 28 nm technology process technology

Table 7. Coarse-Grained Cell Array Accelerators.

Three Key Features and the Coarse-Grained Cell Array Accelerators Evaluation Value SOFTBRAIN 2017 [35] Lopes et al. 2017 [40] Eyeriss v2 2016 [41] 452 GOPs 1.03 GOPs Computation ability 153.6 GOPs (test under 16-bit mode) =1.545 GOPs Weight/iacts: 8-bit Fixed Precision 64-bit Fixed (DianNao) 24-bit Fixed psum: 20-bit Fixed Power consumption 954.4 mW 1.996 mW 160 mW 0.45 mm2 24.5 mm2 Size 3.76 mm2 (not including the buffer, (2 times of v1 [19]) memory, and control systems) Evaluation value E 125.96 1720.1 39.18 Implemented in 55 nm Implemented in 90 nm Implemented in a 65 nm Implementation technology technology technology.

Some of the works do not show the computation ability in OPs, the standard reference unit for the AI accelerator designers and clients. Instead, a few of them compare their com- putation ability with the ARM Cortex A9 [32,33,36]. For example, Reference [33] releases Versat’s performance in operation cycles by running the benchmarks and shows the ARM Cortex A9 processor’s ability on those benchmarks for comparison. However, the article does not show Versat’s OPs, the standard reference unit for the AI accelerator designers and clients. According to the results, it shows that Versat is 2.4 times faster than the ARM Cortex A9 processor on average. Then, Reference [43] shows the performance of the ARM Cortex A9 processor is 500 mega floating points per second (MFlops). Af- Electronics 2021, 10, 2048 9 of 16

ter calculation, Versat’s operation ability is equal to 1.17 Giga Flops (GFlops). However, GFlops is still not the preferred unit for edge AI devices’ designers and clients. Based on (1), GFlops can be converted to GOPs by scaling three times. Finally, the performance of Versat is gotten and equal to 3.51 GOPs in 32-bit precision. We adopt (2) to get its cFixed16, 7.02 GOPs, as shown in Table5, for easily comparing to other accelerators. Similar works are done for the rest of the architectures in Tables5–7. For exception, the area size of [ 34] cannot be found out due to the lack of information. Reference [36] does the work on an FPGA, so its core size is unable to be evaluated. The computation unit used by [37] is faces-per-second because it targets face recogni- tion, making the computation unit’s converting work even harder and having significant deviation when referencing the ability from other similar work. Table5 shows the computa- tion ability of [37] is 450 faces/second, roughly equal to 201.6 GOPs [44]. In [37], it achieves recognizing 30 faces in a frame while the is 15 per second, which amounts to 450 recognitions to be performed per second. On the other hand, the reference work [44] recognizes up to 10 objects in a frame with a 60 per second frame rate. As a result, the converted computation ability of [37] remains for reference with a certain deviation. Refer- ence [40] has the highest evaluation value E of all the listed works. However, the revealed size of [40] is only part of the architecture, so edge AI designers should be more conser- vative in assessing specific architecture specifications. Reference [42] does not release its computation ability. Since [42] shares the same architecture with [45], the operation/power ratio in [45] can be the reference. Furthermore, Reference [42] contains double cores and extra heterogenous PEs comparing to [45]. As a result, the overall operation/power ratio of [42] would be higher than being evaluated. Overall, the evaluation results show that [37,38,41] have tens’ grade evaluation values E, between 10 to 40. References [32,33,35,42] have hundreds’ grade evaluation values E, between 100 to 400. According to CGRA’s evaluation value E, CGRAs show the potential ability to execute edge AI applications, like the outstanding prior art edge AI accelerators in Tables2–4.

3.4. Implementation Technology The implementation method of each edge AI accelerator has been shown in Tables2–7 . Most of the prior art edge AI accelerators in Tables2–4 have been commercialized and taped out as ASIC chips, such as [9,16,22,23,25,26,28], which can be easily found online for sale in different system packages. Commonly seen, the system packages adopted by the systems are developing boards or USB sticks; the examples are in Tables2–4. On the academic side, References [19,21] are also built as ASIC, but [20] is implemented in a single FPGA chip. When it comes to CGRA accelerators in Tables5–7, although most of them are not taped out to get the physical chip, they have been synthesized on ASIC standard library. The technology they use has been organized in Tables5–7. From the implementation information in Tables2–7, it can be noticed that FPGA and application-specific integrated circuits (ASIC) are the most used approaches to implement edge AI accelerators, including CGRA based, for their customized ability and low-power consumption. The non-recurring engineering expense (NRE expense) and flexibility of ASIC are high and low, respectively, compared to FPGA. As a result, building a system by ASIC has a higher cost than FPGA when the amount of the products is small. Not only that, but also the developing time of ASIC is longer than FPGA. At the beginning of the system developing process, FPGA-based platforms are the better solution due to their high throughput, reasonable price, low power consumption, and reconfigurability [46]. Accordingly, at the prototyping stage, building future AI platform design on a suitable FPGA platform at the system level [47] is suggested.

4. Architecture Analysis and Design Direction Figure1 organizes the accelerators whose evaluation value E is in the grade of tens or hundreds from Tables2–7. References [ 20,34,36,40] are not included in Figure1 because Electronics 2021, 10, 2048 10 of 16

they lack the area data at the chip level. The power consumption data of [23,25,28] is only released at module-level, so to be fair, they are not listed in Figure1, either. The yellow area in Figure1 represents the area size of each accelerator. Every bar in Figure1 is composed of two parts, up and down in blue and orange color, respectively. The upper part in blue represents the GOPs/mm2; the lower orange represents the mW/mm2. The two lines, the upper one and the lower one in Figure1, represent evaluation value E and the ratio of GOPs and mW, respectively. To focus on accelerators targeting ultra-small areas, the accelerators whose area size is below 10 mm2 are selected, and their three key features are presented in Figure2a. In Figure2a, each accelerator contains two bars. The left orange one represents the power consumption in mW while the right blue one represents the computation ability, GOPs. According to Figure2a, the accelerators can be grouped into two categories by area size as units’ grade and decimal grade. In the units’ grade group are [9,35,37,38]. On the other hand, Refs. [32,33,42] belong to the decimal grade group. In the units’ grade group, References [9,35] share a similar ratio of GOPs/mW while [37] has a relatively lower ratio, and [38] has the lowest ratio. The below analyzes the result of the Electronics 2021, 10, x FOR PEER REVIEWratio. Although [37] has close computation ability about 1.3 times [9] and 0.45 times11 [of35 16], it Electronics 2021, 10, x FOR PEER REVIEW 11 of 16 consumes too much power, near 3.5 times [9] and 1.3 times [35]. When it comes to [38], its computation ability almost reaches a hundred but with huge power consumption, even higher than [37]. As a result, References [9,35] have better computation ability performance abilityability andand powerpower consumptionconsumption cancan bebe takentaken onceonce thethe architecturearchitecture sizesize hashas beenbeen chosen.chosen. and power consumption in the units mm2 area grade. It is interesting to know that the area DesignersDesigners cancan setset accelerator’saccelerator’s specificatispecificationsons accordingaccording toto itsits targetingtargeting application.application. AsAs aa size and GOPs/mW ratio positively correlate in the decimal grade group. Reference [42] result,result, ifif designersdesigners wantwant toto designdesign anan architecturearchitecture forfor anan edgeedge AIAI acceleratoraccelerator inin anan ultraultra-- has a relatively large2 area size compared to its computation ability. For more detail, the smallsmall areaarea (units’(units’ mmmm2 areaarea size),size), thethe powerpower consumptionconsumption andand operationoperation abilityability shouldshould bebe next paragraph will introduce the analysis. inin thethe orderorder ofof hundredshundreds ofof mWsmWs andand GOPs,GOPs, respectively.respectively.

FigureFigureFigure 1.1. PowerPower consumption consumption andand operationsoperations operations perper per areaarea area statisticsstatistics statistics...

2 Figure 2. (a) Three key features statistics (accelerators under 10 mm 2) and (b) the statistics normalized to the same grade Figure 2. (a) Three key featuresFigure statistics 2. (a) Three (accelerators key features under statistics 10 mm (accelerators) and (b) the under statistics 10 mm normalized2) and (b )to the the statistics same grade normalized of GOPs. of GOPs. to the same grade of GOPs.

FigureFigure 33 showsshows thethe scenarioscenario ofof thethe AIAI applicationsapplications fromfrom thethe originaloriginal toto thethe future.future. ItIt alsoalso illustratesillustrates thethe modelmodel ofof thethe AIAI applicationapplication distributiondistribution inin thethe yellowyellow andand purplepurple ovaloval,, fromfrom data data center center--basedbased cloud cloud AI AI to to edge edge AI, AI, mentioned mentioned as as three three hierarchies. hierarchies. FigureFigure 44 showsshows thethe legendslegends ofof FigureFigure 3.3. TheThe historyhistory ofof thethe developingdeveloping datadata centercenter--basedbased AIAI inin thethe firstfirst hierarchyhierarchy cancan tracetrace backback toto thethe 1980s.1980s. AtAt thatthat time,time, AIAI systemssystems hadhad developeddeveloped toto solvesolve severalseveral specificspecific prob-prob- lemslems asas thethe decisiondecision--makingmaking abilityability ofof aa humanhuman expeexpert.rt. ItIt isis achievedachieved byby ifif--thenthen rulesrules in-in- steadstead throughthrough conventionalconventional proceduralprocedural codecode andand composedcomposed byby thethe inferenceinference engineengine andand thethe knowledgeknowledge basebase units.units. TheThe ideaidea isis likelike thethe AIAI technologytechnology wewe areare usingusing nowadays.nowadays. Ra-Ra- therther thanthan thethe knowledgeknowledge base,base, wewe useuse bigbig datadata anandd deepdeep learninglearning.. SeveralSeveral modernmodern data-data- basebase AIAI achievementsachievements areare wellwell noticed;noticed; thethe examplesexamples areare DeepDeep BlueBlue (a(a chesschess computercomputer 1996)1996) andand AlphaGOAlphaGO (a(a boardboard gamegame GoGo programprogram 2014).2014). TheThe significantsignificant technologytechnology gapgap be-be- tweentween DeepDeep BlueBlue andand AlphaGOAlphaGO isis deepdeep learninglearning neuralneural nnetworketwork thatthat cancan handlehandle aa largerlarger branchingbranching factorfactor inin GOGO games.games. Nowadays,Nowadays, muchmuch differentdifferent informationinformation isis collectedcollected byby internetinternet browsers,browsers, databases,databases, andand sensorssensors throughthrough bigbig datadata technology.technology. MassiveMassivelyly collectedcollected datadata isis aa goodgood materialmaterial forfor

Electronics 2021, 10, 2048 11 of 16

Figure2b shows the normalized three key features of the accelerators in Figure2a. The three key features of the accelerators are normalized to the same grade of computation ability by scaling up [32,33,38,42] and scaling down [35,37], linearly [48]. The result shows that except for [38,42], the remaining five accelerators have a similar trend in power consumption and area size. After normalization, the result emphasizes the unsatisfactory performance of [38,42] for low-power edge AI devices. Reference [38] consumes too much power while [42] has a too big area compared to its computation ability. However, if the targeting application requires ultra-low power consumption and can accept hundred-grade MOPs, Reference [42] is a good choice. Overall, a trading-off between computation ability and power consumption can be taken once the architecture size has been chosen. Designers Electronics 2021, 10, x FOR PEER REVIEWcan set accelerator’s specifications according to its targeting application. As a result,13 of 16 if

designers want to design an architecture for an edge AI accelerator in an ultra-small area (units’ mm2 area size), the power consumption and operation ability should be in the order of hundredsIn summary, of mWs training and GOPs, on chips respectively. in edge AI accelerators will be a popular research topic. The Figuredistributed3 shows computing the scenario technology, of the AI 5G applications or future telecommunication from the original to prot theocol, future. data It alsoencryption illustrates for thetraining model on of chips, the AI cross application-platform distribution data exchange in the protocol yellow, andand purpleharsh envi- oval, fromronment data tolerance center-based will cloud involve AI toin edgethe developing AI, mentioned path as of three the edge hierarchies. AI accelerators Figure4 shows in the thefuture legends. of Figure3.

Radiation environment Training data Local inference Pre-trained Weights/Bias Pt-W/B (Pt-W/B) Remote sensing satellite Local Local inference Pt-W/B Traning inference Data center in cloud Surveillance Wifi router camera Single board Local ? system inference Unmanned Shop Traning (different system) Server Cell Unmanned Shop Updating Wifi tower Weights/Bias PC/Work router Traffic light Local station Local inference Partial Autonomous car Traning inference training data Traning Training data Pt-W/B Pt-W/B

UAV with Smart phone Autonomous car camera Face recognition Smart watch Local Updating Local Weights/Bias Radiation door lock inference inference Local Traning environment Local inference Local (Nuclear Smart phone inference inference PowerStation usage) Pt-W/B Traning Pt-W/B Pt-W/B Partial training data

FigureFigure 3. 3.The The scenario scenario ofof AIAI applicationsapplications onon hardwarehardware platforms.platforms. (Figure(Figure legendslegends cancan bebe foundfound inin FigureFigure4 4.)..).

Electronics 2021, 10, 2048 12 of 16 Electronics 2021, 10, x FOR PEER REVIEW 14 of 16

Future trend

Current trend Local Indicates local inference happening in the edge Original inference device Cloud AI/ Edge-cloud coordination AI Local Edge AI with local Indicates local inference inference inference and training happening The edge device in this circle might Traning in the edge device be designed to work in the radiation Edge AI with local field inference and training

Received data transmission Possible received data transmission

Inference result transmission Possible Inference result transmission

Data transmission for training Partial weights/bias sharing for updating the unexpected cases Pre-training data path Partial training data transmission

FigureFigure 4.4.Figure Figure legendslegends ofof FigureFigure3 .3. The history of the developing data center-based AI in the first hierarchy can trace back 5. Conclusions and Future Works to the 1980s. At that time, AI systems had developed to solve several specific problems as theThis decision-making paper has presented ability a of survey a human of up expert.-to-date It isedge achieved AI accelerators by if-then and rules CGRA instead ac- throughcelerators conventional that can apply procedural to image code recognition and composed systems by and the introduced inference engine the evaluation and the knowledgevalue E for baseboth units.edge AI The accelerators idea is like and the AICGRAs technology. CGRA we architectures are using nowadays. meet the evalua- Rather thantion val theue knowledge E of the existing base, we prior use art big edge data AI and accelerators, deep learning. which Several implies modern the potential database suit- AI achievementsability of CGRA are architectures well noticed; thefor examplesrunning edge are DeepAI applications. Blue (a chess The result reveals 1996) and the AlphaGOevaluation (a values board E game of prior Go programart edge 2014).AI accelerators The significant and CGRAs technology are between gap between tens to Deep four Bluehundred, and AlphaGOindicatingis that deep edge learning AI accelerators neural network’ future design that can trend handle should a larger meet branchingthis grade. factorOverall, in GOthe games.analysis shows that the future design of ultra-small area (under 10 mm2) ac- celerators’Nowadays, power much consumption different and information operation is ability collected should by internet be in the browsers, order of databases, hundreds andof mWs sensors and throughGOPs, respectively. big data technology. Massively collected data is a good material for data training.As the edge According devices toare the finding different their individual way into various applications, applications the trained such weights as monitor- are different.ing natural Trained hazards weights by UAVs, are storeddetecting in cloudradiation systems leakage and for wait nuclear for the disaster unjudged by robotics, data to comeand remote in. A specificsensing application-orientedin space by satellites, AIthese system applied uses fields a suitable are more deep critical learning than neural usual. networkMany research algorithm articles to infer are thetargeting result, the then fault transmit tolerance the result feature back for to edge the requiringAI accelerators system. in Cloud-basedrecent years, which data center indicates AI systemthat the istrend powerful of the edge due toAIits accelerators excellent hardwareis resilient suchin terms as supercomputerof reliability and or high high-end radiation GPU. field It is applicability. hard to reach Finally directly, we by illustrate ordinary the consumers. current status of edgeThanks AI application to the well-developeds with a future internet vision and, which personal addresses consumer the technolog electronics,ies involved AI tech- nologyin the future can benefit design the of generaledge AI public. accelerators As a result, and the edgeir challenges. AI is a popular topic. The current development status of edge AI can be seen in both yellow and purple oval in Figure3. TheAuthor yellow Contribution oval showss: Conceptualization, that the data center W.L. AI, A.A system. and T is.A connected.; methodology, with W several.L.; formal edge analy- AI devicessis, W.L.; through investigation, internet W.L infrastructures.; resources, W.L. suchand T.A. as routers; data curation, and cell W. towers.L.; writing Some—original edge draft AI systemspreparation, need W to.L work.; writing with—review intermediary and editing, systems W.L. such and as T. aA server.; visualization, and a PC W to.L.; process supervision, data beforeT.A. All transmitting authors have orread receiving and agreed data to fromthe published the cloud. version This of type the ofmanuscript. edge AI system, which requiresFunding: connection This research to received the data no center, external is calledfunding an. edge-cloud coordination system, which is currently well-spread and famous such as virtual assistants, e.g., Siri. Data Availability Statement: The original contributions presented in the study are included in the In Figure3, the face recognition door lock can send the unjudged data, i.e., face article, and further inquiries can be directed to the corresponding author. picture, to the cloud for processing and get the inference result. However, in reality, the methodConflicts is of challenged Interest: The due authors to safety declare concerns. no conflict As aof result, interest. systems like door locks requiring high standard security usually adopt edge AI systems instead of edge-cloud coordination References systems. Because of the local inference feature of edge, AI systems can provide systems 1. Hoang, L.-H.; Hanif, M.A.;processing Shafique, M. the FT data-ClipAct: without Resilience an internet Analysis connection. of Deep Neural The local Networks inference and feature Improving is a bigtheir move Fault Tolerance using Clipped Activation.for a device In Proceedings to promise of data the privacy,2020 Design, security, Automation real-time & Test processing, in Europe and Conference working & underExhibition no (DATE), Grenoble, France, 9–13 March 2020.

Electronics 2021, 10, 2048 13 of 16

internet coverage environment. This paper has introduced several edge AI accelerators, which target this field to satisfy the demand for edge AI devices. The latest generation telecommunication standard, e.g., 5G, can be a valuable tech- nology in the future edge AI development when it comes to training on chips. Training on chips might require something like , which is achieved by the Internet of things (IoT) through 5G or future communication protocols. The idea is shown as the orange oval illustrated in Figure3. For example, autonomous cars contain pre-trained weights during the developing time to allow their AI system to make decisions for self-driving. The car manufacturer, Tesla, does epochal milestones on this page. However, pre-trained data might have some flaws due to the environment’s change with an unexpected scenario. There are several non-fatal crashes reported in [49]. The idea of training on chips is that while human drivers engage the autopilot for a system non-aware situation but human handleable, the system can share the information with other compatible systems for training on chips. Training on chips can adopt distributed computing technology to reduce the computation. The future trend of the research is how to distribute the data for training and share the weights between compatible systems, which might demand 5G or future telecommuni- cation protocol to implement massive data exchange. Training on Chips also encounters another challenge. As shown in Figure3, edge AI systems along different unmanned shops may experience several significant issues about which people are caring. The significant issues are personal data utilization and a standard platform/protocol for data exchanging, etc. Although personal data utilization may sound beyond the edge AI accelerators’ tech- nical design level, it is suggested that an accelerator designer involves a data encryption algorithm with data training on chips during the design level. Another significant topic is data training exchanging protocol. Applying training on chips to universal edge AI devices or systems is demanded to maximize the benefit of training on chips. The answer might be a cross-platform protocol, which will require the coordination of the designers. Furthermore, there are several popular research of edge AI systems targeting harsh environments mentioned in the introduction. The scenario of a UAV working in a nuclear power station and a remote sensing satellite in Figure3 are examples. They should be aware while designing future trend edge AI accelerators. In summary, training on chips in edge AI accelerators will be a popular research topic. The distributed computing technology, 5G or future telecommunication protocol, data encryption for training on chips, cross-platform data exchange protocol, and harsh environment tolerance will involve in the developing path of the edge AI accelerators in the future.

5. Conclusions and Future Works This paper has presented a survey of up-to-date edge AI accelerators and CGRA accelerators that can apply to image recognition systems and introduced the evaluation value E for both edge AI accelerators and CGRAs. CGRA architectures meet the evaluation value E of the existing prior art edge AI accelerators, which implies the potential suitability of CGRA architectures for running edge AI applications. The result reveals the evaluation values E of prior art edge AI accelerators and CGRAs are between tens to four hundred, indicating that edge AI accelerators’ future design trend should meet this grade. Overall, the analysis shows that the future design of ultra-small area (under 10 mm2) accelerators’ power consumption and operation ability should be in the order of hundreds of mWs and GOPs, respectively. As the edge devices are finding their way into various applications such as monitoring natural hazards by UAVs, detecting radiation leakage for nuclear disaster by robotics, and remote sensing in space by satellites, these applied fields are more critical than usual. Many research articles are targeting the fault tolerance feature for edge AI accelerators in recent years, which indicates that the trend of the edge AI accelerators is resilient in terms of reliability and high radiation field applicability. Finally, we illustrate the current status of Electronics 2021, 10, 2048 14 of 16

edge AI applications with a future vision, which addresses the technologies involved in the future design of edge AI accelerators and their challenges.

Author Contributions: Conceptualization, W.L., A.A. and T.A.; methodology, W.L.; formal analysis, W.L.; investigation, W.L.; resources, W.L. and T.A.; data curation, W.L.; writing—original draft preparation, W.L.; writing—review and editing, W.L. and T.A.; visualization, W.L.; supervision, T.A. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Data Availability Statement: The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Hoang, L.-H.; Hanif, M.A.; Shafique, M. FT-ClipAct: Resilience Analysis of Deep Neural Networks and Improving their Fault Tolerance using Clipped Activation. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020. 2. Zhang, J.J.; Gu, T.; Basu, K.; Garg, S. Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In Proceedings of the 2018 IEEE 36th VLSI Test Symposium (VTS), San Francisco, CA, USA, 22–25 April 2018. 3. Hanif, M.A.; Shafique, M. Dependable Deep Learning: Towards Cost-Efficient Resilience of Deep Neural Network Accelerators against Soft Errors and Permanent Faults. In Proceedings of the 2020 IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS), Napoli, Italy, 13–15 July 2020. 4. Yasoubi, A.; Hojabr, R.; Modarressi, M. Power-Efficient Accelerator Design for Neural Networks Using Computation Reuse. IEEE Comput. Archit. Lett. 2017, 16, 72–75. 5. Venkataramani, S.; Ranjan, A.; Roy, K.; Raghunathan, A. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), La Jolla, CA, USA, 11–13 August 2014. 6. You, Z.; Wei, S.; Wu, H.; Deng, N.; Chang, M.-F.; Chen, A.; Chen, Y.; Cheng, K.-T.T.; Hu, X.S.; Liu, Y.; et al. White Paper on AI Chip Technologies; Tsinghua University and Beijing Innovation Centre for Future Chips: Beijing, China, 2008. 7. Montaqim, A. Top 25 AI Chip Companies: A Macro Step Change Inferred from the Micro Scale. Robotics and Automation News 2019. Available online: https://roboticsandautomationnews.com/2019/05/24/top-25-ai-chip-companies-a-macro-step-change- on-the-micro-scale/22704/ (accessed on 4 May 2021). 8. Simonyan, K.; Zisserman, A. Very deep convolution networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. 9. Du, L.; Du, Y.; Li, Y.; Su, J.; Kua, Y.-C.; Liu, C.-C.; Chang, M.C.F. A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things. IEEE Trans. Circuits Syst. 2018, 65, 198–208. [CrossRef] 10. Clark, C.; Logan, R. Power Budgets for Mission Success; Clyde Space Ltd.: Glasgow, UK, 2011; Available online: http: //mstl.atl.calpoly.edu/~{}workshop/archive/2011/Spring/Day%203/1610%20-%20Clark%20-%20Power%20Budgets%20 for%20CubeSat%20Mission%20Success.pdf (accessed on 4 May 2021). 11. Yazdanbakhsh, A.; Park, J.; Sharma, H.; Lotfi-Kamran, P.; Esmaeilzadeh, H. Neural acceleration for GPU throughput processors. In Proceedings of the 48th International Symposium on , Waikiki, HI, USA, 5–9 December 2015. 12. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [CrossRef] 13. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, New York, NY, USA, 3–7 November 2014. 14. Vasudevan, A.; Anderson, A.; Gregg, D. Parallel multi channel convolution using general matrix multiplication. In Proceedings of the 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 10 July 2017. 15. Guo, K.; Sui, L.; Qiu, J.; Yu, J.; Wang, J.; Yao, S.; Han, S.; Wang, Y.; Yang, H. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2018, 37, 35–47. [CrossRef] 16. Gyrfalcon Technology Inc. (GTI). Lightspeeur® 2801S. Available online: https://www.gyrfalcontech.ai/solutions/2801s/ (accessed on 4 May 2021). 17. Farahini, N.; Li, S.; Tajammul, M.A.; Shami, M.A.; Chen, G.; Hemani, A.; Ye, W. 39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation. In Proceedings of the 2013 IEEE International Symposium on Circuits and Systems (ISCAS), Beijing, China, 19–23 May 2013. 18. Abdelfattah, A.; Anzt, H.; Boman, E.G.; Carson, E.; Cojean, T.; Dongarra, J.; Gates, M.; Grützmacher, T.; Higham, N.J.; Li, S.; et al. A survey of numerical methods utilizing mixed precision arithmetic. arXiv 2020, arXiv:2007.06674. Electronics 2021, 10, 2048 15 of 16

19. Chen, Y.-H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 2017, 52, 262–263. [CrossRef] 20. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolution neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015. 21. Sim, J.; Park, J.-S.; Kim, M.; Bae, D.; Choi, Y.; Kim, L.-S. A 1.42TOPS/W deep convolution neural network recognition processor for intelligent IoE systems. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference, San Francisco, CA, USA, 31 January–4 February 2016. 22. Oh, N. Intel Announces Myriad X VPU, Featuring ‘Neural Compute Engine’, AnandTech 2017. Available online: https://www.anandtech.com/show/11771/intel-announces-movidius-myriad-x-vpu (accessed on 4 May 2021). 23. NVIDIA. JETSON NANO. Available online: https://developer.nvidia.com/embedded/develop/hardware (accessed on 4 May 2021). 24. Wikipedia, T. Available online: https://en.wikipedia.org/wiki/Tegra#cite_note-103 (accessed on 4 May 2021). 25. Toybrick. TB-RK1808M0. Available online: http://t.rock-chips.com/portal.php?mod=view&aid=33 (accessed on 5 May 2021). 26. Coral. USB Accelerator. Available online: https://coral.ai/products/accelerator/ (accessed on 5 May 2021). 27. DIY MAKER. Google Coral edge TPU. Available online: https://s.fanpiece.com/SmVAxcY (accessed on 5 May 2021). 28. Texas Instruments. AM5729 Sitara Processor. Available online: https://www.ti.com/product/AM5729 (accessed on 5 May 2021). 29. Shafiee, A.; Nag, A.; Muralimanohar, N.; Balasubramonian, R.; Paul Strachan, J.; Hu, M.; Williams, R.S.; Srikumar, V. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on (ISCA), Seoul, Korea, 18–22 June 2016. 30. Arensman, R. Despite HPs Delays, Memristors Are Now Available. Electronics 360. 2016. Available online: https://electronics3 60.globalspec.com/article/6389/despite-hp-s-delays-memristors-are-now-available (accessed on 4 May 2021). 31. Podobas, A.; Sano, K.; Matsuoka, S. A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective. IEEE Access 2020, 8, 146719–146743. [CrossRef] 32. Karunaratne, M.; Mohite, A.K.; Mitra, T.; Peh, L.-S. HyCUBE: A CGRA with Reconfigurable Single-cycle Multi-hop Interconnect. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 18–22 June 2017. 33. Lopes, J.D.; de Sousa, J.T. Versat, a Minimal Coarse-Grain Reconfigurable Array. In Proceedings of the International Conference on Vector and Parallel Processing, Porto, Portugal, 28–30 June 2016. 34. Prasad, R.; Das, S.; Martin, K.; Tagliavini, G.; Coussy, P.; Benini, L.; Rossi, D. TRANSPIRE: An energy-efficient TRANSprecision floatingpoint Programmable archItectuRE. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020. 35. Nowatzki, T.; Gangadhar, V.; Ardalani, N.; Sankaralingam, K. Stream-Dataflow Acceleration. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017. 36. Cong, J.; Huang, H.; Ma, C.; Xiao, B.; Zhou, P. A Fully Pipelined and Dynamically Composable Architecture of CGRA. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, Boston, MA, USA, 11–13 May 2014. 37. Mahale, G.; Mahale, H.; Nandy, S.K.; Narayan, R. REFRESH: REDEFINE for Face Recognition Using SURE Homogeneous Cores. IEEE Trans. Parallel Distrib. Syst. 2016, 27, 3602–3616. [CrossRef] 38. Fan, X.; Li, H.; Cao, W.; Wang, L. DT-CGRA: Dual-Track Coarse Grained Reconfigurable Architecture for Stream Applications. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016. 39. Fan, X.; Wu, D.; Cao, W.; Luk, W.; Wang, L. Stream Processing Dual-Track CGRA for Object Inference. IEEE Trans. VLSI Syst. 2018, 26, 1098–1111. [CrossRef] 40. Lopes, J.; Sousa, D.; Ferreira, J.C. Evaluation of CGRA architecture for real-time processing of biological signals on wearable devices. In Proceedings of the 2017 International Conference on and FPGAs (ReConFig), Cancun, Mexico, 4–6 December 2017. 41. Chen, Y.-H.; Yang, T.-J.; Emer, J.; Sze, V. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Trans. Emerg. Sel. Topics Circuits Syst. 2019, 9, 292–308. [CrossRef] 42. Das, S.; Martin, K.J.; Coussy, P.; Rossi, D. A Heterogeneous Cluster with Reconfigurable Accelerator for Energy Efficient Near- Sensor Data Analytics. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018. 43. Nikolskiy, V.; Stegailov, V. Floating-point performance of ARM cores and their efficiency in classical molecular dynamics. J. Phys. Conf. Ser. 2016, 681, 012049. [CrossRef] 44. Kim, J.Y.; Kim, M.; Lee, S.; Oh, J.; Kim, K.; Yoo, H. A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor with Bio-Inspired Neural Perception Engine. IEEE J. Solid-State Circuits 2010, 45, 32–45. [CrossRef] 45. Gautschi, M.; Schiavone, P.D.; Traber, A.; Loi, I.; Pullini, A.; Rossi, D.; Flamand, E.; Gürkaynak, F.K.; Benini, L. Near-Threshold RISC-V Core with DSP Extensions for Scalable IoT Endpoint Devices. IEEE Trans. Very Large Scale Integr. Syst. 2017, 25, 2700–2713. [CrossRef] Electronics 2021, 10, 2048 16 of 16

46. Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [CrossRef] 47. Lavagno, L.; Sangiovanni-Vincentelli, A. System-level design models and implementation techniques. In Proceedings of the 1998 International Conference on Application of Concurrency to System Design, Fukushima, Japan, 23–26 March 1998. 48. Takouna, I.; Dawoud, W.; Meinel, C. Accurate Mutlicore Processor Power Models for Power-Aware Resource Management. In Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, Sydney, NSW, Australia, 12–14 December 2011. 49. Wikipedia. Tesla Autopilot. Available online: https://en.wikipedia.org/wiki/Tesla_Autopilot#Hardware_3 (accessed on 6 August 2021).