This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented.

IEICE Electronics Express, Vol.VV, No.NN, 1–6

LETTER A Proposal for Enhancing Training Speed in Models Based on Memory Activity Survey

Dang Tuan Kiet1a), Binh Kieu-Do-Nguyen1b), Trong-Thuc Hoang1c), Khai-Duy Nguyen1d), Xuan-Tu Tran2e), and Cong-Kha Pham1f)

Abstract forms such as mobile devices or embedded systems [1]. Deep Learning (DL) training process involves intensive computations that However, it is not the case for the training process. The require a large number of memory accesses. There are many surveys on training process is still a data-intensive challenge for many memory behaviors with the DL training. They use well-known profiling tools or improving the existing tools to monitor the training processes. computing systems. It is characterized by millions of para- This paper presents a new approach to profile using a co-operate solution meters with billions of data transactions to compute. There- from software and hardware. The idea is to use Field-Programmable-Gate- fore, besides improving the training algorithms or models, Array memory as the main memory for the DL training processes on a memory access enhancement is another approach to speed computer. Then, the memory behaviors from both software and hardware up such a process. point-of-views can be monitored and evaluated. The most common DL models are selected for the tests, including ResNet, VGG, AlexNet, and DL training is a data-intensive process that requires millions GoogLeNet. The CIFAR-10 dataset is chosen for the training database. The to billions of read/write transactions. For example, the confi- experimental results show that the ratio between read and write transactions gurable parameters during the training processes of VGG-19 is roughly about 3 to 1. The requested allocations are varied from 2-Byte to 64-MB, with the most requested sizes are approximately 16-KB to 64-KB. and ResNet-152 are already 144 and 57 million parameters Based on the statistic, a suggestion was made to improve the training speed [2], respectively. To interact with those multi-million para- using an L4 cache for the Double-Data-Rate (DDR) memory. It can be meters, the number of read/write transactions on the main demonstrated that our recommended L4 cache configuration can improve memory could quickly go up to the scale of multi-billion. the DDR performance by about 15% to 18%. Such an enormous number puts high pressure on the main key words: Deep Learning, Memory, Survey, Training Speed. Classification: Integrated circuits (logic) memory, which tends to be the system’s bottleneck. Many profiling tools were proposed to evaluate the data usage of 1. Introduction DL training processes, such as Nvprof [3], Tegrastats [4], Deep Learning (DL) is widely applied to reform multiple and TensorFlow profiler [5]. Although the collected statis- aspects in today’s world (e.g., image recognition, cognitive tics from these tools can help in many cases [2, 6, 7, 8], assistance, speech recognition, gene analysis, street navi- they are hard to or even cannot provide a complete view gation, etc.). DL applications development generally con- at the physical level, which represents the actual read/write sists of two steps, i.e., training and inference. The central transactions of the Random-Access-Memory (RAM). Fur- issue of inference is now less severe because of recent de- thermore, modern Central-Processing-Units (CPUs) nowa- velopments of Field-Programmable-Gate-Arrays (FPGAs), days only communicate to the main memory via a complex Graphics-Processing-Units (GPUs), and Tensor-Processing- bus system with many layers of caches. Thus, the memory’s Units (TPUs) have made them more potent in small plat- actual behaviors are often hidden, sometimes even inten- tionally due to security reasons, from the software point of 1University of Electro-Communications (UEC), view. 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan To overcome the issue, a co-operate software-hardware so- 2The Information Technology Institute (VNU-ITI), lution is derived from this paper. The idea is to develop the 144 Xuan Thuy road, Cau Giay dist., Hanoi, Vietnam training processes for several common DL models that run a) [email protected] on a computer but not using the computer’s main RAM. b) [email protected] Instead, an FPGA board having a Double-Data-Rate (DDR) c) [email protected] memory and a Peripheral-Component-Interconnect-express d) [email protected] e) [email protected] (PCIe) connection is used to emulate the computer’s main f) [email protected] RAM. Furthermore, in the FPGA, a monitor module is de- DOI: 10.1587/elex.XX.XXXXXXXX signed to collect the valuable data of the memory at the DOI: 10.1587/elex.18.20210252 physical level. To be specific, three steps are performed in Received October 10, 2019 ReceivedAccepted June October 08, 10,2021 2019 this paper to provide an effective and appropriate suggestion Accepted July 01, 2021 PublicizedPublished July December 09, 2021 31, 2019 on improving memory access. First of all, we determine the 1 CopyrightCopyright ©© 2021 2019 The The Institute Institute of of Electronics, Electronics, Information Information and and Communication Communication Engineers Engineers IEICE Electronics Express, Vol.VV, No.NN, 1–6 most critical aspects that need to be evaluated in memory tioned above, only TensorFlow, PyTorch, and OpenNN are activities. They are the peek allocated memory for a training using a C/C++ core library. However, TensorFlow is heavily loop, the smallest and the biggest requested size in a single python-oriented, and the OpenNN is using 3rd-party allocation, the most and the second most frequent requested for the main computations. As a result, the PyTorch is the sizes of the allocations, the number of read/write transac- most suitable one for the implementation in this paper. More- tions, and the total length of read/write data. Secondly, a over, its C/C++ core library can achieve high-performance co-operate software-hardware system is designed to collect processing [22]. the trivial data at both abstracts (software) and physical (hardware) levels. On the software side, the DL applications 3. Proposed Memory Activity Survey Method are modified to control the hardware and collect a part of the 3.1 Hardware Implementation statistical data. On the hardware side, a monitor module is Fig. 1 shows the overview of the designed monitoring sys- developed to monitor and collect the DDR behaviors at the tem. In the proposed system, the hardware implementation physical level. Furthermore, the monitor module is designed on FPGA is used to emulate the host computer’s main me- in a way to not interfere with the DDR bandwidth. In other mory. The chosen FPGA board needs to have high-speed words, we make sure that the bottleneck of the whole system DDR RAM and a PCIe connection. The FPGA board is is not due to the monitor module itself. Finally, based on the connected to the host computer via PCIe, and the PCIe driver collected data, we perform an analysis and suggest practical is used for the software on the host computer to recognize the solutions to improve memory access performance. FPGA as a RAM. Therefore, the DL programs on the host The remainder of this paper is organized as follows. Section computer can run while using the FPGA’s DDR memory. 2 briefly reviews the most common Convolutional Neural In the FPGA, a monitor module is developed to record the Networks (CNNs) and datasets used in the DL applications. memory usage behaviors at the physical level. Furthermore, Section 3 explains the memory activity survey method. Sec- the software on the host computer is also modified to con- tion 4 shows the experimental results. Section 5 suggests trol the hardware and get the statistics data from the FPGA some remedies to enhance the DDR usage performance. afterward. And finally, section 6 concludes the work. Fig. 2 gives the design idea of the monitor module. As shown in the figure, there are four main components, including 2. Common Deep Learning Models and Datasets probe, controller, collectors, and storage. The probe is also One of the most common tasks encompassed by object shown in Fig. 1; it records the signal activities of the RAM recognition is image classification which involves predic- controller. Then, the controller part in the monitor module ting the type of an object in images. So far, CNN, a class analyses the received data and pushes it to the corresponding of DL neural networks, is the best learning algorithm for collector. The collectors keep track of the RAM’s behaviors, understanding image content. It has shown exemplary per- such as the number of read/write cycles, the number of formances in various image processing applications such as bytes (or requested size), and the total number of read/write segmentation, classification, detection, and retrieval-related transactions. Finally, all the recorded data are mapped to a tasks [9]. A typical CNN is often constructed by multiple storage that can be read back by the host computer. layers like Convolutional layer (Conv), Rectified Linear Unit (ReLU), pooling layer, Batch Normalization (BN), dropout function, and Fully-Connected layer (FC). Some of these primary layers can be grouped to form a block. Generally, a DL model consists of distinct layers and building blocks stacked in many orders to form various architectures for various purposes [10]. The DL models that used for the survey in this paper are among the most common ones such as ResNet-18, ResNet- 34, ResNet-50, ResNet-101, ResNet-152 [11], AlexNet [12], VGG-16, VGG-19 [13], and GoogLeNet [14]. For the image Fig. 1. The system overview. dataset, there are many datasets available freely and very frequently used in recent publications. Several common datasets are ImageNet [15], CIFAR-10, CIFAR-100 [16], and MNIST [17]. The CIFAR-10 dataset is chosen for setting up tests in this paper due to its relatively small size. About the DL framework, currently, the most popular ones are Tensor- Flow [18], PyTorch [19], [20], and OpenNN [21]. Due to the later integration with the FPGA’s PCIe driver, which requires a C/C++ environment, the DL framework with a C/C++ library and functions is preferred. Among those men- Fig. 2. The monitor module’s architecture. 2 IEICE Electronics Express, Vol.VV, No.NN, 1–6 The FPGA board used in this paper is the TR5 FPGA deve- lopment kit with the chip of Stratix V 5SGXEA7. The board supports PCIe Gen3×4 with the theoretical bandwidth of 8×4 = 32 Giga-bit-per-second (Gbps). The used Dual In-line Memory Module (DIMM) RAM is a one Giga-Byte (GB) DDR3-800 with the theoretical bandwidth of 6,400 Mega- Byte-per-seconds (MBps). In addition, the FPGA vendor’s DDR memory controller hard-IP was used for the implemen- tation to guarantee memory soundness. The hard-IP handles the DDR IOs at the 800-MHz clock frequency and then returns the transactions to the inner bus at the fixed clock of 250-MHz. Therefore, the implementation of the monitor module was designed to run with a 250-MHz clock fre- quency. Furthermore, the 250-MHz is also the operating speed of the PCIe hard-IP. The total cost of hardware im- plementation is 14,482 Adaptive Logic Modules (ALMs), which is about 61.7% of the FPGA resources. For the moni- tor module only, the cost is 2,066 ALMs, about 8.8% FPGA resources. Finally, a simple test was conducted only to mea- sure the system transferring rate; it achieved 31.44-Gbps, which is close to the theoretical bandwidth of 32-Gbps.

3.2 Software Implementation Two separated programs are developed on the host computer; one uses the host’s memory, and the other uses the FPGA’s memory. Both programs are using the same DL training pro- cedure, which utilizes the PyTorch [19] as its core functions. Besides the Pytorch, the OpenCV library [23] was also used to read the CIFAR-10 [16] image dataset. In addition, Py- Fig. 3. DL training program and the data collection process. provides a pure C++ interface that helps unify with the PCIe driver of the hardware. Furthermore, it supports the converting from Python models to C++ by using the 3.3 Survey method TorchVision [24]. The DL training configurations used in Fig. 3 describes the survey process of a typical DL training the programs are 32×32 input image size, one batch size, program. As seen in the figure, the DL training procedure 0.001 learning rate, Stochastic Gradient Descent (SGD) op- can be divided into five main steps of creating a model, timizer, and cross-entropy loss function. loading batch, forward calculation, backward calculation, DL models are matrix-based operations. In PyTorch, a tensor and optimization. And the statistical data are collected at is a specialized data structure of a multi-dimension matrix each step. with a single data type (e.g., scalar, float, int). This struc- In the beginning, a user-mode function from the PCIe driver ture encodes the model parameters and its inputs/outputs. A is called to detect our PCIe device (i.e., the FPGA board). tensor holds not only the model’s data but also a meta-data Then, before registering the device, a range of physical me- that describes its size, the element types that it contains, and mory is specified, and its addresses are mapped by the PCIe the device that it lives on. The actual place to store a tensor driver for later use. After that, the standard DL training pro- is called storage, and it is performed by memory alloca- cedure can be started beginning from creating the chosen tion/deallocation procedures. Moreover, PyTorch has a pro- DL model. During the training process, the software pro- filer tool that can track the memory allocation/deallocation. gram will request the memory and interact with the RAM. We modify that profiler to interact with our implemented The registered memory range provided by the PCIe driver, hardware to get more data at the physical level. which is mapped to the physical addresses of the FPGA A PCIe driver is used to map a range of physical memory to DDR, can be passed to the software allocator. Hence, the the userspace. Therefore, the software program can access DL program now is using the FPGA memory for its cal- the FPGA DDR and the monitor module. The allocation pro- culations. When the program is finished, we can read the cedure of PyTorch is modified for integration with the PCIe collected data from the FPGA as shown in Fig. 3. driver. As a result, it can return both virtual and physical The developed software program can train ResNet- addresses of the used memory. Furthermore, an allocation 18/34/50/101/152, VGG16/19, GoogLeNet, and AlexNet algorithm was also implemented to manage the memory using the PyTorch library with the CIFAR-10 image dataset. resource. Each model was run with two scenarios of PC memory and 3 IEICE Electronics Express, Vol.VV, No.NN, 1–6 FPGA memory. In using PC memory, the FPGA board is Table II. Statistical result of read/write transactions. detached from the PCIe cable. Thus, the PCIe driver cannot transactions per loop MB/loop Model detect any available device. As a result, there is no registered read write read write memory range available beforehand, forcing the software ResNet18 50,769,202 17,379,253 458 349 allocator to use the PC memory instead. ResNet34 95,270,348 33,267,925 885 659 ResNet50 112,582,315 38,065,918 1,013 769 4. Experimental Results ResNet101 187,968,858 66,544,364 1,816 1,388 ResNet152 259,653,921 91,808,246 2,536 1,920 The survey was done for all the DL models with two scena- VGG16 107,702,609 44,278,316 1,329 917 rios of PC memory and FPGA memory. There are four notes VGG19 130,566,001 52,581,295 1,558 1,086 that worth mentioning because they are consistent through- GoogLeNet 363,103,960 49,511,163 2,369 742 out the survey: AlexNet 49,251,425 20,135,746 674 407 • The creating model step only does alloc() but not free(); and it heavily relies on write.

• The numbers of alloc() and free() in the loading batch step are the same for all DL models.

• The optimization step only read/write to the memory but not requesting any new alloc() neither does free().

• The heaviest step is the backward with the highest num- bers of alloc(), free(), and read/write transac- tions. Table I and table II shows the average values measured for one training loop of all models. Table I shows the number of memory allocations on four aspects; they are peak allocated Fig. 4. Number of read/write transactions per training loop. memory, smallest requested size, biggest requested sizes, and the most and the second-most frequent requested sizes. The table shows that the smallest requested size is just 2 Bytes (B), while the biggest one could go up to 64 Mega- Bytes (MB). About the most and the second-most frequent requested sizes are 16 and 64 Kilo-Bytes (KB), respectively. Table II shows the record of memory behaviors at the phy- sical level, which relates to the read/write transactions. It is clear that the training processes of all models need to read much more than to write. Almost all models have the read/write ratio of 3:1, except for the GoogLeNet with the 7:1 ratio. Fig. 4 and Fig. 5 are the visual representations for table II. They show the relationship between the numbers of Fig. 5. Total read/write sizes per training loop. read/write requests and the total sizes of the read/write data, respectively. 5. Enhancing Memory Speed Proposal Table I. Statistical result of memory allocations. Peak Smallest Biggest 1st/2nd Based on the survey result in section 4, we recommend Model alloc. req. size req. size most freq. enhancing the memory performance specialized for DL (MB) (B) (MB) req. (KB) training programs with three main proposals: ResNet18 92.21 2 9 16/32 1) The design of DIMM-RAM can put the speed on a higher priority than total capacity. But an individual DDR chip ResNet34 176.36 2 9 16/1 should have at least 64-MB of storage. ResNet50 206.83 2 9 16/64 2) DIMM-RAMs could favor read transactions more than ResNet101 368.60 2 9 16/64 write transactions. ResNet152 510.89 2 9 16/64 3) A cache should be embedded on a DIMM-RAM with a VGG16 272.40 2 64 2/32 custom setting specializing in DL training applications. VGG19 315.39 2 64 2/32 The first suggestion is based on the observation of peak GoogLeNet 253.26 2 2.5 0.5/512 allocated memory in table I. According to the table, we can AlexNet 131.26 2 32 8/8 see that the peak allocations are from a hundred of MB 4 IEICE Electronics Express, Vol.VV, No.NN, 1–6 to around 500-MB. The capacity of DIMM-RAMs nowa- About the configuration of this embedded DIMM-RAM’s days could easily go much higher than that. Hence, the total cache, the following settings are recommended. Based on capacity of DDRs is not much of an issue in DL training. the survey, the cache size should be 64-MB, and a single Furthermore, almost all of the allocations are freed after each cache set should be 16-KB. That makes the total cache sets loop. In contrast, although the required peak allocation in equal to 64-MB / 16-KB = 4,096 sets. About the number of each loop is relatively small, the total number of read/write ways, L1, L2, and L3 caches typically have 2 or 4 ways, 4 or transactions are hundreds of millions as shown in table II. 8 ways, and 8 or 16 ways, respectively [25, 26]. Therefore, As a result, the total DIMM-RAM capacity is much less 32-way is an appropriate configuration for our L4 cache. important than its speed. However, an individual DDR chip To calculate the performance of a cache system, the Cycle- on a DIMM-RAM should have at least 64-MB of storage. Per-Instruction (CPI) value is often used [27]. The lower the The reason is to match up with the biggest requesting size CPI, the better the cache system performance. Eq. 1 shows of 64-MB as shown in table I. the formula for a three-level cache system with CPIL1, CPIL2, The second suggestion came from the fact that the number and CPIL3 represents for the CPIs of L1, L2, and L3 caches, of read transactions is much higher than that of the write respectively. The missRateL1, missRateL2, and missRateL3 transactions. For example, as seen in Fig. 4, the proportion of are the missing rates in percentage of L1, L2, and L3 caches, read-over-write transactions is from the minimum of 2.92× respectively. And the missPenalty is the number of clock to the maximum of 7.33×. As a result, the arbitration system cycles that the system must carry if a miss occurred. in DIMM-RAM can favor the read responses more than that of the write. CPI = CPIL1 + missRateL1× The third and final suggestion is for a custom cache on (CPIL2 + missRateL2× (1) a DIMM-RAM to speed up the read responses rate even (CPIL3 + missRateL3 × missPenalty)) further. This cache is not just a generic cache but a cache In a typical computer system, L1 caches generally have the with a custom configuration based on the survey result of DL base CPI from 0.75 to 1.0 [28] with the missing rate of less training programs. In a modern computer, the typical cache than 10% [29]. For L2 and L3 caches, they are about 2.0 to system shown in Fig. 6 consists of L1 caches inside each 2.2 of CPIs and 10% to 20% of missing rates [30]. For the core (i.e., instruction caches and data caches), L2 caches estimation, it is safe to assume that L1, L2, and L3 caches across multiple cores, and an L3 cache right next to the respectively have the base CPIs of 1.0, 2.0, and 2.2, and memory controller hub (or sometimes called northbridge the missing rates of 10%, 15%, and 20%. Finally, about the chipset). Therefore, our proposed DIMM-RAM’s cache can miss penalty, 100 clock cycles can be assumed for a typical be considered an L4 cache. DDR response time [31]. Then, the typical CPI value of a computer system without our L4 cache can be calculated from the Eq. 2.

CPI = 1 + 0.1 × (2 + 0.15 × (2.2 + 0.2 × 100)) (2) = 1.533 (clock cycles) For the performance estimation of the same system but with an L4 cache, first, let’s assume that this cache is just a general-purpose cache like the L3. We could say that it has the same performances as the L3 with 2.2 CPI and a 20% missing rate. Then, the system CPI can be updated to be 1.2996 clock cycles as shown in Eq. 3. Thus, the speed improvement is 15.2%.

CPI = 1 + 0.1 × (2 + 0.15× (2.2 + 0.2 × (2.2 + 0.2 × 100))) (3) = 1.2996 (clock cycles) However, because the L4 cache proposal specialized in run- ning DL programs, we can expect a better hit rate than any general-purpose cache. Therefore, instead of 20%, if the L4 cache achieves the missing rate of 10% and 5%, then the CPI values will be 1.2696 and 1.2546 clock cycles, respectively. Fig. 6. The cache system in a typical modern computer with our L4 cache That means 17.2% and 18.2% in speed improvements for proposal. 10% and 5% missing rates, respectively. 5 IEICE Electronics Express, Vol.VV, No.NN, 1–6 6. Conclusion tion," CVPR (2016) 770 (DOI: 10.1109/CVPR.2016.90). [12] A. Krizhevsky, et al.: "ImageNet Classification with Deep Convolutional Neural Networks," Communications of the The training process is the most expensive step in DL appli- ACM (2017) 1 (DOI: 10.1145/3065386). cations. In this paper, a complete memory activities survey [13] K. Simonyan and A. Zisserman: "Very Deep Convolutional has been done by utilizing a co-operate software-hardware Networks for Large-scale Image Recognition," arXiv cs.CV solution. The idea is to use the FPGA memory to emulate the (2015) http://arxiv.org/abs/1409.1556. main PC memory while a DL training program is running. [14] C. Szegedy, et al.: "Going Deeper with Convolutions," CVPR (2015) 1 (DOI: 10.1109/CVPR.2015.7298594). The FPGA board is connected to the host computer by PCIe. [15] ImageNet (2021) http://image-net.org/. And in FPGA, a monitor module was developed to collect [16] CIFAR-10/CIFAR-100 (2021) https://www.cs. the memory data at the physical level. On the software, seve- toronto.edu/~kriz/cifar.html. ral most common DL models were included for the survey. [17] Y. LeCun, et al.: The MNIST Database of Hand- All the DL models were run by two scenarios using PC’s written Digits (2021) http://yann.lecun.com/ exdb/mnist/. DDR and FPGA’s DDR. Then, the complete memory sur- [18] TensorFlow (2021) https://www.tensorflow. vey from both software and hardware point-of-views was org/. recorded and evaluated. Finally, based on the collected data, [19] PyTorch (2021) https://pytorch.org/. a further analysis was done, and three main proposals were [20] Keras (2021) https://keras.io/. discussed to improve the memory access performance in DL [21] OpenNN: Neural Networks (2021) https://www. opennn.net/. training processes. [22] A. Paszke, et al.: "PyTorch: An Imperative Style, High- Performance Deep Learning Library," NeurIPS 32 (2019) Acknowledgments 1 https://arxiv.org/abs/1912.01703. [23] OpenCV (2021) https://opencv.org/. The research has been partly executed in response to the [24] PyTorch: TorchVision (2021) https://github.com/ /vision. support of KIOXIA Corporation. [25] J. Dorsey, et al.: "An Integrated Quad-Core Opteron Processor," ISSCC (2007) 102 (DOI: References 10.1109/ISSCC.2007.373608). [26] Y. Yarom, et al.: "Mapping the Intel Last-Level [1] J. Liu, et al.: "Performance Analysis and Cha- Cache," IACR (2015) 905 https://eprint.iacr. racterization of Training Deep Learning Models org/2015/905. on Mobile Device," ICPADS (2019) 506 [27] Yan Solihin: Fundamentals of Parallel Multicore Architec- (DOI:10.1109/ICPADS47876.2019.00077). ture (Chapman & Hall/CRC, Boca Raton, Floria, 2015) 160. [2] K. Guo, et al.: "A Survey of FPGA-Based Neural Net- [28] Intel Corp.: Intel VTune Profiler User Guide (2021) work Accelerator," ACM TRETS 12 (2019) 1 (DOI: https://software.intel.com/content/www/ 10.1145/3289185). us/en/develop/documentation/vtune-help/ [3] NVIDIA Profiler User’s Guide: DU-05982-001_v11.2 top/reference/cpu-metrics-reference/ (2021) https://docs.nvidia.com/cuda/pdf/ clockticks-per-instructions-retired-cpi. CUDA_Profiler_Users-_Guide.pdf. html. [4] NVIDIA Jetson Linux Driver Package Software Features [29] Intel Corp.: Intel VTune Profiler Perfor- (2019) https://docs.nvidia.com/jetson/ mance Analysis Cookbook (2020) https: archives/l4t-archived/l4t-3231/index. //software.intel.com/content/ html#page/Tegra%2520Linux%2520Driver% www/us/en/develop/documentation/ 25-20Package%2520Development%2520Guide% vtune-cookbook/top/tuning-recipes/ 2FAppendix-TegraStats.html. instruction-cache-misses.html. [5] TensorFlow Profiler: Profile Model Merformance (2021) [30] Jason D. Bakos: Embedded Systems (Morgan Kaufmann, https://www.tensorflow.org/tensorboard/ Boston, Massachusetts, 2016) 147. tensorboard_profiling_keras. [31] Intel Corp.: Memory Access Analysis for Cache [6] M. Hashemi, et al.: "Learning Memory Access Pat- Misses and High Bandwidth Issues (2020) terns," arXiv cs.LG (2018) https://arxiv.org/ https://software.intel.com/content/ abs/1803.02329. www/us/en/develop/documentation/ [7] Z. Lu, et al.: "Modeling the Resource Requirements of vtune-help/top/analyze-performance/ Convolutional Neural Networks on Mobile Devices," ACM microarchitecture-analysis-group/ MM’17 (2017) 1663 (DOI: 10.1145/3123266.3123389). memory-access-analysis.html [8] J. Hanhirova, et al.: "Latency and Throughput Charac- terization of Convolutional Neural Networks for Mobile Computer Vision," ACM MMSys’18 (2018) 204 (DOI: 10.1145/3204949.3204975). [9] A. Khan, et al.: "A Survey of the Recent Architectures of Deep Convolutional Neural Networks," Springer AI Review 53 (2020) 5455 (DOI: 10.1007/s10462-020-09825-6). [10] J. Gu, et al.: "Recent Advances in Convolutional Neu- ral Networks," Elsevier Patt. Recog. 77 (2018) 354 (DOI: 10.1016/j.patcog.2017.10.013). [11] K. He, et al.: "Deep Residual Learning for Image Recogni- 6