Inference speed comparison using convolutions in neural networks on various SoC hardware platforms using MicroPython

Kristian Dokic, PhDa, Hrvoje Mikolcevicb and Bojan Radisica a Polytechnic in Pozega, Vukovarska 17, Pozega, Croatia b Tehnical school, Ratarnicka 1, Pozega, Croatia

Abstract In recent years we have witnessed the rapid development of machine learning algorithms, and the same can be said for IoT. Developments in both fields have also influenced the growth of machine learning algorithms in IoT devices. The authors of a series of papers cite several reasons to argue this trend. This paper explores the possibility of using the Python in different versions to create, train, and implement convolutional neural networks on two SoCs based on different architectures (ARM and RISC-V). The influence of the number of filters in the convolutional layer on the inference speed is also investigated. The number of filters has a different effect on inference speed depending on the existence of additional components that accelerate individual operations of convolutional neural networks (convolution, batch normalization, activation, and pooling operations).

Keywords 1 MicroPython, SoC, CNN, Convolution Neural Network, RISC-V

1. Introduction these are energy savings, network traffic reduction, privacy problems and delays in data transmission [3]. Sakr et al. also cites similar In the last few years, the terms Internet of benefits of moving computation toward the Things and Machine learning have been edge, such as lower energy consumption, lower increasingly mentioned in the field of response latency, higher security, lower information and communication technologies. bandwidth occupancy and expected privacy [4]. Definitions of both terms are given below. Since Python is a programming language Internet of things is “a system of interrelated most often used for machine learning and very computing devices, mechanical and digital rarely for the development of IoT devices, in machines, objects, animals or people that are this paper, the aim was to compare two different provided with unique identifiers (UIDs) and the SoCs' performance using Python and ability to transfer data over a network without MicroPython in the development of machine requiring human-to-human or human-to- learning models and IoT devices. The /C++ computer interaction” [1]. Machine learning “is programming language is most often used to the study of computer algorithms that improve develop IoT devices, but if a version of Python automatically through experience” [2]. were used for this task, all phases of a machine Powerful computers are most often required learning model development for IoT devices to implement machine learning, and IoT would be significantly accelerated and devices are often low in processing power. simplified. Several authors have stated that there are few Two development boards that hit the market advantages of transferring part of machine a year ago based on different architectures were learning data processing from large and cloud used in this paper, and their inference computers to IoT devices. Zhang et al. state that

Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania EMAIL: [email protected] (Kristian Dokic); [email protected] (Hrvoje Mikolcevic); [email protected] (Bojan Radisic)

© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) performance was tested with a different number low-cost ambient monitor device based on of filters in the first layer of the convolutional ESP8266 SoC. They used a DHT22 neural network used to detect handwritten temperature and humidity sensor and OLED digits. Easy use of convolutional neural 128x64 display [14]. networks on has been possible Khamphroo et al. introduced a mobile in the last year or two, so there is not much educational robot based on MicroPython. The information about microcontrollers' development's primary goal is to create a robot performance on these tasks. A convolution that will be easy to use for beginners. neural network is a deep neural network class STM32L432KC ARM Cortex-M4 168 MHz usually used for image processing, was used in the presented project [15] [16]. classification, and other similar tasks. Tariq et al. used MicroPython for noise filtering on an STM32F10RB platform to implement an 2. Literature review early warning seismic event detection algorithm [17]. Zhang et al. used MicroPython in their quadruped robot project that deals with The first part of this section provides an the twisting trunk's effect on the tumble overview of papers in which the authors used stability from an energy viewpoint. Their MicroPython to solve various problems. The development board was based on STM32F405 second part provides an analysis of the used [18]. Wie et al. used MicroPython in their hardware platforms. wearable bio-signal processing system. Devices were based on the STM32F722 2.1. MicroPython [19]. Ibba et al. developed an impedance analyzer for fruit quality monitoring based on MicroPython is a programming language the STM32L486 microcontroller. They also partially compatible with Python 3. It is used MicroPython for microcontroller optimized to run on microcontrollers, and it programming [20]. Crepaldi et al. developed includes various modules for low-level the body channel communication system used hardware access. It was released in 2014. and it for landmark identification. They used is ported on various platforms (ARM Cortex- MicroPython and STM32L486 microcontroller M, RISC-V, ESP32, ESP8266, PIC, STM32) [21]. [5] [6]. It is also available on BBC micro:bit Bahmanian et al. used MicroPython in their development board [7] [8]. Finally, a new wide-band frequency synthesizer that can lock Raspberry product Pico can be on any of the optical reference harmonics used with MicroPython [9]. It is evident that between 2 GHz and 20 GHz. They used the MicroPython has become quite popular in a FE310 microcontroller based on RISC-V to short time, and it has been ported to various control the DA converters [22]. platforms without the significant influence of At the end of the MicroPython part, we large companies. should mention the Arduino IDE, which is MicroPython is used in various IoT projects. undoubtedly more popular than MicroPython, Regnath et al. used MicroPython on ESP32 but it also has some drawbacks. Kodali et al. platform to test their approach to verify data compared these two platforms by a series of integrity and consensus on a blockchain used on features. They state that the difference is in the IoT platforms [10]. Cardenas et al. used language type since MicroPython is a scripting MicroPython in Low-Cost and Low-Power language, while Arduino C code needs to be Messaging System. Their solution was based on compiled. According to them, MicroPython is ESP32 and LoRa Wireless Technology [11]. simpler, and the development is 5-10 times Tzounis et al., in their review of the IoT faster because there is no compiling. The syntax platform in agriculture, also mention is cleaner, and the code is more readable in MicroPython as a programming language for MicroPython [14]. Tanganelli et al. have the LoPy development platform based on reviewed the available platforms for rapid ESP32 SoC [12]. LoPy development board and prototyping of IoT solutions from a developer MicroPython have been used by Sayed et al. in perspective. In addition to the Arduino IDE and their multi-sensing platform for the ORCA Hub MicroPython, they listed several other integrated with Robot Operating System [13]. platforms and development tools: FreeRTOS, Kodali et al. used MicroPython to develop a mbedOS, Zephyr, Contiki OS, RIOT OS and Zerynth. They also cite perhaps the biggest Microprocessor’s characteristics drawback of the MicroPython, which is the STM32H743I Kendryte significantly smaller number of development I K210 boards that support it. The Arduino can be used Producer STM Canaan Inc. on over 1000 development boards, while the Bit Width 32 bit 64 bit MicroPython is supported by twenty. Their CPU clock 480 MHz 400 MHz paper is from 2019, and the situation may have RAM 1+32 MB 6 MB changed somewhat [23]. FLASH 2+32 MB 16 MB

2.2. MicroPython hardware on edge

In the previously presented paper review, MicroPython-enabled platforms are grouped into three individual sections. The first section contains papers that use ESP32 and ESP8266 SoC. These are Espressif Inc. products based on Tensilica LX106 and LX6 processors. The second section lists papers that use different SoCs from STMicroelectronics, while the third section includes paper that uses SoCs based on the RISC-V architecture. The official GitHub page of the Figure 1: OpenMV Cam H7 Plus MicroPython project lists development boards that come with MicroPython installed. There are currently 35 development boards on the list, primarily based on the STM32F family of microcontrollers [24].

3. Method

The first part of this section gives the characteristics of the used development boards and microcontrollers with MicroPython installed. The second part of the section describes creating a model of a convolutional neural network and converting the model into Figure 2: Sipeed Maixduino formats that can be used on SoC with MicroPython. The Development board in Figure 2 is a Sipeed Maixduino Kit for RISC-V AI + IoT, and it is based on the Kendryte K210 based on 3.1. Used hardware supported by RISC-V instruction set architecture. The MicroPython development board's price was $ 23.90, but only one K210 costs $ 8.64. The paper analyzes two development boards based on different architectures. The development board in Figure 1 is an OpenMV 3.2. CNN model development and Cam H7 Plus and is based on the STM32H743II deployment ARM Cortex M7 processor. The development board's price was € 99.00, but only one STM32H743II processor costs € 11.61. Their The MNIST database of handwritten digits characteristics are listed in Table 1. has been used for training in this paper. This database consists of sixty thousand grayscale Table 1 images of handwritten digits between 0 and 9. The image's size is 28x28 pixels that are the OpenMV Cam H7 Plus development board reduced to 14x14-pixel size. One simple in the Windows environment is presented as a convolution neural network has been developed USB memory stick. to resolve this classification problem, and it can In contrast, for the Sipeed Maixduino be seen in Figure 3. development board, model preparation is more This neural network is quite simple and complicated. NNCase is used for the consists of a single convolutional layer with a conversion, which is also free, and after the 3x3 pixel filter. The number of filters is a conversion, a model in kfpkg format is variable changed during testing to notice the obtained. The file in the specified kfpkg format impact of that variable on the inference speed. is transferred to the development board using Adam is used as an optimizer and ReLU as an the KFlash tool. Figure 4 shows the process of activation function. The model code can be training, conversion and transfer of the model found on the GitHub address: to both development boards. https://github.com/kristian1971/MicroPythonR ACE. OpenMV Sipeed Cam H7 Plus Maixduino

GOOGLE COLABORATORY - MODEL GOOGLE COLABORATORY - MODEL CREATION AND TRAINING CREATION AND TRAINING

tflite model h5 model

OPENMV IDE- MODEL DEPLOYMENT tflite model

. kfpkg model

. MaiXPy IDE - MODEL DEPLOYMENT

Figure 4 - Process of training, conversion and transfer After the models have been transferred to development boards, it is necessary to write a program in each integrated development environment that will apply that model. In both Figure 3: Used CNN model cases, it is written in MicroPython, and the differences between them are insignificant. The The model was trained for only 12 epochs, programs are available on the GitHub server. and after that, it was converted to TensorFlow Lite (tflite) format. A total of 10 training processes were conducted, and the variable that was changed was the number of filters in the convolutional layer. After each training, the model in tflite format is saved. The same model was used on both development boards. The OpenMV Cam H7 Plus development board's integrated development environment can be downloaded for free from the manufacturer's website and is called the OpenMV IDE. A free Integrated development Figure 5: MaixPy IDE environment based on the OpenMV IDE, called Figure 5 shows the integrated development the MaixPy IDE, is available for the Sipeed environment MaixPy IDE, while the OpenMV Maixduino development board. Both are based IDE is not shown because the visual differences on MicroPython. However, the use of the are negligible. convolutional neural network model is The HANTEK DSO5102P oscilloscope was somewhat different on boards. The OpenMV used to measure the inference speed. Before the IDE and the associated development board command that triggers the data inference allow easy transfer of the model in tflite format through the neural network, one pin on the with a simple drag-and-drop method because development board is lifted from LOW to 10 0,9620 78 kB 2,6 ms HIGH. After the inference process, the state of that same pin is lowered from HIGH to LOW. The first column of both tables contains the The commands for the OPENMV IDE are as number of filters in the convolution layer. In the follows: second column, the accuracy is obtained in the pin1.value(1) model creation phase, and in the third column out = net.classify(graytmp) are the tflite file sizes. The first three columns pin1.value(0) are the same for both development boards. The commands for the MaixPy IDE are as Table 3 follows: OpenMV Cam H7 Plus data led_r.value(1) ACCURACY Tflite t fmap=kpu.forward(task,img2) led_r.value(0) 1 0.8969 8 kB 0,22 ms 2 0,9220 16 kB 0,33 ms The described method is used to measure the 3 0.9479 24 kB 0,42 ms propagation time of data through a 4 0.9330 31 kB 0,52 ms convolutional neural network. Figure 6 shows 5 0,9520 39 kB 0,63 ms the measurement procedure on the 6 0.9637 47 kB 0,74 ms oscilloscope. 7 0,9623 54 kB 0,85 ms 8 0.9624 62 kB 0,97 ms 9 0,9632 70 kB 1,08 ms 10 0,9620 78 kB 1,20 ms The fourth columns show the neural network inference speed, and these values are significantly less for the OpenMV Cam H7 Plus development board. Figure 7 shows a graph of speed for both development boards. 3

Figure 6: Measurement with an oscilloscope 2.5 2 4. Testing and measurement 1.5 1 0.5 In the measurement procedure, ten models (ms) propagation were used on both development boards that 0 differed in the number of filters in the 12345678910 convolutional layer. The number of filters was Number of filters changed from one to ten. Table 2 shows the data obtained for the Sipeex Maixduino RISC-V ARM development board, while Table 3 shows the data for the OpenMV Cam H7 Plus. Figure 7: Inference speed Table 2 If we look at the data more closely, we can see a linear increase in the inference speed in Sipeex Maixduino data the OpenMV Cam H7 Plus development board ACCURACY Tflite t depending on the number of filters, while for 1 0.8969 8 kB 1,2 ms Sipeed MaixPy, this dependence is not linear. 2 0,9220 16 kB 1,3 ms To determine this difference, Figure 8 shows 3 0.9479 24 kB 1,4 ms the inference speed with x filters (tx) relative to 4 0.9330 31 kB 1,5 ms the inference speed measured with only one 5 0,9520 39 kB 1,6 ms filter in the convolution layer (t1). It is evident 6 0.9637 47 kB 2,0 ms that by increasing the number of filters, the 7 0,9623 54 kB 2,3 ms inference speed increases faster on the 8 0.9624 62 kB 2,3 ms OpenMV Cam H7 Plus development board, 9 0,9632 70 kB 2,6 ms while this increase is slower on Sipeed MaixPy.

6 data propagation through a convolutional 5 neural network for 3x3 filters. 4 1

/t 3 x 6. References t RISC-V 2 1 ARM 0 [1] A. S. Gillis, »internet of things (IoT),« 12345678910 TechTarget, [Mrežno]. Available: Number of filters https://internetofthingsagenda.techtarge t.com/definition/Internet-of-Things-

Figure 8: Relative inference speed IoT. [Pokušaj pristupa 12 January 2021]. [2] T. M. Mitchell, Machine Learning, New 5. Discussion and conclusion York: McGraw-hill, 1997. [3] Y. Zhang, N. Suda, L. Lai i V. Chandra, The paper compares the convolutional Hello Edge: Keyword Spotting on neural network inference speed of two various Microcontrollers, 2018. developmental boards. One development board is based on the STM32H743II ARM Cortex M7 [4] F. Sakr, F. Bellotti, R. Berta i A. De processor, while the other is based on the Gloria, »Machine Learning on Kendryte K210 RISC-V processor. The same Mainstream Microcontrollers,« neural network model was used in both cases, Sensors, svez. 20, 2020. and MicroPython was used for the model [5] J. Beningo, »Prototype to production: implementation on development boards. MicroPython under the hood,« TensorFlow 2.x and Keras were used to develop Aspeccore Inc., 11 July 2016. [Mrežno]. and train the model. Available: It can be observed that the inference speed https://www.edn.com/prototype-to- under the given conditions is significantly production-micropython-under-the- lower on the STM32H743II ARM Cortex M7 hood/. [Pokušaj pristupa 13 December processor. However, we must emphasize that 2020]. the convolutional neural network used is simple [6] George Robotics Limited, and that the input image is 14x14 pixels in size. »MicroPython,« George Robotics A detail that should also be emphasized is Limited, [Mrežno]. Available: the influence of the number of used filters on https://micropython.org/. [Pokušaj the inference speed. With only one filter in the pristupa 12 December 2020]. convolutional layer, the development board [7] S. Sentance, J. Waite, L. Yeomans i E. based on the ARM processor was 6x faster, MacLeod, »Teaching with physical while when using ten filters, this difference computing devices: the BBC micro: bit dropped to only 2x. The increase in the number initiative,« u Proceedings of the 12th of filters reduces the difference in inference Workshop on Primary and Secondary speed between boards. It is possible that at Computing Education, 2017. some point, the development board based on the RISC-V processor would be faster. [8] M. Cápay i N. Klimová, »Engage Your The cause of this slight slope curve for the Students via Physical Computing!,« u RISC-V processor in Figure 8 is a Neural 2019 IEEE Global Engineering Network processor in K210 with built-in Education Conference (EDUCON), convolution, batch normalization, activation, 2019. and pooling operations. This is sparsely [9] J. Adams, »Meet Raspberry Silicon: described in the documentation [25]. Raspberry Pi Pico now on sale at $4,« In future research, the number of filters and RASPBERRY PI FOUNDATION, 21 other parameters should be further increased, January 2021. [Mrežno]. Available: and the influence of filter size should be https://www.raspberrypi.org/blog/raspb studied. In this paper, a 3x3 filter has been used, erry-pi-silicon-pico-now-on-sale/. and the documentation for the K210 states that [Pokušaj pristupa 22 Februray 2021]. the Neural Network processor speeds up the [10] E. Regnath i S. Steinhorst, »LeapChain: Design Engineering Technical efficient blockchain verification for Conferences and Computers and embedded IoT,« u Proceedings of the Information in Engineering Conference, International Conference on Computer- 2018. Aided Design, 2018. [19] Y. Wei, Q. Cao, L. Hargrove i J. Gu, »A [11] A. M. Cardenas, M. K. N. Pinto, E. Wearable Bio-signal Processing System Pietrosemoli, M. Zennaro, M. Rainone i with Ultra-low-power SoC and P. Manzoni, »A low-cost and low- Collaborative Neural Network power messaging system based on the Classifier for Low Dimensional Data LoRa wireless technology,« Mobile Communication,« u 2020 42nd Annual Networks and Applications, p. 1–8, International Conference of the IEEE 2019. Engineering in Medicine & Biology [12] A. Tzounis, N. Katsoulas, T. Bartzanas Society (EMBC), 2020. i C. Kittas, »Internet of Things in [20] P. Ibba, M. Crepaldi, G. Cantarella, G. agriculture, recent advances and future Zini, A. Barcellona, M. Petrelli, B. D. challenges,« Biosystems Engineering, Abera, B. Shkodra, L. Petti i P. Lugli, svez. 164, p. 31–48, 2017. »Fruitmeter: An ad5933-based portable [13] M. E. Sayed, M. P. Nemitz, S. Aracri, A. impedance analyzer for fruit quality C. McConnell, R. M. McKenzie i A. A. characterization,« u 2020 IEEE Stokes, »The limpet: A ROS-enabled International Symposium on Circuits multi-sensing platform for the ORCA and Systems (ISCAS), 2020. hub,« Sensors, svez. 18, p. 3487, 2018. [21] M. Crepaldi, A. Barcellona, G. Zini, A. [14] R. K. Kodali i K. S. Mahesh, »Low cost Ansaldo, P. M. Ros, A. Sanginario, C. ambient monitoring using ESP8266,« u Cuccu, D. De Marchi i L. Brayda, »Live 2016 2nd International Conference on wire-a low-complexity body channel Contemporary Computing and communication system for landmark Informatics (IC3I), 2016. identification,« IEEE Transactions on Emerging Topics in Computing, 2020. [15] M. Khamphroo, N. Kwankeo, K. Kaemarungsi i K. Fukawa, »Integrating [22] M. Bahmanian, S. Fard, B. Koppelmann MicroPython-based educational mobile i J. C. Scheytt, »Wide-Band Frequency robot with wireless network,« u 2017 Synthesizer with Ultra-Low Phase 9th International Conference on Noise Using an Optical Clock Source,« Information Technology and Electrical u 2020 IEEE/MTT-S International Engineering (ICITEE), 2017. Microwave Symposium (IMS), 2020. [16] M. Khamphroo, N. Kwankeo, K. [23] G. Tanganelli, C. Vallati i E. Mingozzi, Kaemarungsi i K. Fukawa, »Rapid Prototyping of IoT Solutions: A »MicroPython-based educational Developer's Perspective,« IEEE mobile robot for computer coding Internet Computing, svez. 23, p. 43–52, learning,« u 2017 8th International 2019. Conference of Information and [24] M. Sielenkemper , »Boards Summary,« Communication Technology for GitHub, 20 Octobar 2019. [Mrežno]. Embedded Systems (IC-ICTES), 2017. Available: [17] H. Tariq, F. Touati, M. A. E Al-Hitmi, https://github.com/micropython/microp D. Crescini i A. Ben Mnaouer, »A real- ython/wiki/Boards-Summary. [Pokušaj time early warning seismic event pristupa 27 Februray 2021]. detection algorithm using smart geo- [25] Sipeed, "K210," Sipeed, [Online]. spatial bi-axial inclinometer nodes for Available: Industry 4.0 applications,« Applied https://maixduino.sipeed.com/en/hardw Sciences, svez. 9, p. 3650, 2019. are/k210.html. [Accessed 13 January [18] C. Zhang, X. Chai i J. S. Dai, 2021]. »Preventing Tumbling With a Twisting Trunk for the Quadruped Robot: Origaker I,« u ASME 2018 International