Project 2: Benchmarking, Proﬁling and Optimizing an Application for an ARM-Based Embedded Device

LECTURE: MP-6171 SISTEMAS EMPOTRADOS DE ALTO DESEMPEÑO Project 2: Benchmarking, Profiling and Optimizing an Application for an ARM-based Embedded Device Lecturers: MSc. José Araya Martínez MSc. Sergio Arriola-Valverde First term, 2020 0 Contents 1 Introduction3 1.1 Administrative Matters...........................3 1.1.1 Team Formation..........................3 1.1.2 Forum and Communication....................3 1.1.3 Plagiarism.............................4 1.2 Development Environment Overview....................4 1.2.1 Nomenclature............................5 2 CoreMark and CoreMark-Pro: Benchmarking the Raspberry Pi 45 2.1 The Simple, yet Sophisticated CoreMark..................6 2.1.1 Compiling and Running the CoreMark Benchmark........6 2.1.2 Experiment 1: Multi-Threading and the CoreMark Benchmark..7 2.1.3 Experiment 2: Compiler Optimization of "non-optimizable" Code8 2.2 The more Comprehensive CoreMark-Pro..................9 2.2.1 Understanding the Concept of the Improved Algorithm......9 2.2.2 Running the CoreMark-Pro in the Raspberry Pi 4.........9 3 Application Description 10 3.1 Motivation.................................. 10 3.2 Objective.................................. 12 3.3 Image Data................................. 12 4 Prototyping Color Transformation RGB/YUV with OpenCV 13 4.1 Adding OpenCV to our File System.................... 14 5 C Implementation of the Application 15 5.1 RGB-YUV conversion using C/C++.................... 15 5.2 Benchmark and Analyze your Implementation............... 16 6 Profiling Analysis of the C Implementation with perf 18 6.1 Adding perf to our File System....................... 18 6.2 Profiling with Perf.............................. 19 6.3 Profile Your Application.......................... 20 7 C Implementation with NEON Intrinsics 21 7.1 RGB-YUV conversion using C/C++ and NEON Intrinsics Instrucctions. 21 8 Optimizing our C implementation with Multi-Threading 22 8.1 RGB-YUV conversion using Pthread.................... 23 1 0 8.2 RGB-YUV conversion using OpenMP................... 24 8.3 Yocto Testing................................ 25 9 Optional: Optimization Competition 26 10 Deliverables and Grading 27 10.1 Work flow requirements........................... 27 10.2 Folder Structure of your Deliverables.................... 28 10.3 Grading................................... 28 10.4 Deliverables Submission.......................... 29 2 1.1.3 1 Introduction In this project each team will carry out an image analysis of an RGB24 format. A color space transformation is to be made in order to transform a RGB24 image to YUV image format. Previous to the implementation is highly recommended to investigate the numer- ical matrix for space color transformation. In order to analyze, estimate and understand the space color transformation a sample software prototype will be develop (in OpenCV) using a jpeg format image attached in this project. Due to several hardware and software restrictions is necessary to port your application from a high level language approach in OpenCV to a low level approach in C for instance. This ported application must be integrated in a custom meta-layer and inherited in a Yocto image for a RPI4 in order to estimate execution time in the first stage. In embedded systems is commonly use to profiling and benchmarking your application in order to improve performance and efficiency. For this project benchmarking and profiling approaches will be used in order to optimize your application based on the following approaches: NEON Intrinsics ARM, OpenMP and Pthread, all of this integrated in a RPI4 Yocto image. 1.1 Administrative Matters Before we get down to business, some administrative affairs are to be discussed: 1.1.1 Team Formation Use the initial work team organization defined previously. In this case maximum two students per group. 1.1.2 Forum and Communication This project will be evaluated remotely, for this reason having a suitable online platform turns out to be very important to facilitate the communication between students and lecturers. In order to do so, we will adopt a "community" approach by means of a forum in the TecDigital platform. In the forum, all students can create new topics as questions arise. All discussions are public to all the course members so that any fellow student can answer and/or add more information to the proposed question. Please avoid sending project- related queries directly to the lecturers’ Email, as this prevents other students to benefit from the answer. In the forum we all are a team! The only restriction is to not share working source code in the forum, instead, we can create discussion and propose ideas and concepts that lead to the solution. 3 1.2 1.1.3 Plagiarism Any evidence of plagiarism will be thoroughly investigated. If it is corroborated, the team will receive a zero grade in the project and the incident will be communicated to the corresponding institutional authorities for further processing. 1.2 Development Environment Overview Embedded systems are ubiquitous; they are present everywhere in our daily-life activities. This project will introduce students in both: the setup of the development environment and some best design and implementation practices concerning embedded systems. A Raspberry Pi 4 will be used as target platform. As Figure1 shows, our development environment consists of two main components: • Host: Serves as main development platform. As it typically has more computing capacity than the target, all the design and implementation will be done here. • Target: This is the test and debug environment. Once a design phase has been concluded, the target platform is used as a test system. Here we validate correct functioning, profile and benchmark our application. It is even possible to measure the energy and efficiency of our algorithms for energy-aware applications. Figure 1: General view of the development environment It is worth noting that there are 2 connections between the host and target: • UART: Thanks to its simplicity, this is usually the first communication established with the target even at early stages of the board bring-up. It will serve to send commands to the bootloader and get logging information of the boot process. • Ethernet: Because UART does not a allow a high transfer rate (normally up to a range of hundreds of kBps), we need a faster communication method to share large 4 2.1 files in a reasonable amount of time. For this reason the TFTP protocol will be used over Ethernet to share the device tree and Linux kernel at boot time. In addition the Target will mount a File System present in the Host over the NTFS protocol. 1.2.1 Nomenclature As we will work with different command-line consoles during the setup of the development environment, it is necessary to specify where the commands are to be executed. Table1 summarizes the prompt symbols for the different command-line consoles. Table 1: Prompt symbols for the multiple command-line consoles during the project. Prompt symbol Description $ Host Linux PC => Target: U-Boot @ Target: Linux 2 CoreMark and CoreMark-Pro: Benchmarking the Rasp- berry Pi 4 Creating a standard and universal benchmark is not a trivial task as hardware may vary greatly from device to device. Over the years many trials have been made to standardize the performance evaluation of embedded devices and many of them are now obsolete, such as the case of Dhrystone. A good starting point to measure performance of our embedded system nowadays is the CoreMark benchmark developed by the non-profit EEMBC community. We are going to evaluate two variants of the algorithm: • On one hand the former CoreMark algorithm, which was developed as a general purpose benchmark to measure the performance of microcontrollers (MCUs) and central processing units (CPUs) used in embedded systems. • On the other hand, the more advanced CoreMark-Pro, which builds on the original CoreMark benchmark by adding context-level parallelism and 7 new workloads which cover integer and floating-point performance. Please follow the steps and make sure you include your results and answer the questions in your written report. 5 2.1.1 2.1 The Simple, yet Sophisticated CoreMark The first step will be to get to know the CoreMark benchmark. To do so answer briefly the next questions in your report: 1. Briefly describe the theory of operation of the benchmark algorithm. Make sure you add a short description of its three main algorithms: (a) Linked List (b) Matrix Multiply (c) State Machine 2. How does the CoreMark benchmark try to deal with compiler optimization to come up with a standardized result? Make sure you include the next concepts in your description: (a) Compile time vs run time (b) Volatile variables (c) Input-dependent results by using time based, scanf and command line param- eters. 3. What is the difference between the "core_portme" and the "core" files? Are we allowed to modify all of them? 2.1.1 Compiling and Running the CoreMark Benchmark The last part of this section is to actually compile the benchmark, check how independent it is from the compiler optimization and analyze our results. To do so please follow the next steps: 1. Go to the CoreMark Github and clone it in your host. 2. Follow the Readme.md file of the repository and port the Benchmark to be executed in our 64-bit Linux system. (a) NOTE: Here is a fall-back way for you to compile it if you cannot compile it with the provided make: i. Source the environment of your Toolchain as we did in Section 3.9.1 of the first project. ii. Note that you will need to modify the compiler flags in the next sections. This command just provides a start point for compilation. Run the following command: 6 2.1.2 1 $ {CC} -O2 -Ilinux64 -I. -DFLAGS_STR=\""-O2 -DPERFORMANCE_RUN=1 -lrt"\" -DITERATIONS=0 -DPERFORMANCE_RUN=1 core_list_join.c core_ main.c core_matrix.c core_state.c core_util.c linux64/core_portme.c -o ./coremark .exe -lr 2 3.

Project 2: Benchmarking, Proﬁling and Optimizing an Application for an ARM-Based Embedded Device

I.MX 8Quadxplus Power and Performance

Part 1 of 4 : Introduction to RISC-V ISA

BOOM): an Industry- Competitive, Synthesizable, Parameterized RISC-V Processor

Efficient Online Memory Error Assessment and Circumvention For

An Evaluation of Soft Processors As a Reliable Computing Platform

A 32-Bit Synthesizable Processor Core

Low-Power High Performance Computing

Exploring Coremark™ – a Benchmark Maximizing Simplicity and Efficacy by Shay Gal-On and Markus Levy

CPU and FFT Benchmarks of ARM Processors

Standard Product

DSP Benchmark Results of the GR740 Rad-Hard Quad-Core LEON4FT

A Multicore Computing Platform for Benchmarking