LECTURE: MP-6171 SISTEMAS EMPOTRADOSDE ALTO DESEMPEÑO

Project 2: Benchmarking, Profiling and Optimizing an Application for an ARM-based Embedded Device

Lecturers: MSc. José Araya Martínez MSc. Sergio Arriola-Valverde

First term, 2020 0

Contents

1 Introduction3 1.1 Administrative Matters...... 3 1.1.1 Team Formation...... 3 1.1.2 Forum and Communication...... 3 1.1.3 Plagiarism...... 4 1.2 Development Environment Overview...... 4 1.2.1 Nomenclature...... 5

2 CoreMark and CoreMark-Pro: Benchmarking the Raspberry Pi 45 2.1 The Simple, yet Sophisticated CoreMark...... 6 2.1.1 Compiling and Running the CoreMark ...... 6 2.1.2 Experiment 1: Multi-Threading and the CoreMark Benchmark..7 2.1.3 Experiment 2: Compiler Optimization of "non-optimizable" Code8 2.2 The more Comprehensive CoreMark-Pro...... 9 2.2.1 Understanding the Concept of the Improved Algorithm...... 9 2.2.2 Running the CoreMark-Pro in the Raspberry Pi 4...... 9

3 Application Description 10 3.1 Motivation...... 10 3.2 Objective...... 12 3.3 Image Data...... 12

4 Prototyping Color Transformation RGB/YUV with OpenCV 13 4.1 Adding OpenCV to our File System...... 14

5 Implementation of the Application 15 5.1 RGB-YUV conversion using C/C++...... 15 5.2 Benchmark and Analyze your Implementation...... 16

6 Profiling Analysis of the C Implementation with perf 18 6.1 Adding perf to our File System...... 18 6.2 Profiling with Perf...... 19 6.3 Profile Your Application...... 20

7 C Implementation with NEON Intrinsics 21 7.1 RGB-YUV conversion using C/C++ and NEON Intrinsics Instrucctions. 21

8 Optimizing our C implementation with Multi-Threading 22 8.1 RGB-YUV conversion using Pthread...... 23

1 0

8.2 RGB-YUV conversion using OpenMP...... 24 8.3 Yocto Testing...... 25

9 Optional: Optimization Competition 26

10 Deliverables and Grading 27 10.1 Work flow requirements...... 27 10.2 Folder Structure of your Deliverables...... 28 10.3 Grading...... 28 10.4 Deliverables Submission...... 29

2 1.1.3

1 Introduction

In this project each team will carry out an image analysis of an RGB24 format. A color space transformation is to be made in order to transform a RGB24 image to YUV image format. Previous to the implementation is highly recommended to investigate the numer- ical matrix for space color transformation. In order to analyze, estimate and understand the space color transformation a sample software prototype will be develop (in OpenCV) using a jpeg format image attached in this project. Due to several hardware and software restrictions is necessary to port your application from a high level language approach in OpenCV to a low level approach in C for instance. This ported application must be integrated in a custom meta-layer and inherited in a Yocto image for a RPI4 in order to estimate execution time in the first stage. In embedded systems is commonly use to profiling and benchmarking your application in order to improve performance and efficiency. For this project benchmarking and profiling approaches will be used in order to optimize your application based on the following approaches: NEON Intrinsics ARM, OpenMP and Pthread, all of this integrated in a RPI4 Yocto image.

1.1 Administrative Matters Before we get down to business, some administrative affairs are to be discussed:

1.1.1 Team Formation Use the initial work team organization defined previously. In this case maximum two students per group.

1.1.2 Forum and Communication This project will be evaluated remotely, for this reason having a suitable online platform turns out to be very important to facilitate the communication between students and lec- turers. In order to do so, we will adopt a "community" approach by means of a forum in the TecDigital platform. In the forum, all students can create new topics as questions arise. All discussions are public to all the course members so that any fellow student can answer and/or add more information to the proposed question. Please avoid sending project- related queries directly to the lecturers’ Email, as this prevents other students to benefit from the answer. In the forum we all are a team! The only restriction is to not share working source code in the forum, instead, we can create discussion and propose ideas and concepts that lead to the solution.

3 1.2

1.1.3 Plagiarism Any evidence of plagiarism will be thoroughly investigated. If it is corroborated, the team will receive a zero grade in the project and the incident will be communicated to the corresponding institutional authorities for further processing.

1.2 Development Environment Overview Embedded systems are ubiquitous; they are present everywhere in our daily-life activities. This project will introduce students in both: the setup of the development environment and some best design and implementation practices concerning embedded systems. A Raspberry Pi 4 will be used as target platform. As Figure1 shows, our development environment consists of two main components: • Host: Serves as main development platform. As it typically has more computing capacity than the target, all the design and implementation will be done here. • Target: This is the test and debug environment. Once a design phase has been concluded, the target platform is used as a test system. Here we validate correct functioning, profile and benchmark our application. It is even possible to measure the energy and efficiency of our algorithms for energy-aware applications.

Figure 1: General view of the development environment

It is worth noting that there are 2 connections between the host and target: • UART: Thanks to its simplicity, this is usually the first communication established with the target even at early stages of the board bring-up. It will serve to send commands to the bootloader and get logging information of the boot process. • Ethernet: Because UART does not a allow a high transfer rate (normally up to a range of hundreds of kBps), we need a faster communication method to share large

4 2.1

files in a reasonable amount of time. For this reason the TFTP protocol will be used over Ethernet to share the device tree and kernel at boot time. In addition the Target will mount a File System present in the Host over the NTFS protocol.

1.2.1 Nomenclature As we will work with different command-line consoles during the setup of the develop- ment environment, it is necessary to specify where the commands are to be executed. Table1 summarizes the prompt symbols for the different command-line consoles.

Table 1: Prompt symbols for the multiple command-line consoles during the project.

Prompt symbol Description $ Host Linux PC => Target: U-Boot @ Target: Linux

2 CoreMark and CoreMark-Pro: Benchmarking the Rasp- berry Pi 4

Creating a standard and universal benchmark is not a trivial task as hardware may vary greatly from device to device. Over the years many trials have been made to standardize the performance evaluation of embedded devices and many of them are now obsolete, such as the case of . A good starting point to measure performance of our embed- ded system nowadays is the CoreMark benchmark developed by the non-profit EEMBC community. We are going to evaluate two variants of the algorithm: • On one hand the former CoreMark algorithm, which was developed as a general purpose benchmark to measure the performance of microcontrollers (MCUs) and central processing units (CPUs) used in embedded systems. • On the other hand, the more advanced CoreMark-Pro, which builds on the original CoreMark benchmark by adding context-level parallelism and 7 new workloads which cover integer and floating-point performance. Please follow the steps and make sure you include your results and answer the questions in your written report.

5 2.1.1

2.1 The Simple, yet Sophisticated CoreMark The first step will be to get to know the CoreMark benchmark. To do so answer briefly the next questions in your report: 1. Briefly describe the theory of operation of the benchmark algorithm. Make sure you add a short description of its three main algorithms: (a) Linked List (b) Matrix Multiply (c) State Machine 2. How does the CoreMark benchmark try to deal with compiler optimization to come up with a standardized result? Make sure you include the next concepts in your description: (a) Compile time vs run time (b) Volatile variables (c) Input-dependent results by using time based, scanf and command line param- eters. 3. What is the difference between the "core_portme" and the "core" files? Are we allowed to modify all of them?

2.1.1 Compiling and Running the CoreMark Benchmark The last part of this section is to actually compile the benchmark, check how independent it is from the compiler optimization and analyze our results. To do so please follow the next steps: 1. Go to the CoreMark Github and clone it in your host. 2. Follow the Readme.md file of the repository and port the Benchmark to be executed in our 64-bit Linux system. (a) NOTE: Here is a fall-back way for you to compile it if you cannot compile it with the provided make: i. Source the environment of your Toolchain as we did in Section 3.9.1 of the first project. ii. Note that you will need to modify the compiler flags in the next sections. This command just provides a start point for compilation. Run the fol- lowing command:

6 2.1.2

1 $ {CC} -O2 -Ilinux64 -I. -DFLAGS_STR=\""-O2 -DPERFORMANCE_RUN=1 -lrt"\" -DITERATIONS=0 -DPERFORMANCE_RUN=1 core_list_join.c core_ main.c core_matrix.c core_state.c core_util.c linux64/core_portme.c -o ./coremark .exe -lr 2

3. Source our Yocto-created toolchain to be able to cross-compile the code (source /opt/poky/3.0.2/environment-setup-aarch64-poky-linux). 4. Make sure you can compile the benchmark following the console command ex- plained in the Readme.md file. (a) Note: As we are cross-compiling for ARM, we have to use the aarch64-poky- linux-gcc compiler instead of the native gcc as the command in the Reame.md suggests. To do so consider using $CC instead of gcc. 5. Run the generated binary in your RPI4 to make sure the compilation was successful. 6. Set the ITERATIONS variable so that CoreMark runs at least 20 seconds in the RPI4. Use this value in all compilations of the next 2 sections. In the following two sections we will explore two experiments related to the benchmark: the effect of compiling the code with multi-threading and let a modern compiler try to optimize a code which was originally designed to be compiler-independent.

2.1.2 Experiment 1: Multi-Threading and the CoreMark Benchmark As explained in the Readme.md, you can let the compiler know how many threads your platform supports. Follow the next simple steps to explore the impact of this compiler flag in the overall performance of the benchmark: 1. Check how many threads your RPI4 can handle. 2. Change the DMULTITHREAD flag in the XCFLAGS of the make command and set it from 1 thread to twice as many threads as your RPI4 can handle in steps of 1. So for instance if you determine that your RPI4 can handle up to 2 threads, you will need to compile and execute the benchmark for a value of DMULTITHREAD of 1, 2, 3, and 4. 3. Plot your results in a "Benchmark Performance (y-axis) vs Number of Threads (x- axis)" 4. Analyze your results, include at least the following points:

7 2.1.3

(a) Does your curve follow a linear, quadratic, exponential, etc behaviour or a combination of them? Which step produces the largest improvement in com- parison to its predecessor? (b) Would keep increasing the number of threads indefinitely be helpful to the benchmark performance? (c) What is the role the Linux scheduler plays in assigning threads to physical cores. Mention how the Linux scheduling policy works in your system.

2.1.3 Experiment 2: Compiler Optimization of "non-optimizable" Code As mentioned before, the CoreMark benchmark was designed to be as compiler-independent as possible. Let’s see how accurate this is! We will compile the benchmark using different optimization degrees and analyze our results: 1. Check the theory on compiler optimization and elaborate on what would be the best optimization according to the theory? 2. Compile the Benchmark with the following compiler optimizations and plot your results in a "Benchmark Performance (y-axis) vs Compiler optimization (x-axis)" graph: (a) O0 (b) O1 (c) Os (d) O2 (e) O3 (f) Ofast 3. Analyze your results, include at least the following points: (a) Does your curve follow a linear, quadratic, exponential, etc behaviour or a combination of them? Which step produces the largest improvement in com- parison to its predecessor? (b) Is it possible to significantly improve the CoreMark benchmark with the com- piler? Would you consider it a compiler-independent benchmark? Now, to summarize our RPI4 benchmarking results so far execute the following steps: 1. Use the results of the last 2 sections to generate the best possible CoreMark re- sult you can get out of the RPI4 with our custom Linux Kernel, File System and

8 2.2.2

Toolchain. 2. Go to the scores list, find scores reported for the BCM2711 processor and compare yours. 3. Analyze your results. Explain why it could be that your results are slower or faster than the reported ones.

2.2 The more Comprehensive CoreMark-Pro As we just experienced, the CoreMark benchmark is not the most reliable performance evaluation tool for a modern multi-core embedded device. A trial to make a better bench- mark for Linux, multi-core platforms is the CoreMark-Pro, as you can check in this com- parison between the two.

2.2.1 Understanding the Concept of the Improved Algorithm Again, the first thing we will do is to get to know the algorithm behind our benchmark, briefly answer the next theoretical questions: 1. How does the algorithm differ from the original one? What has improved? 2. Overview its integer and floating-point workloads without explaining in detail the 24 FORTRAN kernels. 3. Is the simple CoreMark included into the CoreMark-Pro? 4. How are the multiple workloads combined to summarize results in one single score?

2.2.2 Running the CoreMark-Pro in the Raspberry Pi 4 Last but not least, the interesting part: actually testing the algorithm! CoreMark-Pro can be compiled using its provided Makefile: 1. Go to its Github repository. 2. Chose one of the next approaches to compile and run the algorithm: (a) Investigate how to cross-compile it using our toolchain and then run the gen- erated executables on the target. (b) Copy the sources into the RPI4 File System and perform a native compila- tion/execution. (c) Document your decision and argument why did you select it.

9 3.1

3. Investigate if you can vary any compilation parameter as we did in Sections 2.1.3 and 2.1.2. (a) If you can, create a plot for each parameter you consider important and analyze it. (b) If you can’t, support your decision based on an analysis of the Makefiles and its available compilation options and compiler flags. 4. Based on your results and analysis answer the question: is the CoreMark-Pro a better benchmarking tool for our 64-bit, multi-core processor?

3 Application Description

3.1 Motivation When working on embedded systems you are exposed to a wide range of , one of the most common devices you will use on your design are images or camera sensors like the one shown in the figure2[1].

Figure 2: Image sensor Sony IMX-219

Camera sensors are equipped with a large set of features including: • Hue, gamma and sharpness controls. • Lens correction. • Stabilization. • Defective pixel correction. • Noise cancelling. • Auto-focus. • Etc.

10 3.2

As part of the features the output format of the video content (the image content itself) may vary depending of the manufacturer and the part number. Some sensors such as the OV5640 from OmniVision can provide YUV. RGB and Bayer while other such as the Sony IMX-219 only provides Bayer format. Actually most of the sensors provide at least Bayer format and is for this reason that most of the camera interfaces on SoC supports Bayer as preferred data format. In some cases the SoC supports the Bayer capture, for example, if you take a look to the camera port capabilities of the i.MX6 you will find: Camera Ports The role of these ports is to receive input from video sources (e.g image sensors) and to provide support for time-sensitive control signals to the camera. (Non-time-sensitive controls e.g configuration, reset are performed by the MCU, through an I2C I/F or GPIO signals). Each of the camera ports includes the following features: • Direct connectivity to most relevant external devices. • Parallel interface - up to 20-bit data bus. • Frame size:up to 8192 × 4096 pixels (including blanking intervals). • Data formats supported include Raw (Bayer), RGB, YUV 4:4:4, YUV 4:2:2 and grayscale, up to 16 bits per value (component). Besides it reports Bayer as a supported input format all the internal image and video processing block within the i.MX6 requires YUV format to process the image. Hence it is required to do a conversion from Bayer to YUV in order to be able to process the image in the system. The Bayer color conversion is usually achieved in two stages: • Bayer to RGB conversion. • RGB to YUV conversion (* Workflow in this Project). The process of converting the Bayer format to any other color format is commonly called debayering or bayer interpolation and is depicted in figure3. Some SoC already provide functional blocks optimized for this kind of conversion, how- ever in some cases there is not such hardware unit and we need to do the conversion totally in software which is inefficient and provides a low performance. So, finding out a way to accelerate this process is critical to improve the application performance.

11 3.3

Figure 3: Bayer Interpolation Process

3.2 Objective In this section the student will learn about the debayering process and at the same time will understand how to prototype the color transformation (RGB888/YUV444p) through OpenCV. The application must be accelerated with SIMD algorithms using NEON™, OpenMP, Pthread through profiling analysis. Finally the color transformation applications must be inherited in a Yocto image for RPI 4 from a custom metal-layer called in this project meta-hpec.

3.3 Image Data For this project it is recommended to use a RGB888 image (image.rgb must be used in this project in order to develop an application in C/C++, Neon Intrinsics, OpenMP and Pthread, however for prototyping in OpenCV a jpg format called imagergb.jpg must be used). In this case a RGB888 image file, that is a raw image that has 3 bytes for each pixel (Red, Green, and Blue), is attached in the Project 2 folder. To visualize the image.rgb in this folder enter Rawpixels. In order to visualize the sample image consider the following configuration parameters: • Width of 640. • Height of 480. • Offset 0. • flip h, flip v and invert unmarked. • Zoom of 0. • Predefined format is RGB24. • Pixel Format in RGBA. • Unmarked option called Ignore Alpha.

12 4

• Pixel Plane is Packed. Based on the configuration parameters in Rawpixels the image should look like the one on figure4. Hint: Remember that the image file generated by your application is in a raw data form, this means it does not contain any header information about the image so you cannot open it with a normal image viewer. In order to verify your YUV image you must configure the format accordingly. In order to understand to configuration parameters watch this video here.

Figure 4: Sample image for this project in RGB888 format

In order to upload/download images files from RPI 4 pay attention to following instruc- tions: 1. SCP approach reading this information here. 2. To create a new folder into ROOT, this new folder must be allowing access in order to write/read through sudo chmod +777 folder direction, in this case you can move files from microSD to PC and vice versa.

4 Prototyping Color Transformation RGB/YUV with OpenCV

As learned in the second topic of our theoretical lecture, it is always a good practice to prototype (or model) any project we are about to start. Furthermore, it should be the first step of the design process. In order to follow the proposed design model, first we are going to program our application using high-level libraries to leave the low-level implementation and optimization steps for the next sections. To do so, we are going to use the widely-used computing vision library called Open Source Computer Vision Library (OpenCV). As explained by the OpenCV Organization: "it is an open source computer vision and machine learning software library. OpenCV

13 4.1

was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products. Being a BSD- licensed product, OpenCV makes it easy for businesses to utilize and modify the code." Please implement the algorithm described in Section 3.2 by using OpenCV and C/C++. In order to run the algorithm in the Target, we need to add the OpenCV libraries into the File System of the Raspberry Pi. To do so, we need to compile again our Yocto proyect and extract again the compressed .tar into the shared NFS location.

4.1 Adding OpenCV to our File System Luckily there is support already for OpenCV 4.1.0 in Yocto. To know if there is a recipe for a given package, you can review the OpenEmbedded Layer Index, click on the Recipes section, select your branch (Zeus), and type the package you are looking for. There you will find that OpenCV 4.1.0 is provided by the meta-op layer, that the file with its recipe is recipes-support/opencv/opencv_4.1.0.bb and which are its associated dependencies. Why do we need to know all these? Because: 1. We know that we do not have to implement the recipe ourselves. 2. We know which meta layer we need to include to have OpenCV 4.1.0 support. 3. If we want to select which components of the OpenCV package are compiled in our File System (e.g. for storage reasons), we can take a look at the recipes- support/opencv/opencv_4.1.0.bb recipe and tell which elements are optional. So once we have talked a bit about the theory, please add the OpenCV package into the File System of your Raspberry Pi, the basic steps to follow are: 1. To avoid RAM problems while compiling, make sure you have a 8GB swap. To do so, you can follow instructions in this link. 2. Make sure you have cloned the Zeus branch of the meta-openembedded layer into your Yocto folder. 3. Source the oe-init-build-env of your poky folder. 4. Make sure you have added /home/project2/Yocto/meta-openembedded/meta-oe into your conf/bblayers.conf file. 5. Investigate how to add the OpenCV package into your local.conf 6. Compile again your file system as we did in the first project. 7. Delete the old file system from the NFS shared folder (but not the rootfs folder itself):

14 5.1

1 $ rm -rf /var/nfs/rootfs/* 2

8. Unpack the just compiled file system into /var/nfs/rootfs, just as we did in the first project.

How can you remove a package (e.g. gstreamer) from the OpenCV compilation? (1) The result of your prototype of the algorithm must be validated with Rawpixels.

5 C Implementation of the Application

5.1 RGB-YUV conversion using C/C++ For this part of the project you will create a simple C/C++ program named rgb2yuv_c which shall meet the following requirements: 1. For color transformation you must used the image file image.rgb in order to perform RGB/YUV transformation. In order to understand color transformation matrix pro- cessing pay attention to this link. 2. You must create a new Yocto recipe folder called: rgb2yuv-c in a meta-layer called meta-hpec. 3. The program should have the following options using the getop API and the order of the parameters should not affect the correct operation of your application. You will find some useful information for getop here.

1 Usage 2 ./rgb2yuv_c [ -i RGBfile ] [ -o YUVfile ] [-h] [-a] 3 -i RGBfile specifies the RGB file to be converted. 4 -o YUVfile specifies the output file name. 5 -a displays the information of the author of the program. 6 -h displays the usage message to let the user know how to execute the application. 7 8 Yocto prompt: 9 rgb2yuv_c -i image.rgb -o outputC.yuv 10

15 5.2

4. You are free to choose any build system as GNU make, Automake or CMake. Just make sure it builds correctly and is totally integrated with the Yocto recipe. (Is highly recommended to use the Autotools process). 5. The conversion process MUST be done in a C/C++ function with the following proposed prototype:

1 void rgb2yuv (char *input_image, char *output_image) 2 You can vary the arguments but NOT the function name. 3

6. Remember that your output image (outputC.yuv) must be validated using Rawpix- els. 7. Don’t forget, all binaries MUST be installed in the file system as part of the recipe compilation.

5.2 Benchmark and Analyze your Implementation In this section we will measure from within you program the time spent by the function rgb2yuv() and print it out in the console after the process has finished. Hint: In order to estimate time spent pay attention to this and this liks. Now we will perform a series of "tricks" to try to reduce the application run-time without manually optimizing the code: 1. Scheduler Priority Test: Increase the scheduling priority of your application to try to speed it up and measure the time it takes. Does it make a difference? To change the priority of your process, please use and compare these two com- mands: (a) nice: Change the niceness of a process to increase its scheduler priority. Please increase the niceness to be maximum for a user-space application and measure the time it takes with the "time" command. (b) chrt: Modify the "real-time" scheduling attributes of a process. Please in- crease the FIFO scheduling policy to the maximum and measure the time it takes for your application to be executed. Report if you get better results by modifying any other parameter of the chrt command (try: chrt –help). (c) Analyze your results and compare between: i. Standard schedule policy.

16 5.2

ii. Increased scheduling niceness. iii. Increased real-time scheduling attributes. 2. Turn off the Kernel Frequency Scaling: Normally the Kernel scales down the frequency of the processor cores according to the current load of the system in order to save energy. We can of course turn this feature off and see if we get better results if we use continuously the highest frequency on a single core. To do so, do the following: (a) You can see the current frequency of a core by typing:

1 @ watch -n 1 cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq 2 # This will show every second the current frequency of the Core0 3

(b) We can turn off frequency scaling by setting the lower minimum of the kernel frequency scaling to the processor’s maximum frequency as follows:

1 echo 1500000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq 2

Note: Check with the command of the last step, that the current frequency has changed to 1.5GHz. (c) You can now execute your application in the Core 0 by typing:

1 @ taskset --cpu-list 0"my_command" 2

Please measure the execution time of your application without frequency scaling and report any difference Note: You can turn off the frequency scaling of all cores and don’t use the taskset command. This is important if your are using muti-threading. (d) When your are done, activate back the frequency scaling by typing:

1 echo 600000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq 2

17 6.1

3. Perform compiler optimization of your code just as described in Section 2.1.3. Make a plot and analyze your results. There are 3 major points to be evaluated in this Section: 1. The substitution of described high-level functions for OWN "low-level" code in this case RGB/YUV C implementation. 2. Benchmark your application as described in Section 5.2. 3. Remember to validate your otput with Rawpixels.

6 Profiling Analysis of the C Implementation with perf

Once we have prototyped and correctly implemented our application in a lower-level pro- gramming language, the next step will be to find its performance bottlenecks and try to optimize it using different approaches! A good start point to detect performance hot-spots is to use a profiler tool. It gives us an idea about the time allocation of the processor for all tasks happening in a certain time window. The perf profiling tool can give us this information not only for user-space but also for kernel tasks.

6.1 Adding perf to our File System We haven’t yet compiled any profiler in out file system. Luckily there is good official documentation and support to accomplish this task: 1. First, let us take a look at the OpenEmbedded Layer Index to check whether we have to implement it ourselves or if it is already supported in a layer. As you can see in this link, our perf recipe is already implemented in the OpenEmbedded core layer, so we got lucky! Just make sure you have cloned this repository and added its layer just as we did in Section 4.1. 2. Then follow the Chapters 1 and 2 of this official guide to have profiling capabilities in our system. (a) Note: To avoid RAM problems while compiling, make sure you have a 8GB swap. To do so, you can follow instructions in this link. 3. Once you have successfully compiled your file system with perf, remove the old file system from /var/nfs/rootfs/* and unpack the new one there. 4. Restart your system and run:

18 6.2

1 @ perf top 2 # This command shows the overhead of the current tasks(kernel space and user space) 3

If this command runs, you have successfully compiled perf in your file system.

6.2 Profiling with Perf Now that we have installed it, we will check how do we actually profile with perf: 1. We will first clone the Flame Graphs repository to better visualize the profiling data. Clone this repository in your virtual machine and remember its local path. 2. One command that we can use to profile a program with perf is (Remember: "@" are commands to be executed in the Target’s Linux. More info in Section 1.2.1):

1 @ perf record -F 99 -a -g --"my program" 2 # Takea look at the perf help to know what the switches mean in the last command. 3

3. As an example, we will profile the sleep command line program:

1 @ perf record -F 99 -a -g -- sleep 60 2 3 # Then run: 4 @ ls -alh 5 -rw------1 root root 39K Jun 11 2020 perf.data 6 # As you see, we just generateda perf.data 7

4. Then, to resolve symbols type:

1 @ perf script > perf.script 2 # This command will translate the perf.data intoa perf.script file. 3

5. Now copy the resulting perf.script in your virtual machine and execute:

1 $ ./path/to/FlameGraph/stackcollapse-perf.pl perf.script > /tmp/perf.folded 2

19 6.3

3 # This will take the perf events and translate them so that Flame Graphs can work with them. 4

Note that as we are using NFS, the perf.script is accessible from both: the Raspberry Pi and Ubuntu. So copying it is actually not strictly necessary. 6. Next step will be to create a scalable vector graphics out of the perf.folded file, to do so type in your virtual machine:

1 $ ./path/to/FlameGraph/flamegraph.pl /tmp/perf.folded > perf.svg 2 3 # This will createa.svg to be analyzed 4

7. And the last step is the visualization and analysis of our profiling data. You just have to open the perf.svg generated previously with a program capable of showing .svg files:

1 $ firefox perf.svg 2

You should be able to see an image like Figure5 or the one in this link.

Figure 5: Scalable Vector Graphics image out of perf profiling data

6.3 Profile Your Application Using the provided information about perf, profile your application and determine:

20 7.1

1. What are the performance bottlenecks of your application? Can you identify the critical functions in your code? 2. How does the profiling data of your own implementation differ from the prototype implementation? Are bottlenecks somewhere else? 3. Design a draft strategy to optimize your application based on your profiling data. What would you do and how to speed up your code? 4. Generate .svg profiling data of your prototype and low-level implementations and deep analysis of your profiling information and draft of your optimization strategy.

7 C Implementation with NEON Intrinsics

7.1 RGB-YUV conversion using C/C++ and NEON Intrinsics Instruc- ctions For this part of the project you will create a C/C++ program named rgb2yuv_intrinsics based on your C/C++ implementation which shall meet the following requirements: 1. For color transformation you must use the image file image.rgb in order to perform RGB/YUV transformation. In order to understand color transformation matrix pro- cessing pay attention to this link. 2. You must create a new Yocto recipe folder called: rgb2yuv-intrinsics in a meta- layer called meta-hpec. 3. The program should have the following options using the getop API and the order of the parameters should not affect the correct operation of your application. You will find some useful information for getop here

1 Usage 2 ./rgb2yuv_intrinsics [ -i RGBfile ] [ -o YUVfile ] [-h] [-a] 3 -i RGBfile specifies the RGB file to be converted. 4 -o YUVfile specifies the output file name. 5 -a displays the information of the author of the program. 6 -h displays the usage message to let the user know how to execute the application. 7 8 Yocto prompt: 9 rgb2yuv_instrinsics -i image.rgb -o outputIN.yuv 10

21 8.1

4. You are free to choose any build system as GNU make, Automake or CMake. Just make sure it builds correctly and is totally integrated with the Yocto recipe. (Is highly recommended to use the Autotools process). 5. The conversion process MUST be done in a C function with the following proposed prototype:

1 void rgb2yuv (char *input_image, char *output_image) 2 You can vary the arguments but NOT the function name. 3

6. The implementation of the conversion algorithm MUST be done using NEON™ Intrinsics approach. In this case it is highly recommended to consider to following information: • Infocenter ARM here. • Basics instructions(*) here. • Introduction Neon for ARMv8 here other approach here. • Compiling process here. • GCC compiler optimization for ARM-based systems here and pay attention inmtune features. • Maximizing SIMD performance here pat attention in optimization and vector- ize compile directions. 7. Measure from within your program the time spent by the function rgb2yuv() and print it out in the console after the process has finished. Hint: In order to estimate time spent pay attention to this and this liks. 8. Remember your output image (outputIN.yuv) must be validated using Rawpixels. 9. Don’t forget, all binaries MUST be installed in the file system as part of the recipe compilation.

8 Optimizing our C implementation with Multi-Threading

We will explore two parallelization strategies to optimize our application: The POSIX Thread (pthread) library and the Open Multi-Processing API (OpenMP). In order to un- derstand the recipes procedure using OpenMP and Pthread in Yocto watch this video.

22 8.1

8.1 RGB-YUV conversion using Pthread By using the POSIX Thread library you can take advantage of the 4 cores available in your RPI4 to speed up your application. Your task is to modify your code so that you run time consuming tasks in parallel: 1. For color transformation you must use the image file image.rgb in order to perform RGB/YUV transformation. In order to understand color transformation matrix pro- cessing pay attention to this link. 2. You must create a new Yocto recipe folder called: rgb2yuv-pthread in a meta-layer called meta-hpec.

1 Usage 2 ./rgb2yuv_pthread [ -i RGBfile ] [ -o YUVfile ] [-h] [-a] 3 -i RGBfile specifies the RGB file to be converted. 4 -o YUVfile specifies the output file name. 5 -a displays the information of the author of the program. 6 -h displays the usage message to let the user know how to execute the application. 7 8 Yocto prompt: 9 rgb2yuv_pthread -i image.rgb -o outputPT.yuv 10

3. You are free to choose any build system as GNU make, Automake or CMake. Just make sure it builds correctly and is totally integrated with the Yocto recipe. (Is recommended to use the Autotools process). 4. The conversion process MUST be done in a C/C++ function with the following proposed prototype:

1 void rgb2yuv (char *input_image, char *output_image) 2 You can vary the arguments but NOT the function name. 3

5. Using the profiling information and your knowledge of your application determine the most time consuming functions in your code. 6. Among those time consuming tasks, determine which ones you can run indepen- dently from each other. 7. Use the pthread library to run those functions in parallel. The pthread library is well documented and there is a lot of helpful information on the web.

23 8.2

8. Prove that your results do not differ from your original implementation. 9. Measure the time your application takes with and without pthread optimization. To do so, take a look at this and this. 10. Analyze your results. 11. Remember your output image (outputPT.yuv) must be validated using Rawpixels. 12. Don’t forget, all binaries MUST be installed into the file system as part of the recipe compilation. NOTE: In your Eclipse project, you have to add pthread in the linker libraries so that you don’t have compilation problems. Go to Project properties → C/C++ Build → Settings → GCC C++ Linker → Libraries and add "pthread".

8.2 RGB-YUV conversion using OpenMP In order to use the OpenMP API in the Yocto project it is highly recommended to follow the next considerations: 1. Look for your file local.conf in your folder /build/conf/local.conf and include fol- lowing dependencies. HINT: Pay attention here.

1 IMAGE_INSTALL_append =" libgomp libgomp-dev libgomp-staticdev glibc-staticdev" 2

2. When the local.conf files is modified, you have to to build your Yocto image using the bitbake command, in this case bitbake core-image-base. 3. To compile OpenMP using Autotools framework compilation it is necessary to add a CFLAGS directive with -fopenmp into Makefile.am. More information here 4. In order to improve the execution time of your C application using pragmas of OpenMP it is necesary to look for a solution based on your profiling process. More information to use threads here, in this example meta-layer here and consider this statement here in order to use more threads. 5. You must create a new Yocto recipe folder called: rgb2yuv-openmp. 6. The program should have the following options using the getop API and the order of the parameters should not affect the correct operation of your application. You will find some useful information for getop here

1 Usage 2 ./rgb2yuv_openmp [ -i RGBfile ] [ -o YUVfile ] [-h] [-a] [-t threads number]

24 8.3

3 -i RGBfile specifies the RGB file to be converted. 4 -o YUVfile specifies the output file name. 5 -t Threads number 6 -a displays the information of the author of the program. 7 -h displays the usage message to let the user know how to execute the application. 8 9 Yocto prompt: 10 rgb2yuv_openmp -i image.rgb -o outputOM.yuv -t 2 11

7. You are free to choose any build system as GNU make, Automake or CMake. Just make sure it builds correctly and is totally integrated with the Yocto recipe. (Is recommended to use the Autotools process). 8. The conversion process MUST be done in a C/C++ function with the following proposed prototype:

1 void rgb2yuv (char *input_image, char *output_image) 2 You can vary the arguments but NOT the function name. 3

9. Remember your output image (outputOM.yuv) must be validated using Rawpixels. 10. Don’t forget, all binaries MUST be installed in the file system as part of the recipe compilation.

8.3 Yocto Testing In order to report your Yocto recipes functionality must be show in your video record an execution running according to figure6. NOTE: Remember do not transfer your binary file into a Yocto image you MUST be created the binary file into a Yocto image through command bitbake.

25 9

Figure 6: Binary execution in Yocto image based on RGB/YUV applications according to the approach performance.

9 Optional: Optimization Competition

This section is optional and grants: 1. 15 Extra points to the first team 2. 10 Extra points to the second place 3. 5 Extra points to the third place What do you have to do? 1. Optimize the run-time of your application using a custom approach based on meth- ods introduced in this project and additional methods that you investigate yourself. This includes: (a) Multi-threading (pthread, openmp, etc) (b) NEON optimization (c) Compiler optimization (d) Modifying kernel frequency scaling and scheduling policies. (e) Custom optimization of your C code. (f) Use of the Floating Point Unit (FPU) 2. Demonstrate in your video and report:

26 10.2

(a) The time it takes for your application to process the provided image (same image for all teams) (b) Validate your output image (outputOM.yuv) using Rawpixels.

10 Deliverables and Grading

This section explains the work flow requirements, deliverables and grading related to this project. Please read and make sure to organize your project execution and deliverables over work time and not just before the deadline. The grading of this project will be based on: 1. Written report (PDF). 2. Video. 3. Git Repository

10.1 Work flow requirements In relation to the Git repository, several aspects are to be considered. It is important to follow them in order to organize your code development and prepare your repository: 1. Open a Git account on a free Git hosting (GitHub) 2. Create a Git repository named as follows: < group# > _HPEC_2020. Where the group# shall be the assigned group number. For example, for Group 1, the name of his Git repository shall be: group1_HPEC_2020 3. If the repository is private then provide access to users sercr0388 and jmarayam (GitHub) and make sure to provide write permissions. 4. The Git repository shall contain two main branches: master and develop 5. Initially the develop branch is created from the master. 6. When working on a project the student shall create a new working branch from develop and when the feature is ready the branch must be merged to develop. Any additional fix or modification after merge must require the process to be repeated (i.e. create the branch from develop and merge the changes later). Once the code in develop is ready it shall be merge to master and a tag must be created. The process is described in Figure7. More information is recommended to investigate here.

27 10.4

Figure 7: Repository structure for assignments or projects

10.2 Folder Structure of your Deliverables In relation to deliverables documentation and folder structure you MUST follow the folder organization in your user_id.tar.gz (e.g group1_HPEC_2020) and Git repository as de- picted below:

1 master ---- project_2 2 |--- Report 3 |--- Report.pdf 4 |--- Video 5 |--- Video_Link.pdf 6 |--- Prototyping 7 |------prototype file (OpenCV) 8 |--- Custom Meta-layer 9 |--- meta-hpec 10 |--- recipes rgb2yuv_c 11 |--- recipes rgb2yuv_intrinsics 12 |--- recipes rgb2yuv_pthread 13 |--- recipes rgb2yuv_openmp 14 |--- Image Outputs 15 |--- YUV_c.yuv 16 |--- YUV_intrinsics.yuv 17 |--- YUV_pthread.yuv 18 |--- YUV_openmp.yuv

10.3 Grading The grading of this project will be based on Table2.

28 10.4

Table 2: Project’s evaluation criteria

Item Description Percentage CoreMark, experiments 1 and 2 and analysis 10 % CoreMark-Pro and Analysis OpenCV Prototype of the Algorithm: - Add OpenCV to the File System 10 % - OpenCV prototype Implementation and validation Own C/C++ Implementation of the Algorithm: Report, - C/C++ Implementation and validation (15%) 20 % Video - Initial Benchmark of your application (5%) and Perf Profiling Data and Analysis Repository based on C Implementation 10 % (include .svg profiling data) Application Optimization: - Neon Intrinsics (10%) - Pthread (10%) 30 % - OpenMP (10%) NOTE: Resulting .yuv must be validated (Rawpixels) for ALL implementations to avoid losing points. Yocto, Workflow, GIT and Application: - Yocto meta-layer, correct creation Custom - Autotools, GNU make or Cmake. Yocto - Correct usage for the compilation program Application - Git control versioning, delivery is correct. - Using Git with the layout suggested and 20 % following the required workflow - Getop implementation - Correct usage of getop for the command lines options - All application versions are included in the File System and are totally functional providing correct estimation result Total 100 %

10.4 Deliverables Submission To submit your deliverables you have to: 1. Before sending your documentation and deliverables please follow the folder struc- ture established on section 10.2. 2. Compress your deliverable folder in zip format and look in TEC-Digital for the

29 10

delivery link for Project 2. 3. Delivery date is projected for Wednesday July 29th 2020, deadline 23:45 afterwards, the link access in TEC-Digital will be closed.

References

[1] M.Madrigal. Bump it up!, 2017. [Online; accessed 10-June-2020].

30