Progress Report

Image Processing using NVidia Jetson Tegra K1 Development Board

August 2015 – November 2015

Prepared by: Dr. Tuba Kurban, [email protected] Visiting Post-doctoral Researcher http://imaging.utk.edu/research/tkurban

Dr. Rifat Kurban, [email protected] Visiting Post-doctoral Researcher http://imaging.utk.edu/research/rifkur

Supervisor: Dr. Mongi A. Abidi Imaging, Robotics, and Intelligent Systems Laboratory, University of Tennessee, Knoxville

1

Contents

1. Introduction ...... 3

2. GPU Computing Platforms & Libraries ...... 3

3. Dynamic Voltage and Frequency Scaling ...... 4

4. Linux Desktop Environments ...... 4

5. Previous studies using TK1 in IRIS Lab ...... 5

6. Image Processing using TK1 ...... 7

7. Video File Processing ...... 9

8. Streaming Video Processing with Ximea Camera ...... 10

9. Streaming Video Processing with VREO Board & SONY Camera ...... 11

10. Conclusions ...... 14

References ...... 15

2

1. Introduction In this progress report, a brief description of the technologies used and results of some basic image processing tasks using NVidia Jetson Tegra K1 embedded computing platform are given.

When Tegra K1 is first announced by NVidia in Q2, 2014, it was attracted much attention not only the industry but also the researchers working in the high performance computing area, as well. Tegra K1 mobile processor includes 4+1 Quad-Core ARM Cortex-A15 CPU clocked at 2.3 GHz and 192 CUDA Core Kepler architecture based GPU clocked at 852 MHz. Tegra K1 (TK1) supports up to 8 GB DDR3L memory and 4K display over HDMI. The chip is produced with 28 nm process [1].

Jetson TK1 development kit is announced in April 2014 by NVidia and sold at $192 in US. It includes; 2 GB memory, 16GB eMMC memory, USB 3.0, HDMI, GigE LAN, SATA, audio, PCIE, RS232, CSI and GPIO ports [2], as shown in Figure 1. Average power consumption of the board is reported typically 5W and maximum consumption reaches to 15W when CPU, GPU and other peripherals used together [3].

Figure 1. NVidia Jetson Tegra K1 development board.

There are two journal papers appeared in Scopus that utilizes TK1 (as of November, 2015): Cocchioni et.al. realized a landing application for an unmanned quadrotor [4] and Zhao used TK1 for fast filter bank convolution with three-dimensional wavelet transform [5]. With its 326 GFLOPS peak power GPU, TK1 promises a lot for the area of embedded supercomputing [6].

2. GPU Computing Platforms & Libraries For general purpose GPU programming, NVidia CUDA and Khronos OpenCL platforms are widely used. While CUDA only works with NVidia GPUs, OpenCL is supported by both AMD and NVidia GPUs. Moreover, OpenCL support parallel programming for common Intel, AMD and ARM CPUs, some mobile GPUs and FPGAs. According to NVidia, Tegra K1 supports OpenCL, however the compatible drivers and SDK has not been realized yet (as of November 2015) [7]. Both, CUDA and OpenCL are cross-platform

3

APIs that can run natively on Windows, Linux and OSX operating systems. Jetson TK1 runs Linux4Tegra operating system (basically an Ubuntu 14.04 with pre-configured drivers) and CUDA is the most common choice for GPU programming on TK1.

ArrayFire is an open-source C/C++ library which aims to make GPU programming simple and fast. ArrayFire API supports both CUDA and OpenCL capable devices including NVidia and AMD GPUs, AMD and Intel CPUs and some mobile devices from ARM [8]. ArrayFire has hundreds of functions across various domains such as:

 Vector Algorithms  Image Processing  Computer Vision  Signal Processing  Linear Algebra  Statistics

3. Dynamic Voltage and Frequency Scaling Dynamic voltage and frequency scaling (DVFS) is a commonly-used technique to make the computing systems power efficient by decreasing the clock frequency of a processor to allow a reduction in the supply voltage [9]. Linux kernel comes with five predefined CPU DVFS algorithms (governors) [10]:

 Userspace: frequency can be set manually by the user  Powersave: frequency is set to the lowest possible value  Performance: frequency is set to the highest possible value  Ondemand: frequency is adjusted to the maximum in response to large increases in work-load  Conservative: frequency is adjusted similar to the ondemand but reacts slower to the changes

TK1’s GPU is also under DVFS control however it utilizes one governor that can be enabled or disabled to allow userspace control [10].

If time consumption is important in the applications performance governor can be used. However, when the power consumption is more important than the time consumption, the remaining governors can be selected.

4. Linux Desktop Environments There are three layers included in the Linux desktop system:

4

 X windows is the foundation and the primitive framework that allows for graphic elements to be drawn on the screen.  Window manager controls the placement and appearance of windows. It requires X windows but not a desktop environment. Some well-known examples are Fluxbox, IceWM, Window maker and Openbox.  Desktop environment is a fully integrated system that includes a window manager and builds upon it. GNOME, KDE, Xfce and LXDE are popular desktop environments for Linux [11].

TK1 comes with Unity desktop environment which is typically based on GNOME3 and Unity is the default desktop on Ubuntu distributions.

5. Previous studies using TK1 in IRIS Lab In this section, benchmark results obtained by Mr. Ben Olson using TK1 are briefly introduced. The detailed information can be found in [12].

In order to determine NVidia Jetson TK1 performance on imaging applications, a gamma correction application was written using C and CUDA. In the experiments, two different test image was used with resolutions at 400x300 and 1920x1080, respectively. Test images, shown in Figure 2, was used to show Jetson capabilities in real world applications. Image gamma correction process is performed on single- threaded CPU, multi-threaded CPU, CUDA GPU and hybrid (multi-threaded CPU + CUDA GPU) modes.

(a) 400x300 test image (b) 1920x1080 test image Figure 2. Test images used in the experiments [12].

Results of 400x300 test image are given in Table 1. Olson has obtained a maximum of 140 FPS without displaying the results. In this study, which Linux DVFS governor used is not mentioned. However, results of the next section proves that the default ondemand governor was probably used. Olson used a split parameter which is in the range of [0,1] to determine how much of the work will be processed with CPU or

5

GPU. His results indicate that using the split parameter as 0.9 gives the best results, which means 90% percent of the work is processed in the GPU and the remaining in the CPU.

Table 1. Gamma correction frame rates of 400x300 test image [12]. Method With Without display (fps) display (fps) Single threaded CPU 10 16 Multi threaded CPU 30 60 CUDA GPU 60 130 Hybrid 80 140 (multi-thread CPU + CUDA GPU)

Figure 3 shows the results of gamma correction on the 1920x1080p Full HD image. In this study only the results without displaying images are given. According to the results maximum FPS is obtained as 27 FPS for the Full HD test image by using the split parameter as 1.0 (bimage_cuda). Singe-thread CPU (bimage_nothread) and multi-threaded CPU (bimage_thread) modes resulted that 1 and 4 FPS, respectively.

Figure 3. Gamma correction frame rates of 1920x1080 test image [12].

6

6. Image Processing using TK1 In this application; a simple image processing task, gamma correction, is realized using ArrayFire library and compared with the results of Olson’s previous study. Image gamma is a non-linear operation used to encode or decode luminance in still images or video:

훾 푉표푢푡 = 퐴푉푖푛 (1) where A is a constant and equals 1 in common, input values are in the range of 0-1, 훾 < 0 is called as encoding gamma or gamma compression and 훾 > 0 is called as decoding gamma or gamma expansion. In Figure 4, applying different gamma values to an image are demonstrated [13].

Figure 4. Gamma correction example.

ArrayFire handles all variables in a universal data-type class called array. Computational algorithms are more readable with math-resembling array-based notation. All array objects are stored in the GPU’s memory and all operations on array objects are realized in GPU in parallel. In Figure 5, a single-line code- snippet of a C++ function is given that realizes image gamma correction on GPU.

array applyGammaEachPixel(array img, float gammadef) { return pow((img/255.f),gammadef)*255; } Figure 5. ArrayFire implementation of GPU gamma correction in C++.

In the experiments; two different test images, which were also used in the previous studies of Olson, are used resolutions at 400x300 and 1920x1080, respectively. The effect of four different window manager and desktop environments of Linux are also tested (such as Ubuntu Unity, Xfce, LXDE and OpenBox). Experiments are realized both using ondemand governor (referred as default clock) and performance governor (referred as superclock). Experiments are conducted 1000 times for each configuration and the averaged frame rates are given in the results.

7

Figure 6. Image gamma correction with ArrayFire and TK1 using a 400x300 image.

Figure 6 shows the results of the experiments on a low-resolution test image. As expected, results of with display option obtained frame rates are lower than the case of without display. Setting the clock speeds to maximum allowed frequency gained ~4x speed-up without display and ~1.25x speed-up with display. Maximum FPS rates,1116 FPS, are obtained with Openbox as it is the lightweight desktop option. Also, we obtained ~3x speed-up compared to Olson’s CUDA based implementation with display in both default and superclock combinations.

Figure 7. Image gamma correction with ArrayFire and TK1 using a 1920x1080 image.

8

Figure 7 shows the results of the experiments of a high-resolution test image. Setting the clock speeds to maximum allowed frequency gained ~1.5x speed-up without display and had no gain in with display case. Using different Linux desktop environments did not make any sense in the super clocking case. Maximum 92 FPS obtained without displaying and 10 FPS obtained with displaying by using Full HD images.

The C++ source codes of this study can be downloaded from [14].

7. Video File Processing In this section, a video file processing application is realized. Well-known Big Buck Bunny short-film from Blender Institute is used in the experiments. OpenCV is used to read the frames of Microsoft MP4 encoded 480p, 720p and 1080p versions of Big Buck Bunny video files from the internal memory of TK1. Flowchart of the application is given in Figure 8.

Capture/read frames with OpenCV

Display the results

Convert to ArrayFire Process the frame data type on GPU (mat2array)

Figure 8. Flowchart of Video file processing on GPU.

ArrayFire’s I/O functions are capable of reading only single images from disk or memory. Therefore, OpenCV is used to extract the frames of a video file. However, OpenCV datatype (mat) is not compatible with ArrayFire’s datatype (array). Mcclanahoochie’s conversion codes are used for this purpose [15].

Table 2 shows the results of displaying video files using only OpenCV in milliseconds and display frame rates. As can be seen from the table, displaying Full HD images with 19 FPS may not be suitable for real- time applications. However, TK1 can obtain much higher rates with its internal hardware multimedia encoders and software APIs.

9

Table 2. Video file processing application using only OpenCV. 480p 720p 1080p capture_cv(ms) 3.6 8.4 19.3 display_cv(ms) 7.3 15 33 total(ms) 10.9 23.4 52.3 total(fps) 91.74 42.7 19.1

Table 3 shows the results of using OpenCV and ArrayFire together. As can be seen from the table, data- type conversion from OpenCV to ArrayFire consumes too much time.

Table 3. Video file processing application using OpenCV and ArrayFire together. 480p 720p 1080p capture_cv(ms) 3.5 8.2 19.6 mat2array(ms) 15.6 34 83.5 display_af(ms) 17.4 40.8 82 total(ms) 36.5 83 185.1 total(fps) 27.4 12 5.4

The C++ source codes of this study can be downloaded from [16]

TK1s OpenMAX IL API includes H.264, VC-1, VP8, MPEG-4 basic, MPEG-2 and JPEG supported by Tegra’s high-definition video hardware and the driver is accessible through gstreamer [17] [18]. In future works, by using gstreamer and appsrc, higher frame rates can be obtained.

8. Streaming Video Processing with Ximea Camera In this section, a GPU based video processing application is realized using Ximea MQ013MG-E2 USB 3.0 monochrome camera which is capable of capturing 1.3MP (1280x1024) at 60 FPS. It is a very compact and low-power camera that can be used many industrial and scientific projects. Ximea offers many APIs and SDKs for different operating systems and CPU hardware. However, it’s Linux drivers and API for ARM CPUs such as TK1 is still experimental and beta [19].

The flowchart of streaming video processing application is given in Figure 9. First, a frame is captured from Ximea camera and copied to the TK1 CPU memory. Then, the data is transferred to GPU memory and converted to ArrayFire image format. A simple image processing task, gamma correction, is applied and the modified image transferred back to CPU memory and displayed using ArrayFire’s graphic API, forge.

10

Transfer to GPU and Grab a frame to CPU Transfer back to CPU convert to ArrayFire Frame Processing memory and display the result datatype

Figure 9. Flowchart of streaming video processing with Ximea camera.

Experimental results are given in Table 4. As shown from the table, xiAPI works successfully in TK1 to capture raw data from the camera at the maximum speed. Since the data of the monochrome camera is obtained as an integer vector of grayscale pixel values, it’s transferred to the GPU memory and converted to a MxN image matrix very fast. The whole operation including display is realized at 59 FPS. However, using the color version of the same camera, RGB frames could not be captured. The beta driver permits only to capture the RAW de-bayered data.

Table 4. Streaming video processing results using xiApi and ArrayFire. Only capturing Capture, apply gamma and display Monochrome 60 FPS 59 FPS Color - -

The C++ source codes of this study can be downloaded from [20].

9. Streaming Video Processing with VREO Board & Sony Camera In this section, a video processing application using VREO Unity USB 3.0 interface board and Sony FCB- MA130 camera is realized. This system is capable of capturing 1080p video at 30 FPS and still images up to 13MP. The system works very-well in Windows operating systems however there are some limitations in Linux. The Linux driver of VREO does not exists. The vendor offers a viewer application called OneView which uses the generic Video for Linux (V4L) drivers. This generic UVC driver is capable of capturing video at 30 FPS. However, still image capturing is not supported [21]. We could not be able to build and run the OneView application due to some library conflicts. Therefore, we used the Ubuntu’s default Qt based V4L2 Test Utility application [22]. Figure 10 shows a screenshot of the application. As can be seen from the figure, 1080p video is captured and displayed at ~15 FPS by setting the CPU and GPU frequencies to maximum allowed speeds (performance governor).

11

Figure 10. Capturing and displaying video of VREO & Sony using V4L2 Test Utility application.

The framerate is below the real-time limit because the data obtained from VREO over V4L2 is YUYV encoded. To display the camera input on the screen, the video frames has to be converted to RGB format. We tried develop a low-level and simple video capture & display application by own to increase the frame rate at 1080p resolution.

Figure 11. Example of UV color plane at Y=0.5 and represented in RGB color.

12

YUV is a color space that encodes a color image or video suitable for human perception. Historically, YUV term is used for analog encoding and YCbCr is used for digital encoding. Today, the term YUV is commonly used in the computer industry to describe file-formats that are encoded using YCbCr. In YUV color space, Y channel is used for encoding luma (gray-level), U and V channels are used for encoding the chroma (color). Example of UV color plane is given in Figure 11. Conversion from YUV to RGB can be computed as follows:

푅 = 푌 + 1.402(푉 − 128) (2) 퐺 = 푌 − 0.344(푉 − 128) − 0.714(푈 − 128) 퐵 = 푌 + 1.772(푈 − 128)

R, G and B values must be clamped between the ranges [0,255]. In some architectures, floating-point arithmetic may consume much time. An alternative fixed-point conversion can be realized by:

푅 = (298(푌 − 16) + 409(푉 − 128) + 128) ≫ 8 (3) 퐺 = (298(푌 − 16) − 100(푈 − 128) − 208(푉 − 128) + 128) ≫ 8 퐵 = (298(푌 − 16) + 516(푈 − 128) + 128) ≫ 8

YUV images can be encoded in 12, 16 or 24 bits per pixel. Common formats are YUV444, YUV411, YUV422 and YUV420. The relation between data rate and sampling is defined by the ratio between Y to UV channels. Figure 12 shows the encoding of YUYV or YUY2 (a kind of YUV422). In this format, U0 and V0 values used both for Y0 and Y1.

Figure 12. YUYV (YUY2) pixel placement [23].

Since the VREO board does not have Linux drivers, we utilized Video 4 Linux (V4L) drivers in a C++ program to capture frames from Sony camera. V4L is a video capture API and driver framework which is closely integrated with the Linux kernel. It supports most USB cameras as well.

13

There are a lot of different display libraries in Linux however, in this application we tried to use Linux framebuffer (fbdev) which is a graphic hardware-independent abstraction layer to show graphics and images on a computer screen, especially on the Linux console. Framebuffer allows to access directly to the video memory without using a low-level library. It’s very popular in embedded systems to avoid the heavy overhead of the X Window system.

In this section, a V4L and Framebuffer based capturing and display application is described. First, the frames are captured from the VREO board in YUYV format. Then, the raw data is converted to RGB by using floating-point or fixed-point numbers on CPU and displayed on the console using the Framebuffer. The actual frame rates using VREO interface board and Sony camera using TK1 capturing 1080p video are shown in Table 5.

Table 5. Streaming video processing results using VREO and Sony system. Conversion and display with floating- Conversion and display with fixed- Conversion and display point numbers and Framebuffer point numbers and Framebuffer with ArrayFire 6 Fps 19 Fps 11 Fps

As can be seen from Table 5, integer arithmetic is much faster than the floating-point arithmetic. However, it is not suitable for real-time systems. In future works, higher frame rates can possibly be obtained by:

 using low-level ARM NEON instructions for YUV to RGB conversion,  using NVidia Performance Primitives (NPP) library which is a GPU-accelerated image, video and signal processing library or  using TK1 hardware H.264 encoder and gstreamer interface.

The C++ source codes of this study can be downloaded from [24].

10. Conclusions In this progress report, some benchmark image processing applications are realized by using NVidia Jetson TK1 evolution board. Single image processing, video file processing and streaming video processing using different high-speed & high-resolution cameras are also evaluated. Much optimization is needed to improve the frame rates of the applications for real-time operations.

14

References

[1] NVidia, "Tegra K1," [Online]. Available: http://www.nvidia.com/object/tegra-k1-processor.html.

[2] NVidia, "Jetson TK1," 2015. [Online]. Available: http://www.nvidia.com/object/jetson-tk1- embedded-dev-kit.html.

[3] Elinux, "TK1 Power Consumption," 2015. [Online]. Available: http://elinux.org/Jetson/Jetson_TK1_Power#Typical_power_draw_of_Jetson_TK1.

[4] F. Cocchioni and et.al., "Visual Based Landing for an Unmanned Quadrotor," Journal of Intelligent and Robotic Systems: Theory and Applications, vol. Article in Press, 2015.

[5] D. Zhao, "Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computing," Journal of Supercomputing, vol. 71, no. 9, pp. 3440-3455, 2015.

[6] Elinux, "Jetson TK1," 2015. [Online]. Available: http://elinux.org/Jetson_TK1.

[7] NVidia, "Tegra K1 Whitepaper," 2015. [Online]. Available: http://www.nvidia.com/content/PDF/tegra_white_papers/Tegra-K1-whitepaper-v1.0.pdf.

[8] P. Yalamanchili, U. Arshad, Z. Mohammed, P. Garigipati and P. Entschev, "ArrayFire - A high performance software library for with an easy-to-use API," AccelerEyes, Atlanta, 2015.

[9] E. L. Sueur and G. Heiser, "Dynamic voltage and frequency scaling: the laws of diminishing returns," in In Proceedings of the 2010 international conference on Power aware computing and systems (HotPower'10), Berkeley, CA, USA, 2010.

[10] K. Stokke, H. Stensland, C. Griwodz and P. Halvorsen, "Energy efficient continuous multimedia processing using the tegra K1 mobile SoC," in Proceedings of the 7th ACM Workshop on Mobile Video, MoVid 2015, 2015.

[11] Ubuntu, "What is the difference between a desktop environment and a window manager?," 2015. [Online]. Available: http://askubuntu.com/questions/18078/what-is-the-difference-between-a- desktop-environment-and-a-window-manager.

15

[12] B. Olson, "Imaging Robotics and Intelligent Systems Lab. UTK," 2015. [Online]. Available: http://imaging.utk.edu/classes/spring2015/ece491/molson5/index.html.

[13] Wikipedia, "Gamma Correction," 2015. [Online]. Available: https://en.wikipedia.org/wiki/Gamma_correction#/media/File:GammaCorrection_demo.jpg.

[14] T. Kurban, "Image Gamma Correction ArrayFire Source Codes, IRIS Lab.," 2015. [Online]. Available: http://imaging.utk.edu/research/tkurban/software/AF_Image_Gamma.zip.

[15] Mcclanahoochie, "Image Processing with ArrayFire and OpenCV on the GPU," 2015. [Online]. Available: http://blog.accelereyes.com/blog/2012/09/19/image-processing-with--and- opencv/.

[16] R. Kurban, "Video File Processing using OpenCV and ArrayFire," 2015. [Online]. Available: http://imaging.utk.edu/research/rifkur/software/AF_OpenCV_VideoFromFile.zip.

[17] Elinux, "Jetson H264 Codec," 2015. [Online]. Available: http://elinux.org/Jetson/H264_Codec.

[18] NVidia, "Jetson TK1 Multimedia User Guide," 2015. [Online]. Available: http://developer.download.nvidia.com/embedded/L4T/r21_Release_v3.0/L4T_Jetson_TK1_Multim edia_User_Guide_V2.1.pdf.

[19] Ximea, "xiApi Linux ARM Support," 2015. [Online]. Available: http://www.ximea.com/support/wiki/apis/Linux_ARM_Support.

[20] T. Kurban, "Streaming Video Processing using Ximea Camera," 2015. [Online]. Available: http://imaging.utk.edu/research/tkurban/software/AF_Xi_VideoStream_Gamma.zip.

[21] VREO, "Downloads," 2015. [Online]. Available: http://vreo.biz/downloads.

[22] Ubuntu, "Qt V4L2 Test Utility," 2015. [Online]. Available: https://apps.ubuntu.com/cat/applications/oneiric/qv4l2/.

[23] MSDN, "Recommended 8-Bit YUV Formats for Video Rendering," 2015. [Online]. Available: https://msdn.microsoft.com/en-us/library/windows/desktop/dd206750%28v=vs.85%29.aspx.

[24] R. Kurban, "Streaming video processing using VREO board and Sony camera," 2015. [Online]. Available: http://imaging.utk.edu/research/rifkur/software/v4l_fb.cpp.

16