Multiprocessing GPU Acceleration of H.264/AVC Motion Estimation under CUDA Architecture

Eduarda R. Monteiro, Bruno B. Vizzotto, Cláudio M. Diniz, Bruno Zatt, Sergio Bampi Informatics Institute, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil {ermonteiro, bbvizzotto, cmdiniz, bzatt, bampi}@inf.ufrgs.br

Abstract— This work presents a parallel GPU-based solution for We considered the processing hierarchy of CUDA architecture the Motion Estimation (ME) in a video encoding system. for implementation. A set of threads (basic processing We propose a way to partition the steps of Full Search block unit) is organized in blocks, which are elements of a grid. In our matching algorithm in the CUDA architecture, and to compare work, ME algorithm mapping for GPU is composed by a kernel the performance with a theoretical model and two (procedure to be executed in GPU) where the grid and the block implementations (sequential and parallel using OpenMP sizes are defined by the video resolution and the search area, library). We obtained a O(n²/log²n) speed-up which fits the respectively. Each computes each 4x4 video block. theoretical model considering different search areas. It represents up to 600x gain compared to the serial III. RESULTS AND CONCLUSIONS implementation, and 66x compared to the parallel OpenMP Tests were performed for three video resolutions (CIF, implementation. HD720p and HD1080p) comparing the CUDA implementation running on an NVIDIA GTX 480 @ 1.40GHz (480 functional I. INTRODUCTION units) to both serial and parallel OpenMP implementation running on a Core 2 Quad Q9550 @ 2.82GHz. In the past decade, the demand for high quality digital video applications has brought the attention of industry and academy to Results (CIF resolution) shows that the serial implementation drive the development of advanced video coding techniques and is in accordance with the theoretical model. The experimental standards, e.g. H.264/AVC [1], the state-of-the-art since it data fitting is approximately 40n² for n ranging from 12 to 128, provides higher coding efficiency compared to previous where n*n are the number of pixels inside a search area as shown standards (MPEG-2, MPEG-4, H.263). in Figure 2a. Speed-up results were also consistent with the In this scenario, Motion Estimation (ME) [2] is a key issue in theoretical model O(n²/log²n) considering different search areas, order to obtain high compression gains. ME requires intensive as shown in Figure 2b. The obtained speed-up represents up to computation and memory communication for block matching 600x gain compared to the serial software implementation, and task. By exploring the inherent parallelism potential of ME, our 66x compared to the parallel OpenMP software implementation. work presents a parallel GPU-based solution for the Full Search 7 700 (FS) block matching algorithm [2] implemented on the CUDA 6 O(40*n²) 600 CUDA

architecture [3]. ] 5 ME_FS_Serial 500 2n²/Log²n 2 (x) 4 ME_FS_OpenMP 400 up - II. PROPOSED IMPLEMENTATION 3 ME_FS_CUDA 300

2 Speed 200

Our hardware system is formed by CPU and GPU. CPU [x10 (s) Time (a) (b) manipulates video to separate the original and reference frames, 1 100 and feeds the GPU, as shown in Figure 1. FS algorithm is divided 0 0 12 16 20 24 32 36 48 64 80 128 12 16 20 24 32 36 48 64 80 128 into two steps (see Figure 1): i) Sum of Absolute Differences Search Area (pixel) [n*n] Search Area (pixel) [n*n] (SAD) of all matches inside a search area (region in the reference frame); ii) the comparison to find the best match (lower SAD). Figure 2. (a) Theoretical model (in blue) versus obtained results for CIF videos. (b) Theoretical speed-up (in red) versus obtained speed-up with CUDA (in blue) for different search areas

REFERENCES [1] ITU-T Recommendation H.264 (03/10): advanced video coding for generic audiovisual services, 2010. [2] HUANG, Y-W; et. al. Survey on Block Matching Motion Estimation Algorithms and Architectures with New Results. J. VLSI Signal Process. Syst. 42, 3 (March 2006), pp. 297-320. [3] NVIDIA, Corp. Available at: http://www.nvidia.com Figure 1. Proposed Algorithm Flow MULTI-PROCESSING GPU ACCELERATION OF H.264/AVC MOTION ESTIMATION UNDER CUDA ARCHITECTURE

Eduarda Monteiro, Bruno Vizzotto, Cláudio Diniz, Bruno Zatt, Sergio Bampi [ermonteiro,bbvizzotto,cmdiniz,bzatt,bampi]@inf.ufrgs.br

Introduction and Overview of H.264/AVC and Motion Estimation

 The H.264/AVC standard introduces a significant gain in Current Block compression rates when compared to existing standards Block Matching (MPEG-2, MPEG-4, H.263); Motion Vector  H.264/AVC requires high computational complexity;  Motion Estimation (ME) is a key operation in order to obtain high compression gains;  ME requires intensive computation and memory Search communication to compute the best block matching; Area Original Reference Frame  By exploring the inherent parallelism potential of ME, our Frame work presents a parallel GPU-based solution for the Full Position of the Search block matching algorithm implemented on the CUDA current block in the architecture. reference frame The Motion Estimation (ME) process Parallel GPU-based for Full Search - ME Video: CIF, QCIF, HD720p, HD1080p Complexity Analysis: The theoretical sequential and parallel (PRAM) models: Frame Separation: Reference  Sequential: O (n²); and Original  Parallel: O (log² n); n x n: number of pixels in the CPU • Theoretical Speed-up (comp. + comm.): 2n² / Log²n; search area. GPU

Host (CPU) Device (GPU) SAD Values Calculation Grid transferred to video Kernel step 1 Block (0,0) Block (0,1) Block (0,2) Block (0,3) Block (0,4) Block (0,5) memory

SAD comparison Video Threads 4x4 Video Frame ... Block Search Area Size (nxn) step 2

CUDA Model Programming – Algorithm Allocation Proposed ME algorithm flow Results and Comparisons

CIF HD 720p HD 1080p 700000 1000000 10000000

600000

500000 1000000

400000 100000

300000 100000

Time (ms)Time 200000

Log10 Time (ms)Time Log10 10000 (ms)Time Log10 100000 10000 0

-100000 12 16 20 24 32 36 48 64 80 128 1000 1000 O(40*n²) 5760 10240 16000 23040 40960 51840 92160 163840 256000 655360 12 16 20 24 32 36 48 64 12 16 20 24 32 36 48 ME_FS_Serial 6140 10796 16656 23828 42000 53093 94046 168453 262265 669984 ME_FS_Serial 56390 99218 152609 218453 382968 485750 858765 1519687 ME_FS_Serial 125500 220484 343062 488421 861031 1089593 19311671 ME_FS_OpenMP 880 1300 2290 2970 5020 6180 10190 17580 28350 72500 ME_FS_CUDA 5755 5770 5841 5927 6105 6154 6314 6738 ME_FS_OpenMP 16890 27770 43900 58700 97410 137650 207550 ME_FS_CUDA 697 698 702 710 726 735 748 798 859 1114 ME_FS_OpenMP 6910 11820 17210 25060 42790 52310 91730 161300 ME_FS_CUDA 12888 12996 13041 13169 13545 13719 14005

Speed-up 700

600 Test conditions: Core 2 Quad Q9550 (2.82 GHz) and NVIDIA GTX 480 (1.40 GHz). 500

400

CUDA Conclusions 300 2n²/Log²n Tests with CIF, HD720p and HD1080p video resolutions running on an NVIDIA GTX 480 with 480 200 functional units showed results that are consistent with the theoretical model and obtained 100 O(n²/log²n) speed-up for different search areas. The experimental data fitting gave approximately

0 40n² for n ε [12,128]. It represents up to 600x gain compared to the serial software 12 16 20 24 32 36 48 64 80 128 implementation, and 66x compared to the parallel OpenMP software implementation. Support: Federal University of Rio Grande do Sul (UFRGS) - Informatics Institute Mail Address: Av. Bento Gonçalves, 9500 CEP: 91501-970 Porto Alegre - RS - Brazil