Performance Analysis of Parallel Execution of H.264 Encoder on the Cell Processor

Jonghan Park Soonhoi Ha

School of EECS, Seoul National University, Seoul, Korea

{ forhim, sha }@iris.snu.ac.kr

 interest to know how much speed gain we can achieve by Abstract— Performance improvement by parallel parallel execution of the algorithm. Performance improvement execution depends on two factors: the potential parallelism by parallel execution depends on two factors: the potential of the application itself, and the optimal mapping of the parallelism of the application itself, and the optimal mapping application to the target architecture, which is usually very of the application to the target architecture, which is usually target specific. As a case study, we analyze the expected very target specific. As a case study, we analyze the expected performance of parallel execution of an H.264 encoding performance of parallel execution of an H.264 encoding algorithm, known as X264, on the Cell processor. algorithm, known as X264, on the Cell processor. To our best Considering the communication architecture of the Cell knowledge, no performance report has been published of processor, we parallelize the algorithm at the macro-block H.264 encoding algorithm on the Cell processor. Considering level. From the performance analysis, we discover the the communication architecture of the Cell processor, we overhead factors of parallel execution and estimate the parallelize the algorithm at the macro-block level. Some expected performance. Comparison with simulation previous works sacrifice the compression ratio by using a results proves the accuracy and the usefulness of the simpler encoding algorithm for MPEG encoding [5]. But we proposed analysis method. aim to maintain the original X264 algorithm to maintain the encoding quality in this paper. The remainder of this paper is organized as follows. The I. INTRODUCTION Section II reviews the background information on the Cell processor and the H.264 Encoding algorithm. Section III presents the proposed parallel execution of X264 algorithm Insatiable demand of system performance makes it and its performance analysis. In Section IV, we show the inevitable to integrate more and more processing elements in a simulation results and compare them with the analysis results. single chip, called MPSoC (Multi-Processor System on a Section V concludes the paper. Chip), to meet the performance requirement[8]. Recently, Sony, Toshiba, and IBM (known as STI) have jointly developed the Cell processor that integrates 9 processor cores II. BACKGROUND in a single chip. To get most benefit of such a powerful chip, it is essential to exploit the parallelism of the application optimally. II.1 Cell Processor H.264 is getting wider acceptance as a video codec algorithm for mobile multimedia devices due to its higher The Cell processor is a heterogeneous MPSoC that consists compression ratio than the previous algorithms. The mobile of a Power Processor Element (PPE) and 8 Synergistic devices usually support the baseline profile of the H.264 Processor Elements (SPEs). The PPE is a 64-bit Power- decoding algorithm. Since the decoding complexity of the Architecture compliant core optimized for power efficiency. It baseline profile of the H.264 algorithm is much not higher supports two simultaneous threads for execution and can be than the previous algorithms, there is no need of parallel viewed as a 2-way multiprocessor. The SPE is a special execution. On the other hand, there are applications, such as purpose processor which consists of the Synergistic Processor HDTV, that require the main profile of the H.264 algorithm Unit (SPU) with 256KB local store (LS) and Memory Flow with higher quality and larger frame size. The main profile Controller (MFC). uses B-frame, weighted prediction, and CABAC algorithms To execute a program using the SPE, we need to load the that increase the computational complexity beyond the code and data on the LS: If the sum of code and data memory processing capability of a single processor. requirements is larger than 256KB, code overlay technique In this paper, we are concerned about the main profile should be used, which makes the parallelization difficult to H.264 encoding algorithm. In particular, it is of our primary achieve. Since there is no globally shared memory, the SPE cannot read the main memory directly. Therefore, data should  be transferred between the LS of a SPE and the main memory III.1. Pipelining the algorithm using DMA commands to MFC. Since the SPE supports a The H.264 encoding algorithm is first partitioned into three variety of SIMD (Single Instruction Multiple Data) sections that are executed in a pipelined fashion: two for frame instructions, it is able to effectively execute computation data processing and one for macro-block processing. For each intensive kernels. Generally, the PPE invokes threads which frame data processing, the macro-block processing section is are run on the SPEs, and manages the flow of application. invoked N times if a frame consists of N macro blocks. For the QVGA size of frame, N is 300 (=15x20). We can further II.2 H.264 Encoding Algorithm pipeline the MB processing section itself between the ‘MB The flowchart of H.264 algorithm is sketched in Figure 1. Analysis’ and the ‘MB Encode’ module in Figure 1. Most of its execution time is spent on the analysis and Figure 3 shows the execution profile after pipelining the encoding of macroblock (MB). For the encoding of a QVGA algorithm as explained above if we use only the PPE of the format video, the execution times are distributed as shown is Cell processor. It is the reference execution that will be used Figure 2. As the frame size is larger, the portion of analysis and encoding of MB becomes higher. So, it is essential to for performance comparison with parallel execution. As speed up these two modules by parallel execution. shown in the figure, by using separate threads for MB Analysis and MB encoding, we overlap most of their Read File executions to hide the MB encoding time. On the other hand, Figure 1. Flowchart of H.264 Encoder the PPE should pay synchronization overhead between two Frame Init threads: The overhead is negligible within a PPE. X264 [1] is an open source H.264 encoding algorithm that Analysis is much faster than JMMB ReferenceAnalysis code[2]. It uses a good early Init VLC/Write termination algorithm,(Intra and hexagon / ME & search algorithm instead PPE 1 … of full search. MC) PPE 2 Enc MB Encode (DCT, Scaling, Quant) & Deblocking N times Filter Figure 3. Pipelining in PPE

End of N If we move the ‘MB Analysis’ module to a single SPE, the Frame? execution profile becomes as shown in Figure 4. Then two E n adverse effects float to the surface: First, data transfer CABACd overhead between the PPE and the SPE should be paid as the o synchronization overhead between threads. Second the raw f performance of SPE is lower than the PPE so that the Figure 2.X264WriteF encoding profile execution time of the ‘MB Analysis’ module itself is File r increased. As a result using a SPE for the macro-block a III. PARALLELIZATION AT THE MACROBLOCK LEVEL processing lengthens the total execution time. m e There are several approaches? to parallelize the H.264 Ini encoding algorithm: the most popular choices are based on the t PPE VLC/Writ frame(or slice) level[4]-[5] or at motion estimation(ME) 1 SPE e level[7]. Parallelization at the frame level , however, does not … fit for the Cell Processor due to the space limitation of the LS PPE 2 of a SPE. The LS is not able to store the entire frame data with N time the code. And dynamic allocation of data buffer in the SPE is Figure 4. MB Analysis on SPE needed if the frame size is changed, which is not a good programming model for SPE. We expect that parallel execution of macro-block Parallel execution of the motion estimation algorithm is the processing will outweigh this performance overhead. For most popular technique for hardware implementation. If we quantitative analysis, we define the following notations: use the same parallel execution on the Cell processor, we have to pay huge overhead of data transfer between the SPE and the tse : sum of the execution times of the frame processing main memory. modules that are executed sequentially in the PPE. It sums Therefore, we parallelize the algorithm at the MB level. up to 10% in Figure 2. Since the MB size is constant (16x16 for luminance and 8x8 tenc : execution time of the ‘MB Encoder’ module, which is for chrominance), we can allocate the data buffer statically. 6% in Figure 2. And the LS of a SPE can accommodate all the required data and the code. rspe : performance ratio between an SPE and the PPE. With no SPE-specific optimization, this ratio is 0.43 meaning that a SPE is slower than the PPE by 1/0.43. P : total number of SPE processors used. d(P) : data transfer overhead between the PPE and the SPEs, which is not hidden. In our experiments, this overhead is measured less than 1% of the total execution time.

Then a simple guess of the expected performance using P SPEs becomes Figure 6. MB Profiling 1 t  t 1 T(P)  t  se enc   d(P) se P r (1) So, the total execution time of parallel execution of ‘MB spe Analysis’ becomes Equation (1) says that the maximum performance gain with 8  n1  SPEs can be as large as 2.7, compared with the reference  N  4   k   k1  execution time that is 1 tenc . This gain is possible only if the T  4  (n 1)  (3)  n  MB processing is fully parallelizable. But it is not true in   reality so that we may not achieve that much performance   gain: The reason will be discussed in the following sub- The first term in the parenthesis indicates the sum of time sections. durations before and after maximum parallelism. The maximum parallelism is achieved after we spend 2 (n 1) III.2 Parallel Execution of ‘MB Analysis’ time units in the beginning of the profile. The same amount of time should be spent at the tail of the profile. During that Figure 5 illustrates a simple frame structure that consists of n1 4x7 MBs where MBs are indexed by (i,j). In the H.264 duration, as many as 4   k MBs are analyzed. The encoding algorithm, there is dependency between MBs: To k 1 analyze MB(i,j), we need to refer to the analysis results of remaining MBs can be processed at full parallelism and its MB(i-1,j), MB(i-1,j+1), and MB(i,j-1). Therefore, the profile duration is formulated with the second term. 1 tse  tenc 1 of macro-block processing becomes as shown in Figure 6. Since T is  , the total execution time becomes Initially two macro-blocks, MB(1,1) and MB(1,2) should be N rspe analyzed sequentially. Let T be the execution time of one  n1   N  4  i  macro-block analysis. After 2T two macro-blocks, M(1,3) and 1  t  t 4  T (P)  t  se enc    (n 1)  i1   d(P) MB(2,1), can be concurrently analyzable. After 4T three se   N rspe n  rspe macro-blocks are concurrently analyzable, and so on. Thus N   (4)   macro-blocks are not fully parallelizable for their processing. 1 t  t  N   t  se enc   2(n 1)  d(P) se N  r n (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) spe   (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) (2,7) Equation (3) indicates that the second term of equation (1) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,7) P  N  is slowed down by the following factor:    2  (n 1) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,7) N  n  Figure 5. A simple frame that is composed of 28 MBs . For QVGA size of frame, n = 8, P = 8, N = 300 and the slow down factor becomes 1.37. Let us estimate the performance gain from parallel execution of macro-block processing. First we compute the III.3. Non-uniform Execution Time maximum number of concurrently analyzable MBs, which can be formulated as equation (2). There are 3 types of frame in H.264. I-Frame is appeared every dozens or hundreds of frames, and the time variance of MB analysis is not large. But the time variance of MB analysis n  min(min(cols / 2, rows), P) (2) for P and B frames is significant enough to be analyzed in details. The first term indicates the potential parallelism of the algorithm. If it is larger than the number of SPEs, the 16x16 8x8 16x8/8x16 maximum number of concurrently analyzable MBs is limited Early Termination to P. Figure 6 shows that this number is 4 for the frame of (SKIP) Figure 5. After achieving the maximum parallelism, the Intra Quarter-pel parallelism is again decreased at the tail of the profile, to make To encode stagePredictio Refinement the profile bisymmetric. Figure 7. Stepn of MB analysis MB analysis of P, B frame consists of 5 steps as shown in Figure 9. Ratio of MB analysis step Figure 7. A MB of size 16x16 is first analyzed. If the difference between the motion-estimated MB and the current The idle time only occurs when there are no executable MBs MB is smaller than a threshold, the MB analysis is terminated, after completing the current MB analysis. Then the idle time which is called “early termination” and saves significant slows down the MB analysis by the utilization factor s. If portion of execution time. Otherwise the MB is divided into 4 there are enough number of concurrently executable MBs, no 8x8 blocks and 8x8 blocks are analyzed separately. The next idle time will be experienced. So the slow-down factor also [16x8/8x16] step is executed only when the cost of [8x8] depends on the number of concurrently executable MBs. analysis is smaller than that of [16x16] analysis. The Thus the slow-down factor heavily depends on the input video following two steps are executed once [8x8] analysis is characteristics. Nonetheless, considering these two factors and performed. the execution profile of Figure 6, we compute the estimated slow-down factor as follows: 1. Identify the time period during which the degree of 40% 37% parallelism is greater than and equal to the number of SPEs 35% times a weight factor k. During the period, the number of 30% concurrently executable MBs is sufficiently larger than the 24% 25% 22% number of SPEs so that no SPE will experience any idle 20% time. The MBs associated with this time period are expected to be analyzed without any slow-down. In our 15% experiment, we set the weight factor k to 1.5. 9% 10% 8% 2. We assume that the remaining MBs belong to the time 5% period during which SPEs are likely to experience idle 0% time. We let the percentile of these MBs be x. 16x16 8x8 16x8,8x16 Refine Intra 3. Then the time period without any slow-down becomes 1 x Figure 8. Profiling of each steps in MB analysis of the total MB Analysis time, and the time period 1 x The relative execution times of these 5 steps are displayed 2x with slow-down by s becomes of the total processing in Figure 8 where the sum is normalized to 1. Since some 1 x steps need not be executed, we do not predict how long it will time. Note that during the time period with slow-down, take to analyze a MB. So the execution time of ‘MB Analysis’ only half of the SPEs are processing the MBs on average is varying at run time and its average value is used in equation as illustrated in Figure 6. So the ratio between two time (3). period becomes 2x :1 x instead of x :1 x . There is another slow down factor for parallel execution of 4. As a result, the expected slow-down factor, K(P) , for MB Analysis. Some SPEs should wait until other SPEs finish macro-block analysis becomes the current execution. Suppose MB(2,1) is finished earlier due 2x to early termination than MB(1,3) during [2T,3T] in Figure 6. (1 x) 1 x K(P)   . (6) Since MB(2,2) and MB(1,4) depends on MB(1,3), the SPE s 1 x that executes MB(2,1) should wait until MB(1,3) is completed. Such idle duration should be taken into account as another III.4. Summary slow down factor. Note that if we use one PPE for the whole MB processing, no waiting time is needed. Since the slow Considering all slow-down factors of parallel execution, the down factor is dependent on the input scene, it is not possible total execution time of X264 algorithm is summarized to the to compute it at compile-time. So we propose an estimation following equation: formula for the slow-down factor. 1 t  t N First, the slow down factor depends on the utilization factor se enc   (7), T (P)  tse     2(n 1)  K (P)  d(P) of the SPE. We define the normalized value of T, macro-block N  rspe  n  analysis time, as the utilization factor s. To compute T, we 2x obtain the execution profiles of the five steps as shown in (1 x) 1 x where K (P)   Figure 9: (A) % of MBs are terminated early after [16x16] s 1 x processing while (C) % of MBs need all five steps. The remaining portion of MBs need all steps except [16x8/8x16] step. Then the utilization factor s becomes IV. EXPERIMENTS In this section, we examine the accuracy of the expected s  A  0.22  B  (0.22  0.24  0.08  0.09)  C 1 (5) performance in equation (7) by comparing it with the experimental results. We used Sony’s PlayStation3 (PS3) and [16x16] : (A)% IBM’s Cell Simulator(known as MAMBO) as experiment [8x8],[Qpel Refine], [Intra] : (B)% environments. For PS3 environment, we loaded Terasoft’s [16x8/8x16] : (C)% Yellow Dog linux on PS3 but noticed that the system does not use all 8 SPEs: Instead it uses only 6 SPEs at maximum. So performance gain. To obtain more gain, we have to apply we report the simulation results only in this paper. SPE-specific optimization. There are two methods of As a test input video, we used a music video clip that optimization. One is to use Vector/SIMD instructions as contains lots of busy motion throughout its execution. The effectively as possible. Simple modification using a single frame is of QVGA(320x240) format that contains 15x20 MBs: SIMD instruction has improved the SPE performance to make

So, the maximum number of concurrently executable MBs is rspe to 0.53. This simple change increases the performance up

10. And, as shown in Figure 2, t se =0.10, tenc =0.06. We to 37.17 for QVGA and 15.98 for VGA size of frame. found that the data transfer overhead, d(n) , between the PPE Another method is to optimize the memory access to the previous frame during the motion estimation. X264 accesses and the SPEs takes only 0.5% of the total execution time. the memory region of reference frame by pointer for computing the absolute difference between current MB and IV.1. Calculating the Slow-down Factor K(P) that of the reference frame. In case of Cell processor, however, because SPE has no way of accessing the main memory We obtained the profile information from 20 frames and got directly, we should copy the data to the LS. Our measurement A=35%, B=25%, C=40% in Figure 9. So the utilization factor reveals that data copy takes more than 15% of the MB analysis of an SPE is set to 0.63: s = 0.63. The slow-down factor K(P) time. Since X264 uses hexagon search for integer-pel and is a function of P as well as the frame size. When P is 2, most sub-pel, and uses square search for quarter-pel, we can reduce of MBs can be proceeded concurrently so that K(2) is close to the copy overhead by removing any redundant data copy. 1. For the case of P=8, x becomes 1 because the maximum Even though how much gain we can obtain is not known yet number of concurrently executable MBs is less than from such optimization, Figure 11 shows the expected 12(=1.5x8). So the slow-down factor becomes 1/s, which is 1.59. Table 1 shows the slow-down factors for other values of performance improvement varying the value of r spe . If we P. increase r spe to 2, the figure shows that we can increase the Table 1. Delay Constant in QVGA performance by 455 % by utilizing 8 SPEs. P 2 4 6 8

K(P) 1 1.26 1.44 1/s 140

IV.2. Performance Estimation 120 100 Figure 10 displays the performance results from simulation c

e 80 s and the proposed analysis, varying the number of SPEs. / e m

a 60 35 r F

30 40 25 c

e 20 S 20 /

e Estmation

m Experiment a 15

r 0 F 0.43 0.53 1 2 5 10 r_spe 5 0 Figure 11. Expected performance improvement by 0 2 4 6 8

num of SPE increasing rspe Figure 10. Comparison between estimation and experiment V. CONCLUSION When we run the pipelined version of the X264 on the PPE without using any SPE, we obtained about 19.67 frames/sec of We analyzed the expected performance of parallel performance. Since the performance ratio between an SPE and execution of an H.264 encoding algorithm, known as X264, the PPE rspe is 0.43 without any SPE-specific optimization, on the Cell processor. Considering the communication architecture of the Cell processor, we parallelize the algorithm we could not obtain any speed-up until we use more than 2 at the macro-block level. By examining the execution profile SPEs. It is noteworthy that the analysis results are very close of the algorithm, we have discovered several factors that to the simulation results for all values of SPE: It proves the hinder the speed-up of parallel execution. Through detailed accuracy of the analysis. analysis, we formulate the expected performance with a closed form of equation: equation (7). The equation shows what are IV.3. SPE Optimization the overhead factors and how much each factor affects the So far we haven’t applied any SPE-specific optimization. performance. Then, the expected performance gain is bounded by equation (1): with 8 SPEs, we may not obtain more than 2.7 of It reveals that we need to apply SPE-specific optimization annual conference on Design automation, June 07-11, 2004, San Diego, to obtain a meaningful speed-up on the Cell processor, which CA, USA [20] RFC1014, XDR : External Data Representation Standard remains as the future work. [21] RFC1832, XDR : External Data Representation Standard Comparison with simulation results proves the accuracy and the usefulness of the proposed analysis. The optimization will be to use Vector/SIMD instructions, and to reduce the data transfer time between the PPE and SPE by carefully reusing the fetched data.

VI. ACKNOWLEDGEMENT This work was supported by BK21 project, SystemIC 2010 project funded by Korean MOCIE, and Creative Research Initiative sponsored by KOSEF research program (R17-2007 -086-01001-0). This work was also partly sponsored by ETRI SoC Industry Promotion Center, Human Resource Development Project for IT-SoC Architect. The ICT and ISRC at Seoul National University and IDEC provide research facilities for this study.

REFERENCES [1] X264, http://www.videolan.org/developers/x264.html [2] JM Reference, http://iphome.hhi.de/suehring/tml/index.htm [3] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression (Video Coding for Next-generation Multimedia) [4] A. Rodriguez, A. Gonzalez, M.P. Malumbres, Hierarchical Parallelization of an H.264/AVC Video Encoder, In Proceedings of the PARELEC’06, pp. 363-368 [5] Yen-Kuang Chen, Xinmin Tian, Steven Ge, Milind Girkar, Towards Efficient Multi-Level Threading of H.264 Encoder on Intel Hyper- Threading Architectures, Proceedings of the 18th International Parallel and Distributed Processing Symposium, pp. 63-70 [6] Aleksandar Beric, Ramanathan Sethuraman, Carlos Alba Pinto, Harm Peters, Gerard Veldman, Peter van de Haar, Marc Duranton, Heterogeneous multiprocessor for high definition video, In Proceedings of ICCE’06, pp. 401-402 [7] Chuan-Yu Cho, Shiang-Yang, Jia-Shung Wang, An Embedded Merging Scheme for H.264/AVC Motion Estimation, ICIP 2003, pp. I- 909-12 [8] Julien Bernard, Jean-Louis Roch, Serge De Paoli, Miguel Santana, Adaptive Encoding of Multimedia Streams on MPSoC, Proceedings of ICCS 2006, Part IV, LNCS 3994, pp. 999-1006 [9] IBM, Cell Broadband Engine Programming Tutorial [10] IBM, Full-System Simulator User’s Guide [11] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy, Introduction to the Cell Processor [12] IBM, Cell Broadband Engine : Software Development Kit 2.0 Programmer’s Guide [13] Jonathan Rentzsh, Data alignment, Straighten up and fly right [14] Jonathan Bartlett, Programming high-performance applications on the Cell/B.E. processor [15] Sidney Manning, Michael Kistler, Debugging Cell Broadband Engine systems [16] Seiji Maeda, Shigehiro Asano, Tomofumi Shimada, Koichi Awazu, Haruyuki Tago, A Real-Time Software Platform for the CELL Processor [17] Ryuji Sakai, Seiji Maeda, Christopher Crookes, Mitsuru Shimbayashi, Katsuhisa Yano, Tadashi Nakatani, Hirokuni Yano, Shigehiro Asano, Masaya Kato, Hiroshi Nozue, Tatsunori Kanai, Tomofumi Shimada, Koichi Awazu, Programming and Performance Evaluation of the Cell Processor [18] Nobuhiro KATO, Kazuaki TAKEUCHI, Seiji Maeda, Mitsuru Shimbayashi, Ryuji Sakai, Hiroshi Nozue, Jiro Amemiya, Digital Media Applications on a CELL Software Platform, Proceedings of ICCE’06, pp. 347-348 [19] O. Ozturk , M. Kandemir , I. Demirkiran , G. Chen , M. J. Irwin, Data compression for improving SPM behavior, Proceedings of the 41st