Multi-Standard Reconfigurable Motion Estimation Processor for Hybrid Video Codecs

Multi-standard Reconfigurable Motion Estimation Processor for Hybrid Video Codecs

Jose L. Nunez-Yanez, Trevor Spiteri, George Vafiadis

Abstract — This work presents a programmable and configurable motion estimation processor capable of performing motion estimation across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion vectors. The core can be programmed using a C-style syntax optimized to implement arbitrary block matching algorithms and configured with different execution units depending on the selected codec, the available inter-coding options and required performance. This flexibility means that the core can support the latest video codecs such as H.264, VC-1 and AVS at high- definition resolutions and frame rates. The configuration and programming phases are supported by an integrated development environment that includes a compiler and profiling tools enabling a designer without specific hardware knowledge to optimize the microarchitecture for the selected codec standard and motion search technique leading to a highly efficient implementation.

Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, VC1, AVS, FPGA.

I. TINTRODUCTION he emergence of new advanced coding standards such as H.264 [1], VC-1 [2] and AVS [3] with their multiple coding tools have introduced new complexity challenges during the motion estimation (ME) process used in inter-frame prediction. While previous standards such as MPEG-2 could only vary a few options, H.264, VC-1 and AVS add the freedom of using quarter pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the inter-prediction process. The potential complexity introduced by these tools operating on large reference area sizes makes the full-search approach, which exhaustively tries each possible combination, less attractive from a performance and power points of view. A flexible, reconfigurable and programmable motion estimation processor such as the one proposed in this work is well poised to address these challenges by fitting the core microarchitecture to the inter-frame prediction tools and algorithm for the selected encoding configuration. The concept was briefly introduced in [4] and it is further developed in this work. The paper is organized as follows. Section II reviews significant work in the field of hardware architectures for motion estimation. Section III establishes the need for architectures able to support fast motion estimation algorithms in order to deliver high quality results in a power/area/time-constrained video processing platform. Section IV describes the processor microarchitecture details while Section V presents the tools developed to program the core and explore the software/hardware design space for advanced motion estimation. Section VI presents the multistandard hardware extensions targeting VC-1 and AVS codecs. Finally section VII analyses and compares the complexity/performance of the proposed solution and section VIII concludes the paper. II.MOTION ESTIMATION HARDWARE REVIEW

A. Full Search ME Hardware

There are numerous examples of full-search motion hardware in the literature, and in this section we will review a few relevant examples. Full-search algorithms have been the preferred option for hardware implementations due to their regular dataflow, which makes them well suited to architectures using one-dimensional (1-D) or two-dimensional (2-D) systolic array principles with simple control and high hardware utilization. This approach generally avoids global routing, resulting in high clock frequencies. Practical implementations, however, need to consider the interfacing with the memories in which the frame data resides. This can result in large memory data widths and port counts or a large number of registers needed to buffer the pixel data. Data broadcasting techniques can be used to reduce the need for large memory bit widths, although this can reduce the achievable clock frequency. These ideas are developed in the work presented in [5] which uses a broadcasting technique to propagate partial sum of absolute differences (SAD) values and merges the partial results to obtain different block sizes in parallel.

An improvement on this concept is shown in [6] which develops a 2-D SAD Tree architecture which operates on one reference location at a time using a four-row array of registers for reference data, thereby removing the need for broadcasting and also allowing a snake-scan search order which further increases data-reuse. A different approach to using an adder tree is proposed in [7], which adds variable block size support to a 1-D systolic array by using each processing element (PE) to accumulate the SAD for each of the 41 motion vectors required, with a shuffling technique. The authors report a latency of

4496 clock cycles to complete the full search on a 16×16 search area with 16 PEs working in parallel. The throughput in this case is 13 frames per second (fps) in QCIF video resolution. A similar approach based on a 2-D architecture is presented in

[8]. The architecture includes a total of 256 PEs grouped into 16 4×4 arrays that can complete the matching of a candidate macroblock in every clock cycle. The implementation reports results based on a search area of 32×32 pixels. The whole computation takes around 1100 clock cycles to complete with a complexity of 23K logic elements (LUTs) implemented in an Altera Excalibur EPXA10. An FPGA working frequency of 12.5 MHz is reported although the device works at 285 MHz when targeting a TSMC 0.13 μm standard cell library.

Importantly, with these SAD reuse strategies, it can be seen that full-search implementations have a clear advantage in that extending them to support variable block sizes requires only a small increase in gate count over their conventional fixed- block counterparts, and has little or no bearing on its throughput, critical path, or memory bandwidth. On the other hand, full search invariably implies a large number of SAD operations, and even for reduced search areas, a number of optimizations have been developed to make it more computationally tractable. For example, the work presented in [9] uses a most- significant-bit first bit-serial architecture instead of a systolic implementation. This enables early termination when the SAD of a particular motion vector candidate becomes larger than the current winner during the SAD computation. One of the challenges facing the full-search approach in hardware is that throughput is determined by the search range which is generally kept small at 16×16 pixels to avoid large increases in hardware complexity. Intuitively it is reasonable to expect that the search ranges should be larger for high-definition video formats (the same object moves more pixels in high resolution than in a lower resolution screen) and the increase in search range will have a large impact in complexity or throughput since all the pixels must be processed, limiting the scalability of full search hardware. For example, the work presented in [10] considers a relatively large search range of 63×48 pixels in their integer full-search architecture, which can vary the number of pixel processing units. A configuration using 16 pixel processing units can support 62 fps of 1920×1080 video resolution clocking at 200 MHz, but it needs around 154K LUTs implemented in a Virtex5 XCV5LX330 device.

From this short review it can be concluded that full-search hardware architectures focus on integer-pel search while fractional-pel search is not investigated since it is considered to take only a fraction of the time of the integer search. Also, rate distortion optimization (RDO) using Lagrangian multipliers that add the cost of the motion vector to the SAD cost are generally ignored although they can have a large impact on coding efficiency of around 10% as shown in later sections. One of the difficulties of adding RDO to the previous architectures is that all the motion vector costs need to be calculated in parallel to avoid becoming a bottleneck, and the additional hardware needed to support these parallel computations with multipliers involved will increase the complexity considerably.

B. Fast Search ME Hardware

Architectures for fast ME algorithms have also been proposed [11]. The challenges the designer faces in this case include unpredictable data flow, irregular memory access, low hardware utilization and sequential processing. Fast ME approaches use a number of techniques to reduce the number of search positions, and this inevitably affects the regularity of the data flow, eliminating one of the key advantages that systolic arrays have: their inherent ability to exploit data locality for reuse.

This is evident in the work done in [12] that compares a number of fast-motion algorithms mapped onto a systolic array and discovers that the memory bandwidth required does not scale at anywhere near the same rate as the gate count.

A number of architectures have been proposed which follow the programmable approach by offering the flexibility of not having to define the algorithm at design time. The application specific instruction-set processor (ASIP) presented in [13] uses a specialized data path and a minimum instruction set similar to our own work. The instruction set consists of only eight instructions operating on a RISC-like, register-register architecture designed for low-power devices. There is the flexibility to execute any arbitrary block matching algorithms and the basic SAD16 instruction computes the difference between two sets of 16 pixels and in the proposed microarchitecture takes 16 clock cycles to complete using a single eight-bit SAD unit.

The implementation using a standard cell 0.13 μm ASIC technology shows that this processor enables real-time motion estimation for QCIF, operating at just 12.5 MHz to achieve low power consumption. An FPGA implementation using a

Virtex-II Pro device is also presented with a complexity of 2052 slices and a clock of 67 MHz. In this work, scaling can be achieved by varying the width of the SADU (ALU equivalent for calculating SADs), but due to its design, the maximum parallelism that can be achieved would be if the SAD for the entire row could be calculated in the minimum one clock cycle, in a 128-bit SIMD manner.

The programmable concept is taken a step further in [14]. This ME core is also oriented to fast motion estimation implementation and supports sub-pixel interpolation and variable block sizes. The interpolation is done on-demand using a simplified 4-tap non-standard filter for the half-pel interpolation, which could cause a mismatch between the coder output and a standard-compliant decoder. The core uses a technique to terminate the calculation of the macroblock SAD when this value is larger than some previously calculated SAD, but it does not include a Lagrangian-based RDO technique to effectively add the cost of coding the motion vector to the selection process. Scalability is limited since only one functional unit is available, although a number of configuration options are available to match the architecture to the motion algorithm such as algorithm-specific instructions. The SAD instruction, comparable to our own pattern instruction, operates on a 16- pixel pair simultaneously and 16 instructions are needed to complete a macroblock search point, taking up to 20 clock cycles. The processor uses 2127 slices in an implementation targeting a Virtex-II device with a maximum clock rate of 50

MHz. This implementation can sustain processing of 1024×750 frames at 30 fps.

There are also examples of ME processors in industry as reviewed in [11]. Xilinx has recently developed a processor capable of supporting high definition 720p at 50 fps, operating at 225 Mhz [15] in a Virtex-4 device with a throughput of

200,000 macroblocks per second. This Virtex-4 implementation uses a total of around 3000 LUTs, 30 DSP48s embedded blocks and 19 block-RAMs. The algorithm is fixed and based on a full search of a 4×3 region around 10 user-supplied initial predictors for a total of 120 candidate positions, chosen from a search area of 112×128 pixels. The core contains a total of 32

SAD engines which continuously compute the cost of the 12 search positions that surround a given motion vector candidate.

III. THE CASE FOR FAST MOTION ESTIMATION HARDWARE ENGINES IN HIGH-DEFINITION VIDEO CODING

It has been shown in the literature [16] that the motion estimation process in advanced video coding standards can represent up to 90% of the total complexity, especially when considering features such as multiple reference frames, multiple partition sizes, large search ranges, multiple vector candidates and fractional-pel resolutions. A thorough evaluation of how these options affect video quality is available at [16] although limited to QCIF (176×144) and CIF (352×288) formats. These low resolutions formats are of limited application in current communications and multimedia products which aim at delivering high quality video; even mobile phones already target resolutions of 800×480 pixels. Some of the conclusions reached in [16] indicate that the impact of large search sizes on coding efficiency is limited, that most of the coding efficiency is obtained from the first four block sizes (16×16, 16×8, 8×16, 8×8) and that the gains obtained thanks to RD-

Lagragian optimization are substantial. Our research has confirmed that RD-Lagrangian is of vital importance typically reducing bit rates around 10% for the same video quality. Disabling this optimization reduces performance especially for the exhaustive search. The reason is that the selection of the winner based only on the SAD can introduce motion vectors that do not follow the real motion with excessive ‘noise’ that hurts the bit rate. The conclusion is that RD-Lagrangian should be present in any high quality motion estimation hardware processor or algorithm. In subsections A and B we re-examine the effects of search range/sub-partitions using three high-definition 1920×1080 video sequences: tractor (complex chaotic motion), pedestrian area (fast simple motion) and sunflower (simple slow motion) obtained from [17] using an open-source implementation of H.264 [18]. In all these experiments we have kept the option of using predicted motion vector candidates to initialize the search range enabled and RD-Lagrangian optimization enabled since they consistently improve the quality of the motion vectors.

A. Search range evaluation

For the search range evaluation we consider the well-known hexagonal search algorithm and the full search algorithm, both as implemented in [18]. We consider the following search ranges: 8×8, 16×16, 32×32, 64×64, 128×128 and 256×256. The results for the hexagonal case are shown in Figs. 1, 2 and 3. It is clear that the optimal search range varies depending on the sequence increasing from 32×32 for the sunflower sequence to 128×128 for the pedestrian sequence. The analysis with the full-search algorithm indicates similar results so it is not included. Overall, a 16×16 search range as used in many full-search hardware implementations is too restrictive. In our hardware we have increased the range to 96×112 (search window of

112×128) as a tradeoff between hardware efficiency and coding quality, as explained in the following sections.

B. Sub-partition analysis

Fig. 4 shows the results of enabling sub-partitions for the pedestrian area sequence for the hexagonal search algorithm.

The results show that the sub-partitions optimization fails to improve performance for the hexagonal search algorithms and can actually degrade it. Figs. 5 and 6 correspond to the other two sequences and show equivalent results. Similar experiments conducted with other search algorithms including full-search also confirm this behavior. It is apparent that the cost of coding the additional motion vectors required by the sub-partitions can make the gains obtained during residual encoding negligible.

Additionally, if Lagrangian optimization is disabled then the effect of sub-partition coding is much more negative since the costs of these additional motion vectors are being ignored. Sub-partitions are more effective with smaller resolutions (the same object will move fewer pixels in CIF compared with HD, for example) and it is possible that different Lagrangian techniques from the one used in x264 could yield a better performance, but careful analysis is needed when these techniques are being used in high definition video coding. Otherwise they could result in an important computational overhead with no apparent benefit. This means that one of the main advantages of full-search hardware which is its ability to calculate in parallel all the sub-partitions costs by combining SAD results of the smaller sub-blocks could be of limited benefit in high- resolutions video formats. On the other hand the large search areas required by high-definition as seen in sub-section A could be problematic to implement in real-time with full-search hardware resulting in a large hardware complexity and energy consumption.

IV. PROCESSOR MICROARCHITECTURE

Section III has shown that for high-definition sequences large search ranges are required so complex exhaustive search algorithms will need very high-throughputs to maintain real-time performance. On the other hand, fast motion estimation algorithms can offer high quality of results by being able to track real motion better. A configurable/programmable processor such as the one presented in this work can use different configurations depending on the video resolutions being processed. Other configuration options that the hardware should support include the number of reference frames, fractional- pel support and sub-partition sizes.

A programmable/configurable hardware solution can be designed to enable the trade-off among complexity and throughput depending on the number and type of functional or execution units, so for example if a particular application does not benefit from rate distortion optimization, this can be disabled, reducing the complexity and power consumption of the core.

According to this criterion, a configurable and programmable motion estimation processor has been designed which can support H.264 motion estimation over a large range of search options and video resolutions, including high definition, using different programs and hardware configurations. The microarchitecture of a sample hardware configuration using four integer-pel execution units, one fractional-pel execution unit and one interpolation execution unit is illustrated in Fig. 7. The main integer-pel pipeline must always be present to generate a valid processor configuration, but the other units are optional, and are configured at compile time.

A. Integer-pel execution units (IPEU).

The main IPEU is shown in the middle of Fig.7 and has as principal components: (1) The physical address calculator that transforms the motion vector offsets and vector candidates into addresses to the reference memory, (2) The vector alignment unit that aligns the two 64-bit words obtained from the reference memory into a single 64-bit word of valid pixels ready to operate with the data from the current macroblock memory. (3) The SAD logic which is formed by the SAD operator and the adder tree to obtain and accumulate the SAD values,. (4) the Motion Vector Decision unit which decides which SAD and motion vector candidate are kept as current winners for the next interaction. Each functional unit uses a 64-bit wide word and a deep pipeline to achieve a high throughput. All the accesses to reference and macroblock memory are done through

64-bit wide data buses and the SAD engine also operates on 64-bit data in parallel. The memory is organized in 64-bit words and typically all accesses are unaligned, since they refer to macroblocks that start in any position inside this word. By performing 64-bit read accesses in parallel to two memory blocks, the desired 64 bits inside the two words can be selected inside the vector alignment unit. The number of integer-pel execution units is configurable from a minimum of one to a maximum limited only by the available resources in the technology selected for implementation. Each execution unit has its own copy of the point memory and processes 64 bits of data in parallel with the rest of the execution units. The point memories are 256×16 bits in size and contain the x and y offsets of the search patterns. For example a typical diamond search pattern with a radius of 1 will use four positions in the point memory with values [−1,0], [0, −1], [1,0], [0, 1]. Any pattern can be specified in this way, and multiple instructions specifying the same pattern can point to the same position in the point memory saving memory resources. Each integer-pel execution unit receives an incremented address for the point memory so each of them can compute the SAD for a different search point corresponding to the same pattern. This means that the optimal number of integer-pel execution units for a diamond search pattern is four, and for the hexagon pattern six. Further optimization to avoid searching duplicated patterns can halve the number of search points for many regular patterns. In algorithms which combine different search patterns, such as UMH, a compromise can be found to optimize the hardware and software components. This illustrates the idea that the hardware configuration and the software motion estimation algorithm can be optimized together to generate different processors depending on the software algorithm to be deployed.

B. Fractional-pel Execution Unit (FPEU) and Interpolation Execution Unit (IEU).

The engine supports half- and quarter-pel motion estimation thanks to a fractional-pel execution unit (FPEU) and an interpolation execution units (IEU) that calculate the quarter- and half-pel values using interpolation filters as defined in the standard. The FPEU is conceptually similar to IPEU although in this case two vector alignment units are required since the quarter-pel interpolation is done on-demand by averaging the corresponding half-pel reference data obtained from the original, diagonal, horizontal or vertical memories. A memory block storing the original pixels without interpolation is included since it is needed for quarter-pel operations and in this way no accesses to the main reference memory are required so that the integer-pel execution unit can operate in parallel. The number of IEUs is limited to one, but the number of FPEUs can be configured at compile time. The IEU interpolates the 20×20 pixel area that contains the 16×16 macroblock corresponding to the winning integer motion vector. The interpolation hardware is cycled three times to calculate first the horizontal pixels, then the vertical pixels, and finally the diagonal pixels. The IEU calculates the half pels through a six-tap

Wiener filter as defined in the standard. The IEU is formed by a total of eight systolic 1-D interpolation processors with six processing elements each. The objective is to balance the internal memory bandwidth with the processing power so that in each cycle, a total of eight valid pixels are presented to one interpolator. The interpolator starts processing these eight pixels producing one new half-pel sample after each clock cycle. In parallel with the completion of 1-D interpolation of the first eight-pixel vector, the memory has already been read another seven times, and its output assigned to the other seven interpolators. The data read during the ninth memory cycle can then be assigned back to the first interpolator, obtaining high hardware utilization. The horizontally-interpolated area contains enough pixels for the diagonal interpolation to also complete successfully. A total of 24 rows with 24 bytes each are read. Each interpolator is enabled nine times so that a total of 72 eight-byte vectors are processed. Due to the effects of filling and emptying the systolic pipeline before the half-pel samples are available, a total of 141 clock cycles are needed to complete half-pel horizontal interpolation. During this time, the integer pipeline is stalled, since the memory ports for the reference memory are in use. Once horizontal interpolation completes, and in parallel with the calculation of the vertical and diagonal half-pel pixels and the fractional-pel motion estimation, the processing of the next macroblock or partition can start in the integer-pel execution units. Completion of the vertical and diagonal pixel interpolation takes a further 170 clock cycles, after which the motion estimation using the fractional pels can start. As mention previously quarter-pel interpolation is done on-the-fly as required simply by reading the data from two of the four memories containing the half- and full-pel positions, and averaging according to the H.264 standard. The quarter-pel processing unit uses two vector alignment units to keep up with the stream of pixels data being read from the half-pel memories. Two aligned 64-bit vectors are generated in parallel and averaging takes place in this data in a single clock cycle. This maintains the same level of performance as for the integer pipeline. The processor is designed to execute fractional and integer-pel motion estimation in parallel over different macroblocks so that while fractional-pel refinement is being performed in macrobock n, the integer-pel parts are already calculating macroblock n+1. In general for most algorithms, the number of points searched for the fractional-pel refinement is generally lower than for the integer-pel part, so it completes faster even if the additional overhead of calculating the half-pel pixels is taken into account. In any case, the cycles needed by each part depend on the type of search strategies deployed at the integer-pel and fractional-pel levels and will be dependent on the algorithm. An optimal solution will balance the two parts to avoid having one part idle while the other one is busy.

V. DEVELOPMENT AND ANALYSIS TOOLS

To facilitate the access to the hardware without needing specific knowledge of the microarchitecture, a high level C-like language called EstimoC and a compiler have been developed. The EstimoC language is aimed at designing a broad range of block-matching algorithms. The code can be developed and compiled in an integrated development environment (IDE) for motion estimation. The language contains a preprocessor for macro facilities that include conditional (if) and loop (for, while, do) statements. The language also has facilities directly related to the motion estimation processor instruction set

(ISA) such as checking the SAD of a pattern consisting of a set of points, and conditional branching depending on which point from the last pattern check command had the best SAD. The target ISA for the compiler is simple and it is formed by a total of 8 instructions as illustrated in Fig.8. There are two arithmetic instructions (opcodes 0,1), three jump instructions

(opcodes 2,3,4) and three compare instructions (opcodes 5,6 ,7).

The compiler converts the program to assembly and then to a program memory file containing instructions and a point memory file containing patterns that can then be written directly to the firmware memories of the hardware core. Fig. 8 shows an example block-matching algorithm written in EstimoC and excerpts from the target files. The algorithm is a diamond search pattern executed for up to five times for a radius of 8, 4, 2 and 1 pixels, followed by a small full search at fractional pixel level. The first set of check() and update() commands create the first search pattern, which consists of five points. Each check() command adds a point to the search pattern being constructed and the update() command completes the pattern. This pattern is compiled into the instruction at program address 00, which uses the five points available in the point memory at addresses 00–04. The preprocessor goes through the do while loop three times, with s taking the values 4, 2 and 1. Each time, a four-point pattern is checked for up to five times. The #if (WINID == 0) #break; syntax ensures that if a pattern search does not improve the motion vector estimate, it is not repeated. The final lines create a

25-point fractional pattern search. Fig. 10 provides two further examples of how the how the EstimoC language can be used to implement arbitrary block-matching motion estimation algorithms. The full-search example in Fig.10 corresponds to the classical exhaustive search algorithm and it is included as a reference. It is not recommended due to its high computing requirements. Notice that the algorithm has an initial check() command followed by the full-search loop itself. This is done to be able to test all the motion vector candidates first and then conduct the full-search only around the winner. The hardware will execute the first instruction once per motion vector candidate available. The compiled machine code shown below the source code contains only two instructions: one for the first check() and one for the loop with a total of 225 search points

(15x15). Notice the parallelism is expressed without the need for any specific constructs such as PAR normally used in hardware extensions of general purpose programming languages. The compiler recognizes all the search points that can be computed in parallel and combines them in a single instruction with the help of the update command that tells the compiler to stop looking for further checks to combine in a single instruction. The hardware will execute each instruction in a different number of interactions depending on the available number of execution units since each execution unit can process a check point in parallel.

The UMH example in Fig.10 contains sections from the source code corresponding to the sophisticated UMH (Uneven

Multihexagon Search) recommended in the H.264 standard. The source code starts with the definition of the search patterns so that they can be called directly using the check command simplifying the program. The UMH algorithm performs different pattern checks and uses a cost function (SAD or Lagrangian optimized SAD) to decide the type of patterns to follow. For example, after the first check either a large or small cross can be used depending on the cost function. The COST keyword is used by the compiler to generate the appropriate compare instructions followed by conditional jump instructions.

Finally, after the integer-pel search completes, a fractional refinement is conducted around the winner using two half-pel diamond interactions and a final quarter-pel square pattern. The binary for UMH contains a total of 27 instructions and has not been included for simplicity reasons. These examples illustrate the multiple programming options are available for block-matching motion estimation algorithms and the difficulties of programming these algorithms without appropriate compiler and tool support. On the other hand, the simple programming model and ISA of the proposed processor mean that no overheads are introduced by using the compiler instead of writing assembly or machine code directly. Once the program has been compiled into machine code, it can be complicated and time consuming to use the actual motion estimation hardware for performing the analysis and configuration. The tasks required include synthesizing and implementing a processor with some specific configuration in the FPGA, which will take around one hour per configuration due to place&route running times. A cycle-accurate simulator can speed up the development cycle significantly by reducing the number of tasks required to perform the analysis of a particular configuration. Additionally, the designer does not need access to the actual hardware when using the simulator. A cycle-accurate model of the processor was developed as part of the toolset. x264, a free software library for encoding H.264, was modified to use the cycle-accurate model when searching for the motion vectors during motion estimation. The cycle-accurate simulator can be used directly from the developed IDE.

The designer can design a motion estimation algorithm and test it using different processor configuration parameters. The simulator takes several parameters as inputs. The inputs which determine the processor configuration are: the program and point memories generated by the Estimo compiler, the number of integer and fractional execution units, the minimum size for block partitioning, whether to use motion vector cost optimization, and whether to use multiple motion vector candidates.

The simulator takes other options which do not affect the processor configuration itself, which are: the video file to process and its resolution, the maximum number of frames to process, the quantization parameter (QP), and the number of reference frames to use. The simulator will then process the video file using the selected search algorithm and processor configuration.

The output of the simulator includes: the bit rate of the compressed video, the peak signal-to-noise ration (PSNR), the number of frames that can be processed per second, the number of clock cycles required per macroblock, and the energy requirements per macroblock. The designer can also generate plots of the results until he or she is satisfied with a particular configuration and algorithm. At this point, the tools can be used to generate a VHDL configuration file which can be used together with the rest of the hardware library to implement the motion estimation processor using standard FPGA synthesis and place&route tools. Fig.11 shows the overall design flow from algorithm conception to processor implementation.

VI. MULTI-STANDARD HARDWARE EXTENSIONS

The previous sections have concentrated on the hardware and software support for motion estimation in H.264, arguably the most popular advanced video coding standard. The two additional codecs which have been considered in this work include

VC-1, the standard based on WMV-9 developed by Microsoft, and AVS the Chinese video coding Standard. Together with

H.264 these two codecs are state-of-the-art standards being deployed across a wide range of applications such as high definition broadcast television, optical disc storage (Blu-ray) and broadband video streaming over the internet. They are all transform and block-based codecs and share a lot of similarities. They process pixel groups in blocks of varying sizes in two main modes: intra-mode in which a frame is compressed independently of other frames, and inter-mode in which temporal redundancies among frames are exploited using motion estimation and compensation techniques. The transform used is DCT-based and it is applied either to the pixels themselves during intra coding or to the residuals resulting from the motion estimation phase during inter coding. The final stage is the entropy coding stage, which is generally based on variable-length codes (VLC) although H.264 can use arithmetic coding in the form of CABAC. The motion estimation is critical from the performance and complexity point of view for all of them. Since these standards do not specify the motion estimation technique deployed, the video engineer is free to develop new algorithms trading speed and quality as long as the generated output bitstream can be handled by the standard decoders. These features are well-suited to a configurable and programmable processor such as the one presented in the previous sections, which can exploit the similarities between the different standards to support the standards with few hardware changes.

A. Motion estimation differences in the three standards.

All the three codecs can operate at a maximum resolution of quarter-pel and support block sizes from 16×16 down to 4×4 for

AVS (4×4 available in AVS-M for mobile multimedia applications) and H.264 standards and down to 8×8 for the VC-1 standard. From the motion estimation point of view, the most significant differences exist in how the half-pel and quarter pel pixels are calculated. H.264 uses a six-tap filter for the half-pel pixels and simple averaging for the quarter-pel pixels, as seen in the previous sections. On the other hand VC-1 and AVS use four-tap filters for half- and quarter-pel interpolation with different coefficients. This is summarized in Table 1 that also shows the dividing factors used to scale the result of the filter operation. The table also shows that VC-1 uses a special case to handle ¾ interpolation, which is a special case for quarter-pel interpolation and corresponds to pixels that are located as shown in Fig. 12. In this case, although the coefficient values are equivalent, they are applied to opposite input pixels. Also from the table it can be seen that all the coefficient multiplications can be obtained with simple shift and adds, except for the multiplication by 53 that will need a multiplier structure. The complexity cost of this multiplier will be lower than a full multiplier since the multiplicand are fixed so it can be implemented with logic. VC-1 also includes an optional lower complexity mode that uses a simple two-tap bilinear filter for the ¼ positions which has not been included in the table.

To add support for these two new standards in our processor, the half-pel and quarter-pel processing functions need to be modified to accommodate these new coefficients. Half-pel interpolation is done exhaustively over the 20×20 surrounding the winning integer-pel motion vector. In H.264 this is done for performance reasons to be able to efficiently process the calculation of the quarter-pel pixels that need diagonal half-pel pixels. Notice that to obtain diagonal pixels, we need to obtain horizontal half-pel pixels first. Performance will be negatively affected since the processor will need first to obtain the half-pel horizontal pixels, then the half-pel diagonal pixels and finally the quarter-pel pixels, interrupting the pipeline while this extra processing is taking place. This will be even more challenging for VC-1 and AVS that need not just two, but four half-pel pixels to obtain the correct value of the quarter-pel pixels. In the H.264 solution, all half-pel pixels have been already calculated and can be read in parallel to obtain the quarter-pel pixels, maintaining the pipeline fully utilized. The following subsections describe how the half-pel and quarter-pel processing functions have been extended to support VC-1 and AVS.

B. Half-pel pixel processing extensions.

Half-pels are calculated in eight systolic processors that contain six PEs each. To support VC-1 and AVS is straight-forward since the number of coefficients involved is only four, so two PEs are set to zero to disable them when they are not needed.

Data still flows through the disabled PEs, so the control logic that reads integer pixels from memory and inputs them into the processing elements does not need to be modified. Fig.13 shows the simplified diagram of the systolic processor. We have opted for a simply systolic structure with a global adder since it obtains the required throughput and does not become a bottleneck in the processor. An output register temporarily buffers eight output pixels before they are written to the half-pel memories. The performance of the half-pel processing unit is equivalent in all three possible modes. The complexity of the systolic processor in Fig.13 rises from 383 logic cells to 450 logic cells when it is configured in multi-standard mode, which is a modest increase.

C. Quarter-pel pixel processing extensions.

Compared with the half-pel extensions, the quarter-pel processing unit needs more attention. The first challenge is that both

VC-1 and AVS use a similar four-tap filter instead of the two-tap averaging filter used in H.264. This means that four half or integer pixels are needed to obtain each quarter-pel pixel. In H.264 mode the data is obtained from accessing two of the half-pel memories, aligning as required by the motion vector, and averaging. To guarantee that eight valid pixels are available after the alignment, two 64-bit memory locations are read from each memory using dual-port memories (two reads and one write). The alignment units then select the right eight bytes out of the 16 bytes read. In VC-1 and AVS, the number of ports needed to get the four eight-byte vectors is doubled. This means that the number of BRAMs used to store the half- pel pixels must be doubled and the number of vector alignment units must also be increased from two to four. This will double the logic needed to support fractional-pel searches in VC-1 and AVS. Instead of doubling the logic, we opt for halving the performance of the fractional-pel pipelines when operating in VC-1 or AVS modes. The memories maintain the number of ports as in H.264 but are read twice to obtain two eight-byte aligned vectors each time. To perform the filtering itself we have designed a quarter-pel interpolator as shown in Fig.14. Similarities exist among the shift and add operations needed to obtain the interpolated pixels for the three standards. These similarities can be exploited to reduce the complexity of this unit as sown in Fig.14. The multiplexers are controlled by signals defined by the active standard. Pipelines a and b in

Fig.14 process two eight-pixel vectors. Operator sharing among the standards means that the additional logic needed to support VC-1 and AVS is reduced. The limitation in fractional-pel performance in VC-1 and AVS modes is only a problem if the fractional-pel searches take longer than the integer-pel searches, which is not generally the case. If this is the case, the core can be configured with more fractional-pel execution units. The complexity of the quarter-pel pixel processing block in H.264 mode is approximately 82 logic cells, since simple pixel averaging is involved. This value increases to 508 logic cells in multi-standard mode since more complex operations and a more complex state machine are needed to cycle the logic twice and interpolate four half- and full-pel pixels.

VI.HARDWARE PERFORMANCE EVALUATION AND IMPLEMENTATION

For the implementation, we have selected the Virtex-4 SX35 device included in the ML402 development platform. This device offers a medium level of density inside the Virtex-4 family and can be considered main stream being fabricated using

90 nm CMOS technology. The results of implementing the processor with different numbers and types of execution units are illustrated in Table 2. The basic configuration is relatively small using 7.4% of the available logic resources and 21 memory blocks. The size of the block RAMS in Virtex-4 devices enables two reference search areas of size 112×128 to fit in this memory as previously described. Each new execution unit adds around 1600 logic cells and 17 embedded memory blocks to the complexity. The fractional and integer execution units have been carefully pipelined and all the configurations can achieve a clock rate of 200 MHz. To obtain a performance value in terms of macroblocks per second performance is not as straight-forward as in the full search case that always computes the same number of SADs for each macroblock. In this case, the amount of motion in the video sequence, the type of algorithm and the hardware implementation affect the number of macroblocks per second that the engine can process. Equation (1) can be used to estimate the cycle count needed to process a single 16×16 partition and a single reference frame. The variables in the equation are ppp (search points per pattern), eu

(execution units available), ppm (average search points per macroblock).

 ppp  ppm MCC1616  33 11 (1)  eu  ppp

The equation takes into account that the SAD calculations for each macroblock take 32 cycles (32 SADs of eight bytes each) plus one cycle of memory read/write overheads and motion vector decision. There is also an 11-cycle overhead representing the time needed to empty the integer pipeline before the best motion vector can be found in each pattern iteration and the next pattern started from the current winning position. Also, the current microarchitecture always uses 33 cycles per search point and it does not try to terminate the SAD calculations earlier if the current cost becomes larger than the cost obtained during a previous calculation. There are two reasons why this optimization based on monitoring the SAD during the search point calculation has not been used: firstly, since the core uses multiple execution units it is very important that all the execution units are maintained in synchrony so that a single control unit can issue the same control signals to all the execution units. Execution units terminating at different clock cycles will invalidate this requirement. Secondly, the detection of the cost has to be done at the bottom of the pipeline after the addition of the motion vector cost to the accumulated SAD value. This means that all the SAD calculations already in flight in the pipeline must be invalidated, introducing a bubble in the pipeline with a cost of 11 clock cycles. Unless the early detection saves more than 11 clock cycles (which is not the case in our experiments, since the cost of nearby points tend to be close in value) there will be no performance gains obtained from this technique but a considerable overhead in terms of logic. In order to use equation (1) we need to estimate the average numbers of points searched per macroblock. These values have been measured using the x264 software implementation configured with the diamond, hexagon and UMH algorithms, using the previous high definition sequences with varying degrees of motion complexity. The values measured indicate that 15 points are tested for the diamond search algorithm, 22 for the hexagon search algorithm, and 70 for the UMH search algorithm, for the 16×16 partition case, and these values increase by a factor of 1.5 if the 8×8 partitions are considered as well. The hardware can overlap the fractional-pel search with the integer-pel search working on different macroblocks in parallel, so as long as the fractional-pel completes faster than the integer-part, it will not affect the performance of the core. In order to validate this statement we have further analyze the number of search points evaluated in x.264 algorithm. From the three integer-pel algorithms available in x.264 it is obvious that the simple diamond search is the one that with a higher probability of completing before the fractional search part. We must also note that x.264 only offers the diamond search for the fractional refinement consisting of up to 2 half-pel followed by up to 2 quarter-pel interactions. Table 3 shows the effects in search complexity and bit rate reduction of adding the fractional-pel refinement to the integer diamond search for the three high- definition videos tested. Table 3 shows that the number of search points tested when both half-pel and quarter-pel is roughly equivalent to the full-search part for the Sunflower and Tractor video sequences and smaller for the Pedestrian area sequence. It also shows that the reduction in bit rate thanks to the fractional-pel search is very important. With this data we can reasonably assume that the fractional-pel search will complete in approximately the same amount of time as the integer- pel search and that it can be performed in parallel without increasing the total number of clock cycles. Table 4 shows the performance figures obtained for the three motion estimation algorithms analyzed as a function of the number of integer-pel execution units. The table shows the macroblocks per second for different configurations and also the largest video format that each configuration can support. Table 4 only shows the optimal hardware configuration for each algorithm, taking into account that, for example, a diamond search pattern does not benefit from more than four IPEUs and that a configuration with three IPEUs will need the same number of cycles as for the two IPEUs case. The reason for this is that while the first iteration will use all three IPEUs, a second iteration will still be required to complete the pattern instruction, when only one

IPEU will be used. Finally, Table 5 compares the performance and complexity figures of the base configuration of our processor against the ASIP cores proposed in [13] and [14] in terms of performance complexity. The figures measured in the general purpose P4 processor with all assembly optimizations enabled are also presented as a reference although the power consumption and cost of this general purpose processor are not suitable for the embedded applications this works targets.

These types of comparisons are difficult since the features of each implementation vary. For example, our base configuration does not support fractional pel searches and the addition of the interpolator and fractional pel execution unit in parallel with the integer pel execution unit increases complexity by a factor of three. The core presented in [14] does support fractional pel searches, although with a non-standard interpolator, and both searches must run sequentially. Overall, Table 5 shows that our core offers a similar level of integer performance in terms of clock cycles for the diamond search algorithm to the ASIP developed in [14] with one execution unit, and performance almost doubles if the configuration instantiates two execution units as shown in the last row. For these experiments, our core was retargeted to a Virtex-II device to obtain a fair comparison, since this is the technology used in [13] and [14]. The pipeline of the proposed solution can clock at double the frequency as shown in the table, and this helps to justify why our solution with a single execution unit can support 1080p HD formats while the solution presented in [13] is limited to 720p HD formats. The measurements of cycles per macroblock were obtained processing the same CIF sequences as used in [14].

The diamond search used in this experiment corresponds to the implementation available in x264 that includes up to 8 diamond interactions followed by a square refinement using a single reference frame and a single macroblock size (16×16).

It is also noticeable that 287 cycles per macroblock will generate a throughput of almost 2000 CIF frames per second, enabling the addition of extra reference frames and sub-partitions while still operating in real-time.

VII. CONCLUSIONS

The proposed processor exploits both reconfigurability and programmability in an innovative way to support motion estimation in three state-of-the-art video coding standards, seamlessly trading complexity, throughput and quality of results, and matching the architecture to the workload requirements. The processor has been named LiquidMotion to reflect its adaptability and the toolset, which enables the designer to create new algorithms and hardware processors without specific knowledge of the microarchitecture, is available for download at http://sharpeye.borelspace.com/. Support for older standards such as MPEG2 can easily be added, since in this case, sub-pixel resolution is limited to half-pel using an averaging filter. The advanced features available in these codecs are supported, including a parallel Lagrangian optimization hardware unit which yields 10% lower bit rates for the same quality without affecting throughput. The research also shows the importance of using relatively large search windows at 128×128 pixels for high definition video samples although further increases of the search window do not bring noticeable benefits. The multi-standard hardware extensions targeting VC-1 and

AVS increase the flexibility of the processor with a relatively small hardware cost. Future work involves adding the core as part of a dynamically reconfigurable multiFPGA array targeting multimedia applications. [1] Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T. and Wedi, T., “Video coding with H.264/AVC: tools,

performance and complexity”. IEEE Circuits Syst. Mag. v4. pp. 7-28.

[2] Sridhar Srinivasan, et.all., “Windows Media Video 9: overview and applications”, EURASIP Signal Proc: Image Communication, 19 (2004) pp. 851-

[3] L. Yu et al. “An Overview of AVS-Video: tools, performance and complexity”, Visual Communications and Image Processing 2005, Proc. of SPIE,

vol. 5960, pp.596021, July 31, 2006.

[4] Nunez-Yanez, J.L.; Hung, E.; Chouliaras, V., 'A configurable and programmable motion estimation processor for the H.264 video codec,' FPL 2008.

International Conference on , vol., no., pp.149-154, 8-10 Sept. 2008

[5] Huang, Y.-W., Wang, T.-C., Hsieh, B.-Y., Chen L.-G. “Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4

AVC/JVT/ITU-T H.264”. ISCAS. May 2003.

[6] Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen; Tu-Chih Wang; Liang-Gee Chen, "Analysis and architecture design of variable

block-size motion estimation for H.264/AVC", IEEE TCSVT, vol.53, no.3, pp.578-593, March 2006

[7] Yap, S.Y.; Mccanny, J.V., ‘A VLSI architecture for advanced video coding motion estimation’, ASAP, pp. 293-301, 24-26 June 2003

[8] Chao-Yung Kao and Youn-Long Lin, “An AMBA-Compliant Motion Estimator For H.264 Advanced Video Coding” IEEE International SOC

Conference (ISOCC), Seoul, Korea, October 2004

[9] Brian M. Li , Philip H. Leong, “Serial and Parallel FPGA-based Variable Block Size Motion Estimation Processors”, Journal of Signal Processing

Systems, Vol. 51 , No. 1, pp. 77-98 April 2008

[10] Theepan Moorthy, Andy Ye: A scalable computing and memory architecture for variable block size motion estimation on Field-Programmable Gate

Arrays. FPL 2008: 83-88

[11] Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen, Liang-Gee Chen, “Survey on Block Matching Motion Estimation Algorithms and

Architectures with New Results”, The Journal of VLSI Signal Processing, Vol. 42, No. 3. (March 2006), pp. 297-320.

[12] Sheu-Chih Cheng; Hsueh-Min Hang, "A comparison of block-matching algorithms mapped to systolic-array implementation," IEEE TCSVT, IEEE

Transactions on , vol.7, no.5, pp.741-757, Oct 1997

[13] T. Dias , S. Momcilovic , N. Roma , L. Sousa, “Adaptive motion estimation processor for autonomous video devices”, EURASIP Journal on

Embedded Systems, v.2007 n.1, pp.41-41, January 2007

[14] Babionitakis, Konstantinos1, et al., “A real-time motion estimation FPGA architecture”, Journal of Real-Time Image Processing, Volume 3, Numbers

1-2, March 2008 , pp. 3-20(18)

[15] Information available at http://www.xilinx.com/ products/ipcenter/DO-DI-H264-ME.htm

[16] S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans, “Performance and complexity co-evaluation of the advanced video coding standard for

cost-effective multimedia communications,” EURASIP J. Appl. Signal. Process., no. 2, Feb. 2004, pp. 220-235.

[17] 1080p HD sequences obtained from http://nsl.cs.sfu.ca/wiki/index.php/Video_Library_and_Tools#HD_Sequences_from_CBC

[18] Information available at http://www.videolan.org/developers/x264.html Fig. 1. Search range analysis for sunflower sequence Fig. 2. Search range analysis for pedestrian area sequence Fig. 3. Search range analysis for tractor sequence Fig. 4. Sub-partitions analysis for pedestrian area sequence Fig. 5. Sub-partitions analysis for sunflower sequence Fig. 6. Sub-partitions analysis for crowdrun sequence A u xiliary in te g er M a in inte ge r p el p e l p ip e line s p ipe line G P R eg ister File (m otion vector cand ida tes a nd 8 + 1 + 1 + 1 r esults)

H P P ix e l p roc e s sing

A ligned M otion vector cand id ate p ixel data P oint P oint P o int P oint m em ory m em ory m em ory m em or y

1 6 r r r r r r r r o o o o o o o o s s s s s s s s s s s s s s s s

e e e e e e e e 16

c c c c c c c c P r ogr am 16 16

o o o o o o o o 16 r r r r r r r r

P P P P P P P P m em ory

c c c c c c c c i i i i i i i i l l l l l l l l o o o o o o o o t t t t t t t t

s s s s s s s s + + + + y y y y y y y y

S S S S S S S S Instru ctions

r r r r r r r r Instructio n 24 16 16 16 16 o o o o o o o o 8 t t t t t t t t add re ss a a a a a a a a l l l l l l l l o o o o o o o o

p p p p p p p p N e xt r m

r r r r r r r r P hysical P hysical P hysical

e e e e e e e e P hysica l t t t t t t t t C ontrol add ress A ddr ess

n n n n n n n n A ddr es s A ddr ess i i i i i i i i A ddr ess signa ls C alculator P P P P P P P P 12 C a lcu la tor C alculator C a lcu la tor

H H H H H H H H 6 Instr u ction fetch, N ext rm d ecode a nd issue L ine o ffset 6 a ddr ess N ext rm 8 8 8 8 8 8 8 8 Line offs et 12 6 12 add re ss N e xt r m FP p ipeline 6 12 add ress 6 4 6 4 6 4 H P interp ola tion C ontrol signals Line offset data C o ntro l Line offs et

r m 9 a ddr ess C ontrol sign als FP p ip eline Q P pipeline O riginal F P H or izo nta l V e rtical H P D iag onal C o ntro l co ntro l pixels H P pixels pixels H P pixels C o ntro l signa ls F P pipeline refer en ce r m rm C ontro l 9 9 m em o r y a ddr ess add re ss H P data 64 64 6 4 64 64 64 6 4 6 4 64 U n align ed V ector pixel d ata 64 a lignm en t 64 6 4 64 6 4 refer ence refer ence refer en ce m em or y m em or y m em or y U naligne d U n align ed pixel data Q P p ix e l U n align ed 64 p ixel data p roc e s sing V ector V ector p ixel data 6 4 64 alig nm ent a lignm en t 6 4 64 64 64 A lign ed p ixel data V ector 6 4 6 4 V ector V ector alig nm ent a lignm ent Q uarter p el interpolation C ur rent a lignm en t m acroblock m em ory

C u r re nt m acro block 64 A lig ned F ra ctio n al p e l 64 A ligned 64 6 4 A lign ed 6 4 m em o ry 6 4 pixel d ata pipe lin e S A D pixel d ata pixel d ata S A D 64 Q P d ata Q uantization S A D S A D 64 6 4

S A D M V M V P A dder tree Q uantizatio n A dder tree A dder tree A dder tree Q uantization A dder tree M V C u rre nt S A D C O S T M V M V P 16 C u r re nt S A D C u rre nt S A D 16 1 6 C urr ent S A D 16 M V A dder C O S T A dder 16 A dder A dder A dder

16 B est C O S T C urr ent cost 1 6 B est m otion vector 1 6 M otion vector d ecision

1 6 16 B e st S A D B est m o tion vec tor

Fig. 7. Processor microarchitecture with a total of six execution units 24 Op code Field A Field B

0000 Pattern address Number of points

Integer pattern instruction

0001 Pattern address Number of points

Fractional pattern instruction

0010 Winner field immediate8

Conditional jump to label (if winner field = winner id the jump to inmediate8) winner id = 0 no winner in pattern else ids the winner execution unit

0011 unused immediate8 16 15 8 7 Unconditional jump to label 16 15 8 7

0100 unused immediate8

Conditional jump to label (if condition bit set jump to label)

0101 reg immediate14 Compare (if less than set condition bit)

0110 reg immediate14

Compare (if greater than set condition bit)

0111 reg immediate14 15 13 Compare (if equal set condition bit)

Fig. 8. Instruction set architecture ISA 25

S = 8; // Initial step size

check(0, 0); Conditional jump instruction check(0, S); check(0, -S); Integer check pattern instruction check(S, 0); check(-S, 0); update; 0 0 05 00 chk NumPoints: 5 startAddr: 0 do 1 0 04 05 chk NumPoints: 4 startAddr: 5 { 2 2 00 0B chkjmp WIN: 0 goto: 11 S = S / 2; 3 0 04 05 chk NumPoints: 4 startAddr: 5 for(i = 0 to 4 step 1) ………………. { 11 0 04 0A chk NumPoints: 4 startAddr: 9 check(0, S); 12 2 00 15 chkjmp WIN: 0 goto: 21 check(0, -S); ……………….. check(S, 0); 21 0 04 0D chk NumPoints: 4 startAddr: 13 check(-S, 0); 22 2 00 1F chkjmp WIN: 0 goto: 31 update; ………………. #if( WINID == 0 ) 31 1 19 11 chkfr NumPoints: 25 startAddr: 17 #break; } } while( S > 1); Fractional check pattern instruction

for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j); update;

Fig. 9. Programming example 26

/* check(0,0); #else //hp refinement UMH algorithm //full search example update; { check(diamond); //small cross for(loop = 1 to 2 step 1) */ //first point // horizontal cross { #if (COST > 2000) for(i = -5 to 5 step 2) check(diamondhp); Pattern(diamondhp) check(0,0); { { #if( WINID == 0 ) { update; //large cross check(i,0); #break; check(0,0.5) // horizontal cross } } check(0,-0.5) for(i = -17 to 17 step 2) // vertical cross check(0.5,0) //full search { for(i = -3 to 3 step 2) //qp refinement check(-0.5,0) for(i = -7 to 7 step 1) check(i,0); { } } check(0,i); check(squareqp); for(j = -7 to 7 step 1) Pattern(diamondqp) // vertical cross } { check(i, j); for(i = -7 to 7 step 2) update; //End check(0,0.25) update; { } check(0,-0.25) check(0,i); check(0.25,0) } /* large hexagon, small check(-0.25,0) update; full search, hexagon and } 0 0 01 00 chk NumPoints: 1 startAddr: 0 } final square refinement as in UMH*/ 1 0 E1 01 chk NumPoints: 225 startAddr: 1/*other pattern definitions*/

Fig. 10. Programming example for the full-search and UMH algorithms. 27

N ew M E algorithm

H igh level M E S harpE ye algorithm code C om piler

S harpE ye A ssem bly code A ssem bler/Linker P rocessor bitstream

P rogram B inary P oint B inary

S tandard N ew H ardw are S ynthesis/ configuration P lace& route FP G A tools C ycle A ccurate S im ulator/ C onstraints C onfigurator N um ber and type of energy/ functional units throughput/ (Integer and fractional quality/area pel, Lagrangian, m otion vector candidates, etc) G enerate R TL configuration R TL C onstraints file m et? C om ponent Library Y es N o

Fig. 11. Processor programming and configuration workflow. 28

Video coefficient coefficient coefficient coefficient coefficient coefficient Divisor Divisor Precision Standard 1 2 3 4 5 6 Numeric Bits shifted H.264 ½ 1 -5 20 20 -5 1 32 5 ½ -1 9 9 -1 0 0 16 4 VC-1 ¼ -4 53 18 -3 0 0 64 6 ¾ -3 18 53 -4 0 0 64 6 ½ -1 5 5 -1 0 0 8 3 AVS ¼ 1 3 3 1 0 0 8 3

Table 1. Fractional-pel interpolation filters coefficients 29

Fig. 12. Fractional pixels locations

Pixel IN 8 8 8 8 8 PE1 PE2 PE3 PE4 PE5 PE6

13 13 13 13 13 13

Add and shift

8 Interpolated Pixel OUT

Fig. 13. Half-pel systolic processor unit 30

Virtex - 4 SX35

Configuration LUTs used M emory blocks used/M emory Critical /LUTs available blocks available/M inimum path (ns) memory bits Logic levels 1 IPEU/ 0 2259/30720 (7.4% ) 21/192 (10% )/95 K bits 4.976/8 FPEU 2 IPEU/ 0 3805 /30270 (12.6% ) 38/192 (19% )/179 K bits 5.040/8 FPEU 3 IPEU/ 0 5571 /30270 (18.4 % ) 55/192 (28% )/263 K bits 5.032/7 FPEU 1 IPEU/ 1 9143/30270 (3 0.2 % ) 31/192 (16% )/95+ 42 K bits 4.986/6 FPEU 2 IPEU/ 1 10985/30270 (36.2% ) 48/192(39% )/179+ 84 K bits 4.996/9 FPEU

Table 2. Processor complexity in H.264 mode 31

standard coefficient (pipeline)

H.264 1(a) 1(b)

-4 (a) 53 (b) 64 64

18(a) -3 (b) VC1 -3(b) 18(a)

53(b) -4(a) HP pixels a HP pixels b

1 (a) 3 (b) 64 AVS H.264 (x1) 64 3 (b) 1 (a)

<< 4 53 AVS or H.264 (x1) << 1 X << 1 96 72 72 VC-1 112 + + VC-1 (x53) x -1 80 AVS (x3) VC1 (x18)

x-1 104 104 VC-1 (x-4) VC1 (x-3)

112 80 VC-1 104 VC-1

112 AVS

QP pixels

Fig. 14. Quarter-pel processor unit 32

Type of fractional- pel search done ( average search points per macroblock 1080p / bit rate reduction %) video sequence None Half- pel only Half - pel and Quarter-pel FP Bit Rate HP Bit Rate HP&QP Bit Rate Search Reduction Search Reduction Search Reduction Points Points Points

Pedestrian 15.7 0% 5.4 6% 9.3 12% area Sunflower 11.3 0% 5.0 5% 9.1 11.5% Tractor 11.1 0% 6.2 16% 10.2 25%

Table 3. Evaluation of fractional-pel search 33

Nu Number of Throughput in Throughput Throughput Throughput in Throughput Throughput Integer macroblocks in in Macroblocks in in Execution per second Macroblocks macroblocks per second Macroblocks Macroblocks units (16x16, per second per second (16x16, 8x8, per second per second implemented diamond, 200 (16x16, 8x8, (16x16, hexagon, 200 (16x16 UMH (16x16, 8x8 MHz, 4 ppp) diamond, 200 hexagon, 200 MHz, 6ppp) 200 MHz, 16 UMH 200 MHz,4ppp) MHz, 6 ppp) ppp) MHz, 16 ppp) 1 372,960 233,918 260,983 173,988 84,813 56,542 (1080p@30) (720p@50) (1080p@30) (720p@30) 2 692,640 461,760 495,867 330,578 166,233 110,822 (1080p@50) (1080p@50) (1080p@50) (1080p@30) (720p@30) 3 708,382 472,255 (1080p@50) (1080p@50) 4 1,212,121 808,080 319,680 213,120 (1080p@50) (1080p@50) (1080p@30) (720p@50) 6 1,239,669 826,446 (1080p@50) (1080p@50) 8 593,692 395,794 (1080p@50) (720p@30) 16 1,038,961 692,640 (1080p@50) (1080p@50)

Table 4. Performance analysis of the configurable processor 34

Processor Cycles per MB FPGA Complexity FPGA clock Memory implementation (Diamond search) (Slices) (MHz, Virtex-II) (BRAMS) Intel P4 assembly ~3,000 N/A N/A N/A Dias et al. [13] 4,532 2,052 67 4(external reference area)

Babionitakis et al. [14] 660 2,127 50 11 (1 reference area of 48x48 pixels) Proposed with one integer- 510 1,231 125 21 (2 reference areas of pel execution unit 112x128 pixels)

Proposed with two integer- 287 2,051 125 38(2 reference areas of pel execution units 112x128 pixels)

Table 5. Performance/complexity comparison