VLIW DSP Vs. Superscalar Implementation of a Baseline H.263

VLIW DSP VS. SUPERSCALAR IMPLEMENTATION OF A BASELINE H.263 VIDEO ENCODER Serene Banerjee, Hamid R. Sheikh, Lizy K. John, Brian L. Evans, and Alan C. Bovik Dept. of Electrical and Computer Engineering The University of Texas at Austin, Austin, TX 78712-1084 USA fserene,sheikh,ljohn,b evans,b [email protected] ABSTRACT sum-of-absolute di erences SAD calculations. SAD provides a measure of the closeness between a 16 AVery Long Instruction Word VLIW pro cessor and 16 macroblo ck in the current frame and a 16 16 a sup erscalar pro cessor can execute multiple instruc- macroblo ck in the previous frame. Other computa- tions simultaneously. A VLIW pro cessor dep ends on tional complex op erations are image interp olation, im- the compiler and programmer to nd the parallelism age reconstruction, and forward discrete cosine trans- in the instructions, whereas a sup erscaler pro cessor de- form DCT. termines the parallelism at runtime. This pap er com- In this pap er, we evaluate the p erformance of the pares TI TMS320C6700 VLIW digital signal pro cessor C source co de for a research H.263 co dec develop ed at DSP and SimpleScalar sup erscalar implementations the University of British Columbia UBC [4] on two of a baseline H.263 video enco der in C. With level pro cessor architectures. The rst architecture is a very two C compiler optimization, a one-way issue sup er- long instruction word VLIW digital signal pro cessor scalar pro cessor is 7.5 times faster than the VLIW DSP DSP represented by the TI TMS320C6701 [5, 6]. The for the same pro cessor clo ck sp eed. The sup erscalar second is an out-of-order sup erscalar architecture rep- sp eedup from one-way to four-way issue is 2.88:1, and resented by the SimpleScalar simulator [7]. from four-way to 256-way issue is 2.43:1. To reduce Weevaluate the p erformance of the sup erscalar pro- the execution time on the C6700, we write assembly cessor vs. the maximum number of simultaneous in- routines for sum-of-absolute-di erence, interp olation, structions executed. We compare the p erformance of and reconstruction, and place frequently used co de and the C source co de on b oth pro cessor architectures. For data into on-chip memory. We use TI's discrete co- the VLIW DSP implementation, we demonstrate that sine transform assembly routines. The hand optimized manual placement of frequently used data and co de VLIW DSP implementation is 61x faster than the C into on-chip memory provides a 29x sp eedup, whereas version compiled with level two optimization. Most hand co ding ve of the most computational complex of the improvementwas due to the ecient placement routines gives a 4x sp eedup. Our hand-co ded SAD, of data and programs in memory. The hand optimized clipping, interp olation, and ll data C6700 assembly VLIW implementation is 14 faster than a 256-way su- routines and our SimpleScalar settings are available p erscalar implementation without hand optimizations. from 1. INTRODUCTION http://www.ece.utexas.edu/~sheikh/h263/ Two factors limit the use of real-time video communi- cations: network bandwidth and pro cessing resources. 2. BACKGROUND The ITU-T H.263 standard [1, 2, 3] for video commu- nication over wireless and wireline networks has high 2.1. TMS320C6700 VLIW DSP Family computational complexity. An H.263 enco der includes the H.263 deco der minus the variable length deco d- The TMS320C6700 is a oating-p oint subfamily within ing in a feedback path. The H.263 enco der is roughly the C6000 VLIW DSP family. The C6000 family has three times more complex than an H.263 deco der. 32-bit data words and 256-bit instructions. The 256- In the H.263 enco der, the most computational com- bit instruction consists of eight 32-bit instructions that plex op eration is motion estimation, even when using may b e executed at the same time. The C6000 dep ends ecient search algorithms. The b ottleneck is in the on the compiler to nd the parallelism in the co de. The C6000 family has two parallel 32-bit data paths. of the pro cessor core, memory hierarchy and branch Each data path has 16 32-bit registers, and can only predictor can be sp eci ed by using command-line ar- communicate one 32-bit word to the other data path guments. The clo ck sp eed of typical sup erscalar pro- p er instruction cycle. Each data path has four 32-bit cessors is as high as 1 GHz. Power dissipation is a few RISC units: one adder, one 16 16 multiplier, one tens of Watts, and the transistor countmayvary from shifter, and one load/store unit. Each 32-bit RISC a few tens to a hundreds of millions. unit has a throughput of one cycle, but its result is delayed by one cycle for 16 16 multiplication, zero 2.3. UBC's H.263 Video Enco der cycles for logical and arithmetic op erations, four cycles The H.263 Version 2 H.263+ video enco der [1] con- for load/store instructions, and ve cycles for branch tains 23,000 lines 720 kbytes of C co de. It was writ- instructions. Every instruction may be conditionally ten for desktop PC applications and increases memory executed. Conditional execution avoids the delay of usage for faster execution. It incorp orates the base- a branch instruction and avoids interrupts from b eing line H.263 enco der with many optional H.263+ mo des. disabled b ecause a branch is in the pip eline. Our goal was to optimize the baseline H.263 enco der The C6000 family has one bank of program mem- only for emb edded applications, where the amountof ory and multiple banks of data memory. Each data available memory is muchlower. This enco der irregu- memory bank is single-p orted and twobytes wide, and larly uses oating p ointvariables and arithmetic. So, has one cycle data access throughput. Accessing one wecho ose the oating-p oint TMS320C6701 instead of data memory bank twice in the same cycle results in the xed-p oint TMS320C6201 as the VLIW DSP rep- a pip eline stall and increased cycle counts. The C6701 resentative. In the p erformance evaluation, we enco de has 64 kbytes of program memory and 64 kbytes of sub-QCIF 128 96 frames. A related pap er optimizes data memory. The program memory bank has 2k of the MPEG-2 deco der on the TMS320C6000 [9]. 256-bit instruction words, and may b e op erated as an instruction cache. Each of the 16 data memory banks has 2k of 16-bit halfwords. 3. VLIW DSP IMPLEMENTATION On the C6700, the pip eline is 11{17 stages, dep end- The most demanding op erations in a typical video ing on the instruction. The clo ck sp eed varies from 100 enco der are motion estimation, motion comp ensation, to 200 MHz. Power dissipation is typically 1.5{2.0 W, discrete cosine transform, half-pixel interp olation, and and the transistor count is less than 0.5 million. All reconstruction, as shown in Table 1. We hand opti- simulation results for the C6700 were obtained by run- mize these routines to sp eed up the enco der. The entire ning the co de on a 100-MHz C6701 Evaluation Mo dule H.263+ enco der, which needs 340 kB of program mem- b oard. The C6700 simulator do es not rep ort pip eline ory and 2.2 MB of data memory, do es not t in internal stalls or memory bank con icts, so running the co de on RAM on the C6701. We used memory placement and the b oard gives the true p erformance. co de optimizations to reduce cycle count. 2.2. SimpleScalar Sup erscalar Pro cessor 3.1. Memory Placement The SimpleScalar has a sup erscalar architecture de- Placing the co de and data in slow external RAM wastes rived from the MIPS-IV. The architecture prefetches at least four times the numb er of instructions to b e ex- many cycles, esp ecially for frequently accessed data and co de. We therefore p erformed manual placement ecuted and nds the parallelism between instructions of co de and data into fast on-chip program and data at run time. All instructions have a xed length of 64 bits and xed format, which promotes fast instruction memory resp ectively. Computationally intensive routines for motion estimation, interp olation, DCT, SAD, deco ding. It supp orts out-of-order issue and execution, based on the Register Up date Unit RUU [8]. The and reconstruction routines were placed in the on-chip program memory. Commonly used runtime functions RUU uses a reorder bu er for register renaming and from TI's libraries suchas memcpy, memcmp and mem- holds the results of p ending instructions. The memory system employs a load/store queue to supp ort sp ecu- set were linked into internal program memory. A total of 43 kB out of the 340 kB of needed program memory lative execution. The results of a sp eculative store are was placed in the internal program memory. placed in the store queue. If the address of all previous stores are known, then loads are dispatched to the Of the 2.2 MB of data memory, the lo okup ta- memory. The six pip eline stages are issue, dispatch, ble for quantization consumed 1.6 MB.

VLIW DSP Vs. Superscalar Implementation of a Baseline H.263

Data-Level Parallelism

Lecture 14: Gpus

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

Design and Implementation of a Multithreaded Associative Simd Processor

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Vector Processors

Instruction Set Innovations for Convey's HC-1 Computer

The RISC-V Instruction Set Manual Volume I: User-Level ISA Document Version 2.2

GPU Architecture and Programming

CSC506 Lecture 7 Vector Processors

Computer Architecture and Design

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors