1 Video Signal Processors

Video & Graphics Pro cessors: 1997 John A. Watlington MIT Media Lab oratory May 9, 1997 This rep ort is an attempt to assess and contrast the current state of the art in digital video and graphics pro cessing system architectures. Although this is not meant to b e a comprehen- sive survey, I attempted to at least quickly review most of the commercially available ma jor architectures. At present, there is a huge amountofinterest in this area, as three new markets are p erceived: systems for p ersonal (set-top) terminals, PC graphics for game and network ap- plications, and consumer video pro ducts based around digital video disks (DVD). Due b oth to their proliferation, and their very limited programmability, I did not include dedicated MPEG2 video deco ders and graphics accelerators limited to 2D op erations (GUI accelerators) in this survey. The surveyed systems aggregated into three distinct classes, namely: 1. Video Signal Pro cessors (VSPs) - A programmable CPU, usually with sp ecialized pro- cessing elements as well. 2. Graphics Architectures - Sp ecialized pro cessing elements, usually with a xed interconnect (a linear array.) 3. Structured Video Architectures - Containing b oth a video signal pro cessor and p ossibly a dedicated pip eline. Currently represented byTalisman. A section follows for each class, describing representative systems of that class. The rep ort closes with a sub jective Commentary section, containing b oth comments ab out architectural features and a lo ok at future scalability. 1 Video Signal Pro cessors Many of the surveyed systems are characterized by a single programmable core, with an instruction set architecture optimized for the op erations typically encountered in either graphics or video pro cessing algorithms. Sp ecialized pro cessing units may be incorp orated either into the programmable core, or as co-pro cessors. Increasingly, general purp ose micropro cessors are integrating instruction set extensions al- lowing the pro cessing of multiple small datum packed into a larger word simultaneously [1 ] [2 ] [3 ] in order to gain a slight sp eed improvement (x2 or x3) in graphics tasks. The minimal costs of these group instruction extensions (increased instruction set complexity, and slightly increased carry propagation delay in the ALUs) compared with the sp eed improvement when pro cessing typical images justify the extensions. While supp ort for group instructions alone was generally not sucient for inclusion in this survey, they are a common architectural feature among the video signal pro cessors discussed. 1 TI Philips MicroUnity Chromatics Samsung System 'C6201 TM1000 MediaPro c MPact2 MSP Units Inst issue 8 5 1 2 1 inst Inst Width 256 220 32 72 32 total bits Inst Cache 512K 256K 256K 18K 16K bits Data Cache - 128K 256K - 40K bits Data Ram 512K - - - - bits Registers 32x32 126x32 320x64 1Kx72 31x32 bits Data Width 32 32 64 72 32 bits Func. Units 8 22 4 4 1 Floating Pt No Yes No Yes Yes Sp ec. Units - 2 1 2 2 Table 1: Video Signal Pro cessors Architecture Comparison 1.1 Texas Instruments Velo ciTI TMS320C6201 The TMS320C6201 [4 ] is the rst implementation of a new VLIW architecture, Velo ciTI [5 ] [6 ], just available from TI's DSP group. While it contains no functional units dedicated to video or graphics, it's raw p erformance makes it interesting. The pro cessing core (currently running at 200 MHz) consists of two each of four typ es of pro cessing elements: logical (and arithmetic), arithmetic, data addressing, and multiply. All pro cessing elements op erate on 32b data except for multiply, which only op erates on 16b input values. Up to eight instructions (one p er pro cessing element) may b e issued in a single cycle. The pro cessing units are group ed into two sets of four, each of which is coupled to a 16x32b register bank. Ten read p orts and ve write p orts are provided per bank. A cross-linked register read p ort provides direct communication between the two pro cessing clusters, which share a common instruction issue unit. No data cache is provided (although the architecture allows for one). Instead, directly addressable data RAM is provided on-chip, organized as four interleaved banks of 128 Kbits (8Kx16b) each. Each pro cessing cluster has a separate 32b data path to the data RAM, and two simultaneous 32b accesses may o ccur if di erent banks are accessed. A 32b memory interface which can address 64 MBytes of the architecture's 4 GBytes is provided, along with a 16b host interface which is limited to addressing the internal register, data and 512 Kbits of instruction RAM. The memory interface is capable of running at clo ck rates equal to the pro cessor core, and is designed for synchronous DRAM, synchronous burst SRAM, or async. SRAM. Two channels of DMA are provided on-chip for eciently moving data around the system. The instruction set architecture is spartan (esp. in the context of this survey). In the TMS320 tradition, it contains concessions to typical signal pro cessing needs: saturation instead of numeric over ow, a normalize instruction (norm), supp ort for circular addressing, and extended precision (using two 32b registers for a 40b value). There is no supp ort for group instructions, or oating p oint. The execution of every instruction is conditional. Three bits of each 32b xed-size op co de indicate a guard register, and a fourth bit indicates whether the execution is conditional up on the guard b eing equal or not-equal to zero. The instructions are group ed by the compiler or optimizing assembler into execution groups to be issued in the same cycle. The LSB of each op co de is dedicated to indicating the last instruction in an execution group. One or more of these groups are packed into an eight 2 Unit Number Unit Number Int. ALU 5 Load/Store 2 DSP ALU 2 DSP Mult. 2 Shifter 2 Branch 3 Int/FP Mult 2 FP ALU 2 FP Compare 1 FP Sq.Ro ot 1 Constant 5 Table 2: TM1000 Functional Units instruction (256b) fetch group. An execution group may not overlap a fetch group b oundary. The instruction RAM (whichmay b e con gured as a cache) uses a 256b bus to provide a fetch group to the CPU every cycle. 1.2 Philips TriMedia TM1000 Unlike the TMS320C6201, the Philips TriMedia TM1000 [7 ] [8 ] [9] integrates Media sp eci c co- pro cessors and a VLIW core with extensions for media pro cessing. It was designed to b e a PCI based media co-pro cessing system in a PC. Based on the LIFE-1 VLIW architecture develop ed at Philips Research Labs, Palo Alto, in 1987, it started development as a pro duct in 1994, and the TM1000 (previously named the TM-1) shipp ed in early 1996. A second generation is now in development. 1 The VLIW core contains a large number of functional units, connected to a 128x32b register bank through 15 read p orts and 5 write p orts. A listing of the functional units is provided in Table 2. Up to ve instructions may be issued in a single 10 nS cycle. The instruction set is large (197), having b een extended to supp ort the sp ecialized functional units and oating p oint. Group instructions are supp orted, op erating on four 8b or two 16b values in a 32b word, as well as packing and unpacking instructions. Sp ecialized instructions are provided for p erforming convolution and vector distance (i.e. motion estimation) calculation. As in the TMS320C6201, the execution of almost every instruction is conditional up on a guard register. The VLIW core is connected to a 128 Kbit data cache (8-way asso ciative) using two 32- bit buses. A separate 256 Kbit instruction cache (also 8-way asso ciative) uses a 220b bus to provide ve instructions per cycle to the CPU. The instruction stream is stored and cached in a compressed format, and decompressed to provide the 220b instructions only up on b eing fetched. The data and instruction caches share a single 32b main data bus (the Data Highway) with all the co-pro cessors and p eripherals on the chip. The Data Highway connects to b oth a 32b PCI bus interface (master/slave), and a memory interface (32b) to o -chip synchronous DRAM. The architecture address space of 4 GBytes is fully supp orted throughout the system. The Image Co-pro cessor is a pip eline of sp ecialized pro cessors designed to p erform typical image manipulations (at 50 Mpixels/sec. p eak) indep endently of the VLIW core. It reads its parameters and image data from SDRAM memory using the Data Highway, and writes its output either back to SDRAM or to a destination on the PCI bus. A set of FIFOs (6 x 512b) are provided at the input to the co-pro cessor, feeding a 5-tap p olyphase 1D FIR ltering unit. The ltering unit pro cesses a single 8b channel at a time, using multiple passes to p erform op erations on multiple color channels. A YUV/RGB converter is next, followed by an alpha- 1 Philips' count of 27 functional units includes 5 constant units which basically serve as p orts for accessing immediate values in the instruction stream. See Table 2. 3 TI Philips MicroUnity Chromatics Samsung System 'C6201 TM1000 MediaPro c MPact2 MSP Units Clo ck 200 100 300 120 50 MHz Chip/ICache 6.4 3.2 38.4 8.6? 6.4 Gbits/sec ICache/Pro c 50 22.4 38.4 8.6 1.8 Gbits/sec Pro c/Reg 196 64 154 104 154? Gbits/sec Reg/DCache 12.8 6.4 77 - 51 Gbits/sec DCache/Chip 6.4 3.2 38.4 104 6.4 Gbits/sec Chip/System 3.2 4.2 35 10.4 4.2 Gbits/sec Reg Size 1 4 10 73 1+ Kbits DCache Size 512 128 256 - 40 Kbits Table 3: Video Signal Pro cessors Bandwidth Hierarchies blending unit (if used, a separate background image is also input), and an output formatting stage.

Load more