A Platform for Building Programmable Multicore Video Processors

mediaDSP™: A Platform for Building Programmable Multicore Video Processors Richard Selvaggi and Larry Pearlstein AMD Hot Chips 20, August 2008 Recent Trends Big-screen TV sets have been selling in ever-increasing numbers in the past 5 to 10 years. LCDs have recently overtaken CRTs in unit shipments. LCD display technology has advanced in many ways over these years. – Wider viewing angle – Higher contrast ratio – Wider color gamut – Higher resolution But...motion blur is one of the last remaining challenges for LCD display technology. – This is the problem our chip is addressing. 2 August 2008 Hot Chips 20: mediaDSP Perception of Motion Blur on LCD Panels 60 frames/sec 120 frames/sec 16ms 16ms 8ms Reduced blur Reduced Interpolated frames Eye moves continuously, smearing Eye moves a shorter arc in 8ms, thus the still image across the retina reducing amount of blur across retina LCD is a hold-type display (not impulsive, like CRT). Projected object is smeared on the retina by the amount that the eye moves in one frame-time (16 vs. 8ms). 3 August 2008 Hot Chips 20: mediaDSP TV System Block Diagram TV System (FullHD, 1080p) TV Main Board Panel Board 120Hz 60Hz AMD DTV SoC Xilleon™ 420 120Hz LCD Panel DRAM DRAM AMD Xilleon 420 (X420) performs motion compensated frame rate conversion (MC-FRC) and doubles the frame rate. Samsung Timing Controller (TCON) for LCD panel is integrated inside X420. 4 August 2008 Hot Chips 20: mediaDSP AMD Xilleon 420 Block Diagram AMD Xilleon™ 420 Timing mini-LVDS LVDS LVDS mini-LVDS Controller Pixel Pixel Input Receiver Transmitter (TCON) Output I2C Slave Host Controller Interface Unit Stream Manager I2C Master Interface mediaDSP mediaDSP Peripheral Core #1 Core #2 GPIO Controller Unit EEPROM Interface Memory Controller SDRAM Interface 5 August 2008 Hot Chips 20: mediaDSP mediaDSP Introduction mediaDSP is a flexible media processing platform – Well suited to address a range of media applications, with emphasis on video processing – Enables guaranteed real-time performance Has been used in two product families – MPEG A/V encoding – Motion compensated frame rate conversion (MC-FRC) Characteristics of MC-FRC: – Is an ill-posed problem – no single, optimal solution; customer preferences vary – Extreme computational load Programmable multicore approach is attractive 6 August 2008 Hot Chips 20: mediaDSP Classes of Video Processing Operations Highly parallelizable operations (linear and nonlinear) Data can be pixels, motion vectors, picture statistics, etc. – Programmable SIMD (single instruction, multiple data), or specialized engine Ad-hoc computation and decision-making – General-purpose processor Data movement/formatting on multiple planes – Multi-dimensional DMA engine Bit-serial processing (entropy encode/decode) – Programmable bit-stream processor Leads to a multicore approach with application-specific processing elements. 7 August 2008 Hot Chips 20: mediaDSP Capture of Video Algorithms Task A Task D Task E Task F Task G Task H Task I Task J Task B Task C Task L Task N Task O Task K Task M Task P Task Q Task R •Video and image processing algorithm developers are accustomed to working with task flow graphs. •mediaDSP architecture is driven by this type of programming model. •mediaDSP accelerates a flexible pipeline of video tasks. 8 August 2008 Hot Chips 20: mediaDSP Example Task Flow Graph: MPEG-2 Video Macroblock Encoder (Unmapped) Read from FB Coarse motion Read half-pel Half-pel search to refresh Address reference data motion search Address Motion +/-128H x +/-64V Cylindrical 24 x 10 x 4 lum compensated effective range calc -4/+3.5H x +/-1V calc buffer prediction (2 macroblocks) 32 x 6 x 4 chr search range (4 macroblocks) Read Average downsampled predictions current MB (4 macroblocks) Only for bidirectional macroblocks Not done in B frames Subtract Dequantize Write result to FB Read prediction IDCT add 16 x 16 lum current MB DCT prediction 16 x 4 lum quantize downsample 16 x 8 chr Not done in I Previous macroblocks macroblock rate control Write info CBP macroblock skip DC prediction header detect DC differential Rate control to stream coefficient scan end of VL encode 1-6 macroblock blocks of coeffs 9 August 2008 Hot Chips 20: mediaDSP Example Task Flow Graph: MPEG-2 Video Macroblock Encoder (Mapped) EMDMA MEE EMDMA CRUNCH RISC RISC CRUNCH Read from FB Coarse motion Read half-pel Half-pel search to refresh Address reference data motion search Address Motion Cylindrical +/-128H x +/-64V compensated effective range calc 24 x 10 x 4 lum -4/+3.5H x +/-1V calc buffer prediction (2 macroblocks) 32 x 6 x 4 chr search range (4 macroblocks) CRUNCH EMDMA Read Average downsampled predictions current MB (4 macroblocks) Only for bidirectional macroblocks CRUNCH EMDMA CRUNCH EMDMA Subtract Dequantize Write result to FB Read prediction IDCT add 16 x 16 lum current MB DCT prediction 16 x 4 lum quantize downsample 16 x 8 chr RISC Not done in I Not done in B frames Previous macroblocks macroblock RISC rate control RISC VLCE RISC info Write CBP macroblock DC prediction skip header DC differential Rate control detect to stream coefficient scan end of VL encode 1-6 macroblock blocks of coeffs 10 August 2008 Hot Chips 20: mediaDSP A Task-Oriented Approach Task: A well defined piece of work that must be done. A task has a definite initiation time, and continues uninterrupted until it halts. Task Oriented Engine (TOE): A processing engine, either programmable or fixed function, that executes a task, and then halts. Task Control Unit (TCU): A specialized control unit that resides “in front of” a TOE, closely coupled with the TOE. – Queues tasks which are to be issued to the TOE – Synchronizes with other TCUs and Control Engines in the system Control Engine: General-purpose RISC processor. – Orchestrates the tasks for maximum parallelism – Functions as a TOE for ad hoc tasks Shared Memory: A wide on-chip memory that is accessible to all of the TOEs and other elements in the mediaDSP. – Holds inter-task data buffers Communication Fabric: The interconnect network among the TOEs and other elements of the mediaDSP. 11 August 2008 Hot Chips 20: mediaDSP Generic mediaDSP Topology mediaDSP Task-Oriented Task-Oriented Task-Oriented … Engine #1 Engine #2 Engine #N Task Control Unit Task Control Unit Task Control Unit Task Control Unit Communication Fabric Task Control Unit Shared Memory External Memory Control Engine DMA Engine Register Slave Interface Unit Chip-Level Chip-Level Memory Backbone Register Backbone 12 August 2008 Hot Chips 20: mediaDSP X420 mediaDSP Topology mediaDSP FRC- FRC- FRC- FRC- FRC- FRC- Crunch Specific Specific Specific Specific Specific Specific Engines Engine #1 Engine #2 Engine #3 Engine #4 Engine #5 Engine #6 #(1-10) TCU TCU TCU TCU TCU TCU TCU Communication Fabric TCU TCU Register Control Control Control Control Shared External Slave Stream Engine Engine Engine Engine Memory Memory Interface Capture #1 #2 #3 #4 DMA Unit Engine Engine Chip-Level Stream Manager Chip-Level Register Backbone Chip-Level Memory Backbone 13 August 2008 Hot Chips 20: mediaDSP Re-Use, Extending, Scaling SD SD HD FHD Module/TOE Enc1 Enc2 FRC FRC Infra Control Engine 2 2 4 8 Task Control Unit 5Re-Use 6 16 36 Advanced DMA Engine 3 3 30 64 Crunch Engine (SIMD DSP) 1 1 9 20 Variable Length CODEC Engine 1 1 Scaling Motion Estimation Engine 1 1 Extending Sub-pixel Motion Estimation Engine 1 TOE’s FRC-Specific Engine #1 1 2 FRC-Specific Engine #2 1 2 FRC-Specific Engine #3 1 2 FRC-Specific Engine #4 1 2 FRC-Specific Engine #5 1 2 FRC-Specific Engine #6 1 2 Stream Capture Engine 2 14 August 2008 Hot Chips 20: mediaDSP Template for TOE TOE heavily leverages mediaDSP Processing Engine infrastructure technology from Task Oriented Engine the Task-Orientedplatform: Task-Oriented Task-Oriented … Engine #1 Engine #2 Engine #N – Re-use of DMA Engine, TCU and Power Manager Processor Task Control Unit Task Control Unit Task ControlCore Unit Task Control Unit – Auto-generated routers/arbiters Local – Allows heterogeneous TOEs to Backbone Communication Fabric Manager have similar „look-and-feel‟ DMA (API) to the programmer. Engine(s) Task Control Unit Building a new TOE amounts to Arbiter justShared designing Memory the Externalprocessor Memory Control Engine core. DMA Engine Register Slave32 Interface Unit Power 128 TCU Manager E.g., 3D Median Filter Engine 32 32 took less than 2 months from Chip-Level = re-used building block module concept to RTL complete Chip-Level Memory Backbone = auto-generatedRegister module Backbone 15 August 2008 Hot Chips 20: mediaDSP Task Control Unit Processing Engine Request to Command FIFO decouples Control TOE Engine writesTask from Oriented the activityEngine of Task Control Unit the TOE. FIFO holds: Task Processor Command – Register writesCore to TOE Processor (task-setup commands) Local – Commands to the TCP Backbone Manager TCU Registers Command (semaphores) FIFO DMA Task CommandEngine(s) Processor off- loads low-level task management from the Control Engine: Arbiter – Pass-through of reg writes to TOE 32 – LocalPower status monitoring128 of TOETCU Request from Request to Manager Backbone Backbone – Wait for a local semaphore 32 32 – Set a remote= re-used semaphore building block module = auto-generated module 16 August 2008 Hot Chips 20: mediaDSP TCUs In Action Crunch Crunch Task Flow Graph: Engine 1 Engine 2 DMA In Semaphores Semaphores Number Crunch DMA Out TCU TCU Communication Fabric TCU Wait2Poll1 ControlSend1Task3 Shared Task2Task1Wait1 EnginePoll2 Memory Send2 Semaphores EMDMA Core 17 August 2008 Hot Chips 20: mediaDSP Memory Latency Management mediaDSP Task-Oriented Engine 1 Task-Oriented Engine 2 Task-Oriented Engine 3 Control Engine Processor Core Processor Core Processor Core RISC Processor Level 0 Memory Level 1 Memory Level 1 Memory Level 1 Memory (Registers) (I/D Store, Registers) (I/D Store, Registers) (I/D Store, Registers) DMA DMA DMA Level 1 Memory Engine Engine Engine (I/D Caches) Level 2 Memory (SRAM “Shared Memory”) …other clients… DMA Engine Level 3 Memory (SDRAM “Frame Buffer”) • Multi-level memory hierarchy with DMA‟s among the levels.

Load more