Development of Full-HD Multi-Standard Video CODEC IP Based on Heterogeneous Multiprocessor Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Development of Full-HD Multi-standard Video CODEC IP Based on Heterogeneous Multiprocessor Architecture H.Nakata1, K.Hosogi1, M.Ehama1, T.Yuasa1, T.Fujihira1 K.Iwata2, M.Kimura2, F.Izuhara2, S.Mochizuki2, M.Nobori2 1Embedded System Platform Laboratory Central Research Laboratory Hitachi, Ltd. 2System Design Div. System Solution Business Group Renesas Technology Corp. Agenda 1.Introduction 2.Multiprocessor Architecture for Video CODEC 3.Development Methodology 4.Implementation Results 5.Summary and Conclusions 2 1.Introduction 2.Multiprocessor Architecture for Video CODEC 3.Development Methodology 4.Implementation Results 5.Summary and Conclusions 3 Video codec trends Video codec standards are increasing… MPEG-1, MPEG-2, MPEG-4 H.263, H.264 (MPEG-4/AVC), VC-1, etc. Video resolution becomes high… Many consumer devices are supporting full-HD. Digital Video Blu-ray Digital Still Digital TV Mobile Phone Camera Recorder Camera 4 Our target for video CODEC Power Better Good solution efficiency Dedicated Our circuits for all of target multi-codec, low power, and high performance. Good solution DSP Good solution for low power for multi-codec. and high performance. But disadvantage General But inefficient in power. for multi-codec. processor Flexibility We tried to apply a heterogeneous multiprocessor architecture to a video CODEC for our target. 5 CODEC IP applicable to many purpose CODEC IP written in HDL Applicable to various LSI designs for LSI for for for DTV DVC Recorder Mobile Applications Digital Video Blu-ray Digital Still Digital TV Mobile Phone Camera Recorder Camera HDL: Hardware Description Language 6 1.Introduction 2.Multiprocessor Architecture for Video CODEC 3.Development Methodology 4.Implementation Results 5.Summary and Conclusions 7 Top level architecture • All modules are connected to SBUS • SBUS is structured with 2 unidirectional shift-register-based 64bit buses • The directions of the 2 buses are opposite to each other • Some of modules use original programmable processors CODEC IP Data can be transfer at same time VLCS CE0 VLCF TRF FME DEB CME PMD MEC CTRL System Bus STX SBUS (two shift-register-based ring buses) Global CBE CE1 VLCF TRF FME DEB CME PMD LMC DMAC Stream Pixel Domain Domain Processor-type circuits Dedicated circuits 8 Separate stream domain and pixel domain • Separate both domains by intermediate stream buffers Optimize performance for each domain Stream domain Pixel domain Optimized for stream processing Optimized for Macroblock (MB) processing CE 0 VLCS CE 1 CODEC Intermediate Image Video stream stream buffer 0 Buffer buffer Intermediate stream buffer 1 External Memory Note) This figure shows decode process. Data transfer directions are opposite for encode process. 9 Distribute to plural intermediate streams Pixel domain has 2 CEs which work in parallel VLCS has to distribute an intermediate stream to both CEs for decode process Intermediate stream buffer 0 1 a picture 3 1 2 3 m VLCS 4 Intermediate stream buffer 1 •Decode to syntax element 2 m level 4 n •Change intermediate stream on every end of MB line Macroblock n Note) This figure shows decode process. The data flow is opposite for encode process. 10 Stream domain operation cycle budgeting • Performance target: 40Mbps for full-HD @ 162MHz operation Corresponded to 40Mbps @ Full-HD 662 Corresponded to 162Mcycle @ Full-HD 600 595 10% margin 400 Assigned to coefficients [cycle/MB] Proportional 200 cycle budget Assigned to MB parameters (MB type, MV, etc.) Operation cycle budget Fixed cycle budget Assigned to MB initialization 0 50 100 150 Bit stream length [bits/MB] Reserve 100 fixed operation cycles per MB and assign 3 cycles/bit for bits in streams (This meets 40Mbps performance included 10% margin) 11 Intermediate stream compaction • Intermediate stream is compacted by simple coding method Example of EGFLC • Coded by EGFLC number 1. fixed length code (FLC) prefix suffix 2. FLC – exp. golomb combined 0 1 code (EGFLC) 1 0 10 • EGFLC is used for coefficients 2 0 1 1 Similar to and MVs. 3 00 1 00 exp-golomb 4 00 1 01 code 5 00 1 10 • Intermediate stream can be 6 00 1 11 encoded and decoded fast 7 00 0 000111 by simple logic 8 00 0 001000 FLC is used • Reduce size of intermediate … 00 0 xxxxxx as suffix buffer and bandwidth for intermediate data transfer • EGFLC is about 20% smaller than normal exp. golomb code in our case. 12 VLCS structure • Stream syntax is analyzed by our VLCS original 2way LIW processor, STX, Syntax analysis processor (STX) except some syntax elements CABAC accelerator (CBE) • Some dedicated circuits are available for performance (40Mbps@162MHz) CAVLC coefficient accelerator (COEF) • VSVLC decodes/encodes various VC-1 MV calculate accelerator (VCA) variable length code for stream I/O. VLCS variable length codec engine (VSVLC) Data Control Local DMAC (LDMAC) SBUS 13 Syntax analysis processor (STX) • Two 32bit instruction slots available STX instruction slot assignments 2 instruction slots used rate 32bit 32bit Stream Type Rate H.264 CAVLC 32% Inst. slot A Inst. slot B H.264 CABAC 38% • register data transfer • register data transfer • load/store • arithmetic operation MPEG-2 48% • stream I/O •branch MPEG-4 45% • accelerator control VC-1 46% • Use only internal instruction and data memories Logical • Data memory has logical address exchangeable area address exchanged Write Data mem Write Data mem next parameter work next parameter work area area parameter parameter STX STX area area 14 Pixel-domain operation cycle budgeting Required operation amount for MB is not so different Assign operation cycle budget for a macroblock Full-HD (1920×1080 30fps) video MB rate : 244,800 MB/s Target operation frequency : 162MHz Only 661 cycle is available for a MB processing pipeline stage Too strict for processor based operation (A MB has 384 pixels for luma & chroma) Assign 661×2 = 1,332 cycles by 2 parallel processing (1,200 cycle for actual operation, 132 cycle for margin) 15 Hierarchical parallel processing • Pixel domain uses hierarchical parallel processing technique 1. 2 MBs processed 2 codec elements (CEs) in parallel 2. Each MB is processed by “pipeline” technique: each module is assigned as an pipeline stage. 3. Parallel processing is executed in each module: processor type modules have some tiny processor elements. Pipeline Stage S0 S1 S2 S3 S4 S5 S6 CE0 VLCF MEC FME TRF DEB Decode LMC Parallel Process CE1 VLCF MEC FME TRF DEB processing LMC MEC CME FME TRF DEB CE0 PMD VLCF Encode LMC Parallel Process MEC CME FME TRF DEB processing Processor-type circuits CE1 PMD VLCF Dedicated circuits LMC 16 Pixel domain processor (Programmable Image Processing Element: PIPE) Instruction Memory (Shared by 3 CPUs) LD-CPU Media-CPU ST-CPU Instruction Program Instruction Program Instruction Program Decoder Counter Decoder Counter Decoder Counter ALU ALU Register Register Media Register File File ALU File LD/ST LD/ST unit unit Data Memory Local DMAC SBUS 17 PIPE based on MIAD architecture (MIAD: Multiple Instruction Arrayed Data) • LD-CPU, Media-CPU, and ST-CPU have own program counter Those CPUs synchronize each other by sync flags in operation code LD-CPU Media-CPU Time wait sync send sync wait sync wait sync stall state send sync wait sync send sync active state • Those CPUs take 2 dimensional arrayed data operands 64 bit operation sync src 1/2(/3) operation dest width height pitch code processing operation height width 18 PIPE extension PIPE instruction set is extended for each module •Major extensions are added to Media-CPU •Some data setup operation extensions are added to LD/ST-CPU Module name Major extensions (Main function) Media-CPU FME •2way LIW mode Instruction Program (Fine Motion •Fine motion Decoder Counter Estimation/ estimation/compensation Compensation) specific instructions Media TRF •2way LIW mode Register ALU (Transform •Transform and quantization File Media and specific instructions Quantization) ALU DEB •De-blocking specific (De-blocking instructions filter) Media-CPU with 2way LIW extension 19 Hybrid architecture CE works by combination of PIPEs and dedicated circuits • PIPE architecture is optimized for 2D arrayed pixel processing • Dedicated circuits used for the functions PIPE is inefficient for Modules implemented by dedicated circuits in pixel-domain Module Reasons to use Main functions name dedicated circuits VLCF • decode/encode intermediate stream • PIPE is inefficient • MV calculation PMD • intra prediction mode selection • logic size (used by H.264 encode process only) LMC • internal line buffer control • PIPE is inefficient CME • coarse motion estimation and •performance compensation MEC • frame buffer access control for CME • PIPE is inefficient operations 20 1.Introduction 2.Multiprocessor Architecture for Video CODEC 3.Development Methodology 4.Implementation Results 5.Summary and Conclusions 21 Design flow Coding Verification Basic architecture • Decide modules in top level decision (functions & interfaces) • Design C-language-based model C model corresponded to the modules design • Develop firmware for processors C model C model • Compare with reference code results debugging verification • Check performance roughly RTL • Design RTL corresponded to design the modules (refer C model for detail) • Check function using C model RTL • Check performance/coverage verification (EWS) RTL /assertions debugging RTL • Detail verification using many long verification (FPGA) streams 22 C-language-based model design All modules are connected to SBUS The SBUS traffic of C model is designed to be the same as RTL TRF, FME,