Spursenginetm a High-Performance Stream Processor Derived from Cell/B.E.TM for Media Processing Acceleration
Total Page:16
File Type:pdf, Size:1020Kb
SpursEngineTM A High-performance Stream Processor Derived from Cell/B.E.TM for Media Processing Acceleration Hiroo Hayashi Toshiba Corporation Semiconductor Company Advanced SoC Development Center SpursEngine and the logo are trademarks of Toshiba Corporation in Japan, the United States and other countries. Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment Incorporated. Copyright 2008, Toshiba Corporation. Outline • Background • SpursEngineTM Architecture Overview • Implementation • Programming Model • Application and Performance • Conclusion Copyright 2008, Toshiba Corporation 2 Outline • Background • SpursEngineTM Architecture Overview • Implementation • Programming Model • Application and Performance • Conclusion Copyright 2008, Toshiba Corporation 3 What is the major target for PC? Answer is HD. • Emergence of HD Contents – High Definition (HD) era has been quickly emerging. For instance, video content from digital terrestrial broadcasting, digital video cameras, and optical disks are all HD ready. • HD contents require much more processing power than SD contents. – HD needs 6 times higher bandwidth than Standard Definition (SD). Conventional PC architecture of CPU and GPU can only decode HD video in real-time. – CPU, even though it keeps getting faster, will not be capable of real-time encoding HD video in the near future. Digital/Analog Terrestrial Broadcasting HD HD Importing Burning Conventional PC Sharing Playback (Decode) Streaming Break Through Scaling Users demand an innovative HD solution! Authoring Transcoding Editing Copyright 2008, Toshiba Corporation 4 Future Demand: Video Indexing and Searching Next generation HD video solutions require intelligent software capabilities. -Advanced and Flexible algorithms for RMS (Recognition, Mining, Synthesis) are hard to be provided by H/W solutions. Live Broadcasting PC Digital/Analog Terrestrial Broadcasting Short movies using DSC will be much Home Videos more popular and casual in the near future. Most videos are un-tagged raw data. Video Archive Difficult to Internet find the video they want. http://www.youtube.com/ DEMAND: indexing and searching features Copyright 2008, Toshiba Corporation 5 Characteristics of Video Data • Video data – is large. – can be efficiently compressed. – is stored and distributed in encoded format. – must be decoded before being used (processed). – requires huge processing power. • Searching and matching is not easy. Combination of encode/decode and huge image processing power is crucial. Copyright 2008, Toshiba Corporation 6 Outline • Background • SpursEngineTM Architecture Overview • Implementation • Programming Model • Application and Performance • Conclusion Copyright 2008, Toshiba Corporation 7 SpursEngine™ Architecture Overview • Media processing accelerator derived from Cell Broadband EngineTM (Cell/B.E.TM) – 4 SPEs in SpursEngineTM are fully compatible with those of Cell/B.ETM. • Supports Full-HD video contents efficiently – Hardware CODECs of standard formats are embedded in the same chip. Media Processors SPE SPE SPE SPE Hardware CODECs MPEG-2 MPEG-2 H.264 H.264 DEC ENC DEC ENC Interfaces XIO PCI Express SPE : Synergistic Processing Element XIO is trademark of Rambus Inc. in the United States and other countries. Copyright 2008, Toshiba Corporation 8 Synergistic Processor Element (SPE) SPE • A simple SIMD processor core derived from Cell/B.E.TM. • Architecture focus on data processing RF SPU – RISC Type ISA (Instruction Set Architecture) • High Level Programming Language LS – SIMD Architecture 128bit x 128 Large Register File – MFC-DMAC – 256KB Local Storage TM • predictable latency Cell/B.E. PPE SPE SPE SPE SPE *) – no cache miss MICXDR BIC SPU SPU SPU SPU PPU DRAM – no address translation miss LS LS LS LS L2 MFC MFC MFC MFC • wide data path Cache – MFC : high performance DMA engine Element Interconnect Bus (EIB) MFC MFC MFC MFC Cell/GPU – High compute density etc. LS LS LS LS PPE:Power Processor Element • Power-efficient SPE: Synergistic Processor Element SPU SPU SPU SPU Super MFC: Memory Flow Controller Companion Chip • Area-efficient LS : Local Storage SPE SPE SPE SPE XDR™ DRAM is trademark of Rambus Inc. in the United States and other countries. I/O Copyright 2008, Toshiba Corporation 9 Tightly Coupled for High Bandwidth • Video applications require much higher bandwidth than conventional applications. • Processor elements, encoders/decoders, and high bandwidth local memories should be tightly coupled together to guarantee high data transfer bandwidth for HD video processing architecture. Tightly coupled in a single chip Decoding Encoding SPE SPE MPEG-2 MPEG-2 PCIe bus PCIe bus DEC ENC 200+ Software 10Mbps processing 10Mbps H.264 MB/s H.264 MPEG2, H.264 for raw data MPEG2, H.264 DEC ENC 10-20Mbps 10-20Mbps SPE SPE Up to 12.8GB/s HD Raw Data (YUV/RGB) XDR™ SPE: Synergistic Processor Element 200+MB/s DRAM XDR™ DRAM is trademark of Rambus Inc. in the United States and other countries. Copyright 2008, Toshiba Corporation 10 Backend Processor for No Interference in PC bus • The stream processor should be located in backend to prevent its interfering in PC system bus with CPU and GPU. Main video playback CPU data stream GPU Frontend System MCH Memory No interference Digital/Analog Tuner In Terrestrial Broadcasting ICH Stream PCIe Interne Processor t video transcoding Backend data stream Local Memory Copyright 2008, Toshiba Corporation 11 Outline • Background • SpursEngineTM Architecture Overview • Implementation • Programming Model • Application and Performance • Conclusion Copyright 2008, Toshiba Corporation 12 SpursEngineTM Block Diagram SPE SPE SPE SPE Advanced Applications 1.5Ghz 1.5GHz 1.5GHz 1.5GHz by Quad GHz Core EIB (Element Interconnect Bus) Full-HD DMAC Host SCP MPEG-2 H.264 MPEG-2 H.264 Video CPU PCI-ex (cont Mem Encoder Encoder Decoder Decoder Codec I/F roller) Cont. HW-IP HW-IP HW-IP HW-IP by HW-IP PCI-Express Video CODECs Max 4 lanes XDR DRAM MPEG-2: MP@HL support In : 1GB/s XDR DRAM H.264: High Profile@Level 4.1 support Out : 1GB/s 6.4 GB/s (1 pcs XDR) max. 1920x1080 60i/24p, 1280x720 60p ~ 12.8 GB/s (2 pcs XDR) XDR™ DRAM is trademark of Rambus Inc. in the United States and other countries. Copyright 2008, Toshiba Corporation 13 Specification Overview • SPE : Synergistic Processor Element • SCP : Control Processor – clock frequency: 1.5GHz – Toshiba original core – number of SPEs: 4 – 4K I$, 16K D$, 24KB Instruction Memory – clock frequency: 300MHz • H.264 Encoder – 4ch DMA – High Profile@Level 4.1 support – max. 1920x1080 60i/24p, 1280x720 60p • PCI Express Controller – clock frequency: 150MHz – PCE Express Base Specification 1.1 • H.264 Decoder – endpoint – High Profile@Level 4.1 support – x4, x2, x1 link support – max. 1920x1080 60i/24p, 1280x720 60p – clock frequency: 300MHz • Memory Controller – 32bit or 16bit x 1ch • MPEG2 Encoder – memory capacity: 64MB – 1GB – MP@HL support – bandwidth 12.8GB/s @32bit data width – max. 1920x1080 60i/24p, 1280x720 60p – clock frequency: 150MHz • Power Management Function • MPEG2 Decoder – power state (D0, D1, D3hot, D3cold) – MP@HL support – clock gating – max. 1920x1080 60i/24p, 1280x720 60p – clock frequency control – clock frequency: 300MHz Copyright 2008, Toshiba Corporation 14 SpursEngineTM Physical Implementation • Process: – 65nm bulk CMOS with 7 levels of copper layers Die Size: PLL PCIe • PHY – 9.98mm x 10.31mm, 102.89mm2 LS LS H.264 PCIe MPEG2 Decode • Fmax: 1.5GHz Decode H.264 Decode • Transistor Counts: SPE SPE SCP MPEG2 – 239.1M Encode • Logic: 134.3M EIB • SRAM: 104.8M Shared Memory • TDP SPE SPE – < 20W (depending on application sets) Mem. • Package: Cont. H.264 LS LS Encode – FC BGA 624 eFuse XIO* * XIO™ is trademarks of Rambus Inc. in the United States and other countries. Copyright 2008, Toshiba Corporation 15 Backend Design Feature: Synthesizable 1.5GHz SPE • Synthesizable design – shorter design TAT and less manpower (estimation: 1/4 compared with custom migration) • Key Feature – floor plan optimization – hybrid standard cell height – RTL abstraction – register retiming – semi-custom clock tree – wire width control Copyright 2008, Toshiba Corporation 16 Floor Plan Optimization Original Floor Plan (Cell/B.E.TM) New Floor Plan Cell/B.E.TM (65nm) 2.09mm x 5.30mm SpursEngineTM (65nm) 2.07mm x 3.93mm L/S Square & Compact Data Path L/S Cont. logic • more cell placement flexibility • area: – 27% smaller • total wire length: MFC – 28% shorter L/S : Local Storage MFC : Memory Flow Controller Copyright 2008, Toshiba Corporation 17 Hybrid Standard Cell Height Implementation • 12 grid height and 9 grid height standard cell libraries are used to minimize delay, area, and power. – 12 grid height is used for 1.5GHz design block (SPEs and EIB) – 9 grid height is used for other clock domains Frequency & Area Estimation in Monolithic SC Height Implementations frequency (GHz) area (mm2) 9 grid & 12 grid hybrid 1.5 102.89 9 grid only (estimated) 1.25 99 +7mm2 12 grid only (estimated) 1.5 110 1.25 1.05 1.05 1.2 1 1.15 1 1.1 0.95 0.95 1.05 1 0.9 [arb.] power 0.9 0.95 delay x area [arb.] area x delay worst path delay [arb.] delay path worst 0.9 0.85 0.85 8 10 12 14 16 8 10 12 14 16 8 10 12 14 16 cell height [grid] cell height [grid] cell height [grid] Delay Delay x Area Power *) *) under the condition where path delay is the same with power supply voltage control Copyright 2008, Toshiba Corporation 18 Outline • Background • SpursEngineTM Architecture Overview • Implementation • Programming Model • Application and Performance • Conclusion Copyright 2008, Toshiba Corporation 19 SpursEngineTM Software Programming Environment • User program development Legend – Two level programming interface on basic software and MW API User – User original SPE code can be developed. Program • Standard OS API can be encapsulated. Standard – ex. standard CODEC API Software User Application SPE MW API Transfer User Library Std. Library & Execute SPE Code SPE Code SPE Code SPE Code System Software API System Software API System Software System Software Host System Copyright 2008, Toshiba Corporation 20 Common SPU Runtime • Cell/B.E.TM Common SPU Runtime – offers common API to various Cell/B.E.TM systems.