
MultiCore DSP vs GPUs Nissim Saban Multi Core Regional Application Manager GPU DSP Texas Instruments [email protected] MultiCore DSP vs GPUs • Today’s system developer’s needs Multi Core DSP • Multicore DSP Architechture • Shared GPU and DSP features • Differences highlighted • Summery GPU Today’s system developer’s needs • Processing power (floating point) – Systems that use GPU/DSP need high processing power , usually floating point operations. – Today’s DSP/GPU can provide many GFLOPS • Connectivity – Fast data movements (bandwidth) • GPU uses PCIe • DSP has PCIe, SRIO (20Gb/s),GMII(1Gb/s), HYPER LINK(50Gb/S) – Common Standards compliance (PCIe,SRIO,LAN,USB) • High level programming tools – Ease of use and reuse (OPENCL,OPENMP,CUDA……) – Easy Platform migration • Size • Price $ The next slides discuss other parameters as well Specialization (app oriented acceleration) • When a specific well defined task is needed usually the silicon contains HW accelerators • 2 examples follow in the next slides • The HW accelerator has API that the controlling DSP can use to offload it’s computations • The accelerators are equivalent to many additional GFLOPS • Typical accelerators: Video Compress/Decompress , Communication protocols , FFT , Encryption……….. App oriented acceleration - HD Video Transcode, Encode & Decode 1GHz - TMS320DM6467T Processor Features TMS320DM6467TZUT1 Digital Video Interfaces Core Capture • ARM926EJ-S™ (MPU) at 500 MHz HD VICP 0 2x BT.656 ARM DSP 1x BT.1120 • TMS320C64x+™ DSP Core at 1 GHz TCM RAM Subsystem Subsystem Display ME IPDE 2x BT.656 Memory ARM C64x+TM MC LF 1x BT.1120 • ARM: 16K I-Cache, 8K D-Cache, 32K TCM RAM, CALC ECD 8K Boot ROM 926EJ-S DSP Stream I/O • DSP: 32KB L1P Cache/SRAM, 32KB L1D Cache/SRAM, CPU Core HD VICP 1 128KB L2 Cache/SRAM, 64K Boot ROM 500 MHz 1 GHz Video Data Conversion TCM RAM Engine Video Encoder/Decoder Capabilities DownScaler Chroma MC LF HW Menu • Real-Time HD-HD Transcoding Up to 1080p Sampler Overlay • Multi-format (mf) HD to mf HD or mf SD CALC ECD • Up to 2 real time for HD-to-SD Transcoder • Real-time HD-HD transcoding for PVR Switched Central Resource (SCR) • Video Encode and Decode • HD 1080p30 H.264 BP encode • HD 1080p60 H.264 BP decode Peripherals System Connectivity • Dual HD 1080i60/p30 MPEG-2 MP@HL decoding • Simultaneous 720p30 H.264 BP encode USB PCI G- Timer EMAC WD and 1080p30 BP/HP decode PWM EDMA 2.0 VLYNQ • Simultaneous 720p60 H.264 BP encode and BP decode 2 Timer 2 With PHY* HPI MDIO Benefits Serial Interfaces Program/Data Storage • Scalable video engine building on high-performance C64x+ DDR2 Async media DSP, low-cost local controllers, and rich suite of multi- McASP McASP 2 UART format video accelerators I C SPI Controller EMIF/ ATA 4 ch 1 ch 3 (16b/32b) NAND Applications • Transcoding (HD-HD, HD-SD) HD-Video Conferencing, HD- IP Set-Top Boxes, Digital Media Adapters, Video Surveillance, Medical Imaging bak Nyquist (C6670) with accelerators • SoC for Baseband C6670 (Nyquist) Hyper – 4x Performance of Faraday Link Layer 1 Acceleration – RTOS, Linux and GCC • Fixed/Floating C66x™ CorePac Turbo Encoder Turbo Decoder – Eight C66x DSP cores @ > 1.0 GHz New C6x Viterbi Decoder FFT / DFT – 1 MB Local L2 core – >128 GMAC, >64 GFLOP L2 Memory / Cache, CQI, Reed TFCI – 2.0 MB shared memory 1MB Muller • Navigator Uplink CR Downlink CR – Make Multi-core easy – Packet Infrastructure Memory System Interference Cancellation b Multicore Memory • Wireless Acceleration for Layer 1 & Layer 2 44 66 hh h – Enable 2 cells on Single Chip Controller cc c 33 tt t - ii – Chip architecture tailor made for 3G/4G packet i R Shared Memory ww w data D Layer 2 Acceleration D SS 2 MB S • Network Processing Engine tt t ii – IP Fast Path in hardware i RoHC Air Ciphering bb b aa a rr • Low Power Consumption System Elements r QoS RLC / MAC ee – Adaptive Voltage Scaling e TT Power Mgmt SysMon T Various and many Power saving techniques • Hyperlink Debug EDMA Scheduler Fast Path – Expansion port – Transparent to Software Peripherals and I/O • Terabit Switch Network Processing Engine sRIO 4x AIF2 6x – Reduces bottlenecks and unleashes full IPsec / GTP-U Fast Path performance of the chip SGMII 2x PCIe 2x • Multicore Debugging GPIO, I2C UART, SPI Cryptographic IEEE 1588 Multi-core Infrastructure Navigator connectivity Power (cooling , reliability….) • The DSP is general purpose and intended to be used in embedded system also • Power is a major concern today to system designers and has many impacts on price , reliability • DSPs are available in industrial temperature range • Next slide shows a Power/Performance GPP/GPU/DSP comparison example GeForce GTX 580 Source : http://www.nvidia.com/object/product-geforce-gtx-580-us.html GeForce GTX 580 FFT benchmark comparison with GPP/GPU Platform Effective Time to Power Energy complete 1024 point (Watts) per FFT complex to complex (µJ) FFT (single precision), µs GPU: nVidia Tesla C2070 0.16 225 36 GPU: nVidia Tesla C1060 0.3 188 56.4 GPP: Intel Xeon Core Duo 1.8 95 171 @ 3 GHz1 GPP: Intel Nehalam Quad 1.2 130 156 Core @ 3.2 GHz1 DSP: TI C6678 @ 1.2 GHz1 0.86 10 8.6 1For GPP and DSP, we show the effective time of completion assuming that multiple FFTs are run in parallel to the multiple cores available • When performance is distributed over power – DSP >4x better than GPU – DSP >18x better than GPP GPP/GPU information sources: • http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf • http://www.fftw.org/speed/CoreDuo-3.0GHz-icc/ Real time systems (HW connectivity ,controlled response timing…..) • The typical system we are discussing has very fast inputs and outputs • The system has to meet real time constraint as frame rate , communications , latency • DSP operating system and HW architecture enables real time reliable response • Following slides show typical connections of GPU and DSP in real world • The DSP is designed to connect easily to FPGA and other HW inputs/outputs GPU environment DSP environment Please note connectivity to FPGA and to external buses C6678 DSP Architecture • The next slide shows the world of embedded processing today 16,32 bits, Accelerators, Multicores • A slide with the internal architechture of the Multicore DSP follows • The DSP has 8 functional units that can run 8 instructions per cycle • In term of “CUDA cores” each DSP has 16 “cores” • In C6678 the HYPER LINK enables to connect DSPs in pairs so you can get 32 cores @1.2GHZ Embedded processing portfolio TI Embedded Processors Microcontrollers (MCUs) ARM®-Based Processors Digital Signal Processors (DSPs) 16-bit ultra- 32-bit 32-bit ARM Ultra ARM DSP Multi-core low power real-time Cortex™-M3 Low power Cortex-A8 & DSP+ARM DSP MCUs MCUs MCUs ARM9™ MPUs DSP ™ ™ C2000 ® Sitara™ C6000 Stellaris ™ ™ ™ ™ ™ ARM Cortex-A8 DaVinci C6000 C5000 MSP430 Delfino ARM Cortex-M3 ™ & ARM9 video processors Piccolo OMAP™ Up to 40MHz to Up to 375MHz to 300MHz to >1Ghz 320 GMACS Up to 300 MHz 25 MHz 300 MHz 100 MHz >1GHz +Accelerator +Accelerator Flash Flash, RAM Flash Cache, Cache Cache Up to 320KB RAM 1 KB to 256 KB 16 KB to 512 KB 8 KB to 256 KB RAM, ROM RAM, ROM RAM, ROM Up to 128KB ROM Analog I/O, ADC PWM, ADC, USB, ENET USB, CAN, SATA, USB, ENET, SRIO, EMAC USB, ADC LCD, USB, RF 2 MAC+PHY CAN, SPI, PCIe, EMAC PCIe, SATA, SPI DMA, PCIe McBSP, SPI, I2C CAN, SPI, I C ADC, PWM, SPI Measurement, Motor Control, Connectivity,Security, Industrial automation, Floating/Fixed Point Telecom test & meas, Audio, Voice Sensing, General Digital Power, Motion Control, HMI, POS & portable Video, Audio, Voice, media gateways, Purpose Lighting, Ren. Energy Industrial Automation data terminals Security, Conferencing base stations Medical, Biometrics $0.25 to $9.00 $1.50 to $20.00 $1.00 to $8.00 $5.00 to $25.00 $5.00 to $200.00 $40.00 to $200.00 $3.00 to $10.00 Software & Dev. Tools MPUs – Microprocessors TMX320C6678/4/2 processors are ideal for Applications such as Key Advantages • Mission Critical • Military • Support for fixed & floating point • Avionics processing • Public Safety • Focused security & network • Medical Imaging coprocessors • Ultrasound • Multiple high speed peripherals • Endoscopy • Enhanced memory architecture • MRI / CT Scan with large on chip memory • Emerging modalities • New multicore navigator for fast • Test & Automation data transfers • Wafer/LCD inspection • Niche printing/scanning • Switch fabric with 2 terabits of • Emerging Video bandwidth • New multicore focused debug and performance profiling capabilities back Multicore High Performance DSP C66x Core Multicore Navigator Next generation Fixed / Floating-Point Data transfer engine that is architected to move DSP core with clock speeds ranging data between various system elements without using any CPU overhead so maximum system from 1GHz – 1.25GHz and Up to 8 efficiency is achieved core options a b Network Co- Processor and Accelerators A cost effective implementation to Memory Architecture b off-load the wireless and secure Up to 8MB of combined memory /%t '' / /%t '' networking functions from the DSP with an enhanced memory architecture that allows fast on- and [ / w!a Y. tttt off-chip memory access including a eeee a { NNNN aaaa rrrr !!!!!! a a " # !!!!!! a a " # { 9 DDR3-1600MHz (64-bit) interface eeee /$ /$ TTTT %%% t %%% t {#' a $ www {#' a $ (8GB of addressable space). www {+ a " # 555 {+ a " # a " 555 a " TeraNet 555 555 a . 5! 95a ! Switch fabric that has 2 Terabits D 9 { of bandwidth which allows t Lh maximum data transfer between {" w0 system components to realize full
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages22 Page
-
File Size-