Multi Core Dsps Vs Gpus TI for Distribution

MultiCore DSP vs GPUs Nissim Saban Multi Core Regional Application Manager GPU DSP Texas Instruments [email protected] MultiCore DSP vs GPUs • Today’s system developer’s needs Multi Core DSP • Multicore DSP Architechture • Shared GPU and DSP features • Differences highlighted • Summery GPU Today’s system developer’s needs • Processing power (floating point) – Systems that use GPU/DSP need high processing power , usually floating point operations. – Today’s DSP/GPU can provide many GFLOPS • Connectivity – Fast data movements (bandwidth) • GPU uses PCIe • DSP has PCIe, SRIO (20Gb/s),GMII(1Gb/s), HYPER LINK(50Gb/S) – Common Standards compliance (PCIe,SRIO,LAN,USB) • High level programming tools – Ease of use and reuse (OPENCL,OPENMP,CUDA……) – Easy Platform migration • Size • Price $ The next slides discuss other parameters as well Specialization (app oriented acceleration) • When a specific well defined task is needed usually the silicon contains HW accelerators • 2 examples follow in the next slides • The HW accelerator has API that the controlling DSP can use to offload it’s computations • The accelerators are equivalent to many additional GFLOPS • Typical accelerators: Video Compress/Decompress , Communication protocols , FFT , Encryption……….. App oriented acceleration - HD Video Transcode, Encode & Decode 1GHz - TMS320DM6467T Processor Features TMS320DM6467TZUT1 Digital Video Interfaces Core Capture • ARM926EJ-S™ (MPU) at 500 MHz HD VICP 0 2x BT.656 ARM DSP 1x BT.1120 • TMS320C64x+™ DSP Core at 1 GHz TCM RAM Subsystem Subsystem Display ME IPDE 2x BT.656 Memory ARM C64x+TM MC LF 1x BT.1120 • ARM: 16K I-Cache, 8K D-Cache, 32K TCM RAM, CALC ECD 8K Boot ROM 926EJ-S DSP Stream I/O • DSP: 32KB L1P Cache/SRAM, 32KB L1D Cache/SRAM, CPU Core HD VICP 1 128KB L2 Cache/SRAM, 64K Boot ROM 500 MHz 1 GHz Video Data Conversion TCM RAM Engine Video Encoder/Decoder Capabilities DownScaler Chroma MC LF HW Menu • Real-Time HD-HD Transcoding Up to 1080p Sampler Overlay • Multi-format (mf) HD to mf HD or mf SD CALC ECD • Up to 2 real time for HD-to-SD Transcoder • Real-time HD-HD transcoding for PVR Switched Central Resource (SCR) • Video Encode and Decode • HD 1080p30 H.264 BP encode • HD 1080p60 H.264 BP decode Peripherals System Connectivity • Dual HD 1080i60/p30 MPEG-2 MP@HL decoding • Simultaneous 720p30 H.264 BP encode USB PCI G- Timer EMAC WD and 1080p30 BP/HP decode PWM EDMA 2.0 VLYNQ • Simultaneous 720p60 H.264 BP encode and BP decode 2 Timer 2 With PHY* HPI MDIO Benefits Serial Interfaces Program/Data Storage • Scalable video engine building on high-performance C64x+ DDR2 Async media DSP, low-cost local controllers, and rich suite of multi- McASP McASP 2 UART format video accelerators I C SPI Controller EMIF/ ATA 4 ch 1 ch 3 (16b/32b) NAND Applications • Transcoding (HD-HD, HD-SD) HD-Video Conferencing, HD- IP Set-Top Boxes, Digital Media Adapters, Video Surveillance, Medical Imaging bak Nyquist (C6670) with accelerators • SoC for Baseband C6670 (Nyquist) Hyper – 4x Performance of Faraday Link Layer 1 Acceleration – RTOS, Linux and GCC • Fixed/Floating C66x™ CorePac Turbo Encoder Turbo Decoder – Eight C66x DSP cores @ > 1.0 GHz New C6x Viterbi Decoder FFT / DFT – 1 MB Local L2 core – >128 GMAC, >64 GFLOP L2 Memory / Cache, CQI, Reed TFCI – 2.0 MB shared memory 1MB Muller • Navigator Uplink CR Downlink CR – Make Multi-core easy – Packet Infrastructure Memory System Interference Cancellation b Multicore Memory • Wireless Acceleration for Layer 1 & Layer 2 44 66 hh h – Enable 2 cells on Single Chip Controller cc c 33 tt t - ii – Chip architecture tailor made for 3G/4G packet i R Shared Memory ww w data D Layer 2 Acceleration D SS 2 MB S • Network Processing Engine tt t ii – IP Fast Path in hardware i RoHC Air Ciphering bb b aa a rr • Low Power Consumption System Elements r QoS RLC / MAC ee – Adaptive Voltage Scaling e TT Power Mgmt SysMon T Various and many Power saving techniques • Hyperlink Debug EDMA Scheduler Fast Path – Expansion port – Transparent to Software Peripherals and I/O • Terabit Switch Network Processing Engine sRIO 4x AIF2 6x – Reduces bottlenecks and unleashes full IPsec / GTP-U Fast Path performance of the chip SGMII 2x PCIe 2x • Multicore Debugging GPIO, I2C UART, SPI Cryptographic IEEE 1588 Multi-core Infrastructure Navigator connectivity Power (cooling , reliability….) • The DSP is general purpose and intended to be used in embedded system also • Power is a major concern today to system designers and has many impacts on price , reliability • DSPs are available in industrial temperature range • Next slide shows a Power/Performance GPP/GPU/DSP comparison example GeForce GTX 580 Source : http://www.nvidia.com/object/product-geforce-gtx-580-us.html GeForce GTX 580 FFT benchmark comparison with GPP/GPU Platform Effective Time to Power Energy complete 1024 point (Watts) per FFT complex to complex (µJ) FFT (single precision), µs GPU: nVidia Tesla C2070 0.16 225 36 GPU: nVidia Tesla C1060 0.3 188 56.4 GPP: Intel Xeon Core Duo 1.8 95 171 @ 3 GHz1 GPP: Intel Nehalam Quad 1.2 130 156 Core @ 3.2 GHz1 DSP: TI C6678 @ 1.2 GHz1 0.86 10 8.6 1For GPP and DSP, we show the effective time of completion assuming that multiple FFTs are run in parallel to the multiple cores available • When performance is distributed over power – DSP >4x better than GPU – DSP >18x better than GPP GPP/GPU information sources: • http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf • http://www.fftw.org/speed/CoreDuo-3.0GHz-icc/ Real time systems (HW connectivity ,controlled response timing…..) • The typical system we are discussing has very fast inputs and outputs • The system has to meet real time constraint as frame rate , communications , latency • DSP operating system and HW architecture enables real time reliable response • Following slides show typical connections of GPU and DSP in real world • The DSP is designed to connect easily to FPGA and other HW inputs/outputs GPU environment DSP environment Please note connectivity to FPGA and to external buses C6678 DSP Architecture • The next slide shows the world of embedded processing today 16,32 bits, Accelerators, Multicores • A slide with the internal architechture of the Multicore DSP follows • The DSP has 8 functional units that can run 8 instructions per cycle • In term of “CUDA cores” each DSP has 16 “cores” • In C6678 the HYPER LINK enables to connect DSPs in pairs so you can get 32 cores @1.2GHZ Embedded processing portfolio TI Embedded Processors Microcontrollers (MCUs) ARM®-Based Processors Digital Signal Processors (DSPs) 16-bit ultra- 32-bit 32-bit ARM Ultra ARM DSP Multi-core low power real-time Cortex™-M3 Low power Cortex-A8 & DSP+ARM DSP MCUs MCUs MCUs ARM9™ MPUs DSP ™ ™ C2000 ® Sitara™ C6000 Stellaris ™ ™ ™ ™ ™ ARM Cortex-A8 DaVinci C6000 C5000 MSP430 Delfino ARM Cortex-M3 ™ & ARM9 video processors Piccolo OMAP™ Up to 40MHz to Up to 375MHz to 300MHz to >1Ghz 320 GMACS Up to 300 MHz 25 MHz 300 MHz 100 MHz >1GHz +Accelerator +Accelerator Flash Flash, RAM Flash Cache, Cache Cache Up to 320KB RAM 1 KB to 256 KB 16 KB to 512 KB 8 KB to 256 KB RAM, ROM RAM, ROM RAM, ROM Up to 128KB ROM Analog I/O, ADC PWM, ADC, USB, ENET USB, CAN, SATA, USB, ENET, SRIO, EMAC USB, ADC LCD, USB, RF 2 MAC+PHY CAN, SPI, PCIe, EMAC PCIe, SATA, SPI DMA, PCIe McBSP, SPI, I2C CAN, SPI, I C ADC, PWM, SPI Measurement, Motor Control, Connectivity,Security, Industrial automation, Floating/Fixed Point Telecom test & meas, Audio, Voice Sensing, General Digital Power, Motion Control, HMI, POS & portable Video, Audio, Voice, media gateways, Purpose Lighting, Ren. Energy Industrial Automation data terminals Security, Conferencing base stations Medical, Biometrics $0.25 to $9.00 $1.50 to $20.00 $1.00 to $8.00 $5.00 to $25.00 $5.00 to $200.00 $40.00 to $200.00 $3.00 to $10.00 Software & Dev. Tools MPUs – Microprocessors TMX320C6678/4/2 processors are ideal for Applications such as Key Advantages • Mission Critical • Military • Support for fixed & floating point • Avionics processing • Public Safety • Focused security & network • Medical Imaging coprocessors • Ultrasound • Multiple high speed peripherals • Endoscopy • Enhanced memory architecture • MRI / CT Scan with large on chip memory • Emerging modalities • New multicore navigator for fast • Test & Automation data transfers • Wafer/LCD inspection • Niche printing/scanning • Switch fabric with 2 terabits of • Emerging Video bandwidth • New multicore focused debug and performance profiling capabilities back Multicore High Performance DSP C66x Core Multicore Navigator Next generation Fixed / Floating-Point Data transfer engine that is architected to move DSP core with clock speeds ranging data between various system elements without using any CPU overhead so maximum system from 1GHz – 1.25GHz and Up to 8 efficiency is achieved core options a b Network Co- Processor and Accelerators A cost effective implementation to Memory Architecture b off-load the wireless and secure Up to 8MB of combined memory /%t '' / /%t '' networking functions from the DSP with an enhanced memory architecture that allows fast on- and [ / w!a Y. tttt off-chip memory access including a eeee a { NNNN aaaa rrrr !!!!!! a a " # !!!!!! a a " # { 9 DDR3-1600MHz (64-bit) interface eeee /$ /$ TTTT %%% t %%% t {#' a $ www {#' a $ (8GB of addressable space). www {+ a " # 555 {+ a " # a " 555 a " TeraNet 555 555 a . 5! 95a ! Switch fabric that has 2 Terabits D 9 { of bandwidth which allows t Lh maximum data transfer between {" w0 system components to realize full

Multi Core Dsps Vs Gpus TI for Distribution

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support