A Manycore Coprocessor Architecture for Heterogeneous Computing

A Manycore Coprocessor Architecture for Heterogeneous Computing Andreas Olofsson [email protected] Houston, we have a problem.. 1PFLOP Supercomputer Growth=6.x in 3 years 100TFLOP 10TFLOP 1TFLOP 100GFLOP James Anderson, HPEC 2008 LACSS 2009 Current Path To Exaflop Supercomputing 2023 Projection: − 10 EXAFLOP − 100 GFLOP/Watt − 10 GWatt System 2023 Alternative: − Change programming model − 10 EXAFLOP − 1-10 TFLOP/Watt − 0.1 ± 1 GWatt LACSS 2009 Ease of use → Area → Power AMD-OPTERON IBM CELL 36 GFLOPS 230 GFLOP/s 283 mm^2 220 mm^2 The quest for the holy grail : An architecture that is flexible, easy to use, and power efficient. NVIDIA-GTX280 VIRTEX5-LX50 933 GFLOPS 24 GMACS 576mm^2 146mm^2 LACSS 2009 Stratix IV FPGA Example(40nm) 820K Logic Elements 48 high speed transceivers at 8.5 GB/s 23 MB SRAM 1288 hard macro multipliers(18bit) Up to 264 LVDS pairs But, even FPGAs have warts (blue and purple squares do useful work!) Source: Rose, et al FPGA 2003 5 LACSS 2009 Intel Teraflops Effort (65nm) Core Details: VLIW Machine Dual FMADD 2 Kbytes DMEM 4 Kbytes IMEM Interconnect Details: 2D Mesh NOC 20 Gbyte/link BW Performance: 5-25 GFLOPS/WATT 6 LACSS 2009 Usage Programming Model Coprocessor: Great at math! FPGA: Limited flexibility Soft Simple programming Interfacing Data Flow Coprocessor High Efficiency Bit Level processing Microprocessor: Program Control Human interfaces, Knowledge extraction SENSOR FPGA Peaceful coexistence between architectures... 7 7 LACSS 2009 Proposed Coprocessor Solution uP MEMORY ROUTER DMA Mesh connected array of ANSI C- Distributed Shared Memory programmable processor cores Architecture 16 to1024 independent dual issue Distributed DMAs microprocessors IEEE 754 floating point support 50 GFLOP/Watt Performance @ 65nm 8 LACSS 2009 The Processor Included: − ANSI-C programmability − Features that increased FLOPS/Watt − Floating point support Excluded: − Cache!!! − Compiler driven optimization − Instruction set optimization across a broad range of applications − SIMD 9 LACSS 2009 The Network Highlights: − DSP DSP DSP On-chip wires are free! IF IF IF − Address used as 2D routing CROSS CROSS CROSS (1,1) address BAR BAR BAR − Bidirectional mesh network DSP DSP DSP operating at same frequency IF IF IF as core CROSS CROSS CROSS (2,1) BAR BAR BAR − Optimized for deterministic data traffic and low latency DSP DSP DSP IF IF IF communication CROSS CROSS CROSS − 64GB/sec BW at each node (3,1) BAR BAR BAR 10 LACSS 2009 Interfaces SOCs tend to have too many interfaces but never the right ones FPGAs can have the right interface and have hundreds of GPIO signals Use custom low power coprocessor- FPGA interface Custom LVDS FPGA Interface 11 LACSS 2009 Multicore FFT Example Approach: s2 s1 1024 point FFT is spread over 16 P0 P1 P2 P3 processors s1 s2 s1,s2,s3,s4 refer to the four FFT stages s4 s3 s4 s3 s4 s3 s4 s3 for combining data with 64 point complex s2 s1 data movements P4 P5 P6 P7 Lower # procs transfer W0 to higher # s1 s2 procs. Lower # proc calculates Wj0+Wj1 x Cj, s2 s1 higher # proc calculates Wj0-Wj1 x Cj P8 P9 P10 P11 Results: s1 s2 NOC enables efficient multicore programming s3 s4 s3 s4 s3 s4 s3 s4 < 3us execution time! s2 s1 High efficiency P12 P13 P14 P15 Work in progress, still room for improvement s1 s2 12 LACSS 2009 Summary Heterogeneous computing is the only way to continue scaling performance ... but it will require some changes to the programming model 50 GFLOPS/Watt possible today in 65nm 13 LACSS 2009.

Load more