A Manycore Coprocessor Architecture for Heterogeneous Computing
Andreas Olofsson andreas@adapteva.com Houston, we have a problem..
1PFLOP
Supercomputer Growth=6.x in 3 years 100TFLOP
10TFLOP
1TFLOP
100GFLOP
James Anderson, HPEC 2008
LACSS 2009 Current Path To Exaflop Supercomputing
2023 Projection:
− 10 EXAFLOP − 100 GFLOP/Watt − 10 GWatt System
2023 Alternative:
− Change programming model − 10 EXAFLOP − 110 TFLOP/Watt − 0.1 – 1 GWatt
LACSS 2009 Ease of use → Area → Power
AMDOPTERON IBM CELL 36 GFLOPS 230 GFLOP/s 283 mm^2 220 mm^2
The quest for the holy grail : An architecture that is flexible, easy to use, and power efficient.
NVIDIAGTX280 VIRTEX5LX50 933 GFLOPS 24 GMACS 576mm^2 146mm^2
LACSS 2009 Stratix IV FPGA Example(40nm)
820K Logic Elements 48 high speed transceivers at 8.5 GB/s 23 MB SRAM 1288 hard macro multipliers(18bit) Up to 264 LVDS pairs
But, even FPGAs have warts (blue and purple squares do useful work!)
Source: Rose, et al FPGA 2003
5 LACSS 2009 Intel Teraflops Effort (65nm)
Core Details: VLIW Machine Dual FMADD 2 Kbytes DMEM 4 Kbytes IMEM Interconnect Details: 2D Mesh NOC 20 Gbyte/link BW Performance: 525 GFLOPS/WATT
6 LACSS 2009 Usage Programming Model
Coprocessor: Great at math! FPGA: Limited flexibility Interfacing Soft Simple programming Data Flow Coprocessor High Efficiency Bit Level processing Microprocessor: Program Control Human interfaces, Knowledge extraction SENSOR
FPGA
Peaceful coexistence between architectures... 7 7 LACSS 2009 Proposed Coprocessor Solution
uP MEMORY
ROUTER DMA
Mesh connected array of ANSI C Distributed Shared Memory programmable processor cores Architecture 16 to1024 independent dual issue Distributed DMAs microprocessors IEEE 754 floating point support 50 GFLOP/Watt Performance @ 65nm
8 LACSS 2009 The Processor
Included: − ANSIC programmability − Features that increased FLOPS/Watt − Floating point support
Excluded: − Cache!!! − Compiler driven optimization − Instruction set optimization across a broad range of applications − SIMD
9 LACSS 2009 The Network
Highlights:
− DSP DSP DSP Onchip wires are free! IF IF IF − Address used as 2D routing CROSS CROSS CROSS (1,1) address BAR BAR BAR
− Bidirectional mesh network DSP DSP DSP operating at same frequency IF IF IF as core CROSS CROSS CROSS (2,1) BAR BAR BAR − Optimized for deterministic data traffic and low latency DSP DSP DSP IF IF IF communication CROSS CROSS CROSS − 64GB/sec BW at each node (3,1) BAR BAR BAR
10 LACSS 2009 Interfaces
SOCs tend to have too many interfaces but never the right ones
FPGAs can have the right interface and have hundreds of GPIO signals
Use custom low power coprocessor FPGA interface
Custom LVDS FPGA Interface
11 LACSS 2009 Multicore FFT Example
Approach: s2 s1 1024 point FFT is spread over 16 P0 P1 P2 P3 processors s1 s2 s1,s2,s3,s4 refer to the four FFT stages s4 s3 s4 s3 s4 s3 s4 s3 for combining data with 64 point complex s2 s1 data movements P4 P5 P6 P7 Lower # procs transfer W0 to higher # s1 s2 procs. Lower # proc calculates Wj0+Wj1 x Cj, s2 s1 higher # proc calculates Wj0Wj1 x Cj P8 P9 P10 P11 Results: s1 s2 NOC enables efficient multicore programming s3 s4 s3 s4 s3 s4 s3 s4 < 3us execution time! s2 s1 High efficiency P12 P13 P14 P15 Work in progress, still room for improvement s1 s2
12 LACSS 2009 Summary
Heterogeneous computing is the only way to continue scaling performance
... but it will require some changes to the programming model
50 GFLOPS/Watt possible today in 65nm
13 LACSS 2009