<<

A Manycore Architecture for

Andreas Olofsson andreas@.com Houston, we have a problem..

1PFLOP

Supercomputer Growth=6.x in 3 years 100TFLOP

10TFLOP

1TFLOP

100GFLOP

James Anderson, HPEC 2008

LACSS 2009 Current Path To Exaflop Supercomputing

 2023 Projection:

− 10 EXAFLOP − 100 GFLOP/Watt − 10 GWatt System

 2023 Alternative:

− Change programming model − 10 EXAFLOP − 1­10 TFLOP/Watt − 0.1 – 1 GWatt

LACSS 2009 Ease of use → Area → Power

AMD­OPTERON IBM 36 GFLOPS 230 GFLOP/s 283 mm^2 220 mm^2

The quest for the holy grail : An architecture that is flexible, easy to use, and power efficient.

NVIDIA­GTX280 VIRTEX5­LX50 933 GFLOPS 24 GMACS 576mm^2 146mm^2

LACSS 2009 Stratix IV FPGA Example(40nm)

 820K Logic Elements  48 high speed transceivers at 8.5 GB/s  23 MB SRAM  1288 hard macro multipliers(18bit)  Up to 264 LVDS pairs

But, even FPGAs have warts (blue and purple squares do useful work!)

Source: Rose, et al FPGA 2003

5 LACSS 2009 Teraflops Effort (65nm)

 Core Details:  VLIW Machine  Dual FMADD  2 Kbytes DMEM  4 Kbytes IMEM  Interconnect Details:  2D Mesh NOC  20 Gbyte/link BW  Performance:  5­25 GFLOPS/WATT

6 LACSS 2009 Usage Programming Model

Coprocessor:  Great at math! FPGA:  Limited flexibility   Interfacing Soft Simple programming   Data Flow Coprocessor High Efficiency  Bit Level processing :  Program Control  Human interfaces,  Knowledge extraction SENSOR

FPGA

Peaceful coexistence between architectures... 7 7 LACSS 2009 Proposed Coprocessor Solution

uP MEMORY

ROUTER DMA

 Mesh connected array of ANSI C­  Distributed Shared Memory programmable cores Architecture  16 to1024 independent dual issue  Distributed DMAs  IEEE 754 floating point support 50 GFLOP/Watt Performance @ 65nm

8 LACSS 2009 The Processor

 Included: − ANSI­C programmability − Features that increased FLOPS/Watt − Floating point support

 Excluded: − !!! − Compiler driven optimization − Instruction set optimization across a broad range of applications − SIMD

9 LACSS 2009 The Network

 Highlights:

− DSP DSP DSP On­chip wires are free! IF IF IF − Address used as 2D routing CROSS CROSS CROSS (1,1) address BAR BAR BAR

− Bidirectional mesh network DSP DSP DSP operating at same frequency IF IF IF as core CROSS CROSS CROSS (2,1) BAR BAR BAR − Optimized for deterministic data traffic and low latency DSP DSP DSP IF IF IF communication CROSS CROSS CROSS − 64GB/sec BW at each node (3,1) BAR BAR BAR

10 LACSS 2009 Interfaces

 SOCs tend to have too many interfaces but never the right ones

 FPGAs can have the right interface and have hundreds of GPIO signals

 Use custom low power coprocessor­ FPGA interface

Custom LVDS FPGA Interface

11 LACSS 2009 Multicore FFT Example

 Approach: s2 s1  1024 point FFT is spread over 16 P0 P1 P2 P3 processors s1 s2  s1,s2,s3,s4 refer to the four FFT stages s4 s3 s4 s3 s4 s3 s4 s3 for combining data with 64 point complex s2 s1 data movements P4 P5 P6 P7  Lower # procs transfer W0 to higher # s1 s2 procs.  Lower # proc calculates Wj0+Wj1 x Cj, s2 s1 higher # proc calculates Wj0­Wj1 x Cj P8 P9 P10 P11  Results: s1 s2  NOC enables efficient multicore programming s3 s4 s3 s4 s3 s4 s3 s4  < 3us execution time! s2 s1  High efficiency P12 P13 P14 P15  Work in progress, still room for improvement s1 s2

12 LACSS 2009 Summary

 Heterogeneous computing is the only way to continue scaling performance

 ... but it will require some changes to the programming model

 50 GFLOPS/Watt possible today in 65nm

13 LACSS 2009