<<

Custom Computing

Lecture 1: Custom Computing Technologies

Wayne Luk Department of Computing Imperial College London

https://www.doc.ic.ac.uk/~wl/teachlocal/cuscomp/ [email protected]

wl 2021 1.1 General-purpose computing: efficient?

AES 128bit key Throughput Power Efficiency 128bit data Consumption (Gb/s/W)

ASIC [0] 3.84 Gbits/sec 350 mW 11 (1/1)

FPGA [1] 1.32 Gbit/sec 490 mW 2.7 (1/4)

ASM StrongARM [2] 31 Mbit/sec 240 mW 0.13 (1/85) Asm Pentium III [3] 648 Mbits/sec 41.4 W 0.015 (1/800)

C Emb. Sparc [4] 133 Kbits/sec 120 mW 0.0011 (1/10,000)

Java Emb. Sparc [5] 450 bits/sec 120 mW 0.0000037 (1/3,000,000)

[0] Application-Specific : 180 nm CMOS ASIC [0]: non-programmable hardware [1] Field Programmable : Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator FPGA [1]: programmable hardware [2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110 [2]-[5]: general-purpose processors [3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet Source: P. Schaumont, and I. Verbauwhede [4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 250 nm CMOS Adapted from: J. Cong [5] on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 250 nm CMOS wl 2021 1.2 TPU:

Systolic Array

Source: N.P. Jouppi et al.

We will find out the secret of systolic array design in Lecture 7! wl 2021 1.3 Learning outcomes: ability to

• develop parametric descriptions of custom • develop alternative designs for custom computers that meet specified requirements • analyse the performance of a custom in terms of time and space • evaluate space/time trade-offs between competing custom computing designs in order to determine optimal solutions • use simulation to compare the intended and actual behaviour of custom computers

wl 2021 1.4 Module plan

Week Monday Thursday Remarks (starting) 2 (18/1) Lecture 1 Lecture 3 Technologies and systems; Lecture 2 Ex 1 parametric block description 3 (25/1) Lecture 4 Lecture 5 Patterns of computation; repeated composition; Ex 2 Ex 3 types, laws 4 (01/2) Lecture 6 Lecture 8 Reasoning and specialisation; sequential designs Lecture 7 Ex 4 and pipelining; systolic design 5 (08/2) Lecture 9 Lecture 10 Industrial case studies; state machines; Ex 5 Ex 6 summary 6 (15/2) Lectures 11 Lecture 12 Streaming design; iterations; stream offsets; Ex 7 Ex 8 hardware mapping 7 (22/2) Lecture 13 Lecture 14 Scheduling; design compilation; performance Ex 9 Ex 10 modelling 8 (01/3) Lecture 15 Lecture 16 + Loops and cyclic graphs; industrial case studies; Ex 11 Ex 12 summary 9 (08/3) Revision class Revision class Revision 10 (15/3) - - Timed assessment week 11 (22/3) - - Timed assessment week

wl 2021 1.5 Custom computing: key principles • generalisation and specialisation

• often start with design: f0

• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value

f(x) design space

generalise

x=x0 designs f0

wl 2021 1.6 Custom computing: key principles • generalisation and specialisation

• often start with design: f0

• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value • specialise f with values for x – to produce f1, f2, f3 … with tradeoffs in speed, size…

f(x) design space

generalise specialise

x=x0 x=x1 designs f0 f1

wl 2021 1.7 Custom computing: key principles • generalisation and specialisation

• often start with design: f0

• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value • specialise f with values for x – to produce f1, f2, f3 … with tradeoffs in speed, size…

f(x) design space

generalise specialise x=x x=x0 x=x1 2 designs f0 f1 f2

wl 2021 1.8 Custom computing: key principles • generalisation and specialisation

• often start with design: f0

• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value • specialise f with values for x – to produce f1, f2, f3 … with tradeoffs in speed, size…

f(x) design space

generalise specialise

x=x x=x3 x=x0 x=x1 2 designs f0 f1 f2 f3

wl 2021 1.9 Benefits of customisation • improvements in – accuracy: as needed, not necessarily 8, 32, 64, 128 bits – throughput: rate of producing results – latency: time between first input and first output – reconfiguration time: speed of adapting to changes – size: area, volume, weight – energy and power consumption: mobile and remote applications – development time: design and validation – cost: minimise fabrication, post-delivery fixes, enhancements • need to prioritise design objectives – e.g. smallest design at a given speed consuming given energy • opportunities for customisation – application-oriented, e.g. run-time conditions – implementation-oriented, e.g. technology used wl 2021 1.10 Implementation technologies • application-specific integrated circuit (ASIC) – high performance, low part cost: cheap if producing large volume – high risk, high development cost, slow time-to-market – costly (Moore’s Second Law) to develop, build and test, inflexible

wl 2021 1.11 FPGA: Field Programmable Gate Array

Arithmetic Block I/O Block Xilinx Virtex-6 FPGA

Memory Block Arithmetic Block Memory Block (20TB/s) Source: Maxeler FPGA: Field Programmable Gate Array

Arithmetic Block I/O Block Logic (105 elements)

Memory Block (20TB/s) Source: Maxeler FPGA: getting more heterogeneous

Scalar, Sequential Flexible Parallel Compute, Machine learning & Signal Processing & Complex Compute Data manipulation Vector, Compute Intensive Heterogeneous Acceleration from Data Center to the Edge Scalar Adaptable Intelligent 160 GB/s of AI Engines Memory B/W Video + AI Arm per Core Dual-Core Cortex-A72 Genomics + AI Risk Modeling + AI Arm Dual-Core Cortex-R5 Database + AI

NETWORK-ON-CHIP Network IPS + AI I/O Storage + AI

Any-to-Any Custom Memory TB/s of Bandwidth Connectivity Hierarchy PL-to-AI Engine

Delivering Deterministic Performance & Low Latency

Source: Xilinx Accelerate clouds: Microsoft + Amazon

www..org/news/microsoft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/

aws.amazon.com/ec2/instance-types/f1/

wl 2021 1.15 Implementation technologies • application-specific integrated circuit (ASIC) – high performance, low part cost: cheap if producing large volume – high risk, high development cost, slow time-to-market – costly (Moore’s Second Law) to develop, build and test, inflexible • Field-Programmable Gate Array (FPGA) – low risk, fast time-to-market, low development cost, high part cost – post-delivery improvement: fix bugs, update functions – customisable at run time: adapt to environment changes – prototype for ASIC – enable internet routing • custom computing systems – stand-alone – PCIe / Infiniband – system-on-chip: instruction + FPGA wl 2021 1.16 When to specialise?

ASIC: Application-Specific Integrated Circuit • fabrication time: pre-fab optimisation – specialise physical fabric, ↓ post-fab options – ↓ flexibility, ↑ efficiency for compilation and execution

wl 2021 1.17 When to specialise?

FPGA: field programmable gate array • fabrication time: pre-fab optimisation – specialise physical fabric, ↓ post-fab options – ↓ flexibility, ↑ efficiency for compilation and execution

• compile time: pre-execution optimisation – specialise initial mapping to fabric, ↓ execution options – ↓ efficiency for compilation, ↑ efficiency for execution

wl 2021 1.18 When to specialise? instruction processor, FPGA overlay or reconfiguration • fabrication time: pre-fab optimisation – specialise physical fabric, ↓ post-fab options – ↓ flexibility, ↑ efficiency for compilation and execution

• compile time: pre-execution optimisation – specialise initial mapping to fabric, ↓ execution options – ↓ efficiency for compilation, ↑ efficiency for execution

• run time – specialise mapping to fabric during execution – ↑ flexibility, ↓ efficiency for execution

wl 2021 1.19 Technology comparison

temporal + spatial specialisation at compile time and run time FPGAs

General-Purpose Instruction Processors spatial specialisation at fab time and compile Digital Signal Processors time, temporal

specialisation at run time Flexibility

Special-Purpose Instruction Processors

ASICs

Efficiency, Performance Adapted from K. Fan, HPCA’09

wl 2021 1.20 Makimoto’s Wave: cyclical innovation

Generalisation at fab time, specialisation at compile/run time

Adapted from . Makimoto, IEEE Computer’13 Specialisation at fab time

wl 2021 1.21 Design metrics

• NRE (non-recurring engineering) cost – one-time cost of designing a system • total cost: total cost = NRE cost + unit cost * number of units • size, performance, power • flexibility – make changes to the hardware with low NRE cost • time-to-prototype, time-to-market • maintainability • correctness, safety, robustness

Source: J. Wong wl 2021 1.22

FPGA/ASIC crossover points Cost

FPGA FPGACost Advantage CostFPGA Advantage Cost AdvantageASIC CostASIC Advantage Cost Advantage Production Volume

Source: S.S.S.P. Rao wl 2021 1.23 Current and future: System-on-Chip

I/O Ring and Interface Circuitry Processor eg ARM

Embedded Fixed Fixed - functionality Processor IP IP specified using Block Block software

On-Chip Reconfigurable Memory Logic Fixed Intellectual I/O Ring and Interface Circuitry Property Block - functionality fixed at design time Programmable Logic - little post-fab - circuit can be specified / modified flexibility after fabrication, possibly at run time - maybe slower than fixed IP block

Source: S. Wilton wl 2021 1.24 Summary

• custom computing: theory and practice of customisation – from data centres/cloud computing to mobile appliances • customisable off-the-shelf implementation technology – e.g. FPGAs, coarse-grained/hybrid processors, custom instructions • factors favouring field-programmability – rise in FPGA capability: many exciting applications – rise in integrated circuit fabrication cost: zero for FPGA users! – customisation: facilitate product evolution and prototyping • custom computing tools + applications at Imperial College – financial analysis/trading, multimedia processing, medical imaging – network firewall, data compression/encryption, mobile robots – bio-informatics, machine learning, bio-inspired/self-aware systems see: http://cc.doc.ic.ac.uk

wl 2021 1.25