Custom Computing
Lecture 1: Custom Computing Technologies
Wayne Luk Department of Computing Imperial College London
https://www.doc.ic.ac.uk/~wl/teachlocal/cuscomp/ [email protected]
wl 2021 1.1 General-purpose computing: efficient?
AES 128bit key Throughput Power Efficiency 128bit data Consumption (Gb/s/W)
ASIC [0] 3.84 Gbits/sec 350 mW 11 (1/1)
FPGA [1] 1.32 Gbit/sec 490 mW 2.7 (1/4)
ASM StrongARM [2] 31 Mbit/sec 240 mW 0.13 (1/85) Asm Pentium III [3] 648 Mbits/sec 41.4 W 0.015 (1/800)
C Emb. Sparc [4] 133 Kbits/sec 120 mW 0.0011 (1/10,000)
Java Emb. Sparc [5] 450 bits/sec 120 mW 0.0000037 (1/3,000,000)
[0] Application-Specific Integrated Circuit: 180 nm CMOS ASIC [0]: non-programmable hardware [1] Field Programmable Gate Array: Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator FPGA [1]: programmable hardware [2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110 [2]-[5]: general-purpose processors [3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet Source: P. Schaumont, and I. Verbauwhede [4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 250 nm CMOS Adapted from: J. Cong [5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 250 nm CMOS wl 2021 1.2 TPU: Tensor Processing Unit
Systolic Array
Source: N.P. Jouppi et al.
We will find out the secret of systolic array design in Lecture 7! wl 2021 1.3 Learning outcomes: ability to
• develop parametric descriptions of custom computers • develop alternative designs for custom computers that meet specified requirements • analyse the performance of a custom computer in terms of time and space • evaluate space/time trade-offs between competing custom computing designs in order to determine optimal solutions • use simulation to compare the intended and actual behaviour of custom computers
wl 2021 1.4 Module plan
Week Monday Thursday Remarks (starting) 2 (18/1) Lecture 1 Lecture 3 Technologies and systems; Lecture 2 Ex 1 parametric block description 3 (25/1) Lecture 4 Lecture 5 Patterns of computation; repeated composition; Ex 2 Ex 3 types, laws 4 (01/2) Lecture 6 Lecture 8 Reasoning and specialisation; sequential designs Lecture 7 Ex 4 and pipelining; systolic design 5 (08/2) Lecture 9 Lecture 10 Industrial case studies; state machines; Ex 5 Ex 6 summary 6 (15/2) Lectures 11 Lecture 12 Streaming design; iterations; stream offsets; Ex 7 Ex 8 hardware mapping 7 (22/2) Lecture 13 Lecture 14 Scheduling; design compilation; performance Ex 9 Ex 10 modelling 8 (01/3) Lecture 15 Lecture 16 + Loops and cyclic graphs; industrial case studies; Ex 11 Ex 12 summary 9 (08/3) Revision class Revision class Revision 10 (15/3) - - Timed assessment week 11 (22/3) - - Timed assessment week
wl 2021 1.5 Custom computing: key principles • generalisation and specialisation
• often start with design: f0
• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value
f(x) design space
generalise
x=x0 designs f0
wl 2021 1.6 Custom computing: key principles • generalisation and specialisation
• often start with design: f0
• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value • specialise f with values for x – to produce f1, f2, f3 … with tradeoffs in speed, size…
f(x) design space
generalise specialise
x=x0 x=x1 designs f0 f1
wl 2021 1.7 Custom computing: key principles • generalisation and specialisation
• often start with design: f0
• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value • specialise f with values for x – to produce f1, f2, f3 … with tradeoffs in speed, size…
f(x) design space
generalise specialise x=x x=x0 x=x1 2 designs f0 f1 f2
wl 2021 1.8 Custom computing: key principles • generalisation and specialisation
• often start with design: f0
• generalise f0 to become f(x) – f(x0) = f0 where x is a parameter, x0 is a specific value • specialise f with values for x – to produce f1, f2, f3 … with tradeoffs in speed, size…
f(x) design space
generalise specialise
x=x x=x3 x=x0 x=x1 2 designs f0 f1 f2 f3
wl 2021 1.9 Benefits of customisation • improvements in – accuracy: as needed, not necessarily 8, 32, 64, 128 bits – throughput: rate of producing results – latency: time between first input and first output – reconfiguration time: speed of adapting to changes – size: area, volume, weight – energy and power consumption: mobile and remote applications – development time: design and validation – cost: minimise fabrication, post-delivery fixes, enhancements • need to prioritise design objectives – e.g. smallest design at a given speed consuming given energy • opportunities for customisation – application-oriented, e.g. run-time conditions – implementation-oriented, e.g. technology used wl 2021 1.10 Implementation technologies • application-specific integrated circuit (ASIC) – high performance, low part cost: cheap if producing large volume – high risk, high development cost, slow time-to-market – costly (Moore’s Second Law) to develop, build and test, inflexible
wl 2021 1.11 FPGA: Field Programmable Gate Array
Arithmetic Block I/O Block Xilinx Virtex-6 FPGA
Memory Block Arithmetic Block Memory Block (20TB/s) Source: Maxeler FPGA: Field Programmable Gate Array
Arithmetic Block I/O Block Logic Cell (105 elements)
Memory Block (20TB/s) Source: Maxeler FPGA: getting more heterogeneous
Scalar, Sequential Flexible Parallel Compute, Machine learning & Signal Processing & Complex Compute Data manipulation Vector, Compute Intensive Heterogeneous Acceleration from Data Center to the Edge Scalar Adaptable Intelligent 160 GB/s of AI Engines Memory B/W Video + AI Arm per Core Dual-Core Cortex-A72 Genomics + AI Risk Modeling + AI Arm Dual-Core Cortex-R5 Database + AI
NETWORK-ON-CHIP Network IPS + AI I/O Storage + AI
Any-to-Any Custom Memory TB/s of Bandwidth Connectivity Hierarchy PL-to-AI Engine
Delivering Deterministic Performance & Low Latency
Source: Xilinx Accelerate clouds: Microsoft + Amazon
www.top500.org/news/microsoft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/
aws.amazon.com/ec2/instance-types/f1/
wl 2021 1.15 Implementation technologies • application-specific integrated circuit (ASIC) – high performance, low part cost: cheap if producing large volume – high risk, high development cost, slow time-to-market – costly (Moore’s Second Law) to develop, build and test, inflexible • Field-Programmable Gate Array (FPGA) – low risk, fast time-to-market, low development cost, high part cost – post-delivery improvement: fix bugs, update functions – customisable at run time: adapt to environment changes – prototype for ASIC – enable internet routing • custom computing systems – stand-alone – PCIe / Infiniband – system-on-chip: instruction processor + FPGA wl 2021 1.16 When to specialise?
ASIC: Application-Specific Integrated Circuit • fabrication time: pre-fab optimisation – specialise physical fabric, ↓ post-fab options – ↓ flexibility, ↑ efficiency for compilation and execution
wl 2021 1.17 When to specialise?
FPGA: field programmable gate array • fabrication time: pre-fab optimisation – specialise physical fabric, ↓ post-fab options – ↓ flexibility, ↑ efficiency for compilation and execution
• compile time: pre-execution optimisation – specialise initial mapping to fabric, ↓ execution options – ↓ efficiency for compilation, ↑ efficiency for execution
wl 2021 1.18 When to specialise? instruction processor, FPGA overlay or reconfiguration • fabrication time: pre-fab optimisation – specialise physical fabric, ↓ post-fab options – ↓ flexibility, ↑ efficiency for compilation and execution
• compile time: pre-execution optimisation – specialise initial mapping to fabric, ↓ execution options – ↓ efficiency for compilation, ↑ efficiency for execution
• run time – specialise mapping to fabric during execution – ↑ flexibility, ↓ efficiency for execution
wl 2021 1.19 Technology comparison
temporal + spatial specialisation at compile time and run time FPGAs
General-Purpose Instruction Processors spatial specialisation at fab time and compile Digital Signal Processors time, temporal
specialisation at run time Flexibility
Special-Purpose Instruction Processors
ASICs
Efficiency, Performance Adapted from K. Fan, HPCA’09
wl 2021 1.20 Makimoto’s Wave: cyclical innovation
Generalisation at fab time, specialisation at compile/run time
Adapted from T. Makimoto, IEEE Computer’13 Specialisation at fab time
wl 2021 1.21 Design metrics
• NRE (non-recurring engineering) cost – one-time cost of designing a system • total cost: total cost = NRE cost + unit cost * number of units • size, performance, power • flexibility – make changes to the hardware with low NRE cost • time-to-prototype, time-to-market • maintainability • correctness, safety, robustness
Source: J. Wong wl 2021 1.22
FPGA/ASIC crossover points Cost
FPGA FPGACost Advantage CostFPGA Advantage Cost AdvantageASIC CostASIC Advantage Cost Advantage Production Volume
Source: S.S.S.P. Rao wl 2021 1.23 Current and future: System-on-Chip
I/O Ring and Interface Circuitry Processor eg ARM
Embedded Fixed Fixed - functionality Processor IP IP specified using Block Block software
On-Chip Reconfigurable Memory Logic Fixed Intellectual I/O Ring and Interface Circuitry Property Block - functionality fixed at design time Programmable Logic - little post-fab - circuit can be specified / modified flexibility after fabrication, possibly at run time - maybe slower than fixed IP block
Source: S. Wilton wl 2021 1.24 Summary
• custom computing: theory and practice of customisation – from data centres/cloud computing to mobile appliances • customisable off-the-shelf implementation technology – e.g. FPGAs, coarse-grained/hybrid processors, custom instructions • factors favouring field-programmability – rise in FPGA capability: many exciting applications – rise in integrated circuit fabrication cost: zero for FPGA users! – customisation: facilitate product evolution and prototyping • custom computing tools + applications at Imperial College – financial analysis/trading, multimedia processing, medical imaging – network firewall, data compression/encryption, mobile robots – bio-informatics, machine learning, bio-inspired/self-aware systems see: http://cc.doc.ic.ac.uk
wl 2021 1.25