Vector IRAM: A Architecture for Media Processing

Christoforos E. Kozyrakis [email protected]

CS252 Graduate Computer Architecture February 10, 2000

Outline

• Motivation for IRAM – technology trends – design trends – application trends • Vector IRAM – instruction set – prototype architecture – performance

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 2

1 Processor-DRAM Gap (latency)

µProc 1000 CPU “Moore’s Law” 60%/yr. 100 Processor-Memory Performance Gap: 10 (grows 50% / year) DRAM Performance DRAM 7%/yr. 1 1987 1991 1980 1981 1982 1983 1984 1985 1986 1988 1989 1990 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 3

Processor-DRAM Tax

logic memory

Intel PIII Xeon 8 15 MIPS R12000 3 4.2 HP PA-8500 4 126 Sun Ultra-2 2 1.8 PowerPC G4 4.5 6 IBM Power3 7 8 AMD Athlon 11 11 6 9.2

0 5 10 15 20 25 30 Million Transistors

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 4

2 Power Consumption

60

50 Alpha 21264 AMD Athlon 40 IBM Power3 PowerPC G4 30 Sun Ultra-2 HP PA-8500 20 MIPS R12000 Intel PIII Xeon

Performance (Spec95FP) 10

0 0 20 40 60 80 Power (W)

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 5

Other Design Challenges

• Interconnect scaling problems – multiple cycles to go across the chip – difficult to achieve single cycle result forwarding – need to add extra pipeline stages at the cost of power, complexity, branch and load-use latency

• Design complexity of high-end CPUs – 4 to 5 years from scratch to chips for new superscalar architectures – >100 engineers – >50% of resources to design verification

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 6

3 Complexity Vs. Performance Gains

R5000 R10000 R10K/R5K • Clock Rate 200 MHz 195 MHz 1.0x • On-Chip Caches 32K/32K 32K/32K 1.0x • Instructions/Cycle 1(+ FP) 4 4.0x • Pipe stages 5 5-7 1.2x • Model In-order Out-of-order --- • Die Size (mm2) 84 298 3.5x – wo cache, TLB 32 205 6.3x • Development 60 300 5.0x (man years) • SPECint_base95 5.7 8.8 1.6x

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 7

Future microprocessor applications

• Multimedia applications – image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption etc. – narrow data types, streaming data, real-time requirements • Mobile and embedded environments – notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, cars etc.. – small devices, limited chip-count, limited power/energy budget • Significantly different environment from the desktop/workstation model

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 8

4 Requirements on (1)

• High performance for multimedia: – real-time performance guarantees – support for continuous media data-types – exploit fine-grain parallelism – exploit coarse-grain parallelism – exploit high instruction reference locality – code density – high memory bandwidth

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 9

Average vs. real time performance ...

45% Average 40% Which one is the best? 35% Statistical Þ Average Þ C 30% Real time Þ Worst Þ A 25%

20% Inputs 15%

10%

5% A B 0% C Performance Worst Case Best Case

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 10

5 Requirements on microprocessors (2)

• Low power and energy consumption – energy efficiency for long battery life – power efficiency for system cost reduction (cooling system, packaging etc...) • Design scalability – performance scalability – physical design scalability • design complexity, verification complexity – immunity to interconnect scaling problems • locality of interconnect, tolerance to latency • System-on-a-chip (SoC) – highly integrated system – low system chip-count

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 11

The IRAM vision statement

Proc L $ $ o f Microprocessor & DRAM L2$ g a I/O I/O Bus on a single chip: Bus i b – on-chip memory latency c 5-10X, bandwidth 50-100X D R A M – improve energy efficiency 2X-4X (no off-chip bus) I/O – serial I/O 5-10X v. buses I/O – smaller board area/volume Proc D – adjustable memory size/width f Bus R a A b M D R A M

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 12

6 Vector IRAM

• Vector processing – high-performance for media processing – low power/energy for processor control – modularity, low complexity – scalability – well understood software development • Embedded DRAM – high bandwidth for vector processing – low power/energy for memory accesses – modularity, scalability – small system size

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 13

IRAM ISA summary

• Full vector instruction set with – 32 vector registers, 32 vector flag registers – support for multiple data types (64b, 32b, 16b, 8b) – support for strided and indexed memory accesses – support for auto-increment addressing – support for DSP operations (multiply-add, saturation etc) – support for conditional execution – support for software speculation – support for fast reductions and butterfly permutations – support for virtual memory – restartable arithmetic (FP & integer) exceptions • Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2)

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 14

7 Vector architectural state

Virtual Processors ($vlr) Control Regs

VP VP VP vcr0 0 1 $vlr-1 vcr vr 1 General 0 vr Purpose 1 vcr Registers 31 64b (32) vr31 $vpw vf Scalar Regs Flag 0 vf1 vs Registers 0 vs1 (32) vf31

1b vs31 64b

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 15

Fixed-point Multiply-add

Mul & Shift Right & Round Add & Sat z x n n/2 + sat w * n Shift n y Round n n/2 a

• Multiply halves & shift instruction provides support for any fixed-point format • Precision is equal to the datatype width; multiplier’s inputs have half the width • Uniform, simple support for all datatypes

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 16

8 VIRAM-1 prototype

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 17

Design Overview

• 64b MIPS scalar core • Memory system – coprocessor interface – 8 2MByte eDRAM banks – 16KB I/D caches – single sub-bank per bank • Vector unit – 256-bit synchronous – 8KByte vector register file interface, separate I/O – support for 64b, 32b, and signals 16b data-types – 20ns cycle time, 6.6ns – 2 arithmetic (1 FP), 2 flag column access processing, 1 load-store units – crossbar interconnect for – 4 64-bit datapaths per unit 12.8 GB/sec per direction – DRAM latency included in – no caches vector pipeline • Network interface – 4 addresses/cycle for – user-level message passing strided/indexed accesses – dedicated DMA engines – 2-level TLB – 4 100MByte/s links

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 18

9 Vector Unit Pipeline Structure

• Single-issue, in-order pipeline – each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles • DRAM latency is included in the execution pipeline (delayed pipeline) – deep pipeline design, but not caches needed to avoid stalls – worst case DRAM latency does not cause pipeline stalls • Address decoupling buffer – buffers memory addresses in the presence of conflicts (indexed/strided accesses) – memory conflicts do not stall pipeline

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 19

Non-Delayed Pipeline

F D X M W . . .

DRAM latency: >=20ns vld mem VLOAD A T VW vadd Long Load-> ALU RAW hazard vst vld VALU VR X1 X2 ... XN VW mem vadd vst . VSTORE A T VR . .

Load->ALU exposes full DRAM latency (long)

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 20

10 Tolerating Memory Latency Delayed Pipeline

F D X M W . . DRAM latency: >20ns . vld A T VW VLOAD vadd Load-> ALU RAW hazard vst vld vadd DELAY VR X1 ... XN VW VALU vst . . .

VSTORE A T VR

Load ® ALU sees functional unit latency (short)

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 21

Clustered VLSI Design 64b

Xbar I/F Xbar I/F Xbar I/F Xbar I/F Integer Integer Integer Integer Datapath 0 Datapath 0 Datapath 0 Datapath 0 Vector Vector Vector Vector Registers Registers Registers Registers Flag Regs. Flag Regs. Flag Regs. Flag Regs. Control & Datapath & Datapath & Datapath & Datapath FP FP FP FP Datapaths Datapaths Datapaths Datapaths Integer Integer Integer Integer Datapath 1 Datapath 1 Datapath 1 Datapath 1 Xbar I/F Xbar I/F Xbar I/F Xbar I/F

256b

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 22

11 VIRAM-1 Floorplan

DRAM DRAM DRAM DRAM Bank Bank Bank Bank N 0 2 4 6 I

M C I Vector Vector Vector Vector T P Lane 0 Lane 1 Lane 2 Lane 3 S L

I DRAM DRAM DRAM DRAM O Bank Bank Bank Bank 1 3 5 7

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 23

Prototype Summary

• Technology: – 0.18um eDRAM CMOS process (IBM) – 6 layers of copper interconnect – 1.2V and 1.8V power supply • Memory: 16 MBytes • Clock frequency: 200MHz • Power: 2 W for vector unit and memory • Transistor count: ~140 millions • Peak performance: – GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) – GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) – GFLOPS: 1.6 (32b)

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 24

12 Kernels Performance

Peak Sustained % Perf. Perf. of Peak Image Composition 6.4 GOPS 6.40 GOPS 100.0% iDCT 6.4 GOPS 1.97 GOPS 30.7% Color Conversion 3.2 GOPS 3.07 GOPS 96.0% Image Convolution 3.2 GOPS 3.16 GOPS 98.7% Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5% Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7% FP MV Multiply 1.6 GFLOPS 1.40 GFLOPS 87.5% FP VM Multiply 1.6 GFLOPS 1.59 GFLOPS 99.6% AVERAGE 86.6%

•Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 25

Comparisons

VIRAM MMX VIS TMS320C82

Image 0.13 - 2.22 (17.0x) - Composition iDCT 1.18 3.75 (3.2x) - -

Color 0.78 8.00 (10.2x) - 5.70 (7.6x) Conversion Image 5.49 5.49 (4.5x) 6.19 (5.1x) 6.50 (5.3x) Convolution

• All numbers in cycles/pixel •MMX, VIS, and TMS results assume all data in L1 cache

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 26

13 FFT Performance

200

150 Fixed Point (16 bit) Floating Point (32 bit)

Pentium/200: 151 us TMS320C67x: 124 us 100

PPC604e: 87 us

50 TigerSHARC: 41 us Time (microseconds) VIRAM: 37 us CRI Pulsar: 27.9 us CRI Pathfinder-1: 22.3 us Wildstar: 25 us 0 128 256 512 1024 Size (#points in FFT) •Note : Simulations performed with unscheduled fixed-point code

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 27

Motion Estimation Performance

Size VIRAM-1 MMX (cycles) (cycles)

QCIF 7.1x106 (4.6x) 3.3x107 (176x144)

CIF 2.8x107 (5.0x) 1.4x108 (352x288)

•Note : MMX results assume all data in L1 cache

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 28

14 Overall Performance of H.263

Akiyo Mom Hall Foreman (12.95 kbit/s) (16.25 kbit/s) (20.47 kbit/s) (65.52 kbit/s)

23.5 fps 22.7fps 22.7fps 20.9fps

•Average encoding speed for H.263 on VIRAM standard mpeg test sequences, using exhaustive search for motion estimation and LLM for DCT. •Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 29

Summary Class Project Suggestions

• Architecture comparisons & applications – information retrieval – signal processing apps – neural nets training • Multimedia application analysis – operand reuse patterns – branch behavior – data/value locality and memory access patterns • Low power/energy architectures – energy-exposed ISA design – compilation for low energy – speculation use for power reduction

2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 30

15