Vector IRAM: a Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis [email protected] CS252 Graduate Computer Architecture February 10, 2000 Outline • Motivation for IRAM – technology trends – design trends – application trends • Vector IRAM – instruction set – prototype architecture – performance 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 2 1 Processor-DRAM Gap (latency) µProc 1000 CPU “Moore’s Law” 60%/yr. 100 Processor-Memory Performance Gap: 10 (grows 50% / year) DRAM Performance DRAM 7%/yr. 1 1987 1991 1980 1981 1982 1983 1984 1985 1986 1988 1989 1990 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 3 Processor-DRAM Tax logic memory Intel PIII Xeon 8 15 MIPS R12000 3 4.2 HP PA-8500 4 126 Sun Ultra-2 2 1.8 PowerPC G4 4.5 6 IBM Power3 7 8 AMD Athlon 11 11 Alpha 21264 6 9.2 0 5 10 15 20 25 30 Million Transistors 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 4 2 Power Consumption 60 50 Alpha 21264 AMD Athlon 40 IBM Power3 PowerPC G4 30 Sun Ultra-2 HP PA-8500 20 MIPS R12000 Intel PIII Xeon Performance (Spec95FP) 10 0 0 20 40 60 80 Power (W) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 5 Other Design Challenges • Interconnect scaling problems – multiple cycles to go across the chip – difficult to achieve single cycle result forwarding – need to add extra pipeline stages at the cost of power, complexity, branch and load-use latency • Design complexity of high-end CPUs – 4 to 5 years from scratch to chips for new superscalar architectures – >100 engineers – >50% of resources to design verification 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 6 3 Complexity Vs. Performance Gains R5000 R10000 R10K/R5K • Clock Rate 200 MHz 195 MHz 1.0x • On-Chip Caches 32K/32K 32K/32K 1.0x • Instructions/Cycle 1(+ FP) 4 4.0x • Pipe stages 5 5-7 1.2x • Model In-order Out-of-order --- • Die Size (mm2) 84 298 3.5x – wo cache, TLB 32 205 6.3x • Development 60 300 5.0x (man years) • SPECint_base95 5.7 8.8 1.6x 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 7 Future microprocessor applications • Multimedia applications – image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption etc. – narrow data types, streaming data, real-time requirements • Mobile and embedded environments – notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, cars etc.. – small devices, limited chip-count, limited power/energy budget • Significantly different environment from the desktop/workstation model 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 8 4 Requirements on microprocessors (1) • High performance for multimedia: – real-time performance guarantees – support for continuous media data-types – exploit fine-grain parallelism – exploit coarse-grain parallelism – exploit high instruction reference locality – code density – high memory bandwidth 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 9 Average vs. real time performance ... 45% Average 40% Which one is the best? 35% Statistical Þ Average Þ C 30% Real time Þ Worst Þ A 25% 20% Inputs 15% 10% 5% A B 0% C Worst Case Performance Best Case 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 10 5 Requirements on microprocessors (2) • Low power and energy consumption – energy efficiency for long battery life – power efficiency for system cost reduction (cooling system, packaging etc...) • Design scalability – performance scalability – physical design scalability • design complexity, verification complexity – immunity to interconnect scaling problems • locality of interconnect, tolerance to latency • System-on-a-chip (SoC) – highly integrated system – low system chip-count 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 11 The IRAM vision statement Proc L $ $ o f Microprocessor & DRAM L2$ I/O I/O g a on a single chip: Bus Bus i b – on-chip memory latency c 5-10X, bandwidth 50-100X – improve energy efficiency D R A M 2X-4X (no off-chip bus) I/O – serial I/O 5-10X v. buses I/O – smaller board area/volume Proc D – adjustable memory size/width f Bus R a A b M D R A M 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 12 6 Vector IRAM • Vector processing – high-performance for media processing – low power/energy for processor control – modularity, low complexity – scalability – well understood software development • Embedded DRAM – high bandwidth for vector processing – low power/energy for memory accesses – modularity, scalability – small system size 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 13 IRAM ISA summary • Full vector instruction set with – 32 vector registers, 32 vector flag registers – support for multiple data types (64b, 32b, 16b, 8b) – support for strided and indexed memory accesses – support for auto-increment addressing – support for DSP operations (multiply-add, saturation etc) – support for conditional execution – support for software speculation – support for fast reductions and butterfly permutations – support for virtual memory – restartable arithmetic (FP & integer) exceptions • Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 14 7 Vector architectural state Virtual Processors ($vlr) Control Regs VP VP VP vcr0 0 1 $vlr-1 vcr vr 1 General 0 vr Purpose 1 vcr Registers 31 64b (32) vr31 $vpw vf Scalar Regs Flag 0 vf1 vs Registers 0 vs1 (32) vf31 1b vs31 64b 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 15 Fixed-point Multiply-add Mul & Shift Right & Round Add & Sat z x n n/2 + sat w * n Shift n y Round n n/2 a • Multiply halves & shift instruction provides support for any fixed-point format • Precision is equal to the datatype width; multiplier’s inputs have half the width • Uniform, simple support for all datatypes 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 16 8 VIRAM-1 prototype 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 17 Design Overview • 64b MIPS scalar core • Memory system – coprocessor interface – 8 2MByte eDRAM banks – 16KB I/D caches – single sub-bank per bank • Vector unit – 256-bit synchronous – 8KByte vector register file interface, separate I/O – support for 64b, 32b, and signals 16b data-types – 20ns cycle time, 6.6ns – 2 arithmetic (1 FP), 2 flag column access processing, 1 load-store units – crossbar interconnect for – 4 64-bit datapaths per unit 12.8 GB/sec per direction – DRAM latency included in – no caches vector pipeline • Network interface – 4 addresses/cycle for – user-level message passing strided/indexed accesses – dedicated DMA engines – 2-level TLB – 4 100MByte/s links 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 18 9 Vector Unit Pipeline Structure • Single-issue, in-order pipeline – each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles • DRAM latency is included in the execution pipeline (delayed pipeline) – deep pipeline design, but not caches needed to avoid stalls – worst case DRAM latency does not cause pipeline stalls • Address decoupling buffer – buffers memory addresses in the presence of conflicts (indexed/strided accesses) – memory conflicts do not stall pipeline 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 19 Non-Delayed Pipeline F D X M W . DRAM latency: >=20ns vld mem VLOAD A T VW vadd Long Load-> ALU RAW hazard vst vld VALU VR X1 X2 ... XN VW mem vadd vst . VSTORE A T VR . Load->ALU exposes full DRAM latency (long) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 20 10 Tolerating Memory Latency Delayed Pipeline F D X M W . DRAM latency: >20ns . vld A T VW VLOAD vadd Load-> ALU RAW hazard vst vld vadd DELAY VR X1 ... XN VW VALU vst . VSTORE A T VR Load ® ALU sees functional unit latency (short) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 21 Clustered VLSI Design 64b Xbar I/F Xbar I/F Xbar I/F Xbar I/F Integer Integer Integer Integer Datapath 0 Datapath 0 Datapath 0 Datapath 0 Vector Vector Vector Vector Registers Registers Registers Registers Flag Regs. Flag Regs. Flag Regs. Flag Regs. Control & Datapath & Datapath & Datapath & Datapath FP FP FP FP Datapaths Datapaths Datapaths Datapaths Integer Integer Integer Integer Datapath 1 Datapath 1 Datapath 1 Datapath 1 Xbar I/F Xbar I/F Xbar I/F Xbar I/F 256b 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 22 11 VIRAM-1 Floorplan DRAM DRAM DRAM DRAM Bank Bank Bank Bank N 0 2 4 6 I M C I Vector Vector Vector Vector T P Lane 0 Lane 1 Lane 2 Lane 3 S L I DRAM DRAM DRAM DRAM O Bank Bank Bank Bank 1 3 5 7 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 23 Prototype Summary • Technology: – 0.18um eDRAM CMOS process (IBM) – 6 layers of copper interconnect – 1.2V and 1.8V power supply • Memory: 16 MBytes • Clock frequency: 200MHz • Power: 2 W for vector unit and memory • Transistor count: ~140 millions • Peak performance: – GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) – GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) – GFLOPS: 1.6 (32b) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 24 12 Kernels Performance Peak Sustained % Perf. Perf. of Peak Image Composition 6.4 GOPS 6.40 GOPS 100.0% iDCT 6.4 GOPS 1.97 GOPS 30.7% Color Conversion 3.2 GOPS 3.07 GOPS 96.0% Image Convolution 3.2 GOPS 3.16 GOPS 98.7% Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5% Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7% FP MV Multiply 1.6 GFLOPS 1.40 GFLOPS 87.5% FP VM Multiply 1.6 GFLOPS 1.59 GFLOPS 99.6% AVERAGE 86.6% •Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths 2/10/2000 C.E.

Load more