Vector IRAM: a Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis [email protected] CS252 Graduate Computer Architecture February 10, 2000 Outline • Motivation for IRAM – technology trends – design trends – application trends • Vector IRAM – instruction set – prototype architecture – performance 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 2 1 Processor-DRAM Gap (latency) µProc 1000 CPU “Moore’s Law” 60%/yr. 100 Processor-Memory Performance Gap: 10 (grows 50% / year) DRAM Performance DRAM 7%/yr. 1 1987 1991 1980 1981 1982 1983 1984 1985 1986 1988 1989 1990 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 3 Processor-DRAM Tax logic memory Intel PIII Xeon 8 15 MIPS R12000 3 4.2 HP PA-8500 4 126 Sun Ultra-2 2 1.8 PowerPC G4 4.5 6 IBM Power3 7 8 AMD Athlon 11 11 Alpha 21264 6 9.2 0 5 10 15 20 25 30 Million Transistors 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 4 2 Power Consumption 60 50 Alpha 21264 AMD Athlon 40 IBM Power3 PowerPC G4 30 Sun Ultra-2 HP PA-8500 20 MIPS R12000 Intel PIII Xeon Performance (Spec95FP) 10 0 0 20 40 60 80 Power (W) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 5 Other Design Challenges • Interconnect scaling problems – multiple cycles to go across the chip – difficult to achieve single cycle result forwarding – need to add extra pipeline stages at the cost of power, complexity, branch and load-use latency • Design complexity of high-end CPUs – 4 to 5 years from scratch to chips for new superscalar architectures – >100 engineers – >50% of resources to design verification 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 6 3 Complexity Vs. Performance Gains R5000 R10000 R10K/R5K • Clock Rate 200 MHz 195 MHz 1.0x • On-Chip Caches 32K/32K 32K/32K 1.0x • Instructions/Cycle 1(+ FP) 4 4.0x • Pipe stages 5 5-7 1.2x • Model In-order Out-of-order --- • Die Size (mm2) 84 298 3.5x – wo cache, TLB 32 205 6.3x • Development 60 300 5.0x (man years) • SPECint_base95 5.7 8.8 1.6x 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 7 Future microprocessor applications • Multimedia applications – image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption etc. – narrow data types, streaming data, real-time requirements • Mobile and embedded environments – notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, cars etc.. – small devices, limited chip-count, limited power/energy budget • Significantly different environment from the desktop/workstation model 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 8 4 Requirements on microprocessors (1) • High performance for multimedia: – real-time performance guarantees – support for continuous media data-types – exploit fine-grain parallelism – exploit coarse-grain parallelism – exploit high instruction reference locality – code density – high memory bandwidth 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 9 Average vs. real time performance ... 45% Average 40% Which one is the best? 35% Statistical Þ Average Þ C 30% Real time Þ Worst Þ A 25% 20% Inputs 15% 10% 5% A B 0% C Worst Case Performance Best Case 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 10 5 Requirements on microprocessors (2) • Low power and energy consumption – energy efficiency for long battery life – power efficiency for system cost reduction (cooling system, packaging etc...) • Design scalability – performance scalability – physical design scalability • design complexity, verification complexity – immunity to interconnect scaling problems • locality of interconnect, tolerance to latency • System-on-a-chip (SoC) – highly integrated system – low system chip-count 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 11 The IRAM vision statement Proc L $ $ o f Microprocessor & DRAM L2$ I/O I/O g a on a single chip: Bus Bus i b – on-chip memory latency c 5-10X, bandwidth 50-100X – improve energy efficiency D R A M 2X-4X (no off-chip bus) I/O – serial I/O 5-10X v. buses I/O – smaller board area/volume Proc D – adjustable memory size/width f Bus R a A b M D R A M 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 12 6 Vector IRAM • Vector processing – high-performance for media processing – low power/energy for processor control – modularity, low complexity – scalability – well understood software development • Embedded DRAM – high bandwidth for vector processing – low power/energy for memory accesses – modularity, scalability – small system size 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 13 IRAM ISA summary • Full vector instruction set with – 32 vector registers, 32 vector flag registers – support for multiple data types (64b, 32b, 16b, 8b) – support for strided and indexed memory accesses – support for auto-increment addressing – support for DSP operations (multiply-add, saturation etc) – support for conditional execution – support for software speculation – support for fast reductions and butterfly permutations – support for virtual memory – restartable arithmetic (FP & integer) exceptions • Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 14 7 Vector architectural state Virtual Processors ($vlr) Control Regs VP VP VP vcr0 0 1 $vlr-1 vcr vr 1 General 0 vr Purpose 1 vcr Registers 31 64b (32) vr31 $vpw vf Scalar Regs Flag 0 vf1 vs Registers 0 vs1 (32) vf31 1b vs31 64b 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 15 Fixed-point Multiply-add Mul & Shift Right & Round Add & Sat z x n n/2 + sat w * n Shift n y Round n n/2 a • Multiply halves & shift instruction provides support for any fixed-point format • Precision is equal to the datatype width; multiplier’s inputs have half the width • Uniform, simple support for all datatypes 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 16 8 VIRAM-1 prototype 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 17 Design Overview • 64b MIPS scalar core • Memory system – coprocessor interface – 8 2MByte eDRAM banks – 16KB I/D caches – single sub-bank per bank • Vector unit – 256-bit synchronous – 8KByte vector register file interface, separate I/O – support for 64b, 32b, and signals 16b data-types – 20ns cycle time, 6.6ns – 2 arithmetic (1 FP), 2 flag column access processing, 1 load-store units – crossbar interconnect for – 4 64-bit datapaths per unit 12.8 GB/sec per direction – DRAM latency included in – no caches vector pipeline • Network interface – 4 addresses/cycle for – user-level message passing strided/indexed accesses – dedicated DMA engines – 2-level TLB – 4 100MByte/s links 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 18 9 Vector Unit Pipeline Structure • Single-issue, in-order pipeline – each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles • DRAM latency is included in the execution pipeline (delayed pipeline) – deep pipeline design, but not caches needed to avoid stalls – worst case DRAM latency does not cause pipeline stalls • Address decoupling buffer – buffers memory addresses in the presence of conflicts (indexed/strided accesses) – memory conflicts do not stall pipeline 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 19 Non-Delayed Pipeline F D X M W . DRAM latency: >=20ns vld mem VLOAD A T VW vadd Long Load-> ALU RAW hazard vst vld VALU VR X1 X2 ... XN VW mem vadd vst . VSTORE A T VR . Load->ALU exposes full DRAM latency (long) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 20 10 Tolerating Memory Latency Delayed Pipeline F D X M W . DRAM latency: >20ns . vld A T VW VLOAD vadd Load-> ALU RAW hazard vst vld vadd DELAY VR X1 ... XN VW VALU vst . VSTORE A T VR Load ® ALU sees functional unit latency (short) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 21 Clustered VLSI Design 64b Xbar I/F Xbar I/F Xbar I/F Xbar I/F Integer Integer Integer Integer Datapath 0 Datapath 0 Datapath 0 Datapath 0 Vector Vector Vector Vector Registers Registers Registers Registers Flag Regs. Flag Regs. Flag Regs. Flag Regs. Control & Datapath & Datapath & Datapath & Datapath FP FP FP FP Datapaths Datapaths Datapaths Datapaths Integer Integer Integer Integer Datapath 1 Datapath 1 Datapath 1 Datapath 1 Xbar I/F Xbar I/F Xbar I/F Xbar I/F 256b 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 22 11 VIRAM-1 Floorplan DRAM DRAM DRAM DRAM Bank Bank Bank Bank N 0 2 4 6 I M C I Vector Vector Vector Vector T P Lane 0 Lane 1 Lane 2 Lane 3 S L I DRAM DRAM DRAM DRAM O Bank Bank Bank Bank 1 3 5 7 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 23 Prototype Summary • Technology: – 0.18um eDRAM CMOS process (IBM) – 6 layers of copper interconnect – 1.2V and 1.8V power supply • Memory: 16 MBytes • Clock frequency: 200MHz • Power: 2 W for vector unit and memory • Transistor count: ~140 millions • Peak performance: – GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) – GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) – GFLOPS: 1.6 (32b) 2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 24 12 Kernels Performance Peak Sustained % Perf. Perf. of Peak Image Composition 6.4 GOPS 6.40 GOPS 100.0% iDCT 6.4 GOPS 1.97 GOPS 30.7% Color Conversion 3.2 GOPS 3.07 GOPS 96.0% Image Convolution 3.2 GOPS 3.16 GOPS 98.7% Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5% Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7% FP MV Multiply 1.6 GFLOPS 1.40 GFLOPS 87.5% FP VM Multiply 1.6 GFLOPS 1.59 GFLOPS 99.6% AVERAGE 86.6% •Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths 2/10/2000 C.E.

Vector IRAM: a Microprocessor Architecture for Media Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support