The and 21464 : Continuing the Performance Lead Beyond Y2K

Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Computer Corporation Shrewsbury, Massachusetts

Slides: 1998 Forum (Peter Bannon) and 1999 Microprocessor Forum ()

Better answers Alpha Microprocessor Roadmap

Higher Performance

0.35mm 0.18mm 0.125mm 21264 21364 21464 EV6 EV7 EV8

0.28mm 0.125mm 21264 21364 EV67 EV78

0.18mm 21264 EV68

1998 1999 2000 2001 2002 2003

First System Ship Better answers Microprocessor

 Architectural Features

 First “Out-of-Order” Alpha

 Four-wide superscalar

 …

 Performance

 World’s Fastest Microprocessor (www.spec.org, 11/17/99)

 39 SPECINT95, 68 SPECFP95 @ 700 Mhz – Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95

Better answers Alpha Microprocessor Roadmap

Higher Performance

0.35mm 0.18mm 0.125mm 21264 21364 21464 EV6 EV7 EV8

0.28mm 0.125mm 21264 21364 EV67 EV78

0.18mm 21264 EV68

1998 1999 2000 2001 2002 2003

First System Ship Better answers Alpha 21364 Goals

 Leadership single stream performance

 Higher operating frequency

 Integrated memory interface

 Leadership multiprocessor performance

 Integrated system / multiprocessor interface

Better answers Alpha 21364 Features

 System-on-a-Chip

 Alpha 21264 core with enhancements

 Integrated L2 Cache

 Integrated

 Integrated network interface

 Fault-Tolerance

 Support for lock-step operation to enable high- availability systems.

Better answers 21364 Chip Block Diagram

Address In 16 L1 R Miss Buffers Address Out A M B 64K Icache U L2 Memory S 21264 Cache Controller Core N Network S E Interface W 64K Dcache I/O 16 L1 Victim Buf 16 L2 Victim Buf

Better answers 21364 Core

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Map (80) Exec (20) L1 L2 Reg Data cache1 80 in-flight instructions Exec File Cache .5MB plus 32 loads and 32 stores Addr 64KB (80) Exec 6-Set Next-Line 2-Set Address L1 Ins. 4 Instructions / cycle Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

Better answers Integrated L2 Cache

 1.5 MB

 6-way set associative

 16 GB/s total read/write bandwidth

 16 Victim buffers for L1 -> L2

 16 Victim buffers for L2 -> Memory

 ECC SECDED code

 12ns load to use latency

Better answers Integrated Memory Controller

 Direct RAMbus

 High data capacity per pin

 800 MHz operation

 30ns CAS latency pin to pin

 6 GB/sec read or write bandwidth

 100s of open pages

 Directory based cache coherence

 ECC SECDED

Better answers Integrated Network Interface

 Direct processor-to-processor interconnect

 10 GB/second per processor

 15ns processor-to-processor latency

 Out-of-order network with adaptive routing

 Asynchronous clocking between processors

 3 GB/second I/O interface per processor

Better answers 21364 System Block Diagram

M M M M 364 364 364 364 IO IO IO IO M M M M 364 364 364 364 IO IO IO IO M M M M 364 364 364 364 IO IO IO IO

Better answers Alpha 21364 Technology

 0.18 mm CMOS

 1000+ MHz

 100 Watts @ 1.5 volts 2  3.5 cm

 6 Layer Metal

 100 million transistors

 8 million logic

 92 million RAM

Better answers Alpha 21364 Status

 70 SPECint95 (estimated)

 120 SPECfp95 (estimated)

 RTL model running

 Tapeout: Summer 2000

Better answers 21364 Summary: System on a Chip

 Integrated L2 cache and memory controller

 outstanding single processor performance

 Integrated network interface

 high performance multi-processor systems

 scales to large number of processors

Better answers Alpha Microprocessor Overview

Higher Performance

0.35mm 0.18mm 0.125mm 21264 21364 21464 EV6 EV7 EV8

0.28mm 0.125mm 21264 21364 EV67 EV78

0.18mm 21264 EV68

1998 1999 2000 2001 2002 2003

First System Ship Better answers Goals

 Leadership single stream performance

 Higher operating frequency / better technology

 New microarchitecture

 Integrated memory interface (like 21364)

 Leadership multiprocessor performance

 Simultaneous Multithreading (with minimal change/cost)

 Integrated system / multiprocessor interface (like 21364)

Better answers Alpha 21464 Technology Overview

 Leading edge process technology – 1.2-2.0GHz

 0.125µm CMOS

 SOI-compatible

 Cu interconnect

 low-k dielectrics

 Chip characteristics

 ~1.2V Vdd

 ~250 Million transistors

Better answers Alpha 21464 Architecture Overview

 Enhanced out-of-order execution

 8-wide superscalar

 Large on-chip L2 cache

 Direct RAMBUS interface

 On-chip router for system interconnect

 Glueless, directory-based, ccNUMA

 for up to 512-way multiprocessing

 4-way simultaneous multithreading (SMT)

Better answers Instruction Issue

Time

Reduced function unit utilization due to dependencies

Better answers Superscalar Issue

Time

Superscalar leads to more performance, but lower utilization

Better answers Predicated Issue

Time

Adds to function unit utilization, but results are thrown away

Better answers Chip Multiprocessor

Time

Limited utilization when only running one thread

Better answers Fine Grained Multithreading

Time

Intra-thread dependencies still limit performance

Better answers Simultaneous Multithreading

Time

Maximum utilization of function units by independent operations

Better answers Basic Out-of-order Pipeline

Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer

PC

Register Map

Regs Regs Dcache Icache

Thread-blind

Better answers SMT Pipeline

Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer

PC

Register Map Regs Dcache Regs Icache

Better answers Changes for SMT

 Basic pipeline – unchanged

 Replicated resources

 Program counters

 Register maps

 Shared resources

 Register file (size increased)

 Instruction queue

 First and second level caches

 Translation buffers

Better answers Multiprogrammed workload

250%

200% 1T 150% 2T 3T 100% 4T 50%

0% SpecInt SpecFP Mixed Int/FP

Better answers Decomposed SPEC95 Applications

250%

200% 1T 150% 2T 3T 100% 4T 50%

0% Turb3d Swm256 Tomcatv

Better answers Multithreaded Applications

300%

250%

200% 1T 150% 2T 4T 100%

50%

0% Barnes Chess Sort TP

Better answers Architectural Abstraction

 1 Processor with 4 Thread Processing Units (TPUs)

 Shared hardware resources

TPU 0 TPU1 TPU2 TPU3

Icache TLB Dcache

Scache

Better answers 21464 System Block Diagram

0 1 2 3 M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO

Better answers Alpha 21464 Summary

 Leadership single stream performance

 Higher operating frequency / better technology

 New microarchitecture

 Integrated memory interface (like 21364)

 Leadership multiprocessor performance

 Simultaneous Multithreading (with minimal changes/cost)

 Integrated system / multiprocessor interface (like 21364)

Better answers Maintain Performance Lead Beyond Y2K

 Alpha 21364

 Reuses 21264 microprocessor core

 System on a chip

 Alpha 21464

 New microarchitecture

 System on a chip

 Simultaneous Multithreading

Better answers My Current Research: Beyond 21464?

 The Truth Project (w/ Joel Emer)

 Examines different microarchitectural issues

 The Multinet Project (w/ Rick Kessler)

 Tightly-coupled multiprocessor networks

 The Reliant Project (w/ Steve Reinhardt)

 Self-Checking Microprocessors using SMT, ISCA submission

 Asim (w/ VSSAD Labs)

 Performance Model for Alphas beyond 21464

Better answers