The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y2K
Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts
Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer)
Better answers Alpha Microprocessor Roadmap
Higher Performance
0.35mm 0.18mm 0.125mm 21264 21364 21464 EV6 EV7 EV8
0.28mm 0.125mm 21264 21364 EV67 EV78
0.18mm 21264 EV68
1998 1999 2000 2001 2002 2003
First System Ship Better answers Alpha 21264 Microprocessor
Architectural Features
First “Out-of-Order” Alpha
Four-wide superscalar
…
Performance
World’s Fastest Microprocessor (www.spec.org, 11/17/99)
39 SPECINT95, 68 SPECFP95 @ 700 Mhz – Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95
Better answers Alpha Microprocessor Roadmap
Higher Performance
0.35mm 0.18mm 0.125mm 21264 21364 21464 EV6 EV7 EV8
0.28mm 0.125mm 21264 21364 EV67 EV78
0.18mm 21264 EV68
1998 1999 2000 2001 2002 2003
First System Ship Better answers Alpha 21364 Goals
Leadership single stream performance
Higher operating frequency
Integrated memory interface
Leadership multiprocessor performance
Integrated system / multiprocessor interface
Better answers Alpha 21364 Features
System-on-a-Chip
Alpha 21264 core with enhancements
Integrated L2 Cache
Integrated memory controller
Integrated network interface
Fault-Tolerance
Support for lock-step operation to enable high- availability systems.
Better answers 21364 Chip Block Diagram
Address In 16 L1 R Miss Buffers Address Out A M B 64K Icache U L2 Memory S 21264 Cache Controller Core N Network S E Interface W 64K Dcache I/O 16 L1 Victim Buf 16 L2 Victim Buf
Better answers 21364 Core
FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Map (80) Exec (20) L1 L2 Reg Data cache1 80 in-flight instructions Exec File Cache .5MB plus 32 loads and 32 stores Addr 64KB (80) Exec 6-Set Next-Line 2-Set Address L1 Ins. 4 Instructions / cycle Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address
Better answers Integrated L2 Cache
1.5 MB
6-way set associative
16 GB/s total read/write bandwidth
16 Victim buffers for L1 -> L2
16 Victim buffers for L2 -> Memory
ECC SECDED code
12ns load to use latency
Better answers Integrated Memory Controller
Direct RAMbus
High data capacity per pin
800 MHz operation
30ns CAS latency pin to pin
6 GB/sec read or write bandwidth
100s of open pages
Directory based cache coherence
ECC SECDED
Better answers Integrated Network Interface
Direct processor-to-processor interconnect
10 GB/second per processor
15ns processor-to-processor latency
Out-of-order network with adaptive routing
Asynchronous clocking between processors
3 GB/second I/O interface per processor
Better answers 21364 System Block Diagram
M M M M 364 364 364 364 IO IO IO IO M M M M 364 364 364 364 IO IO IO IO M M M M 364 364 364 364 IO IO IO IO
Better answers Alpha 21364 Technology
0.18 mm CMOS
1000+ MHz
100 Watts @ 1.5 volts 2 3.5 cm
6 Layer Metal
100 million transistors
8 million logic
92 million RAM
Better answers Alpha 21364 Status
70 SPECint95 (estimated)
120 SPECfp95 (estimated)
RTL model running
Tapeout: Summer 2000
Better answers 21364 Summary: System on a Chip
Integrated L2 cache and memory controller
outstanding single processor performance
Integrated network interface
high performance multi-processor systems
scales to large number of processors
Better answers Alpha Microprocessor Overview
Higher Performance
0.35mm 0.18mm 0.125mm 21264 21364 21464 EV6 EV7 EV8
0.28mm 0.125mm 21264 21364 EV67 EV78
0.18mm 21264 EV68
1998 1999 2000 2001 2002 2003
First System Ship Better answers Alpha 21464 Goals
Leadership single stream performance
Higher operating frequency / better technology
New microarchitecture
Integrated memory interface (like 21364)
Leadership multiprocessor performance
Simultaneous Multithreading (with minimal change/cost)
Integrated system / multiprocessor interface (like 21364)
Better answers Alpha 21464 Technology Overview
Leading edge process technology – 1.2-2.0GHz
0.125µm CMOS
SOI-compatible
Cu interconnect
low-k dielectrics
Chip characteristics
~1.2V Vdd
~250 Million transistors
Better answers Alpha 21464 Architecture Overview
Enhanced out-of-order execution
8-wide superscalar
Large on-chip L2 cache
Direct RAMBUS interface
On-chip router for system interconnect
Glueless, directory-based, ccNUMA
for up to 512-way multiprocessing
4-way simultaneous multithreading (SMT)
Better answers Instruction Issue
Time
Reduced function unit utilization due to dependencies
Better answers Superscalar Issue
Time
Superscalar leads to more performance, but lower utilization
Better answers Predicated Issue
Time
Adds to function unit utilization, but results are thrown away
Better answers Chip Multiprocessor
Time
Limited utilization when only running one thread
Better answers Fine Grained Multithreading
Time
Intra-thread dependencies still limit performance
Better answers Simultaneous Multithreading
Time
Maximum utilization of function units by independent operations
Better answers Basic Out-of-order Pipeline
Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer
PC
Register Map
Regs Regs Dcache Icache
Thread-blind
Better answers SMT Pipeline
Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer
PC
Register Map Regs Dcache Regs Icache
Better answers Changes for SMT
Basic pipeline – unchanged
Replicated resources
Program counters
Register maps
Shared resources
Register file (size increased)
Instruction queue
First and second level caches
Translation buffers
Better answers Multiprogrammed workload
250%
200% 1T 150% 2T 3T 100% 4T 50%
0% SpecInt SpecFP Mixed Int/FP
Better answers Decomposed SPEC95 Applications
250%
200% 1T 150% 2T 3T 100% 4T 50%
0% Turb3d Swm256 Tomcatv
Better answers Multithreaded Applications
300%
250%
200% 1T 150% 2T 4T 100%
50%
0% Barnes Chess Sort TP
Better answers Architectural Abstraction
1 Processor with 4 Thread Processing Units (TPUs)
Shared hardware resources
TPU 0 TPU1 TPU2 TPU3
Icache TLB Dcache
Scache
Better answers 21464 System Block Diagram
0 1 2 3 M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO M M M EV8 EV8 EV8 IO IO IO
Better answers Alpha 21464 Summary
Leadership single stream performance
Higher operating frequency / better technology
New microarchitecture
Integrated memory interface (like 21364)
Leadership multiprocessor performance
Simultaneous Multithreading (with minimal changes/cost)
Integrated system / multiprocessor interface (like 21364)
Better answers Maintain Performance Lead Beyond Y2K
Alpha 21364
Reuses 21264 microprocessor core
System on a chip
Alpha 21464
New microarchitecture
System on a chip
Simultaneous Multithreading
Better answers My Current Research: Beyond 21464?
The Truth Project (w/ Joel Emer)
Examines different microarchitectural issues
The Multinet Project (w/ Rick Kessler)
Tightly-coupled multiprocessor networks
The Reliant Project (w/ Steve Reinhardt)
Self-Checking Microprocessors using SMT, ISCA submission
Asim (w/ VSSAD Labs)
Performance Model for Alphas beyond 21464
Better answers