AMD Opteron™ Shared Memory MP Systems Ardsher Ahmed Pat Conway Bill Hughes Fred Weber Agenda

• Glueless MP systems • MP system configurations • Cache coherence protocol • 2-, 4-, and 8-way MP system topologies • Beyond 8-way MP systems

September 22, 2002 Hot Chips 14 2 AMD Opteron™ Processor Architecture

DRAM 5.3 GB/s 128-bit MCT CPU SRQ

XBAR HT HT HT 3.2 GB/s per direction 3.2 GB/s per direction @ 0+]'DWD5DWH @ 0+]'DWD5DWH 3.2 GB/s per direction @ 0+]'DWD5DWH HT = HyperTransport™ technology

September 22, 2002 Hot Chips 14 3 Glueless MP System

DRAM DRAM

MCT CPU MCT CPU SRQ SRQ non-Coherent HyperTransport™ Link XBAR XBAR

HT I/O

I/O I/O HT cHT cHT cHT cHT

Coherent HyperTransport ™ cHT cHT I/O I/O HT HT cHT cHT XBAR XBAR

CPU MCT CPU MCT SRQ SRQ

HT = HyperTransport™ technology DRAM DRAM

September 22, 2002 Hot Chips 14 4 MP Architecture

• Programming model of memory is effectively SMP – Physical address space is flat and fully coherent – Far to near memory latency ratio in a 4P system is designed to be < 1.4 – Latency difference between remote and local memory is comparable to the difference between a DRAM page hit and a DRAM page conflict – DRAM locations can be contiguous or interleaved – No processor affinity or NUMA tuning required

• MP support designed in from the beginning – Lower overall chip count results in outstanding system reliability – Memory Controller and XBAR operate at the processor frequency – Memory subsystem scale with frequency improvements

September 22, 2002 Hot Chips 14 5 MP Architecture (contd.)

• Integrated Memory Controller – 333 MHz 128-bit DRAM with up to 8 registered DIMMs – High-bandwidth (5.3 GB/s peak) and low-latency memory access – Snoop throughput scales with Processor frequency – Broadcast cache coherence protocol • Avoids serialization delay of directory based systems • Snooping the processors caches is overlapped with DRAM access

September 22, 2002 Hot Chips 14 6 HyperTransport™ Technology

• Screaming I/O for chip-to-chip communication – High bandwidth – Point-to-point links – Split transaction and full duplex – – Tunneling capability • Enables scalable 2-8 processor Cache-Coherent MP systems – Glueless MP • HyperTransport™ Links – Up to three 16-bit links (3.2 GB/s per direction) – Reduced pin count compared to the typical based systems – Compatible with high-volume PC board infrastructure – Each can be: • cHT: coherent (Processor-to-Processor) link or, • HT: non-coherent (Processor-to-I/O) link – For more info see: http://www.HyperTransport.org/

September 22, 2002 Hot Chips 14 7 High–Performance Implementation

32bits @ $0'Π$0'Π533Mhz +\SHU7UDQVSRUWΠ+\SHU7UDQVSRUWΠ$*3 $*3*UDSKLFV$*3*UDSKLFV 7XQQHO7XQQHO *%V

0+] +\SHU7UDQVSRUW ''5 5HJ %LW

*%V FRKHUHQW Œ +\SHU7UDQVSRUW $0'2SWHURQ Œ

$0'2SWHURQ $0'2SWHURQ$0'2SWHURQ %LW 5HJ ''5 0+] *%V +\SHU7UDQVSRUW

ELWV # ELWV # 0K] $0'Œ 0K] 1000 BaseT $0'Œ Gbit Ethernet 3&,; 3&,; +\SHU7UDQVSRUWŒ 3&,; +\SHU7UDQVSRUWŒ U320 SCSI +RW 3OXJ 3&,;7XQQHO3&,;7XQQHO +RW 3OXJ SCSI AC’97 0%V /HJDF\ 3&, USB 2.0 +\SHU7UDQVSRUW 70 $0' 70 $0' 70 ELWV # 0K] +\SHU7UDQVSRUW 70 Ethernet +\SHU7UDQVSRUW ,2+XE,2+XE FLASH EIDE LPC SIO SM Bus

September 22, 2002 Hot Chips 14 8 4-Way Server Implementation

0+] L H ''5 5HJ %LW *%V FRKHUHQW +\SHU7UDQVSRUWŒ

$0'2SWHURQΠ$0'2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW +\SHU7UDQVSRUW +\SHU7UDQVSRUW

0+] L H ''5 5HJ %LW

$0'2SWHURQ $0'2SWHURQ *%V FRKHUHQW +\SHU7UDQVSRUW

*%V %LW 5HJ ''5 0+] +\SHU7UDQVSRUW Ethernet ELWV # ELWV # 0K] 1000 BaseT 0K] $0'Œ$0'Œ Gbit Ethernet +\SHU7UDQVSRUWŒ 3&,; 3&,; 3&,; +\SHU7UDQVSRUWŒ U320 SCSI +RW 3OXJ SCSI +RW 3OXJ 3&,;7XQQHO3&,;7XQQHO AC’97 /HJDF\ 3&, 0%V USB 2.0 70 +\SHU7UDQVSRUW $0' 70 $0' 70 ELWV # 0K] +\SHU7UDQVSRUW 70 Ethernet +\SHU7UDQVSRUW ,2+XE ,2+XE FLASH EIDE LPC SIO SM Bus September 22, 2002 Hot Chips 14 9 4P System — Board Layout

September 22, 2002 Hot Chips 14 10 8-Way Implementation

*%V *%V +\SHU7UDQVSRUW +\SHU7UDQVSRUW

0+] L H ''5 5HJ %LW

*%V FRKHUHQW

$0' 2SWHURQŒ+\SHU7UDQVSRUWŒ $0' 2SWHURQ

%LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW

0+] +\SHU7UDQVSRUW +\SHU7UDQVSRUW ''5 5HJ %LW

$0' 2SWHURQ $0' 2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW +\SHU7UDQVSRUW +\SHU7UDQVSRUW

0+] L H ''5 5HJ %LW

$0' 2SWHURQ $0' 2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW +\SHU7UDQVSRUW +\SHU7UDQVSRUW

0+] L H ''5 5HJ %LW

$0' 2SWHURQ $0' 2SWHURQ *%V FRKHUHQW

+\SHU7UDQVSRUW %LW 5HJ ''5 0+] *%V *%V +\SHU7UDQVSRUW +\SHU7UDQVSRUW

September 22, 2002 Hot Chips 14 11 Local vs. Remote memory access

0-hop 1-hop 2-hops

P0 P1 P0 P1 P0 P1

P2 P3 P2 P3 P2 P3

• Local Memory Access (0-hop) • Remote1 Memory Access (1-hop) • Remote2 Memory Access (2-hops)

September 22, 2002 Hot Chips 14 12 Cache Coherence Protocol Read Transaction Example

Step 1 Memory 0 Memory 1

P0 P1

Read Cache Line

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 13 Cache Coherence Protocol Read Transaction Example

Step 2 Memory 0 Memory 1

P0 P1 Read Cache Line

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 14 Cache Coherence Protocol Read Transaction Example

Step 3 Memory 0 Memory 1

Read Cache Line Snoop Request P1 P0 P1 Snoop Request P0

Snoop Request P2

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 15 Cache Coherence Protocol Read Transaction Example

Step 4 Memory 0 Memory 1

Snoop Response P0 P0 P1

P2 P3

Snoop Request P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 16 Cache Coherence Protocol Read Transaction Example

Step 5 Memory 0 Memory 1

Read Response M0 P0 P1

Snoop Response P0

P2 P3

Snoop Response P2

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 17 Cache Coherence Protocol Read Transaction Example

Step 6 Memory 0 Memory 1

Read Response M0 P0 P1

Snoop Response 1

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 18 Cache Coherence Protocol Read Transaction Example

Step 7 Memory 0 Memory 1

P0 P1

Read Response M0

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 19 Cache Coherence Protocol Read Transaction Example

Step 8 Memory 0 Memory 1

P0 P1

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 20 Cache Coherence Protocol Read Transaction Example

Step 9 Memory 0 Memory 1

P0 P1

P2 P3

Source Done to M0

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 21 Cache Coherence Protocol Read Transaction Example

Step 10 Memory 0 Memory 1

P0 P1

Source Done to M0

P2 P3

Memory 2 Memory 3

September 22, 2002 Hot Chips 14 22 2-way System Topology

I/O I/O

M0 P0 P1 M1

• System parameters – 16 DIMMs (up to 32 GB using 256Mb DRAM) – 2 HyperTransport links available for I/O – Bisection-bandwidth = 6.4 GB/s – Diameter = 1 hop

September 22, 2002 Hot Chips 14 23 4-way System Topology

I/O I/O

M0 P0 P1 M1

M2 P2 P3 M3

I/O I/O • System parameters – 32 DIMMs (up to 64 GB using 256Mb DRAM) – 4 HyperTransport links available for I/O – Bisection-bandwidth = 12.8 GB/s – Average-diameter = 1.33 Hops

September 22, 2002 Hot Chips 14 24 4-way System Topology (contd.)

I/O

M0 P0 P1 M1

M2 P2 P3 M3

I/O • System parameters – 32 DIMMs (up to 64 GB using 256Mb DRAM) – 2 HyperTransport links available for I/O – Bisection-bandwidth = 19.2 GB/s – Average-diameter = 1.17 Hops

September 22, 2002 Hot Chips 14 25 8-way System Topology

I/O I/O

P0 P1 P2

P3 P4

P5 P6 P7

I/O I/O • System parameters – 64 DIMMs (up to 128GB using 256Mb DRAM) – 4 HyperTransport links available for I/O – Bisection-bandwidth = 25.6 GB/s – Average-diameter = 1.71 hops September 22, 2002 Hot Chips 14 26 8-way System Topology (contd.)

I/O

P0 P1 P2

P3 P4

P5 P6 P7

• System parameters I/O – 64 DIMMs (up to 128GB using 256Mb DRAM) – 2 HyperTransport links available for I/O – Bisection-bandwidth = 32 GB/s – Average-diameter = 1.64 hops

September 22, 2002 Hot Chips 14 27 Scalability Beyond 8P

Interconnect Fabric

SW0 SW1 SW2 SW3 SW4 SW5 SW14 SW15

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P28 P29 P30 P31

• Scaling beyond 8P is enabled – External HyperTransport switch • Coherent Interconnect – Snoop filter – Data caching September 22, 2002 Hot Chips 14 28 The Rewards of Good Plumbing

• High Bandwidth – 2P system is designed to achieve 7 GB/s aggregate memory Read bandwidth – 4P system is designed to achieve 10 GB/s aggregate memory Read bandwidth • With data spread uniformly across the nodes

• Low Latency – Average 2P unloaded latency (page hit) is designed to be < 120 ns – Average 4P unloaded latency (page hit) is designed to be < 140 ns – Latency under load increases slowly due to excess Interconnect Bandwidth – Latency shrinks quickly with increasing CPU clock speed and HyperTransport link speed

September 22, 2002 Hot Chips 14 29 Trademark Attribution

AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies.

September 22, 2002 Hot Chips 14 30