AMD Opteron™ Shared Memory MP Systems Ardsher Ahmed Pat Conway Bill Hughes Fred Weber Agenda
Total Page:16
File Type:pdf, Size:1020Kb
AMD Opteron™ Shared Memory MP Systems Ardsher Ahmed Pat Conway Bill Hughes Fred Weber Agenda • Glueless MP systems • MP system configurations • Cache coherence protocol • 2-, 4-, and 8-way MP system topologies • Beyond 8-way MP systems September 22, 2002 Hot Chips 14 2 AMD Opteron™ Processor Architecture DRAM 5.3 GB/s 128-bit MCT CPU SRQ XBAR HT HT HT 3.2 GB/s per direction 3.2 GB/s per direction @ 0+]'DWD5DWH @ 0+]'DWD5DWH 3.2 GB/s per direction @ 0+]'DWD5DWH HT = HyperTransport™ technology September 22, 2002 Hot Chips 14 3 Glueless MP System DRAM DRAM MCT CPU MCT CPU SRQ SRQ non-Coherent HyperTransport™ Link XBAR XBAR HT I/O I/O I/O HT cHT cHT cHT cHT Coherent HyperTransport ™ cHT cHT I/O I/O HT HT cHT cHT XBAR XBAR CPU MCT CPU MCT SRQ SRQ HT = HyperTransport™ technology DRAM DRAM September 22, 2002 Hot Chips 14 4 MP Architecture • Programming model of memory is effectively SMP – Physical address space is flat and fully coherent – Far to near memory latency ratio in a 4P system is designed to be < 1.4 – Latency difference between remote and local memory is comparable to the difference between a DRAM page hit and a DRAM page conflict – DRAM locations can be contiguous or interleaved – No processor affinity or NUMA tuning required • MP support designed in from the beginning – Lower overall chip count results in outstanding system reliability – Memory Controller and XBAR operate at the processor frequency – Memory subsystem scale with frequency improvements September 22, 2002 Hot Chips 14 5 MP Architecture (contd.) • Integrated Memory Controller – 333 MHz 128-bit DRAM interface with up to 8 registered DIMMs – High-bandwidth (5.3 GB/s peak) and low-latency memory access – Snoop throughput scales with Processor frequency – Broadcast cache coherence protocol • Avoids serialization delay of directory based systems • Snooping the processors caches is overlapped with DRAM access September 22, 2002 Hot Chips 14 6 HyperTransport™ Technology • Screaming I/O for chip-to-chip communication – High bandwidth – Point-to-point links – Split transaction and full duplex – Differential Signaling – Tunneling capability • Enables scalable 2-8 processor Cache-Coherent MP systems – Glueless MP • HyperTransport™ Links – Up to three 16-bit links (3.2 GB/s per direction) – Reduced pin count compared to the typical Bus based systems – Compatible with high-volume PC board infrastructure – Each can be: • cHT: coherent (Processor-to-Processor) link or, • HT: non-coherent (Processor-to-I/O) link – For more info see: http://www.HyperTransport.org/ September 22, 2002 Hot Chips 14 7 High–Performance Workstation Implementation 32bits @ $0' $0' 533Mhz +\SHU7UDQVSRUW +\SHU7UDQVSRUW $*3 $*3*UDSKLFV$*3*UDSKLFV 7XQQHO7XQQHO *%V 0+] +\SHU7UDQVSRUW %LW 5HJ ''5 *%V FRKHUHQW +\SHU7UDQVSRUW $0'2SWHURQ $0'2SWHURQ $0'2SWHURQ$0'2SWHURQ %LW 5HJ ''5 0+] *%V +\SHU7UDQVSRUW ELWV # ELWV # Ethernet 0K] $0' 0K] 1000 BaseT $0' Gbit Ethernet 3&,; 3&,; +\SHU7UDQVSRUW 3&,; +\SHU7UDQVSRUW U320 SCSI +RW 3OXJ 3&,;7XQQHO3&,;7XQQHO +RW 3OXJ SCSI AC’97 0%V /HJDF\ 3&, USB 2.0 +\SHU7UDQVSRUW 70 $0' 70 $0' 70 ELWV # 0K] +\SHU7UDQVSRUW 70 Ethernet +\SHU7UDQVSRUW ,2+XE,2+XE FLASH EIDE LPC SIO SM Bus September 22, 2002 Hot Chips 14 8 4-Way Server Implementation 0+] %LW 5HJ ''5 *%V FRKHUHQW +\SHU7UDQVSRUW $0'2SWHURQ $0'2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW +\SHU7UDQVSRUW +\SHU7UDQVSRUW 0+] %LW 5HJ ''5 $0'2SWHURQ $0'2SWHURQ *%V FRKHUHQW +\SHU7UDQVSRUW *%V %LW 5HJ ''5 0+] +\SHU7UDQVSRUW Ethernet ELWV # ELWV # 0K] 1000 BaseT 0K] $0'$0' Gbit Ethernet +\SHU7UDQVSRUW 3&,; 3&,; 3&,; +\SHU7UDQVSRUW U320 SCSI +RW 3OXJ SCSI +RW 3OXJ 3&,;7XQQHO3&,;7XQQHO AC’97 /HJDF\ 3&, 0%V USB 2.0 70 +\SHU7UDQVSRUW $0' 70 $0' 70 ELWV # 0K] +\SHU7UDQVSRUW 70 Ethernet +\SHU7UDQVSRUW ,2+XE ,2+XE FLASH EIDE LPC SIO SM Bus September 22, 2002 Hot Chips 14 9 4P System — Board Layout September 22, 2002 Hot Chips 14 10 8-Way Implementation *%V *%V +\SHU7UDQVSRUW +\SHU7UDQVSRUW 0+] %LW 5HJ ''5 *%V FRKHUHQW $0' 2SWHURQ+\SHU7UDQVSRUW $0' 2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW 0+] +\SHU7UDQVSRUW +\SHU7UDQVSRUW %LW 5HJ ''5 $0' 2SWHURQ $0' 2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW +\SHU7UDQVSRUW +\SHU7UDQVSRUW 0+] %LW 5HJ ''5 $0' 2SWHURQ $0' 2SWHURQ %LW 5HJ ''5 0+] *%V FRKHUHQW *%V FRKHUHQW +\SHU7UDQVSRUW +\SHU7UDQVSRUW 0+] %LW 5HJ ''5 $0' 2SWHURQ $0' 2SWHURQ *%V FRKHUHQW +\SHU7UDQVSRUW %LW 5HJ ''5 0+] *%V *%V +\SHU7UDQVSRUW +\SHU7UDQVSRUW September 22, 2002 Hot Chips 14 11 Local vs. Remote memory access 0-hop 1-hop 2-hops P0 P1 P0 P1 P0 P1 P2 P3 P2 P3 P2 P3 • Local Memory Access (0-hop) • Remote1 Memory Access (1-hop) • Remote2 Memory Access (2-hops) September 22, 2002 Hot Chips 14 12 Cache Coherence Protocol Read Transaction Example Step 1 Memory 0 Memory 1 P0 P1 Read Cache Line P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 13 Cache Coherence Protocol Read Transaction Example Step 2 Memory 0 Memory 1 P0 P1 Read Cache Line P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 14 Cache Coherence Protocol Read Transaction Example Step 3 Memory 0 Memory 1 Read Cache Line Snoop Request P1 P0 P1 Snoop Request P0 Snoop Request P2 P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 15 Cache Coherence Protocol Read Transaction Example Step 4 Memory 0 Memory 1 Snoop Response P0 P0 P1 P2 P3 Snoop Request P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 16 Cache Coherence Protocol Read Transaction Example Step 5 Memory 0 Memory 1 Read Response M0 P0 P1 Snoop Response P0 P2 P3 Snoop Response P2 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 17 Cache Coherence Protocol Read Transaction Example Step 6 Memory 0 Memory 1 Read Response M0 P0 P1 Snoop Response 1 P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 18 Cache Coherence Protocol Read Transaction Example Step 7 Memory 0 Memory 1 P0 P1 Read Response M0 P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 19 Cache Coherence Protocol Read Transaction Example Step 8 Memory 0 Memory 1 P0 P1 P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 20 Cache Coherence Protocol Read Transaction Example Step 9 Memory 0 Memory 1 P0 P1 P2 P3 Source Done to M0 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 21 Cache Coherence Protocol Read Transaction Example Step 10 Memory 0 Memory 1 P0 P1 Source Done to M0 P2 P3 Memory 2 Memory 3 September 22, 2002 Hot Chips 14 22 2-way System Topology I/O I/O M0 P0 P1 M1 • System parameters – 16 DIMMs (up to 32 GB using 256Mb DRAM) – 2 HyperTransport links available for I/O – Bisection-bandwidth = 6.4 GB/s – Diameter = 1 hop September 22, 2002 Hot Chips 14 23 4-way System Topology I/O I/O M0 P0 P1 M1 M2 P2 P3 M3 I/O I/O • System parameters – 32 DIMMs (up to 64 GB using 256Mb DRAM) – 4 HyperTransport links available for I/O – Bisection-bandwidth = 12.8 GB/s – Average-diameter = 1.33 Hops September 22, 2002 Hot Chips 14 24 4-way System Topology (contd.) I/O M0 P0 P1 M1 M2 P2 P3 M3 I/O • System parameters – 32 DIMMs (up to 64 GB using 256Mb DRAM) – 2 HyperTransport links available for I/O – Bisection-bandwidth = 19.2 GB/s – Average-diameter = 1.17 Hops September 22, 2002 Hot Chips 14 25 8-way System Topology I/O I/O P0 P1 P2 P3 P4 P5 P6 P7 I/O I/O • System parameters – 64 DIMMs (up to 128GB using 256Mb DRAM) – 4 HyperTransport links available for I/O – Bisection-bandwidth = 25.6 GB/s – Average-diameter = 1.71 hops September 22, 2002 Hot Chips 14 26 8-way System Topology (contd.) I/O P0 P1 P2 P3 P4 P5 P6 P7 • System parameters I/O – 64 DIMMs (up to 128GB using 256Mb DRAM) – 2 HyperTransport links available for I/O – Bisection-bandwidth = 32 GB/s – Average-diameter = 1.64 hops September 22, 2002 Hot Chips 14 27 Scalability Beyond 8P Interconnect Fabric SW0 SW1 SW2 SW3 SW4 SW5 SW14 SW15 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P28 P29 P30 P31 • Scaling beyond 8P is enabled – External HyperTransport switch • Coherent Interconnect – Snoop filter – Data caching September 22, 2002 Hot Chips 14 28 The Rewards of Good Plumbing • High Bandwidth – 2P system is designed to achieve 7 GB/s aggregate memory Read bandwidth – 4P system is designed to achieve 10 GB/s aggregate memory Read bandwidth • With data spread uniformly across the nodes • Low Latency – Average 2P unloaded latency (page hit) is designed to be < 120 ns – Average 4P unloaded latency (page hit) is designed to be < 140 ns – Latency under load increases slowly due to excess Interconnect Bandwidth – Latency shrinks quickly with increasing CPU clock speed and HyperTransport link speed September 22, 2002 Hot Chips 14 29 Trademark Attribution AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies. September 22, 2002 Hot Chips 14 30.