Architecture of Parallel Computers CSC / ECE 506

Summer 2006

Scalable Multiprocessors Lecture 10

6/19/2006

Dr Steve Hunter What is a Multiprocessor?

• A collection of communicating processors – Goals: balance load, reduce inherent communication and extra work

P ... P • A multi-cache, multi-memory system P – Role of these components essential regardless of programming model – Programming model and communication abstraction affect specific performance tradeoffs

Interconnect

Node Node Node Controller Controller Controller

Cache Cache... Cache

Proc Proc Proc

Arch of Parallel Computers CSC / ECE 506 2 Scalable Multiprocessors

• Study of machines which scale from 100’s to 1000’s of processors.

• Scalability has implications at all levels of system design and all aspects must scale

• Areas emphasized in text: – Memory bandwidth must scale with number of processors – Communication network must provide scalable bandwidth at reasonable latency – Protocols used for transferring data and synchronization techniques must scale

• A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system. For example: – How does the bandwidth/throughput of the system when adding processors? – How does the latency or time per operation increase? – How does the cost of the system increase? – How are the systems packaged?

Arch of Parallel Computers CSC / ECE 506 3 Scalable Multiprocessors

• Basic metrics affecting the scalability of a computer system from an application perspective are (Hwang 93): – Machine size: the number of processors – : determines the basic machine cycle – Problem size: amount of computational workload or the number of data points – CPU time: the actual CPU time in seconds – I/O Demand: the input/output demand in moving the program, data, and results – Memory capacity: the amount of main memory used in a program execution – Communication overhead: the amount of time spent for interprocessor communication, synchronization, remote access, etc. – Computer cost: the total cost of hardware and software resources required to execute a program – Programming overhead: the development overhead associated with an application program

• Power (watts) and cooling are also becoming inhibitors to scalability

Arch of Parallel Computers CSC / ECE 506 4 Scalable Multiprocessors

• Some other recent trends: – Multi-core processors on a single socket – Reduced focus on increasing the processor clock rate – System-on-Chip (SoC) combining processor cores, integrated interconnect, cache, high-performance I/O, etc. – Geographically distributed applications utilizing Grid and HPC technologies – Standardizing of high-performance interconnects (e.g., Infiniband, Ethernet) and focus by Ethernet community to reduce latency – For example, Force 10’s recently announced 10Gb Ethernet switch

» S2410 data center switch has set industry benchmarks for 10 Gigabit price and latency » Designed for high performance clusters, 10 Gigabit Ethernet connectivity to the server and Ethernet-based storage solutions, the S2410 supports 24 line-rate 10 Gigabit Ethernet ports with ultra low switching latency of 300 nanoseconds at an industry-leading price point. » The S2410 eliminates the need to integrate Infiniband or proprietary technologies into the data center and opens the high performance storage market to 10 Gigabit Ethernet technology. Standardizing on 10 Gigabit Ethernet in the data core, edge and storage radically simplifies management and reduces total network cost.

Arch of Parallel Computers CSC / ECE 506 5 Bandwidth Scalability

Typical switches

Bus

S S S S Crossbar

Multiplexers P M M P M M P M M P M M

• What fundamentally limits bandwidth? – Number of wires, clock rate • Must have many independent wires or high clock rate • Connectivity through bus or switches

Arch of Parallel Computers CSC / ECE 506 6 Some Memory Models

P1 Pn P1 Pn Switch $ $ (Interleaved) First-level $ Inter connection network

Main memory Mem Mem (Interleaved) Centralized Memory Shared Cache Dance Hall, UMA

P1 Pn

$ $ Mem Mem

Interconnection network

Distributed Memory (NUMA)

Arch of Parallel Computers CSC / ECE 506 7 Generic Distributed Memory Organization

Scalable network

Switch Switch Switch

° ° ° M CA

$

P

• Network bandwidth requirements? – independent processes? – communicating processes? • Latency?

Arch of Parallel Computers CSC / ECE 506 8 Some Examples

Arch of Parallel Computers CSC / ECE 506 9 AMD Processor Technology

Arch of Parallel Computers CSC / ECE 506 10 AMD Opteron Architecture

• AMD Opteron™ Processor Key Architectural Features – Single-Core and Dual-Core AMD Opteron processors – Direct Connect Architecture – Integrated DDR DRAM – HyperTransport™ Technology – Low-Power

Arch of Parallel Computers CSC / ECE 506 11 AMD Opteron Architecture

• Direct Connect Architecture – Addresses and helps reduce the real challenges and bottlenecks of system architectures – Memory is directly connected to the CPU optimizing memory performance – I/O is directly connected to the CPU for more balanced throughput and I/O – CPUs are connected directly to CPUs allowing for more linear symmetrical multiprocessing • Integrated DDR DRAM Memory Controller – Changes the way the processor accesses main memory, resulting in increased bandwidth, reduced memory latencies, and increased processor performance – Available memory bandwidth scales with the number of processors – 128-bit wide integrated DDR DRAM memory controller capable of supporting up to eight (8) registered DDR DIMMs per processor – Available memory bandwidth up to 6.4 GB/s (with PC3200) per processor • HyperTransport™ Technology – Provides a scalable bandwidth interconnect between processors, I/O subsystems, and other chipsets – Support of up to three (3) coherent HyperTransport links, providing up to 24.0 GB/s peak bandwidth per processor – Up to 8.0 GB/s bandwidth per link providing sufficient bandwidth for supporting new interconnects including PCI-X, DDR, InfiniBand, and 10G Ethernet – Offers low power consumption (1.2 volts) to help reduce a system’s thermal budget

Arch of Parallel Computers CSC / ECE 506 12 AMD Processor Architecture • Low-Power Processors – The AMD Opteron processor HE offers industry-leading performance per watt making it an ideal solution for rack-dense 1U servers or blades in datacenter environments as well as cooler, quieter workstation designs. – The AMD Opteron processor EE provides maximum I/O bandwidth currently available in a single-CPU controller making it a good fit for embedded controllers in markets such as NAS and SAN.

• Other features of the AMD Opteron processor include: – 64-bit wide key data and address paths that incorporate a 48-bit virtual address space and a 40-bit physical address space – ECC (Error Correcting Code) protection for L1 cache data, L2 cache data and tags, and DRAM with hardware scrubbing of all ECC protected arrays – 90nm SOI () process technology for lower thermal output levels and improved frequency scaling – Support for all instructions necessary to be fully compatible with SSE2 technology – Two (2) additional pipeline stages (compared to AMD’s seventh generation architecture) for increased performance and frequency scalability – Higher IPC (Instructions per Clock) achieved through additional key features, such as larger TLBs (Translation Look-aside Buffers), flush filters, and enhanced branch prediction algorithm

Arch of Parallel Computers CSC / ECE 506 13 AMD vs Intel • Performance – SPECint® rate2000 – the Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 28 percent – SPECfp® rate2000 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 76 percent – SPECjbb®2005 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz by 13 percent

• Processor Power (Watts) – Dual-Core AMD Opteron™ processors at 95 watts, consume far less than the competition’s dual-core x86 server processors which according to their published data, have a thermal design power of 135 watts and a max power draw of 150 watts. – Can result in 200 percent better performance-per-watt than the competition. – Even greater performance-per-watt can be achieved with lower-power processors that are (55 watt).

Arch of Parallel Computers CSC / ECE 506 14 IBM POWER Processor Technology

Arch of Parallel Computers CSC / ECE 506 15 IBM POWER4+ Processor Architecture

Arch of Parallel Computers CSC / ECE 506 16 IBM POWER4+ Processor Architecture

• Two processor cores on one chip as shown • Clock frequency of the POWER4+ is 1.5--1.9 GHz • The L2 cache modules are connected to the processors by the Core Interface Unit (CIU) switch, a 2×3 crossbar with a bandwidth of 40 B/cycle per port. • This enables to ship 32 B to either the L1 instruction cache or the data cache of each of the processors and to store 8 B values at the same time. • Also, for each processor there is a Non-cacheable Unit that interfaces with the Fabric Controller and that takes care of non-cacheable operations. • The Fabric Controller is responsible for the communication with three other chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7 GHz are 13.6, 9.0, and 6.8 GB/s, respectively. The chip further still contains a variety of devices: the L3 cache directory and the L3 and Memory Controller that should bring down the off-chip latency considerably • The GX Controller is responsible for the traffic on the GX bus which transports data to/from the system and in practice is used for I/O. The maximum size of the L3 cache is 32 MB

Arch of Parallel Computers CSC / ECE 506 17 IBM POWER5 Processor Architecture

Arch of Parallel Computers CSC / ECE 506 18 IBM POWER5 Processor Architecture

• Like the POWER4(+), the POWER5 has two processor cores on a chip • Clock frequency of the POWER5 is 1.9 GHz. • However the higher density on the chip (the POWER5 is built in 130 nm technology instead of 180 nm used for the POWER4+) more devices could be placed on the chip and they could also be enlarged. • The L2 caches of two neighboring chips are connected and the L3 caches are directly connected to the L2 caches. • Both are larger than their respective counterparts of the POWER4: 1.875 MB against 1.5 MB for the L2 cache and 36 MB against 32 MB for the L3 cache. • In addition the speed of the L3 cache has gone up from about 120 cycles to 80 cycles. Also the associativity of the caches has improved: from 2-way to 4-way for the L1 cache, from 8-way to 10-way for the L2 cache, and from 8 to 12-way for the L3 cache. • A big difference is also the improved bandwidth from memory to the chip: it has increased from 4 GB/s for the POWER4+ to approximately 16 GB/s for the POWER5

Arch of Parallel Computers CSC / ECE 506 19 Intel (Future) Processor Technology

Arch of Parallel Computers CSC / ECE 506 20 DP Server Architecture

Platform Performance: FSB Scaling Bensley Platform It’s all about 800MHz 800MHz Bandwidth & 1067MHz Latency 1333MHz

Large Point to Point Shared Caches Interconnect 64 GB B M A B M A Central Coherency B M A B M A Easy Capacity Resolution Expansion B M A B M A B M A B M A

17 GB/s B M A AM B M A B Local and Remote B M A Sustained & Memory Latencies Blackford Balanced B M A B M A Consistent B M A B M A Throughput

Perf CONSTANTLY ANALYZING THE REQUIREMENTS, Energy THE TECHNOLOGIES, AND THE TRADEOFFS

*Graphics not representative of actual die photo or relative size Energy Efficient Performance – High End

DATACENTER “ENERGY LABEL”

NASA Columbia 2 MWatt 60 TFlops goal 10,240 cpus – II $50M Source: NASA 30,720 Flops/Watt 1,288 Flops/Dollar Computational Efficiency 17,066 Flops/Watt 467 Flops/Dollar ASC Purple 6 MWatt 100 TFlops goal 12K+ cpus – Power5 $230M

Source: LLNL Core™ Microarchitecture Advances With Quad Core

Energy Efficient Performance

4X Quad Core Clovertown H1 ‘07

Clovertown 3X Woodcrest H2 ‘06

Server 2X Dempsey MV H1 ‘06 Paxville DP H2 ‘05 Kentsfield 1X Irwindale H1 ‘05

Desktop

DP Performance Per Watt Comparison with SPECint_rate at the Platform Level Source: Intel®

*Graphics not representative of actual die photo or relative size WoodcrestWoodcrest forfor ServersServers PERFORMANCE 80%

POWER 35% …relative to Intel® Xeon® 2.8GHz 2x2MB

Source: Intel based on estimated SPECint*_rate_base2000 and thermal design power MultiMulti--CoreCore EnergyEnergy--EfficientEfficient PerformancePerformance

Dual-Core 1.73x Performance 1.73x Power

1.13x 1.00x 1.02x

Over-clocked Max Frequency Dual-core (+20%) (-20%)

Relative single-core frequency and Vcc IntelIntel MultiMulti--CoreCore TrajectoryTrajectory

Quad-Core

Dual-Core

2006 2007 Blade Architectures - General

Blade Blade Blade Server Server ….. Server

Interconnect

• Blades interconnected by common fabrics – Infiniband, Ethernet, Fibre Channel are most common – Redundant interconnect available for failover – Links from interconnect provide external connectivity • Each blade contains multiple processors, memory and network interfaces – Some options may exist such as for memory, network connectivity, etc. • Power, cooling, management overhead optimized within chassis – Multiple chassis connected together for greater number of nodes

Arch of Parallel Computers CSC / ECE 506 27 IBM BladeCenter H Architecture

I/O Bridge High-speed Switch Blade 1 HS Switch 1 • Ethernet or Infiniband • 4x (16 wire) blade links HS Switch 2 • 4x (16 wire) bridge links Blade 2 I/O Bridge • 1x (4 wire) Mgmt links

I/O Bridge 3/ SM3 • Uplinks: Up to 12x links for IB and at least four HS Switch 3 . 10Gb links for Ethernet . HS Switch 4

I/O Bridge 4 / SM4 I/O Bridge Switch Module 1 • e.g., Ethernet, Fibre Switch Module 2 Channel, Passthru Blade 14 • Dual 4x (16 wire) wiring Mgmt Mod 1 internally to each HSSM Mgmt Mod 2

Arch of Parallel Computers CSC / ECE 506 28 IBM BladeCenter H Architecture

Blade 1 Blade 2 . . . Blade 14

Blade 1 Blade. 2 . • External high performance interconnect(s) for . multiple chassis Blade 14 • Independent scaling of blades and I/O . . Interconnect . • Scales for large clusters

• Architecture used for Barcelona Blade 1 Center (MareNostrum #8) Blade. 2 .

Blade 14

Arch of Parallel Computers CSC / ECE 506 29 Cray (Octigabay) Blade Architecture

5.4 GB/s 6.4 GB/sec 5.4 GB/s (DDR 333) (HT) (DDR 333) Memory Opteron Opteron Memory

Rapid Array 8 GB/s Communications Processor RAP Per RAP Accelerator Link includes MPI hardware offload capabilities FPGA for application offload

• MPI offloaded in hardware throughput 2900 MB/s and latency 1.6us • Processor and communication interface is Hyper Transport • Dedicated link and communication chip per processor • FPGA Accelerator available for additional offload

Arch of Parallel Computers CSC / ECE 506 30 Cray Blade Architecture

Blade Characteristics • Two 2.2 GHz Opteron processors – Dedicated memory per processor • Two Rapid Array Communication Processors – One dedicated link each – One redundant link each • Application Accelerator FPGA • Local Hard Drive

Chassis Board Options Shelf Characteristics Monitoring & Ctrl • One or two IB 4x switches Control ASIC Sys Power • Twelve or twenty four external links 4 slots PCI-Express SATA via • Additional I/O: via 2 8131 HT/PCI PCI SATA 8111 HT/SATA bridges; each bridges – Three high speed I/O links attach to one blade – Four PCI-X bus slots 2nd switch card – 100Mb Ethernet for management Switch optional Switch • Active management system

Arch of Parallel Computers CSC / ECE 506 31 Cray Blade Architecture

100 Mb Ethernet High-Speed I/O PCI-X Active Mgmt System

5.4 GB/s 6.4 GB/sec 5.4 GB/s (DDR 333) (HT) (DDR 333) Memory Opteron Opteron Memory

RAP includes MPI offload RAP 8 GB/s RAP Accelerator capabilities

RapidRapid Array Array Interconnect Interconnect (24(24 x x24 24 IB IB 4x 4x Switch) Switch)

• Six blades per 3U shelf • Twelve 4x IB external links for primary switch • An additional twelve links are available with optional redundant switch

Arch of Parallel Computers CSC / ECE 506 32 Cray Blade Architecture

. . .

Interconnect

• With up to 24 external links per Octigabay 12K shelf, a variety of configurations can be achieved depending on the applications • OctigaBay suggests interconnecting shelves by mesh, tori, fat trees, and fully- connected shelves for systems that fit in one rack. Fat tree configurations require extra switches, which OctigaBay terms “spine switches.” • Mellanox Infiniband technology used for interconnect • Up to 25 shelves can be directly connected, yielding a 300 Opteron system

Arch of Parallel Computers CSC / ECE 506 33 IBM BlueGene/L Architecture – Compute Card

• The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing. • The individual speed of the processor has been traded in favor of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz. • Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high. • The CPUs have 32 KB of instruction cache and of data cache on board. In favorable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.

Arch of Parallel Computers CSC / ECE 506 34 IBM BlueGene/L Architecture

Arch of Parallel Computers CSC / ECE 506 35 IBM BlueGene/L Overview

• BlueGene/L boasts a peak speed of over 360 teraOPS, a total memory of 32 tebibytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes. Multiple communications networks enable extreme application scaling: • Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications • A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes • Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds • 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk

• The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network. • The torus network is used for most general communication patterns. • The tree network is used for often occurring collective communication patterns like broadcasting, reduction operations, etc.

Arch of Parallel Computers CSC / ECE 506 36 IBM’s X3 Architecture

Arch of Parallel Computers CSC / ECE 506 37 IBM System x X3 Chipset - Scalable Intel MP Server . . .

EM64T EM64T EM64T EM64T

PCI-X 2.0 Scalability I/O Bridge Controller I/O Bridge 266 MHz Scalability Ports to Other Xeon Memory MP Processors Controller PCI-X 2.0 I/O Bridge I/O Bridge 266 MHz

Memory Memory Memory Memory Memory Memory Memory Memory Interface Interface Interface Interface Interface Interface Interface Interface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs IBM System x X3 Chipset – Low latency

EM64T EM64T EM64T EM64T

108 ns PCI-X 2.0 Scalability I/O Bridge Controller I/O Bridge 266 MHz Scalability Ports to Other Xeon Memory MP Processors Controller PCI-X 2.0 I/O Bridge I/O Bridge 266 MHz

Memory Memory Memory Memory Memory Memory Memory Memory Interface Interface Interface Interface Interface Interface Interface Interface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs IBM System x X3 Chipset – Low latency

EM64T EM64T EM64T EM64T

222 ns PCI-X 2.0 Scalability I/O Bridge Controller I/O Bridge 266 MHz Scalability Ports to Other Xeon Memory MP Processors Controller PCI-X 2.0 I/O Bridge I/O Bridge 266 MHz

Memory Memory Memory Memory Memory Memory Memory Memory Interface Interface Interface Interface Interface Interface Interface Interface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs IBM System x X3 Chipset – High Bandwidth

EM64T EM64T EM64T EM64T

10.6 GB/s

PCI-X 2.0 Scalability I/O Bridge Controller I/O Bridge 266 MHz Scalability Ports to Other Xeon Memory MP Processors 6.4 GB/s Controller

15 GB/s PCI-X 2.0 I/O Bridge I/O Bridge 266 MHz 21.3 GB/s

Memory Memory Memory Memory Memory Memory Memory Memory Interface Interface Interface Interface Interface Interface Interface Interface

RAM DIMMs RAM DIMMs RAM DIMMs RAM DIMMs IBM System x X3 Chipset – Snoop Filter

Others X3 No traffic on FSB

EM64T EM64T EM64T EM64T EM64T EM64T EM64T EM64T

Internal Node Controller Scalability Cache I/O Bridge Controller I/O Bridge I/O Bridge I/O Bridge Miss Memory Controller Memory I/O Bridge I/O Bridge Controller I/O Bridge I/O Bridge

Memory Memory Memory Memory Memory Memory Memory Memory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory Interface Interface Interface Interface Interface Interface Interface Interface

! Cache from EACH processor ! Cache from EACH processor must be snooped is mirrored on hurricane ! Creates traffic along FSB ! Relieves traffic on FSB ! Faster access to main memory

42 IBM Confidential IBM Systems IBM System x X3 Chipset – Snoop Filter

Others X3

EM64T EM64T EM64T EM64T EM64T EM64T EM64T EM64T

Node Controller Scalability I/O Bridge Controller I/O Bridge I/O Bridge I/O Bridge Memory Controller Memory I/O Bridge I/O Bridge Controller I/O Bridge I/O Bridge

Memory Memory Memory Memory Memory Memory Memory Memory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory InterfaceMemory Interface Interface Interface Interface Interface Interface Interface Interface

! Cache from EACH processor ! Cache from EACH processor must be snooped is mirrored on hurricane ! Creates traffic along FSB ! Relieves traffic on FSB ! Faster access to main memory

43 IBM Confidential IBM Systems IBM System x Multi-node Scalability – Putting it together

Hurri- This node owns the Requester cane Hurri- Data cane requested cache line

Hurri- Null cane Null Hurri- Cached address maps to cane Main memory on this node

* Snoop filter and Remote Directory work together in multi-node configurations * Local processor cache miss is broadcast to all memory controllers * Only the node owning latest copy of data responds * Maximizes system bus bandwidth

IBM Confidential IBM System x X3 Chipset – Scalability Ports

4 Way 4 Way

4 Way

4 Way Cabled Scalability Ports 16 Way – Single OS Image MP

X3 Scales to 32 Way Dual Core Capable – 64 Cores The End

Arch of Parallel Computers CSC / ECE 506 46