
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Multiprocessors Lecture 10 6/19/2006 Dr Steve Hunter What is a Multiprocessor? • A collection of communicating processors – Goals: balance load, reduce inherent communication and extra work P ... P • A multi-cache, multi-memory system P – Role of these components essential regardless of programming model – Programming model and communication abstraction affect specific performance tradeoffs Interconnect Node Node Node Controller Controller Controller Cache Cache... Cache Proc Proc Proc Arch of Parallel Computers CSC / ECE 506 2 Scalable Multiprocessors • Study of machines which scale from 100’s to 1000’s of processors. • Scalability has implications at all levels of system design and all aspects must scale • Areas emphasized in text: – Memory bandwidth must scale with number of processors – Communication network must provide scalable bandwidth at reasonable latency – Protocols used for transferring data and synchronization techniques must scale • A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system. For example: – How does the bandwidth/throughput of the system when adding processors? – How does the latency or time per operation increase? – How does the cost of the system increase? – How are the systems packaged? Arch of Parallel Computers CSC / ECE 506 3 Scalable Multiprocessors • Basic metrics affecting the scalability of a computer system from an application perspective are (Hwang 93): – Machine size: the number of processors – Clock rate: determines the basic machine cycle – Problem size: amount of computational workload or the number of data points – CPU time: the actual CPU time in seconds – I/O Demand: the input/output demand in moving the program, data, and results – Memory capacity: the amount of main memory used in a program execution – Communication overhead: the amount of time spent for interprocessor communication, synchronization, remote access, etc. – Computer cost: the total cost of hardware and software resources required to execute a program – Programming overhead: the development overhead associated with an application program • Power (watts) and cooling are also becoming inhibitors to scalability Arch of Parallel Computers CSC / ECE 506 4 Scalable Multiprocessors • Some other recent trends: – Multi-core processors on a single socket – Reduced focus on increasing the processor clock rate – System-on-Chip (SoC) combining processor cores, integrated interconnect, cache, high-performance I/O, etc. – Geographically distributed applications utilizing Grid and HPC technologies – Standardizing of high-performance interconnects (e.g., Infiniband, Ethernet) and focus by Ethernet community to reduce latency – For example, Force 10’s recently announced 10Gb Ethernet switch » S2410 data center switch has set industry benchmarks for 10 Gigabit price and latency » Designed for high performance clusters, 10 Gigabit Ethernet connectivity to the server and Ethernet-based storage solutions, the S2410 supports 24 line-rate 10 Gigabit Ethernet ports with ultra low switching latency of 300 nanoseconds at an industry-leading price point. » The S2410 eliminates the need to integrate Infiniband or proprietary technologies into the data center and opens the high performance storage market to 10 Gigabit Ethernet technology. Standardizing on 10 Gigabit Ethernet in the data core, edge and storage radically simplifies management and reduces total network cost. Arch of Parallel Computers CSC / ECE 506 5 Bandwidth Scalability Typical switches Bus S S S S Crossbar Multiplexers P M M P M M P M M P M M • What fundamentally limits bandwidth? – Number of wires, clock rate • Must have many independent wires or high clock rate • Connectivity through bus or switches Arch of Parallel Computers CSC / ECE 506 6 Some Memory Models P1 Pn P1 Pn Switch $ $ (Interleaved) First-level $ Inter connection network Main memory Mem Mem (Interleaved) Centralized Memory Shared Cache Dance Hall, UMA P1 Pn $ $ Mem Mem Interconnection network Distributed Memory (NUMA) Arch of Parallel Computers CSC / ECE 506 7 Generic Distributed Memory Organization Scalable network Switch Switch Switch ° ° ° M CA $ P • Network bandwidth requirements? – independent processes? – communicating processes? • Latency? Arch of Parallel Computers CSC / ECE 506 8 Some Examples Arch of Parallel Computers CSC / ECE 506 9 AMD Opteron Processor Technology Arch of Parallel Computers CSC / ECE 506 10 AMD Opteron Architecture • AMD Opteron™ Processor Key Architectural Features – Single-Core and Dual-Core AMD Opteron processors – Direct Connect Architecture – Integrated DDR DRAM Memory Controller – HyperTransport™ Technology – Low-Power Arch of Parallel Computers CSC / ECE 506 11 AMD Opteron Architecture • Direct Connect Architecture – Addresses and helps reduce the real challenges and bottlenecks of system architectures – Memory is directly connected to the CPU optimizing memory performance – I/O is directly connected to the CPU for more balanced throughput and I/O – CPUs are connected directly to CPUs allowing for more linear symmetrical multiprocessing • Integrated DDR DRAM Memory Controller – Changes the way the processor accesses main memory, resulting in increased bandwidth, reduced memory latencies, and increased processor performance – Available memory bandwidth scales with the number of processors – 128-bit wide integrated DDR DRAM memory controller capable of supporting up to eight (8) registered DDR DIMMs per processor – Available memory bandwidth up to 6.4 GB/s (with PC3200) per processor • HyperTransport™ Technology – Provides a scalable bandwidth interconnect between processors, I/O subsystems, and other chipsets – Support of up to three (3) coherent HyperTransport links, providing up to 24.0 GB/s peak bandwidth per processor – Up to 8.0 GB/s bandwidth per link providing sufficient bandwidth for supporting new interconnects including PCI-X, DDR, InfiniBand, and 10G Ethernet – Offers low power consumption (1.2 volts) to help reduce a system’s thermal budget Arch of Parallel Computers CSC / ECE 506 12 AMD Processor Architecture • Low-Power Processors – The AMD Opteron processor HE offers industry-leading performance per watt making it an ideal solution for rack-dense 1U servers or blades in datacenter environments as well as cooler, quieter workstation designs. – The AMD Opteron processor EE provides maximum I/O bandwidth currently available in a single-CPU controller making it a good fit for embedded controllers in markets such as NAS and SAN. • Other features of the AMD Opteron processor include: – 64-bit wide key data and address paths that incorporate a 48-bit virtual address space and a 40-bit physical address space – ECC (Error Correcting Code) protection for L1 cache data, L2 cache data and tags, and DRAM with hardware scrubbing of all ECC protected arrays – 90nm SOI (Silicon on Insulator) process technology for lower thermal output levels and improved frequency scaling – Support for all instructions necessary to be fully compatible with SSE2 technology – Two (2) additional pipeline stages (compared to AMD’s seventh generation architecture) for increased performance and frequency scalability – Higher IPC (Instructions per Clock) achieved through additional key features, such as larger TLBs (Translation Look-aside Buffers), flush filters, and enhanced branch prediction algorithm Arch of Parallel Computers CSC / ECE 506 13 AMD vs Intel • Performance – SPECint® rate2000 – the Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 28 percent – SPECfp® rate2000 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz processor by 76 percent – SPECjbb®2005 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2.8GHz by 13 percent • Processor Power (Watts) – Dual-Core AMD Opteron™ processors at 95 watts, consume far less than the competition’s dual-core x86 server processors which according to their published data, have a thermal design power of 135 watts and a max power draw of 150 watts. – Can result in 200 percent better performance-per-watt than the competition. – Even greater performance-per-watt can be achieved with lower-power processors that are (55 watt). Arch of Parallel Computers CSC / ECE 506 14 IBM POWER Processor Technology Arch of Parallel Computers CSC / ECE 506 15 IBM POWER4+ Processor Architecture Arch of Parallel Computers CSC / ECE 506 16 IBM POWER4+ Processor Architecture • Two processor cores on one chip as shown • Clock frequency of the POWER4+ is 1.5--1.9 GHz • The L2 cache modules are connected to the processors by the Core Interface Unit (CIU) switch, a 2×3 crossbar with a bandwidth of 40 B/cycle per port. • This enables to ship 32 B to either the L1 instruction cache or the data cache of each of the processors and to store 8 B values at the same time. • Also, for each processor there is a Non-cacheable Unit that interfaces with the Fabric Controller and that takes care of non-cacheable operations. • The Fabric Controller is responsible for the communication with three other chips that are embedded in the same Multi Chip Module (MCM), to L3 cache, and to other MCMs. • The bandwidths at 1.7 GHz are 13.6, 9.0, and 6.8 GB/s, respectively. The chip further still contains a variety of devices: the L3 cache directory and the L3 and Memory Controller that should bring down the off-chip latency considerably • The GX Controller is responsible
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages46 Page
-
File Size-