SMP Node Architecture

High Performance Computing: Concepts, Methods, & Means SMP Node Architecture Prof. Thomas Sterling Department of Computer Science Louisiana State University February 1, 2007 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 2 2 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 3 3 Opening Remarks • This week is about supercomputer architecture – Last time: major factors, classes, and system level – Today: modern microprocessor and multicore SMP node • As we’ve seen, there is a diversity of HPC system types • Most common systems are either SMPs or are ensembles of SMP nodes • “SMP” stands for: “Symmetric Multi-Processor ” • System performance is strongly influenced by SMP node performance • Understanding structure, functionality, and operation of SMP nodes will allow effective programming • Next time: making SMPs work for you! 4 4 The take-away message • Primary structure and elements that make up an SMP node • Primary structure and elements that make up the modern multicore microprocessor component • The factors that determine microprocessor delivered performance • The factors that determine overall SMP sustained performance • Amdahl’s law and how to use it • Calculating cpi • Reference: J. Hennessy & D. Patterson, “Computer Architecture A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003 5 5 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 6 6 SMP Context • A standalone system – Incorporates everything needed for • Processors • Memory • External I/O channels • Local disk storage • User interface – Enterprise server and institutional computing market • Exploits economy of scale to enhance performance to cost • Substantial performance – Target for ISVs (Independent Software Vendors) • Shared memory multiple thread programming platform – Easier to program than distributed memory machines – Enough parallelism to • Building block for ensemble supercomputers – Commodity clusters – MPPs 7 7 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 8 8 Performance: Amdahl’s Law Baton Rouge to Houston • from my house on East Lakeshore Dr. • to downtown Hyatt Regency • distance of 271 • in air flight time: 1 hour • door to door time to drive: 4.5 hours • cruise speed of Boeing 737: 600 mph • cruise speed of BMW 528: 60 mph 9 9 Amdahl’s Law: drive or fly? • Peak performance gain: 10X – BMW cruise approx. 60 MPH – Boeing 737 cruise approx. 600 MPH • Time door to door – BMW • Google estimates 4 hours 30 minutes – Boeing 737 • Time to drive to BTR from my house = 15 minutes • Wait time at BTR = 1 hour • Taxi time at BTR = 5 minutes • Continental estimates BTR to IAH 1 hour • Taxi time at IAH = 15 minutes (assuming gate available) • Time to get bags at IAH = 25 minutes • Time to get rental car = 15 minutes • Time to drive to Hyatt Regency from IAH = 45 minutes • Total time = 4.0 hours • Sustained performance gain: 1.125X 10 10 Amdahl’s Law TO start end TA TF TO ≡ time for non - accelerated computation start end TA ≡ time for accelerated computation TF ≡ time of portion of computation that can be accelerated T /g F g ≡ peak performance gain for accelerated portion of computation f ≡ fraction of non - accelerated computation to be accelerated S ≡ speed up of computation with acceleration applied S = TO TA f = TF TO f TA = ()1− f ×TO + ×TO g TO S = f ()1− f ×TO + ×TO g 1 S = f 1− f + g 11 11 Amdahl’s Law with Overhead TO start end tF tF tF tF TA start end n TF = ∑tFi v + tF/g i v ≡ overhead of accelerated work segment n V ≡ total overhead for accelerated work = ∑vi i f TA = ()1− f ×TO + ×TO + n×v g TO TO S = = TA f ()1− f ×TO + ×TO + n×v g 1 S = f n×v ()1− f + + g TO 12 12 Amdahl’s Law and Parallel Computers • Amdahl’s Law (FracX: original % to be speed up) Speedup = 1 / [(FracX/SpeedupX + (1-FracX)] • A portion is sequential => limits parallel speedup – Speedup<= 1/ (1-FracX) • Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used 80 = 1 / [(FracX/100 + (1-FracX)] 0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1 FracX = (80-1)/79.2 = 0.9975 • Only 0.25% sequential! 13 13 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 14 14 SMP Node Diagram MP MP MP MP Legend : L1 L1 L1 L1 MP : MicroProcessor L2 L2 L2 L2 L1,L2,L3 : Caches L3 L3 M1.. : Memory Banks S : Storage NIC : Network Interface Card M1 M1 Mn-1 S PCI-e Controller JTAG Ethernet Peripherals S USB NIC NIC 15 15 SMP System Examples Vendor & Processor Number Cores Memory Chipset PCI slots name of cores per proc. IBM eServer IBM Power5 64 2 2 TB Proprietary GX+, ≤240 PCI-X p5 595 1.9 GHz RIO-2 (20 standard) Microway AMD Opteron 16 2 128 GB Nvidia nForce Pro 6 PCIe QuadPuter-8 2.6 Ghz 2200+2050 Ion M40 Intel Itanium 2 8 2 128 GB Hitachi CF-3e 4 PCIe 1.6 GHz 2 PCI-X Intel Server Intel Itanium 2 8 2 64 GB Intel E8870 8 PCI-X System 1.6 GHz SR870BN4 HP Proliant Intel Xeon 7040 8 2 64 GB Intel 8500 4 PCIe ML570 G3 3 GHz 6 PCI-X Dell PowerEdge Intel Xeon 5300 8 4 32 GB Intel 5000X 3 PCIe 2950 2.66 GHz 16 16 Sample SMP Systems DELL PowerEdge HP Proliant Intel Server System Microway Quadputer IBM p5 595 17 17 HyperTransport-based SMP System Source: http://www.devx.com/amd/Article/17437 18 18 Comparison of Opteron and Xeon SMP Systems Source: http://www.devx.com/amd/Article/17437 19 19 Multi-Chip Module (MCM) Component of IBM Power5 Node 20 20 Major Elements of an SMP Node • Processor chip • DRAM main memory cards • Motherboard chip set • On-board memory network – North bridge • On-board I/O network – South bridge • PCI industry standard interfaces – PCI, PCI-X, PCI-express • System Area Network controllers – e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch • System Management network – Usually Ethernet – JTAG for low level maintenance • Internal disk and disk controller • Peripheral interfaces 21 21 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 22 22 Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00) IA-32 FPU Control IA-64 Control TLB Integer Units Cache Instr. Fetch & Decode Cache Bus Core Processor Die 4 x 1MB L3 cache 23 23 Multicore Microprocessor Component Elements • Multiple processor cores – One or more processors • L1 caches – Instruction cache – Data cache • L2 cache – Joint instruction/data cache – Dedicated to individual core processor • L3 cache – Shared among multiple cores – Often off die but in same package • Memory interface – Address translation and management (sometimes) – North bridge • I/O interface – South bridge 24 24 Comparison of Current Microprocessors Processor Clock Caches ILP Cores Process & Power Linpack rate (per core) (each core) per die size TPP chip (one core) AMD Opteron 2.6 GHz L1I: 64KB 2 FPops/cycle 2 90nm, 95W 3.89 L1D: 64KB 3 Iops/cycle 220mm 2 Gflops L2: 1MB 2* LS/cycle IBM Power5+ 2.2 GHz L1I: 64KB 4 FPops/cycle 2 90nm, 180W (est.) 8.33 L1D: 32KB 2 Iops/cycle 243mm 2 Gflops L2: 1.875MB 2 LS/cycle L3: 18MB Intel 1.6 GHz L1I: 16KB 4 FPops/cycle 2 90nm, 104W 5.95 Itanium 2 L1D: 16KB 4 Iops/cycle 596mm 2 Gflops (9000 series) L2I: 1MB 2 LS/cycle L2D: 256KB L3: 3MB or more Intel Xeon 3 GHz L1I: 32KB 4 Fpops/cycle 2 65nm, 80W 6.54 Woodcrest L1D: 32KB 3 Iops/cycle 144mm 2 Gflops L2: 2MB 1L+1S/cycle 25 25 Processor Core Micro Architecture • Execution Pipeline – Stages of functionality to process issued instructions – Hazards are conflicts with continued execution – Forwarding supports closely associated operations exhibiting precedence constraints • Out of Order Execution – Uses reservation stations – hides some core latencies and provide fine grain asynchronous operation supporting concurrency • Branch Prediction – Permits computation to proceed at a conditional branch point prior to resolving predicate value – Overlaps follow-on computation with predicate resolution – Requires roll-back or equivalent to correct false guesses – Sometimes follows both paths, and several deep 26 26 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 27 27 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” (2X/1.5yr) 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM Performance DRAM 9%/yr. 1 (2X/10 yrs) 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1980 1981 1982 1983 1984 1985 1986 1999 2000 Time Copyright 2001, UCB, David Patterson28 28 What is a cache? • Small, fast storage used to improve average access time to slow memory. • Exploits spacial and temporal locality • In computer architecture, almost everything is a cache! – Registers a cache on variables – First-level cache a cache on second-level cache – Second-level cache a cache on memory – Memory a cache on disk (virtual memory) – TLB a cache on page table – Branch-prediction a cache on prediction information? Proc/Regs L1-Cache Bigger L2-Cache Faster Memory Disk, Tape, etc.

SMP Node Architecture

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support