SMP Node Architecture
Total Page:16
File Type:pdf, Size:1020Kb
High Performance Computing: Concepts, Methods, & Means SMP Node Architecture Prof. Thomas Sterling Department of Computer Science Louisiana State University February 1, 2007 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 2 2 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 3 3 Opening Remarks • This week is about supercomputer architecture – Last time: major factors, classes, and system level – Today: modern microprocessor and multicore SMP node • As we’ve seen, there is a diversity of HPC system types • Most common systems are either SMPs or are ensembles of SMP nodes • “SMP” stands for: “Symmetric Multi-Processor ” • System performance is strongly influenced by SMP node performance • Understanding structure, functionality, and operation of SMP nodes will allow effective programming • Next time: making SMPs work for you! 4 4 The take-away message • Primary structure and elements that make up an SMP node • Primary structure and elements that make up the modern multicore microprocessor component • The factors that determine microprocessor delivered performance • The factors that determine overall SMP sustained performance • Amdahl’s law and how to use it • Calculating cpi • Reference: J. Hennessy & D. Patterson, “Computer Architecture A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003 5 5 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 6 6 SMP Context • A standalone system – Incorporates everything needed for • Processors • Memory • External I/O channels • Local disk storage • User interface – Enterprise server and institutional computing market • Exploits economy of scale to enhance performance to cost • Substantial performance – Target for ISVs (Independent Software Vendors) • Shared memory multiple thread programming platform – Easier to program than distributed memory machines – Enough parallelism to • Building block for ensemble supercomputers – Commodity clusters – MPPs 7 7 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 8 8 Performance: Amdahl’s Law Baton Rouge to Houston • from my house on East Lakeshore Dr. • to downtown Hyatt Regency • distance of 271 • in air flight time: 1 hour • door to door time to drive: 4.5 hours • cruise speed of Boeing 737: 600 mph • cruise speed of BMW 528: 60 mph 9 9 Amdahl’s Law: drive or fly? • Peak performance gain: 10X – BMW cruise approx. 60 MPH – Boeing 737 cruise approx. 600 MPH • Time door to door – BMW • Google estimates 4 hours 30 minutes – Boeing 737 • Time to drive to BTR from my house = 15 minutes • Wait time at BTR = 1 hour • Taxi time at BTR = 5 minutes • Continental estimates BTR to IAH 1 hour • Taxi time at IAH = 15 minutes (assuming gate available) • Time to get bags at IAH = 25 minutes • Time to get rental car = 15 minutes • Time to drive to Hyatt Regency from IAH = 45 minutes • Total time = 4.0 hours • Sustained performance gain: 1.125X 10 10 Amdahl’s Law TO start end TA TF TO ≡ time for non - accelerated computation start end TA ≡ time for accelerated computation TF ≡ time of portion of computation that can be accelerated T /g F g ≡ peak performance gain for accelerated portion of computation f ≡ fraction of non - accelerated computation to be accelerated S ≡ speed up of computation with acceleration applied S = TO TA f = TF TO f TA = ()1− f ×TO + ×TO g TO S = f ()1− f ×TO + ×TO g 1 S = f 1− f + g 11 11 Amdahl’s Law with Overhead TO start end tF tF tF tF TA start end n TF = ∑tFi v + tF/g i v ≡ overhead of accelerated work segment n V ≡ total overhead for accelerated work = ∑vi i f TA = ()1− f ×TO + ×TO + n×v g TO TO S = = TA f ()1− f ×TO + ×TO + n×v g 1 S = f n×v ()1− f + + g TO 12 12 Amdahl’s Law and Parallel Computers • Amdahl’s Law (FracX: original % to be speed up) Speedup = 1 / [(FracX/SpeedupX + (1-FracX)] • A portion is sequential => limits parallel speedup – Speedup<= 1/ (1-FracX) • Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used 80 = 1 / [(FracX/100 + (1-FracX)] 0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1 FracX = (80-1)/79.2 = 0.9975 • Only 0.25% sequential! 13 13 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 14 14 SMP Node Diagram MP MP MP MP Legend : L1 L1 L1 L1 MP : MicroProcessor L2 L2 L2 L2 L1,L2,L3 : Caches L3 L3 M1.. : Memory Banks S : Storage NIC : Network Interface Card M1 M1 Mn-1 S PCI-e Controller JTAG Ethernet Peripherals S USB NIC NIC 15 15 SMP System Examples Vendor & Processor Number Cores Memory Chipset PCI slots name of cores per proc. IBM eServer IBM Power5 64 2 2 TB Proprietary GX+, ≤240 PCI-X p5 595 1.9 GHz RIO-2 (20 standard) Microway AMD Opteron 16 2 128 GB Nvidia nForce Pro 6 PCIe QuadPuter-8 2.6 Ghz 2200+2050 Ion M40 Intel Itanium 2 8 2 128 GB Hitachi CF-3e 4 PCIe 1.6 GHz 2 PCI-X Intel Server Intel Itanium 2 8 2 64 GB Intel E8870 8 PCI-X System 1.6 GHz SR870BN4 HP Proliant Intel Xeon 7040 8 2 64 GB Intel 8500 4 PCIe ML570 G3 3 GHz 6 PCI-X Dell PowerEdge Intel Xeon 5300 8 4 32 GB Intel 5000X 3 PCIe 2950 2.66 GHz 16 16 Sample SMP Systems DELL PowerEdge HP Proliant Intel Server System Microway Quadputer IBM p5 595 17 17 HyperTransport-based SMP System Source: http://www.devx.com/amd/Article/17437 18 18 Comparison of Opteron and Xeon SMP Systems Source: http://www.devx.com/amd/Article/17437 19 19 Multi-Chip Module (MCM) Component of IBM Power5 Node 20 20 Major Elements of an SMP Node • Processor chip • DRAM main memory cards • Motherboard chip set • On-board memory network – North bridge • On-board I/O network – South bridge • PCI industry standard interfaces – PCI, PCI-X, PCI-express • System Area Network controllers – e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch • System Management network – Usually Ethernet – JTAG for low level maintenance • Internal disk and disk controller • Peripheral interfaces 21 21 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 22 22 Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00) IA-32 FPU Control IA-64 Control TLB Integer Units Cache Instr. Fetch & Decode Cache Bus Core Processor Die 4 x 1MB L3 cache 23 23 Multicore Microprocessor Component Elements • Multiple processor cores – One or more processors • L1 caches – Instruction cache – Data cache • L2 cache – Joint instruction/data cache – Dedicated to individual core processor • L3 cache – Shared among multiple cores – Often off die but in same package • Memory interface – Address translation and management (sometimes) – North bridge • I/O interface – South bridge 24 24 Comparison of Current Microprocessors Processor Clock Caches ILP Cores Process & Power Linpack rate (per core) (each core) per die size TPP chip (one core) AMD Opteron 2.6 GHz L1I: 64KB 2 FPops/cycle 2 90nm, 95W 3.89 L1D: 64KB 3 Iops/cycle 220mm 2 Gflops L2: 1MB 2* LS/cycle IBM Power5+ 2.2 GHz L1I: 64KB 4 FPops/cycle 2 90nm, 180W (est.) 8.33 L1D: 32KB 2 Iops/cycle 243mm 2 Gflops L2: 1.875MB 2 LS/cycle L3: 18MB Intel 1.6 GHz L1I: 16KB 4 FPops/cycle 2 90nm, 104W 5.95 Itanium 2 L1D: 16KB 4 Iops/cycle 596mm 2 Gflops (9000 series) L2I: 1MB 2 LS/cycle L2D: 256KB L3: 3MB or more Intel Xeon 3 GHz L1I: 32KB 4 Fpops/cycle 2 65nm, 80W 6.54 Woodcrest L1D: 32KB 3 Iops/cycle 144mm 2 Gflops L2: 2MB 1L+1S/cycle 25 25 Processor Core Micro Architecture • Execution Pipeline – Stages of functionality to process issued instructions – Hazards are conflicts with continued execution – Forwarding supports closely associated operations exhibiting precedence constraints • Out of Order Execution – Uses reservation stations – hides some core latencies and provide fine grain asynchronous operation supporting concurrency • Branch Prediction – Permits computation to proceed at a conditional branch point prior to resolving predicate value – Overlaps follow-on computation with predicate resolution – Requires roll-back or equivalent to correct false guesses – Sometimes follows both paths, and several deep 26 26 Topics • Introduction • SMP Context • Performance: Amdahl’s Law • SMP System structure • Processor core • Memory System • Chip set • South Bridge – I/O • Performance Issues • Summary – Material for the Test 27 27 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” (2X/1.5yr) 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM Performance DRAM 9%/yr. 1 (2X/10 yrs) 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1980 1981 1982 1983 1984 1985 1986 1999 2000 Time Copyright 2001, UCB, David Patterson28 28 What is a cache? • Small, fast storage used to improve average access time to slow memory. • Exploits spacial and temporal locality • In computer architecture, almost everything is a cache! – Registers a cache on variables – First-level cache a cache on second-level cache – Second-level cache a cache on memory – Memory a cache on disk (virtual memory) – TLB a cache on page table – Branch-prediction a cache on prediction information? Proc/Regs L1-Cache Bigger L2-Cache Faster Memory Disk, Tape, etc.