HighPerformanceComputing:Concepts,Methods,&Means SMPNodeArchitecture

Prof.ThomasSterling DepartmentofScience LouisianaStateUniversity February1,2007 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

2 2 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

3 3 OpeningRemarks

• Thisweekisaboutsupercomputerarchitecture – Lasttime:majorfactors,classes,andsystemlevel – Today:modernmicroprocessorandmulticoreSMPnode • Aswe’veseen,thereisadiversityofHPCsystemtypes • MostcommonsystemsareeitherSMPs orare ensemblesofSMPnodes • “SMP” standsfor:“Symmetric Multi-Processor ” • SystemperformanceisstronglyinfluencedbySMPnode performance • Understandingstructure,functionality,andoperationof SMPnodeswillalloweffectiveprogramming • Nexttime:makingSMPs workforyou! 4 4 The take-away message

• Primarystructureandelementsthatmakeupan SMPnode • Primarystructureandelementsthatmakeupthe modernmulticoremicroprocessorcomponent • Thefactorsthatdeterminemicroprocessordelivered performance • ThefactorsthatdetermineoverallSMPsustained performance • Amdahl’slawandhowtouseit • Calculatingcpi • Reference: J. Hennessy & D. Patterson, “Computer Architecture A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003

5 5 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

6 6 SMPContext

• Astandalonesystem – Incorporateseverythingneededfor • Processors • Memory • ExternalI/Ochannels • Localdiskstorage • Userinterface – Enterpriseandinstitutionalcomputingmarket • Exploitseconomyofscaletoenhanceperformancetocost • Substantialperformance – TargetforISVs(IndependentSoftwareVendors) • Sharedmemorymultiplethreadprogrammingplatform – Easiertoprogramthandistributedmemorymachines – Enoughparallelismto • Buildingblockforensemblesupercomputers – Commodityclusters – MPPs 7 7 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

8 8 Performance:Amdahl’sLaw

BatonRougetoHouston • frommyhouseonEastLakeshoreDr. • todowntownHyattRegency • distanceof271 • inairflighttime:1hour • doortodoortimetodrive:4.5hours • cruisespeedofBoeing737:600mph • cruisespeedofBMW528:60mph 9 9 Amdahl’sLaw:driveorfly?

• Peakperformancegain:10X – BMWcruiseapprox.60MPH – Boeing737cruiseapprox.600MPH • Timedoortodoor – BMW • Google estimates4hours30minutes – Boeing737 • TimetodrivetoBTRfrommyhouse=15minutes • WaittimeBTR=1hour • TaxitimeatBTR=5minutes • ContinentalestimatesBTRtoIAH1hour • TaxitimeatIAH=15minutes(assuminggateavailable) • TimetogetbagsatIAH=25minutes • Timetogetrentalcar=15minutes • TimetodrivetoHyattRegencyfromIAH=45minutes • Totaltime=4.0hours • Sustainedperformancegain:1.125X

10 10 Amdahl’sLaw

TO start end

TA TF

TO ≡ time for non - accelerated computation start end TA ≡ time for accelerated computation

TF ≡ time of portion of computation that can be accelerated T /g F g ≡ peak performance gain for accelerated portion of computation f ≡ fraction of non - accelerated computation to be accelerated S ≡ speed up of computation with acceleration applied

S = TO TA

f = TF TO  f  TA = ()1− f ×TO +  ×TO  g  TO S =  f  ()1− f ×TO +  ×TO  g  1 S =  f  1− f +    g  11 11 Amdahl’sLawwithOverhead

TO start end

tF tF tF tF TA

start end n TF = ∑tFi v+tF/g i v ≡ overhead of accelerated work segment n V ≡ total overhead for accelerated work = ∑vi i f TA = ()1− f ×TO + ×TO + n×v g

TO TO S = = TA f ()1− f ×TO + ×TO + n×v g 1 S = f n×v ()1− f + + g TO 12 12 Amdahl’sLawandParallel

• Amdahl’sLaw(FracX:original%tobespeedup) Speedup=1/[(FracX/SpeedupX +(1FracX)] • Aportionissequential=>limitsparallelspeedup – Speedup<=1/(1FracX) • Ex.Whatfractionsequentialtoget80Xspeedup from100processors?Assumeeither1processoror 100fullyused 80=1/[(FracX/100+(1FracX)] 0.8*FracX +80*(1FracX)=80 79.2*FracX =1 FracX =(801)/79.2=0.9975 • Only0.25%sequential!

13 13 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

14 14 SMPNodeDiagram

MP MP MP MP Legend: L1 L1 L1 L1 MP:MicroProcessor L2 L2 L2 L2 L1,L2,L3:Caches L3 L3 M1..:MemoryBanks S:Storage NIC:NetworkInterfaceCard

M1 M1 Mn1

S PCIe Controller JTAG Ethernet Peripherals S USB NIC NIC

15 15 SMPSystemExamples

Vendor & Processor Number Cores Memory PCI slots name of cores per proc. IBMeServer IBMPower5 64 2 2TB ProprietaryGX+, ≤240PCIX p5595 1.9GHz RIO2 (20standard) Microway AMD 16 2 128GB Nvidia nForce Pro 6PCIe QuadPuter8 2.6Ghz 2200+2050 IonM40 Itanium2 8 2 128GB HitachiCF3e 4PCIe 1.6GHz 2PCIX IntelServer IntelItanium2 8 2 64GB IntelE8870 8PCIX System 1.6GHz SR870BN4 HPProliant IntelXeon7040 8 2 64GB Intel8500 4PCIe ML570G3 3GHz 6PCIX DellPowerEdge IntelXeon5300 8 4 32GB Intel5000X 3PCIe 2950 2.66GHz

16 16 SampleSMPSystems

DELLPowerEdge

HPProliant

IntelServerSystem

Microway Quadputer

IBMp5595 17 17 HyperTransportbasedSMP System

Source: http://www.devx.com/amd/Article/17437 18 18 ComparisonofOpteronandXeon SMPSystems

Source: http://www.devx.com/amd/Article/17437 19 19 MultiChipModule(MCM) ComponentofIBMPower5Node

20 20 MajorElementsofanSMPNode

• Processorchip • DRAMmainmemorycards • chipset • Onboardmemorynetwork – Northbridge • OnboardI/Onetwork – Southbridge • PCIindustrystandardinterfaces – PCI,PCIX,PCIexpress • SystemAreaNetworkcontrollers – e.g.Ethernet,Myrinet,Infiniband,Quadrics,FederationSwitch • SystemManagementnetwork – UsuallyEthernet – JTAGforlowlevelmaintenance • Internaldiskanddiskcontroller • Peripheralinterfaces 21 21 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

22 22 Itanium™ ProcessorSilicon (Copyright: Intel at Hotchips ’00)

IA-32 FPU Control

IA-64 Control TLB

Integer Units Cache Instr. Fetch & Decode Cache Bus

Core Processor Die 4 x 1MB L3 cache

23 23 MulticoreMicroprocessorComponent Elements

• Multipleprocessorcores – Oneormoreprocessors • L1caches – Instructioncache – Datacache • L2cache – Jointinstruction/datacache – Dedicatedtoindividualcoreprocessor • L3cache – Sharedamongmultiplecores – Oftenoffdiebutinsamepackage • Memoryinterface – Addresstranslationandmanagement(sometimes) – Northbridge • I/Ointerface – Southbridge 24 24 ComparisonofCurrent Microprocessors

Processor Clock Caches ILP Cores Process & Power Linpack rate (per core) (each core) per die size TPP chip (one core) AMDOpteron 2.6GHz L1I:64KB 2FPops/cycle 2 90nm, 95W 3.89 L1D:64KB 3Iops/cycle 220mm 2 Gflops L2:1MB 2*LS/cycle IBMPower5+ 2.2GHz L1I:64KB 4FPops/cycle 2 90nm, 180W(est.) 8.33 L1D:32KB 2Iops/cycle 243mm 2 Gflops L2:1.875MB 2LS/cycle L3:18MB Intel 1.6GHz L1I:16KB 4FPops/cycle 2 90nm, 104W 5.95 Itanium2 L1D:16KB 4Iops/cycle 596mm 2 Gflops (9000series) L2I:1MB 2LS/cycle L2D:256KB L3:3MBormore IntelXeon 3GHz L1I:32KB 4Fpops/cycle 2 65nm, 80W 6.54 Woodcrest L1D:32KB 3Iops/cycle 144mm 2 Gflops L2:2MB 1L+1S/cycle

25 25 ProcessorCoreMicroArchitecture

• ExecutionPipeline – Stagesoffunctionalitytoprocessissuedinstructions – Hazardsareconflictswithcontinuedexecution – Forwardingsupportscloselyassociatedoperations exhibitingprecedenceconstraints • OutofOrderExecution – Usesreservationstations – hidessomecorelatenciesandprovidefinegrain asynchronousoperationsupportingconcurrency • BranchPrediction – Permitscomputationtoproceedataconditionalbranchpoint priortoresolvingpredicatevalue – Overlapsfollowoncomputationwithpredicateresolution – Requiresrollbackorequivalenttocorrectfalseguesses – Sometimesfollowsbothpaths,andseveraldeep 26 26 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

27 27 Recap:WhoCaresAbouttheMemoryHierarchy?

Processor-DRAM Memory Gap (latency)

Proc 1000 CPU 60%/yr. “Moore’sLaw” (2X/1.5yr) 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM

Performance DRAM 9%/yr. 1 (2X/10yrs) 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1980 1981 1982 1983 1984 1985 1986 1999 2000

Time Copyright2001,UCB,DavidPatterson28 28 Whatisacache?

• Small,faststorageusedtoimproveaverageaccesstimetoslow memory. • Exploitsspacial andtemporallocality • Incomputerarchitecture,almosteverythingisacache! – Registersacacheonvariables – Firstlevelcacheacacheonsecondlevelcache – Secondlevelcacheacacheonmemory – Memoryacacheondisk(virtualmemory) – TLBacacheonpagetable – Branchpredictionacacheonpredictioninformation?

Proc/Regs L1Cache Bigger L2Cache Faster Memory

Disk,Tape,etc. Copyright2001,UCB,DavidPatterson29 29 LevelsoftheMemoryHierarchy Capacity Upper Level Access Time Staging Cost Xfer Unit faster CPU Registers 100s Bytes Registers <1s ns Instr.Operands prog./compiler 1-8 bytes Cache 10s-100s K Bytes 1-10 ns Cache $10/ MByte cache cntl Blocks 8-128 bytes Main Memory M Bytes 100ns- 300ns Memory $1/ MByte OS Pages 512-4K bytes Disk 10s G Bytes, 10 ms (10,000,000 ns) Disk $0.0031/ MByte user/operator Files Mbytes Larger Tape infinite Tape Lower Level sec-min $0.0014/ MByte Copyright2001,UCB,DavidPatterson30 30 Cacheperformance • MissorientedApproachtoMemoryAccess:

 MemAccess  CPUtime === IC ××× CPI +++ ××× MissRate ××× MissPenalty ××× CycleTime  Execution Inst   MemMisses  CPUtime === IC ××× CPI +++ ××× MissPenalty××× CycleTime  Execution Inst 

– CPI Execution includesALUandMemoryinstructions

• SeparatingoutMemorycomponententirely – AMAT=AverageMemoryAccessTime

– CPI ALUOps doesnotincludememoryinstructions  AluOps MemAccess  CPUtime === IC ×××  ××× CPI +++ ××× AMAT ××× CycleTime  Inst AluOps Inst  AMAT === HitTime +++ MissRate ××× MissPenalty

=== (((HitTime Inst +++ MissRateInst ××× MissPenaltyInst )))+++ Copyright2001,UCB,DavidPatterson31 31 (((HitTime Data +++ MissRateData ××× MissPenaltyData ))) MemoryHierarchy:Terminology • Hit :dataappearsinsomeblockintheupperlevel(example: BlockX) – HitRate :thefractionofmemoryaccessfoundintheupperlevel – HitTime :Timetoaccesstheupperlevelwhichconsistsof RAMaccesstime+Timetodeterminehit/miss • Miss :dataneedstoberetrievefromablockinthelowerlevel (BlockY) – MissRate =1 (HitRate) – MissPenalty :Timetoreplaceablockintheupperlevel+ Timetodelivertheblocktheprocessor • HitTime<<MissPenalty(500instructionson21264!)

Lower Level To Processor Upper Level Memory Memory Blk X From Processor Blk Y Copyright2001,UCB,DavidPatterson32 32 CacheMeasures • Hit rate :fractionfoundinthatlevel – Sohighthatusuallytalkabout Miss rate – Missratefallacy:asMIPStoCPUperformance, missratetoaveragememoryaccesstimein memory • Averagememoryaccesstime =Hittime+MissratexMisspenalty (nsorclocks) • Miss penalty :timetoreplaceablockfromlower level,includingtimetoreplaceinCPU – access time :timetolowerlevel =f(latency tolowerlevel) – transfer time :timetotransferblock =f(BW betweenupper&lowerlevels)

Copyright2001,UCB,DavidPatterson33 33 34 34 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

35 35 MotherboardChipset

• Providescorefunctionalityofmotherboard • Embedslowlevelprotocolstofacilitateefficientcommunication betweenlocalcomponentsofcomputersystem • ControlstheflowofdatabetweentheCPU,systemmemory,onboard peripheraldevices,expansioninterfacesandI/Osusbsystem • Alsoresponsibleforpowermanagementfeatures,retentionofnon volatileconfigurationdataandrealtimemeasurement • Typicallyconsistsof: – Northbridge(or MemoryControllerHub,MCH),managingtrafficbetween theprocessor,RAM,GPU, andoptionallyPCIExpressslots – Southbridge(orI/OControllerHub,ICH),coordinatingslowersetof devices,includingtraditionalPCIbus,ISAbus,SMBus,IDE(ATA),DMA andinterruptcontrollers,realtimeclock,BIOSmemory,ACPIpower management,LPCbridge(providingfancontrol,floppydisk,keyboard, mouse,MIDIinterfaces, etc .),andoptionallyEthernet,USB,IEEE1394, audiocodecs andRAIDinterface

36 36 MajorChipsetVendors

• Intel – http://developer.intel.com/products/chipsets/index.htm • Via – http://www.via.com.tw/en/products/chipsets • SiS – http://www.sis.com/products/product_000001.htm • AMD/ATI – http://ati.amd.com/products/integrated.html • Nvidia – http://www.nvidia.com/page/mobo.html

37 37 ChipsetFeaturesOverview

38 38 OtherCommonChipsets

Intel AMD • Desktopchipsets • Desktopchipsets – 975XExpressChipset – AMD580XCrossFire™ Chipset – 955XExpressChipset – ATIRadeon™ Xpress 200 – 845EChipset – ATISB600Series • Laptopchipsets • Laptopchipsets – Intel® 855GMEChipset – ATIRadeon™ Xpress 1100 Series – MobileIntel® 945GM ExpressChipset – ATIRadeon™ Xpress 200M – MobileIntel® 945PM ExpressChipset

39 39 Motherboard

• Alsoreferredtoasmainboard orsystemboard • Providesmechanicalandelectricalsupportfor pluggablecomponentsofacomputersystem • Constitutesthecentralcircuitryofacomputer, distributingpowerandclocksignalstotargetdevices, andimplementingcommunicationbackplanefordata exchangesbetweenthem • Definesexpansionpossibilitiesofacomputersystem throughslotsaccommodatingspecialpurposecards, memorymodules,processor(s)andI/Oports • Availableinmanyformfactorsandwithvarious capabilitiestomatchparticularsystemneeds, housingcapacityandcost 40 40 MotherboardFormFactors

• Refertostandardizedmotherboardsizes • MostpopularformfactorusedtodayisATX,evolved fromnowobsoleteAT(AdvancedTechnology)format • Examplesofothercommonformfactors: – MicroATX,miniaturizedversionofATX – WTX,largeformfactordesignatedforuseinhighpower workstations/serversfeaturingmultipleprocessors – MiniITX,designedforuseinthinclients – PC/104andETX,usedinembeddedsystemsandsingle boardcomputers – BTX(BalancedTechnologyExtended),introducedbyIntel asapossiblesuccessortoATX

41 41 MotherboardManufacturers

• Abit • IBM • Albatron • Intel • Aopen • Jetway • • MSI • • Shuttle • DFI • Soyo • ECS • SuperMicro • Epox • • FIC • VIA • • Gigabyte

42 42 PopulatedCPUSocket

Source: http://www.motherboards.org 43 43 DIMMMemorySockets

Source: http://www.motherboards.org 44 44 MotherboardonCeleritas

45 45 SuperMike Motherboard: Tyan Thunderi7500(S720)

Source: http://www.tyan.com 46 46 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

47 47 PCIenhancedsystems

48 48 http://arstechnica.com/articles/paedia/hardware/pcie.ars/1 PCIexpress

Lane Clock Throughput Throughput Initial expected uses width speed (duplex, (duplex, bytes) bits) x1 2.5GHz 5Gbps 400MBps Slots,GigabitEthernet x2 2.5GHz 10Gbps 800MBps x4 2.5GHz 20Gbps 1.6GBps Slots,10GigabitEthernet,SCSI, SAS x8 2.5GHz 40Gbps 3.2GBps x16 2.5GHz 80Gbps 6.4GBps Graphicsadapters

49 49 http://www.redbooks.ibm.com/abstracts/tips0456.html PCIX

Bus Width Clock Speed Features Bandwidth

PCIX66 64Bits 66MHz HotPlugging,3.3V 533MB/s

PCIX 64Bits 133MHz HotPlugging,3.3V 1.06GB/s 133 133MHz PCIX 64Bits,optional HotPlugging,3.3&1.5V,ECC DoubleData 2.13GB/s 266 16Bitsonly supported Rate 133MHz PCIX 64Bits,optional HotPlugging,3.3&1.5V,ECC QuadData 4.26GB/s 533 16Bitsonly supported Rate

50 50 BandwidthComparisons

CONNECTION BITS BYTES PCI 32bit/33MHz 1.06666Gbit/s 133.33 MB/s PCI 64bit/33MHz 2.13333Gbit/s 266.66 MB/s PCI 32bit/66MHz 2.13333Gbit/s 266.66 MB/s PCI 64bit/66MHz 4.26666Gbit/s 533.33 MB/s PCI 64bit/100MHz 6.39999Gbit/s 799.99 MB/s PCIExpress (x1link) [6] 2.5Gbit/s 250 MB/s PCIExpress (x4link) [6] 10Gbit/s 1 GB/s PCIExpress (x8link) [6] 20Gbit/s 2 GB/s PCIExpress (x16 40Gbit/s 4 GB/s link) [6] PCIExpress 2.0(x32 80Gbit/s 8 GB/s link) [6] PCIX DDR16bit 4.26666Gbit/s 533.33 MB/s PCIX 133 8.53333Gbit/s 1.06666 GB/s PCIX QDR16bit 8.53333Gbit/s 1.06666 GB/s PCIX DDR 17.066Gbit/s 2.133 GB/s PCIX QDR 34.133Gbit/s 4.266 GB/s 51 51 AGP 8x 17.066Gbit/s 2.133 GB/s HyperTransport:Context

• NorthbridgeSouthbridgedeviceconnectionfacilitates communicationoverfastprocessorbusbetweensystem memory,graphicsadaptor,CPU • SouthbridgeoperatesseveralI/Ointerfaces,throughthe Northbridgeoperatingoveranotherproprietaryconnection • Thisapproachispotentiallylimitedbytheemerging bandwidthdemandsoverinadequateI/Obuses • HyperTransportisoneofthemanytechnologiesaimedat improvingI/O. • Highdataratesareachievedbyusingenhanced,lowswing, 1.2VLowVoltageDifferentialSignaling(LVDS)thatemploys fewerpinsandwiresconsequentlyreducingcostandpower requirements. • HyperTransportalsohelpsincommunicationbetween multipleAMDOpteronCPUs

52 52

http://www.amd.com/usen/Processors/ComputingSolutions/0,,30_288_13265_13295%5E13340,00.html HyperTransport(continued)

• Pointtopointparalleltopologyuses2 unidirectionallinks(oneeachforupstreamand downstream) • HyperTransporttechnologychunksdatainto packetstoreduceoverheadandimprove efficiencyoftransfers. • EachHyperTransporttechnologylinkalso contains8bitdatapaththatallowsforinsertionof acontrolpacketinthemiddleofalongdata packet,thusreducinglatency. • InSummary:“HyperTransport™ technology delivers the raw throughput and low latency necessary for chip-to-chip communication. It increases I/O bandwidth, cuts down the number of different system buses, reduces power consumption, provides a flexible, modular bridge 53 53 architecture, and ensures compatibility with PCI.http://www.amd.com/usen/Processors/ComputingSoluti“ ons /0,,30_288_13265_13295%5E13340,00.html Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

54 54 PerformanceIssues

• Cachebehavior – Hit/missrate – Replacementstrategies

• Prefetching • Clockrate • ILP • Branchprediction • Memory – Accesstime – Bandwidth

55 55 Topics

• Introduction • SMPContext • Performance:Amdahl’sLaw • SMPSystemstructure • Processorcore • MemorySystem • Chipset • SouthBridge– I/O • PerformanceIssues • Summary– MaterialfortheTest

56 56 Summary– MaterialfortheTest

• Pleasemakesurethatyouhaveaddressed allpointsoutlinedonslide5 • Understandcontentonslide7 • Understandconcepts,equations,problems onslides11,12,13 • Understandcontenton21,24,26,29 • Understandconceptsonslides32,33 • Understandcontentonslides36,55

• Requiredreadingmaterial:

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1

57 57 58 58