Multicore: Why Is It Happening Now? Eller Hur Mår Moore’S Lag?

Multicore: Why is it happening now? eller Hur Mår Moore’s Lag? Erik Hagersten Uppsala Universitet Are we hitting the wall now? Pop: Can the transistors be made even smaller and faster? Performance Possible path, [log] but requires a paradigm shift 1000 ? Business as 100 ” w usual… La s re o Well uhm … 10 o M “ The transistors can p: Po be made smaller and PDC 1 Summer faster, but there are School other problems / 2010 Institutionen för informationsteknologi | www.it.uu.seNowMC 2 ©Erik Hagersten| user.it.uu.se/~ehYear Darling, I shrunk the computer Mainframes Super Minis: Microprocessor: Mem Multicore: Many CPUs on a Mem PDC Summer chip! School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 3 ©Erik Hagersten| user.it.uu.se/~eh Multi-core CPUs: Ageia PhysX, a multi-core physics processing unit. Ambric Am2045, a 336-core Massively Parallel Processor Array (MPPA) AMD Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors. Opteron, dual- and quad-core server/workstation processors. Phenom, triple- and quad-core desktop processors. Sempron X2, dual-core entry level processors. Turion 64 X2, dual-core laptop processors. Radeon and FireStream multi-core GPU/GPGPU (10 cores, 16 5-issue wide superscalar stream processors per core) ARM MPCore is a fully synthesizable multicore container for ARM9 and ARM11 processor cores, intended for high-performance embedded and entertainment applications. Azul Systems Vega 2, a 48-core processor. Broadcom SiByte SB1250, SB1255 and SB1455. Cradle Technologies CT3400 and CT3600, both multi-core DSPs. Cavium Networks Octeon, a 16-core MIPS MPU. HP PA-8800 and PA-8900, dual core PA-RISC processors. IBM POWER4, the world's first dual-core processor, released in 2001. POWER5, a dual-core processor, released in 2004. POWER6, a dual-core processor, released in 2007. PowerPC 970MP, a dual-core processor, used in the Apple Power Mac G5. Xenon, a triple-core, SMT-capable, PowerPC microprocessor used in the Microsoft Xbox 360 game console. IBM, Sony, and Toshiba Cell processor, a nine-core processor with one general purpose PowerPC core and eight specialized SPUs (Synergystic Processing Unit) optimized for vector operations used in the Sony PlayStation 3. Infineon Danube, a dual-core, MIPS-based, home gateway processor. Intel Celeron Dual Core, the first dual-core processor for the budget/entry-level market. Core Duo, a dual-core processor. Core 2 Duo, a dual-core processor. Core 2 Quad, a quad-core processor. Core i7, a quad-core processor, the successor of the Core 2 Duo and the Core 2 Quad. Itanium 2, a dual-core processor. Pentium D, a dual-core processor. Teraflops Research Chip (Polaris), an 3.16 GHz, 80-core processor prototype, which the company says will be released within the next five years[6]. Xeon dual-, quad- and hexa-core processors. IntellaSys seaForth24, a 24-core processor. Nvidia GeForce 9 multi-core GPU (8 cores, 16 scalar stream processors per core) GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core) Tesla multi-core GPGPU (8 cores, 16 scalar stream processors per core) Parallax Propeller P8X32, an eight-core microcontroller. picoChip PC200 series 200-300 cores per device for DSP & wireless Rapport Kilocore KC256, a 257-core microcontroller with a PowerPC core and 256 8-bit "processing elements". Raza Microelectronics XLR, an eight-core MIPS MPU Sun Microsystems PDC UltraSPARC IV and UltraSPARC IV+, dual-core processors. UltraSPARC T1, an eight-core, 32-thread processor. Summer UltraSPARC T2, an eight-core, 64-concurrent-thread processor. Texas Instruments TMS320C80 MVP, a five-core multimedia[source: video processor. Wikipedia] School Tilera TILE64, a 64-core processor XMOS Software Defined Silicon quad-core XS1-G4 2010 Institutionen för informationsteknologi | www.it.uu.se MC 4 ©Erik Hagersten| user.it.uu.se/~eh High-performance single-core CPUs: PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 5 ©Erik Hagersten| user.it.uu.se/~eh Many shapes and forms... Mem I/F Mem FSB Mem m m m m € I/F I/F € € C C C C $ $ $ $ C C C C $ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ m m m m ¢ C C C C C C C C C AMD Intel Core2 IBM Cell PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 6 ©Erik Hagersten| user.it.uu.se/~eh Outline Why multicore now? Technology reasons Performance bottlenecks in MCs Commercial offerings Reflection for the future PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 7 ©Erik Hagersten| user.it.uu.se/~eh Everybody is doing is! But, why now? Perf Multi core [log] Single core time 2007 1. Not enough ILP to get payoff from using more transistors 2. Signal propagation delay » transistor delay 3. 2 PDC Power consumption Pdyn ~ C • f • V Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 8 ©Erik Hagersten| user.it.uu.se/~eh Microprocessors today: Whatever it takes to run Exploring ILP (instruction-level parallelism): Faster clocks Superscalar Pipelines Branch Prediction one Prefetching Out-of-Order ExecutionÎ Trace Cache Deep pipelines program fast. Speculation Predicate Execution Advanced Load Address Table PDC Return Address Stack Summer School ….Bad News #1: 2010 We have already explored most ILP Institutionen för informationsteknologi(ins truction-level parallelism) | www.it.uu.se MC 9 ©Erik Hagersten | user.it.uu.se/~eh Bad News #2 Looong#2: Future wire delay Technology Îslow CPUs cycle time 830 ps (1.2GHz) 170ps Latency for a 5mm trace (RC model) 390ps PDC Summer School Quantitative data 2000and trends 2005 according 2010 2015 to V. Agarwal et 70al., ps ISCA(14 GHz) 2000 2010 Based on SIA Institutionen för informationsteknologi(Semiconductor Industry Assoc | www.it.uu.se MC 10 iation) prediction, 1999 År ©Erik Hagersten | user.it.uu.se/~eh #2:Bad How News much of#2 the chip area can youLooong reach wire in one delay cycle? Îslow CPUs Span -- Fraction of chip reachable in one cycle 1.0 100% 0.75 0.50 SIA 0.25 0.4 - 1.4 % 2000 2005 2010 2015 Year PDC .25 .18 .13 .10 .070 .050 .0035 Feature size(micron) Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 11 ©Erik Hagersten| user.it.uu.se/~eh Bad News #2:wire delay Î Multiple small CPUs with private L1$ CPUCPUCPUCPU L1L1L1 L1L1 cache cachecache Externa L1 cache L3 L3 L2 Cache Externa Ctrl L3 PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 12 ©Erik Hagersten| user.it.uu.se/~eh Bad News #3: Power consumption Î Lower frequency Î lower voltage 2 2 Pdyn = C * f * V ≈ area * freq * voltage CPU CPU CPU freq=f vs. freq=f/2 freq=f/2 2 2 Pdyn(C, f, V) = CfV Pdyn(2C, f/2, <V) < CfV CPU CPU CPU vs. freq=f CPU CPU freq = f/2 2 2 Pdyn(C, f, V) = CfV Pdyn(C, f/2, <V) < ½ CfV PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 13 ©Erik Hagersten| user.it.uu.se/~eh Example: Freq. Scaling speed 0.87 1.0 pow 0.51 freq speed pow freq speed freq speed 1.20 1.13 1.51 0.80 0.87 pow 0.80 0.87 pow 0.51 0.51 20% higher freq. 20% lower freq. 20% lower freq. Two cores PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 14 ©Erik Hagersten| user.it.uu.se/~eh Darling, I shrunk the computer Sequential execution ( Mainframes ≈ Super Minis: one program) Microprocessor: Mem Paradigm Shift Parallel Chip Multiprocessor (CMP): Mem Apps (TLP) PDC Summer A multiprocessor on a chip! School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 15 ©Erik Hagersten| user.it.uu.se/~eh How to create threads? e e c c n n ie ie c c Seq. Seq. S S t t Prog. eProg. e Not ideal for k k HARD!! c c o o multi cores R R Parallel Parallelizing Modeling Prog1 Prog2 OS Prog compiler Thread1 Thread2 Thread3 Thread1 Thread2 Thread3 Throughput Computing Capability Computing •“Multitasking” •Weather simulation PDC •Virtualization •Computer games Summer •Concurrency •… School •… 2010 Institutionen för informationsteknologi | www.it.uu.se MC 16 ©Erik Hagersten| user.it.uu.se/~eh Shared Bottlenecks CPU CPU CPU CPU L1 L1 L1 L1 L2 Cache L1 L1 L1 L1 Bandwidth CPU CPU CPU CPU • Cache sharing effects • Cache utilization • Loop order • Data allocation • Data usage… PDC • Data reuse Shared Summer • Tiling (aka Blocking) Resources School • Fusion… 2010 Institutionen för informationsteknologi | www.it.uu.se MC 17 ©Erik Hagersten| user.it.uu.se/~eh Example: Poor Throughput Scaling! Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D 3,5 3 2,5 2 1,5 1.0 Throughput 1 0,5 0 1234 Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores PDC Summer School 2010 Institutionen för informationsteknologi | www.it.uu.se MC 18 ©Erik Hagersten| user.it.uu.se/~eh Throughput Scaling, More Apps 4,50 4,00 3,50 3,00 SPEC CPU 20006 FP Throughput improvements on 4 cores 2.02,50 2,00 1,50 1,00 0,50 0,00 PDC Summer School 410,bwaves 2010 416,gamess Intel X5365 3GHz “Core2”, 1333 MHz FSB, 8MB L2. (Based on data from the SPEC web) 433,milc Institutionen för informationsteknologi 434,zeusmp 435,gromacs 436,cactusADM 437,leslie3d 444,namd | www.it.uu.se 447,dealII 450,soplex MC 19 453,povray 454,calculix 459,GemsFDTD 465,tonto 470,lbm ©Erik Hagersten 481,wrf 482,sphinx3 | user.it.uu.se/~eh Nerd Curve: 470.LBM Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data

Load more