The Future is not what it used to be... Erik Hagersten Then... ENIAC 1946 (”5kHz”) 18 000 radiorör sladdprogrammerad ”5 KHz” AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik HagerstenENIAC| user.it.uu.se/~eh Then (in Sweden) BARK (~1950) 8 000 relays, 80 km cables BESK (~1953) 2 400 vac. tubes ”20 kHz” (world record) AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh “Recently” APZ 212, 1983 Ericsson’s Supercomputer (“5 MHz”) AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh APZ 212 marketing brochure quotes: ”Very compact” 6 times the performance 1/6:th the size 1/5 the power consumption ”A breakthrough in computer science” ”Why more CPU power?” ”All the power needed for future development” ”…800,000 BHCA, should that ever be needed” ”SPC computer science at its most elegance” ”Using 64 kbit memory chips” ”1500W power consumption AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh 65 years of “improvements” Speed Size Price Price/performance Reliability Predictability Energy Safety Usability…. AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh ”Moore’s Law” Pop: Double performance every 18-24th month Performance [log] Multicore 1000 Single-core 100 10 1 AVDARK Year 2012 Dept of Information Technology| www.it.uu.se2006 Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Exponentiell utveckling: Doublerings/halverings-tider (according to Kurzweil) Dynamic RAM Memory (bits per dollar) 1.5 years Average Transistor Price 1.6 years Microprocessor Cost per Transistor Cycle 1.1 years Total Bits Shipped 1.1 years Processor Performance in MIPS 1.8 years Transistors in Intel Microprocessors 2.0 years Log scale 1000 100 10 1 AVDARK time 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Linear scale 1940 2017 (2x performance every 18th month) Doubling every 18th month since 1940 4,E+15 3,E+15 3,E+15 2,E+15 Performance 2,E+15 1,E+15 5,E+14 0,E+00 40 50 60 70 80 90 0 10 Year AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Exponentiell utveckling Example: Doubling every 2nd year How long does it it take for 1000x improvement? Example: Doubling every 18th month How long does it it take for 1000x improvement? Log scale 1000 100 10 1 time ? Linear scale AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Looking Forward Three rules of common wisdom: Do not bet against exponential trends Do not bet against exponential trends Do not bet against exponential trends But, is it possible to continue ”Moore’s Law”? - Are there show-stoppers? - Can we utilize an exponential growth of #cores? AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Not everything scales as fast! Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D 3,5 3 2,5 2 1,5 1.0 Throughput 1 0,5 0 1234 Number of Cores Used Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Nerd Curve: 470.LBM Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate 5,0% 3,5% cache size Running Running Less amount of work four threads one thread per memory byte moved AVDARK @ four threads 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Remember: It is getting worse! Computation vs Bandwidth #Cores ~ #Transistors CPU CPU 6 5 #T * T _f r eq / #P * P_f r eq CPU CPU 4 #Pins 3 2 DRAM 1 0 2007 2008 2009 2010 2011 2012 2013 2014 2015 Year Source: Internatronal Technology Roadmap for Semiconductors (ITRS) From Karlsson and Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution. IPDPS March 2007. [graph updated with more recent data] HPCwire Feb 2011 [cites Linley Gwennap and Justin Rattner] Without Silicon Photonics, Moore's Law Won't Matter HPCwire Feb 2011 AVDARK Growing Data Deluge Prompts Processor Redesign 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Case study: Limited by bandwidth AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Nerd Curve (again) Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) cache Possible miss rate if utilization problem was fixed miss rate orig application 5,0% 2,5% optimized application cache size Running Twice the amount of work four threads per memory byte moved AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Better Memory Usage! Example: 470.LBM Modified to promote better cache utilization 3,5 3 2,5 2 1,5 Througput 1 0,5 0 1234 # Cores Used Original code AVDARK 21 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Example 2: A Scalable Parallel Application Performance 4 3 2 1 0 1234 # Cores App: Cigar Looks like a perfect scalable application! Are we done? AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Example 2: The Same Application Optimized Performance 30 7.3x 25 Original Optimized 20 15 10 5 0 1234 App: Cigar #Cores Looks like a perfect scalable application! Are we done? Duplicate one data structure AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Implementation Trends Predicting the future is hard Predicting: “Chip Multiprocessor” aka Multicores [from PARA Bergen 2000] Chip Multiprocessor (CMP): Mem External Mem Simple fast CPU I/F I/F -- many open questions L2$ $1 $1 $1 $1 CPU CPU CPU CPU AVDARK t treads 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Multi-CMPs [from PARA Bergen 2000] Mem Explicit parallelism: #chips x #threads/chip Mem • Global shared memory • Global/local comm cost >10 Mem • Gotta’ explore small caches Mem • Gotta’ explore locality! chips • OS scalability ? c Mem • Application scalability ? Interconnect Mem Mem Mem AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Why Multicores Now? -- Hur Mår ”Moore’s Lag”? -- Perf Multi core [log] Single core time ~2007 1. Not enough ILP/MLP to get payoff from using more transistors 2. Signal propagation delay » transistor delay 2 3. Power consumption Pdyn ~ C • f • V AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Darling, I shrunk the computer Sequential execution ( Sequential execution Mainframes ≈ Super Minis: one program) Microprocessor: Mem Paradigm Shift Need TLP to Chip Multiprocessor (CMP): Mem make one A multiprocessor on a chip! chip run AVDARK fast 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh HPC in the Rear Mirror... * Promise of performance MC + Accelerators * Forced by technology † ???? MC Clusters * COTS cost convergence † ???? Beowulf x86 Linux Clusters * UNIX †COTS perf Commercial management computing Killer Micro SMPs †High cost, * Scalability Bad scaling Naive view Nifty Parallel †Hard to use Vector No standards †Not general Expensive ???? 1980 1990 2000 2010 AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Parallelism can be used to hide memory latency Intel ”Hyper Threading” T1 Niagara, MIC, … (4 threads per core) GPUs ... Is this a good idea? It cannot hide the need for bandwidth! AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Parallelism is a Hard Currency Speedup Parallelism Remember Amdahl’s Law? AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh Do you have 1000 threads to spare? SIMD rears its ugly head again 512 ”cores” (C) 16 C/StreamProcessor (SP) SP is SIMD-ish (sort off) Full DP-FP IEEE support 64kB L1 cache /SP L2 768kB global shared cache (less than the sum of L1:s) L1 Atomic instructions ECC correction for DRAM This is SIMD-ish (aka Vector) Debugging support Synch within SP efficient Giant chip/high power ... AVDARK 2012 Dept of Information Technology| www.it.uu.se Erik Hagersten| user.it.uu.se/~eh SIMD and CPUs? NVIDIA Fermi: •Special language... Coh. •Topology matter... Mem. •User-managed memory Reminds me of: * Scalability I/O bus †Hard to use No standards Common research papers: ”How to get 100X speedup”.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages37 Page
-
File Size-