High-Performance Computing - and why Learn about it?
Tarek El-Ghazawi
The George Washington University Washington D.C., USA Outline
What is High-Performance Computing? Why is High-Performance Computing Important? Advances in Performance and Architectures Heterogeneous Accelerated Computing Advances in Parallel Programming Making Progress: The HPCS Program, near-term Making Progress: Exascale and DOE Conclusions
Tarek El-Ghazawi, GWU 2 What is Supercomputing and Parallel Architectures?
Also called High-Performance Computing and Parallel Computing Research and innovation in architecture, programming and applications associated with computer systems that are orders of magnitude faster (10X- 1000X or more) than modern desktop and laptop computers Supercomputers achieve speed through massive parallelism- Parallel Architectures! E.g. many processors working together http://www.collegehumor.com/video:1828443
Tarek El-Ghazawi, GWU 3 Outline
What is High-Performance Computing? Why is High-Performance Computing Important? Advances in Performance and Architectures Hardware Accelerators and Accelerated Computing Advances in Parallel Programming What is Next: The HPCS Program, near-term What is Next: Exascale and DARPA UHPC Conclusions
Tarek El-Ghazawi, GWU 4 Why is HPC Important?
Critical for economic competitiveness because of its wide applications (through simulations and intensive data analyses) Drives computer hardware and software innovations for future conventional computing Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!!
Is that why it is turning into an international HPC muscle flexing contest?
Tarek El-Ghazawi, GWU 5 Why is HPC Important?
Design Build Test
Design Model Simulate Build
Tarek El-Ghazawi, GWU 6 Why is HPC Important? National and Economic Competitiveness
Molecular Dynamics Gene Sequence Alignment
HIV-1 Protease Inhibitor Drug
HPC Simulation for 2ns: Phylogenetic Analysis: • 2 weeks on a desktop Application • 32 days on desktop • 6 hours on a supercomputer Examples • 1.5 hrs supercomputer
Car Crash Understanding Simulations Fundamental Structure of Matter 2 million elements simulation: • 4 days on a desktop • 25 minutes on a supercomputer Requires a billion- billion calculations per Tarek El-Ghazawi, GWU second 7 Why is HPC Important? National and Economic Competitiveness
Industrial competitiveness Computational models that can run on HPC are only for the design of NASA space shuttles, but they can also help with Business Intelligence (e.g. IBM) and Watson Designing effective shapes and/or material for Potato Chips Clorox Bottles … Tarek El-Ghazawi, GWU 8 HPC Technology of Today is Conventional Computing of Tomorrow: Multi/Many-cores in Desktops and Laptops
Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007
The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997
Intel 72 Core Chip Xeon Phi KNL 1 Chip and 3 TeraFLOPs in 2016 Tarek El-Ghazawi, GWU 9 Why is HPC Important?- HPC is Ubiquitous Sony PS3 iPhone 7 4 Cores 2.34 GHz
HPC is Ubiquitous! All Computing is becoming HPC, Can we become Uses the Cell Processors! bystanders? The Road Runner: Was Fastest Supercomputer in 08 Xeon Phi KNL: A 72 CPU Chip
Uses Cell Processors! Tarek El-Ghazawi, GWU 10 Why this is happening? - The End of Moore’s Law in Clocking The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a microprocessor doubles every 18-24 months, assuming the price of the processor stays the same Wrong, not anymore! The price of a microchip drops about 48% every 18-24 months, assuming the same processor speed and on chip memory capacity Ok, for Now The number of transistors on a microchip doubles every 18-24 months, assuming the price of the chip stays the same Ok, for Now Tarek El-Ghazawi, GWU 11 No faster clocking but more Cores?
Source: Ed Davis, Intel Tarek El-Ghazawi, GWU 12 Cores and Power Efficiency
Source: Ed Davis, Intel
Tarek El-Ghazawi, GWU 13 Comparative View of Processors and Accelerators Fabrication Peak FP Peak DP Freq # Cores Memory Process Performance Power Flops/W SPFP DPFP Memory nm GHz W BW GB/s GFlops GFlops type PowerXCell 8i 65 3.2 1 + 8 204 102.4 92 1.11 25.6 XDR
NVidia Fermi 40 1.3 512 1330 665 225 2.9 177 GDDR5 Tesla M2090
Nvidia Kepler 28 0.73 2688 3950 1310 235 5.6 250 GDDR5 K20X
NVIDIA Kepler 28 0.88 2x2496 8749 2910 300 9.7 480 GDDR5 K80 Intel Xeon Phi 60 (240 22 1.05 - 1011 225 4.5 320 GDDR5 5110P (KNC) threads) Intel Xeon Phi 72 (288 14 1.7 - ~3500 245 14.3 115.2 DDR4 7290 (KNL) threads) 2.4 DDR3- Intel Xeon 32 10 202.6 101.3 130 0.78 42.7 E7-8870 (2.8) 1333 DDR3- AMD Opteron 45 2.5 12 240 120 140 0.86 42.7 6176 SE 1333 Xilinx V6 40 - - - 98.8 50 3.3 - - SX475T Altera Stratix V 28 - -Tarek El-Ghazawi, - GWU 210 60 3.5 - - 14 GSB8 Most Power Efficient Architectures: Green 500
Tarek El-Ghazawi, GWU https://www.top500.org/green500/lists/2016/11/15 Outline
What is High-Performance Computing? Why is High-Performance Computing Important? Advances in Performance and Architectures Heterogeneous Accelerated Computing Advances in Parallel Programming What is Next: The HPCS Program, near-term What is Next: Exascale and DoE Conclusions
Tarek El-Ghazawi, GWU 16 How the Supercomputing Race is Conducted? TOP500 Supercomputers and LINPACK
Top500 in November and in June Rmax - Maximal LINPACK performance achieved Rpeak - Theoretical peak performance In the TOP500 List table, the computers are ordered first by their Rmax value In the case of equal performances (Rmax value) for different computers, order is by Rpeak For sites that have the same performance, the order is by memory size and then alphabetically Check www.top500.org for more information
Tarek El-Ghazawi, GWU 17 Top 10 Supercomputers: November 2016 www.top500.org
Countr R Rank Site Computer # Cores max y (PFlops)
Sunway TaihuLight - Sunway National Supercomputing Center in MPP, Sunway SW26010 260C 10,649,60 1 Wuxi 0 93.0 1.45GHz, Sunway China NRCPC Tianhe-2 (MilkyWay-2) - TH- National University of Defense IVB-FEP Cluster, Intel Xeon E5- 2 Technology 2692 12C 2.200GHz, TH 3,120,000 33.9 China Express-2, Intel Xeon Phi 31S1P Titan – Cray XK7, Opteron 16 3 Oak Ridge National Laboratory Cores, 2.2GHz, Gemini, 560,640 17.6 Nvidia K20X
Sequoia – BlueGene/Q, Power Lawrence Livermore National 1,572,86 4 BQC 16 Cores, Custom 16.3 Laboratory 4 interconnection
Cori - Cray XC40, Intel Xeon DOE/SC/LBNL/NERSC Phi 7250 68C 1.4GHz, Aries 5 622,336 14.0 United States interconnect Cray Inc. Tarek El-Ghazawi, GWU 18 Top 10 Supercomputers: November 2016 www.top500.org
R # max Rank Country Site Computer (PFlop Cores s)
Oakforest-PACS - PRIMERGY Joint Center for Advanced High CX1640 M1, Intel Xeon Phi 556,10 6 Performance Computing 13.6 7250 68C 1.4GHz, Intel Omni- 4 Japan Path, Fujitsu
RIKEN Advanced Institute for K Computer – SPARC64 VIIIfx 795,02 7 10.5 Computational Science 2.0 GHz, Tofu Interconnect 4
Piz Daint - Cray XC30, Xeon Swiss National Supercomputing E5-2670 8C 2.600GHz, Aries 206,72 8 Centre (CSCS) 9.8 interconnect , NVIDIA K20x 0 Switzerland Cray Inc.
Mira – BlueGene/Q, Power 786,43 9 Argonne National Laboratory BQC 16 Cores, Custom 8.16 2 interconnection
Trinity - Cray XC40, Xeon E5- DOE/NNSA/LANL/SNL 2698v3 16C 2.3GHz, Aries 301,05 10 8.1 United States interconnect 6 Tarek El-Ghazawi,Cray GWU Inc. 19 History
Source: top500.org. Also see: http://spectrum.ieee.org/tech-talk/computing/hardware/china-builds-worlds- fastest-supercomputer Tarek El-Ghazawi, GWU 20 Supercomputers - History
R Computer Processor # Pr. Year max (TFlops) Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C 1.45GHz 10649600 2016 93,014
TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C Tianhe-2 (MilkyWay-2) 3120000 2013 33,862 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P Titan Cray XK7, Opteron 16 Cores, 2.2GHz, NvidiaK20X 560640 2012 17,600 K-Computer, Japan SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510 Intel EM64T Xeon X56xx (Westmere-EP) 2930 MHz Tianhe-1A, China 186368 2010 2,566 (11.72 Gflops) + NVIDIA GPU, FT-1000 8C Jaguar, Cray Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759
Roadrunner, IBM PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026
BlueGene/L - eServer PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478 Blue Gene Solution, IBM BlueGene/L – eServer PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280 Blue Gene Solution, IBM BlueGene/L beta-System IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7 Earth-Simulator / NEC NEC 1000 MHz (8 GFlops) 5120 2002 35.8 IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2 IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9 Intel ASCI Red Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4
Tarek El-Ghazawi, GWU 21 Historical Analysis
Performance MPPs with Multicores and Massively Heterogeneous Accelerators Vector Parallel Machines Processors PetaFLOPS Tons of Lightweight Cores
TeraFLOPS
Discrete Integrated
Time 1993- 2008- 2011 2016 HPCC End of Moore’s Law in Clocking!
Tarek El-Ghazawi, GWU 22 DARPA High-Productivity Computing Systems
Launched in 2002 Next Generation Supercomputers by 2010 Not only performance, but productivity, where
Productivity = f(execution time, Development time) Typically, Productivity = utility/cost
Addresses everything – hardware and software
Tarek El-Ghazawi, GWU 23 HPCS Structure
Each Team is led by a company and includes university research groups Three Phases Phase I: Research Concepts SGI, HP, Cray, IBM, and Sun Phase II: R&D Cray, IBM, Sun Phase III: Deployment Cray, IBM GWU with SGI in Phase I and IBM in Phase II
Tarek El-Ghazawi, GWU 24 IBM, Sun & Cray’s effort on HPCS
Vendor Project Hardware Arch. Language
IBM PERCES Power PC X10
Sun Hero “Rock”, Multi-core Fortress Sparc
Cray Cascade Chapel
Tarek El-Ghazawi, GWU 25 HPCS on IBM, Sun & Cray
IBM PERCS(Productive, Easy-to-use, Reliable Computing System) Power Architecture Sum Hero Multi-core “Rock” Sparc Cray Cascade
Tarek El-Ghazawi, GWU 26 What is New in HPCS
Architecture Lots of Parallelism on the Chip Intelligent and Transactional Memory Innovative Co-Processing: Streaming, PIM, … Computations migrate to data, instead of data going to computations Programming PGAS programming Models Parallel Matlab and other simple interfaces Multiple types of parallelism and locality Transactions Reliability Self-Healing More proprietary stuff Tarek El-Ghazawi, GWU 27 What is Next: Exascale and DOE The DoE Exascale Computing Project Goals: Deliver 50x performance of today’s systems (20 PF) Operate with 20-30 MW power Be sufficiently resilient (MTTI < 1 week) Software stack supporting wide range of apps
Growth of supercomputing capability
Source: https://energy.gov/sites/prod/files/ Source: Figure modified from singularity.com 2013/09/f2/20130913-SEAB- Tarek El-Ghazawi, GWU DOE-Exascale-Initiative.pdf 28 Technical Challenges on The Road to Exascale
Bill Dally, “Technical Challenges on the Road to Exascale” http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/BillDally_NVIDIA_SC12.pdf
Tarek El-Ghazawi, GWU 29 Technical Challenges on The Road to Exascale
The High Cost of Data Movement Fetching operands costs more than computing on them
10000"
1000"
100" 2008"(45nm)"
2018"(11nm)" 10"
Picojoules*Per*64bit*opera2on* 1" " " " " " " t" " P er ip ip ip M c m O t h h h A e e FL is 3c 3c 3c R n st " g n n n D n y P e o o o / o "s D R " " " ip rc s m m m h e s m m m c t ro 1 5 5 f3 in C 1 f l" O ca lo
Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Courtesy: John Shalf, PGAS 2015 Tarek El-Ghazawi, GWU 30 Three pre-Exascale Supercomputers as part of the CORAL intiative from DOE
Summit Sierra Aurora Contract budget $325M combined $200M Location Oak Ridge Livermore Argonne Delivery Date 2017-18 2018-19 Vendor IBM Cray Node Architecture Multiple IBM POWER9 CPUs, Intel Knights Hill Multiple NVIDIA Volta GPUs Many-core CPUs
Node Performance 40+ TFLOPS - 3+ TFLOPS Interconnect Mellanox Dual Rail EDR InfiniBand Intel Omni-Path
Rpeak 150PFLOPS 120-150 PFLOPS 180 PFLOPS Nodes ~3,400 - 50,000+ Power ~10 MW ~10 MW ~13 MW
31 Tarek El-Ghazawi, GWU 31 Aurora Highlights
Available Data:
Cray Shasta compute platform Prediction for Next Gen: Intel Knights Hill Manycore CPUs 1 processor per node 3rd Gen Manycore One 100-core CPU capable of 10nm node 4.5TFLOPS peak, or 3+TFLOPS sustained 3+ TFLOPS per node Dual Omni400 Gb/s aggregate BW per node
50,000+ nodes 50,000 nodes 180 PFLOPS 4 nodes per blade 12,500 blades 13 MW 16 blades per chassis nd Intel Omni-Path (2 Gen) with 782 chassis Silicon Photonics 6 chassis per group 500+ TB/s Bisection Bandwidth 130 groups 2.5+ PB/s Aggregate Node Link 32 Bandwidth Tarek El-Ghazawi, GWU 32 GWU HPCL Facility
Tarek El-Ghazawi, GWU 33 Historical Highlights of the Facility
~ 50 Tons of Cooling, 2000 sq of elevated floor, .25 MW of power Small experimental parallel systems that represent a wide spectrum of architectural ideas Systems with GPU Accelerators from Cray and ACI System with Intel Phi Accelerators from ACI Systems with FPGA Accelerators from SRC, SGI, Cray and Starbridge Homegrown clusters with Infinitband, Myrinet Many experimental boards and workstations from Xilinx, Intel, … Tarek El-Ghazawi, GWU 34 Tarek El-Ghazawi, GWU 35 Tarek El-Ghazawi, GWU 36 Tarek El-Ghazawi, GWU 37 GW CRAY XE6m/XK7m
1856 Processor Core Based on 12-core 64-bit AMD Opteron 6100 Series processors and 16-core AMD Bulldozer processors 32 Nvidia K20 GPUs 64 GB registered ECC DDR3 SDRAM per compute node 1 Gemini routing and communications ASIC per two compute nodes
Tarek El-Ghazawi, GWU 38 Tarek El-Ghazawi, GWU 39 Conclusions
HPC is critical for economic competitiveness at all levels, and it is turning into an international race! Advances in HPC today are the same advances in conventional computing tomorrow HPC is ubiquitous as all computing turns into HPC Multicores and heterogeneous accelerator architecture are getting rising attention but lack software infrastructure and hardware support and will require new programming models and OS support, an opportunity for leadership in research
Tarek El-Ghazawi, GWU 40 Light Reading
http://spectrum.ieee.org/computing/hardware/ibm- reclaims-supercomputer-lead, 2005 http://spectrum.ieee.org/tech- talk/computing/hardware/china-builds-worlds- fastest-supercomputer, 2010 http://spectrum.ieee.org/computing/hardware/china s-homegrown-supercomputers, 2012
Tarek El-Ghazawi, GWU 41