<<

High-Performance Computing - and why Learn about it?

Tarek El-Ghazawi

The George Washington University Washington D.C., USA Outline

 What is High-Performance Computing?  Why is High-Performance Computing Important?  Advances in Performance and Architectures  Heterogeneous Accelerated Computing  Advances in Parallel Programming  Making Progress: The HPCS Program, near-term  Making Progress: Exascale and DOE  Conclusions

Tarek El-Ghazawi, GWU 2 What is Supercomputing and Parallel Architectures?

 Also called High-Performance Computing and Parallel Computing  Research and innovation in architecture, programming and applications associated with systems that are orders of magnitude faster (10X- 1000X or more) than modern desktop and laptop achieve speed through massive parallelism- Parallel Architectures! E.g. many processors working together http://www.collegehumor.com/video:1828443

Tarek El-Ghazawi, GWU 3 Outline

 What is High-Performance Computing?  Why is High-Performance Computing Important?  Advances in Performance and Architectures  Hardware Accelerators and Accelerated Computing  Advances in Parallel Programming  What is Next: The HPCS Program, near-term  What is Next: Exascale and DARPA UHPC  Conclusions

Tarek El-Ghazawi, GWU 4 Why is HPC Important?

 Critical for economic competitiveness because of its wide applications (through simulations and intensive data analyses)  Drives computer hardware and software innovations for future conventional computing  Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!!

Is that why it is turning into an international HPC muscle flexing contest?

Tarek El-Ghazawi, GWU 5 Why is HPC Important?

Design Build Test

Design Model Simulate Build

Tarek El-Ghazawi, GWU 6 Why is HPC Important? National and Economic Competitiveness

Molecular Dynamics Gene Sequence Alignment

HIV-1 Protease Inhibitor Drug

HPC Simulation for 2ns: Phylogenetic Analysis: • 2 weeks on a desktop Application • 32 days on desktop • 6 hours on a Examples • 1.5 hrs supercomputer

Car Crash Understanding Simulations Fundamental Structure of Matter 2 million elements simulation: • 4 days on a desktop • 25 minutes on a supercomputer Requires a billion- billion calculations per Tarek El-Ghazawi, GWU second 7 Why is HPC Important? National and Economic Competitiveness

 Industrial competitiveness  Computational models that can run on HPC are only for the design of NASA space shuttles, but they can also help with  Business Intelligence (e.g. IBM) and Watson  Designing effective shapes and/or material for Potato Chips Clorox Bottles … Tarek El-Ghazawi, GWU 8 HPC Technology of Today is Conventional Computing of Tomorrow: Multi/Many-cores in Desktops and Laptops

Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007 

The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997

Intel 72 Core Chip KNL 1 Chip and 3 TeraFLOPs in 2016 Tarek El-Ghazawi, GWU 9 Why is HPC Important?- HPC is Ubiquitous Sony PS3 iPhone 7 4 Cores 2.34 GHz

HPC is Ubiquitous! All Computing is becoming HPC, Can we become Uses the Processors! bystanders? The Road Runner: Was Fastest Supercomputer in 08 Xeon Phi KNL: A 72 CPU Chip

Uses Cell Processors! Tarek El-Ghazawi, GWU 10 Why this is happening? - The End of Moore’s Law in Clocking The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a doubles every 18-24 months, assuming the price of the processor stays the same Wrong, not anymore! The price of a microchip drops about 48% every 18-24 months, assuming the same processor speed and on chip memory capacity Ok, for Now The number of transistors on a microchip doubles every 18-24 months, assuming the price of the chip stays the same Ok, for Now Tarek El-Ghazawi, GWU 11 No faster clocking but more Cores?

Source: Ed Davis, Intel Tarek El-Ghazawi, GWU 12 Cores and Power Efficiency

Source: Ed Davis, Intel

Tarek El-Ghazawi, GWU 13 Comparative View of Processors and Accelerators Fabrication Peak FP Peak DP Freq # Cores Memory Process Performance Power Flops/W SPFP DPFP Memory nm GHz W BW GB/s GFlops GFlops type PowerXCell 8i 65 3.2 1 + 8 204 102.4 92 1.11 25.6 XDR

NVidia Fermi 40 1.3 512 1330 665 225 2.9 177 GDDR5 Tesla M2090

Nvidia Kepler 28 0.73 2688 3950 1310 235 5.6 250 GDDR5 K20X

NVIDIA Kepler 28 0.88 2x2496 8749 2910 300 9.7 480 GDDR5 K80 Intel Xeon Phi 60 (240 22 1.05 - 1011 225 4.5 320 GDDR5 5110P (KNC) threads) Intel Xeon Phi 72 (288 14 1.7 - ~3500 245 14.3 115.2 DDR4 7290 (KNL) threads) 2.4 DDR3- Intel Xeon 32 10 202.6 101.3 130 0.78 42.7 E7-8870 (2.8) 1333 DDR3- AMD Opteron 45 2.5 12 240 120 140 0.86 42.7 6176 SE 1333 Xilinx V6 40 - - - 98.8 50 3.3 - - SX475T Altera Stratix V 28 - -Tarek El-Ghazawi, - GWU 210 60 3.5 - - 14 GSB8 Most Power Efficient Architectures: Green 500

Tarek El-Ghazawi, GWU https://www.top500.org/green500/lists/2016/11/15 Outline

 What is High-Performance Computing?  Why is High-Performance Computing Important?  Advances in Performance and Architectures  Heterogeneous Accelerated Computing  Advances in Parallel Programming  What is Next: The HPCS Program, near-term  What is Next: Exascale and DoE  Conclusions

Tarek El-Ghazawi, GWU 16 How the Supercomputing Race is Conducted? TOP500 Supercomputers and LINPACK

 Top500 in November and in June  Rmax - Maximal LINPACK performance achieved  Rpeak - Theoretical peak performance  In the TOP500 List table, the computers are ordered first by their Rmax value In the case of equal performances (Rmax value) for different computers, order is by Rpeak For sites that have the same performance, the order is by memory size and then alphabetically  Check www..org for more information

Tarek El-Ghazawi, GWU 17 Top 10 Supercomputers: November 2016 www.top500.org

Countr R Rank Site Computer # Cores max y (PFlops)

Sunway TaihuLight - Sunway National Supercomputing Center in MPP, Sunway SW26010 260C 10,649,60 1 Wuxi 0 93.0 1.45GHz, Sunway China NRCPC Tianhe-2 (MilkyWay-2) - TH- National University of Defense IVB-FEP Cluster, Intel Xeon E5- 2 Technology 2692 12C 2.200GHz, TH 3,120,000 33.9 China Express-2, Intel Xeon Phi 31S1P XK7, Opteron 16 3 Oak Ridge National Laboratory Cores, 2.2GHz, Gemini, 560,640 17.6 Nvidia K20X

Sequoia – BlueGene/Q, Power Lawrence Livermore National 1,572,86 4 BQC 16 Cores, Custom 16.3 Laboratory 4 interconnection

Cori - Cray XC40, Intel Xeon DOE/SC/LBNL/NERSC Phi 7250 68C 1.4GHz, Aries 5 622,336 14.0 interconnect Cray Inc. Tarek El-Ghazawi, GWU 18 Top 10 Supercomputers: November 2016 www.top500.org

R # max Rank Country Site Computer (PFlop Cores s)

Oakforest-PACS - PRIMERGY Joint Center for Advanced High CX1640 M1, Intel Xeon Phi 556,10 6 Performance Computing 13.6 7250 68C 1.4GHz, Intel Omni- 4 Japan Path,

RIKEN Advanced Institute for – SPARC64 VIIIfx 795,02 7 10.5 Computational Science 2.0 GHz, Tofu Interconnect 4

Piz Daint - Cray XC30, Xeon Swiss National Supercomputing E5-2670 8C 2.600GHz, Aries 206,72 8 Centre (CSCS) 9.8 interconnect , NVIDIA K20x 0 Switzerland Cray Inc.

Mira – BlueGene/Q, Power 786,43 9 Argonne National Laboratory BQC 16 Cores, Custom 8.16 2 interconnection

Trinity - Cray XC40, Xeon E5- DOE/NNSA/LANL/SNL 2698v3 16C 2.3GHz, Aries 301,05 10 8.1 United States interconnect 6 Tarek El-Ghazawi,Cray GWU Inc. 19 History

Source: top500.org. Also see: http://spectrum.ieee.org/tech-talk/computing/hardware/china-builds-worlds- fastest-supercomputer Tarek El-Ghazawi, GWU 20 Supercomputers - History

R Computer Processor # Pr. Year max (TFlops) Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C 1.45GHz 10649600 2016 93,014

TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C Tianhe-2 (MilkyWay-2) 3120000 2013 33,862 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P Titan Cray XK7, Opteron 16 Cores, 2.2GHz, NvidiaK20X 560640 2012 17,600 K-Computer, Japan SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510 Intel EM64T Xeon X56xx (Westmere-EP) 2930 MHz Tianhe-1A, China 186368 2010 2,566 (11.72 Gflops) + NVIDIA GPU, FT-1000 8C Jaguar, Cray Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759

Roadrunner, IBM PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026

BlueGene/L - eServer PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478 Blue Gene Solution, IBM BlueGene/L – eServer PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280 Blue Gene Solution, IBM BlueGene/L beta-System IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7 Earth-Simulator / NEC NEC 1000 MHz (8 GFlops) 5120 2002 35.8 IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2 IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9 Intel ASCI Red Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4

Tarek El-Ghazawi, GWU 21 Historical Analysis

Performance MPPs with Multicores and Massively Heterogeneous Accelerators Vector Parallel Machines Processors PetaFLOPS Tons of Lightweight Cores

TeraFLOPS

Discrete Integrated

Time 1993- 2008- 2011 2016 HPCC End of Moore’s Law in Clocking!

Tarek El-Ghazawi, GWU 22 DARPA High-Productivity Computing Systems

 Launched in 2002  Next Generation Supercomputers by 2010  Not only performance, but productivity, where

Productivity = f(execution time, Development time) Typically, Productivity = utility/cost

 Addresses everything – hardware and software

Tarek El-Ghazawi, GWU 23 HPCS Structure

 Each Team is led by a company and includes university research groups  Three Phases Phase I: Research Concepts SGI, HP, Cray, IBM, and Sun Phase II: R&D Cray, IBM, Sun Phase III: Deployment Cray, IBM GWU with SGI in Phase I and IBM in Phase II

Tarek El-Ghazawi, GWU 24 IBM, Sun & Cray’s effort on HPCS

Vendor Project Hardware Arch. Language

IBM PERCES Power PC X10

Sun Hero “Rock”, Multi-core Fortress Sparc

Cray Cascade Chapel

Tarek El-Ghazawi, GWU 25 HPCS on IBM, Sun & Cray

IBM PERCS(Productive, Easy-to-use, Reliable Computing System) Power Architecture Sum Hero Multi-core “Rock” Sparc Cray Cascade

Tarek El-Ghazawi, GWU 26 What is New in HPCS

 Architecture  Lots of Parallelism on the Chip  Intelligent and Transactional Memory  Innovative Co-Processing: Streaming, PIM, …  Computations migrate to data, instead of data going to computations  Programming  PGAS programming Models  Parallel Matlab and other simple interfaces  Multiple types of parallelism and locality  Transactions  Reliability  Self-Healing  More proprietary stuff Tarek El-Ghazawi, GWU 27 What is Next: Exascale and DOE The DoE Exascale Computing Project Goals:  Deliver 50x performance of today’s systems (20 PF)  Operate with 20-30 MW power  Be sufficiently resilient (MTTI < 1 week)  Software stack supporting wide range of apps

Growth of supercomputing capability

Source: https://energy.gov/sites/prod/files/ Source: Figure modified from singularity.com 2013/09/f2/20130913-SEAB- Tarek El-Ghazawi, GWU DOE-Exascale-Initiative.pdf 28 Technical Challenges on The Road to Exascale

Bill Dally, “Technical Challenges on the Road to Exascale” http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/BillDally_NVIDIA_SC12.pdf

Tarek El-Ghazawi, GWU 29 Technical Challenges on The Road to Exascale

The High Cost of Data Movement Fetching operands costs more than computing on them

10000"

1000"

100" 2008"(45nm)"

2018"(11nm)" 10"

Picojoules*Per*64bit*opera2on* 1" " " " " " " t" " P er ip ip ip M c m O t h h h A e e FL is 3c 3c 3c R n st " g n n n D n y P e o o o / o "s D R " " " ip rc s m m m h e s m m m c t ro 1 5 5 f3 in C 1 f l" O ca lo

Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Courtesy: John Shalf, PGAS 2015 Tarek El-Ghazawi, GWU 30 Three pre-Exascale Supercomputers as part of the CORAL intiative from DOE

Summit Contract budget $325M combined $200M Location Oak Ridge Livermore Argonne Delivery Date 2017-18 2018-19 Vendor IBM Cray Node Architecture Multiple IBM POWER9 CPUs, Intel Knights Hill Multiple NVIDIA Volta GPUs Many-core CPUs

Node Performance 40+ TFLOPS - 3+ TFLOPS Interconnect Mellanox Dual Rail EDR InfiniBand Intel Omni-Path

Rpeak 150PFLOPS 120-150 PFLOPS 180 PFLOPS Nodes ~3,400 - 50,000+ Power ~10 MW ~10 MW ~13 MW

31 Tarek El-Ghazawi, GWU 31 Aurora Highlights

 Available Data:

 Cray Shasta compute platform  Prediction for Next Gen:  Intel Knights Hill Manycore CPUs  1 processor per node  3rd Gen Manycore  One 100-core CPU capable of  10nm node 4.5TFLOPS peak, or 3+TFLOPS sustained  3+ TFLOPS per node  Dual Omni400 Gb/s aggregate BW per node

 50,000+ nodes  50,000 nodes  180 PFLOPS  4 nodes per blade  12,500 blades  13 MW  16 blades per chassis nd  Intel Omni-Path (2 Gen) with  782 chassis Silicon Photonics  6 chassis per group  500+ TB/s Bisection Bandwidth  130 groups  2.5+ PB/s Aggregate Node Link 32 Bandwidth Tarek El-Ghazawi, GWU 32 GWU HPCL Facility

Tarek El-Ghazawi, GWU 33 Historical Highlights of the Facility

 ~ 50 Tons of Cooling, 2000 sq of elevated floor, .25 MW of power  Small experimental parallel systems that represent a wide spectrum of architectural ideas  Systems with GPU Accelerators from Cray and ACI  System with Intel Phi Accelerators from ACI  Systems with FPGA Accelerators from SRC, SGI, Cray and Starbridge  Homegrown clusters with Infinitband, Myrinet  Many experimental boards and workstations from Xilinx, Intel, … Tarek El-Ghazawi, GWU 34 Tarek El-Ghazawi, GWU 35 Tarek El-Ghazawi, GWU 36 Tarek El-Ghazawi, GWU 37 GW CRAY XE6m/XK7m

 1856 Processor Core  Based on 12-core 64-bit AMD Opteron 6100 Series processors and 16-core AMD Bulldozer processors  32 Nvidia K20 GPUs  64 GB registered ECC DDR3 SDRAM per compute node  1 Gemini routing and communications ASIC per two compute nodes

Tarek El-Ghazawi, GWU 38 Tarek El-Ghazawi, GWU 39 Conclusions

 HPC is critical for economic competitiveness at all levels, and it is turning into an international race!  Advances in HPC today are the same advances in conventional computing tomorrow  HPC is ubiquitous as all computing turns into HPC  Multicores and heterogeneous accelerator architecture are getting rising attention but lack software infrastructure and hardware support and will require new programming models and OS support, an opportunity for leadership in research

Tarek El-Ghazawi, GWU 40 Light Reading

 http://spectrum.ieee.org/computing/hardware/ibm- reclaims-supercomputer-lead, 2005  http://spectrum.ieee.org/tech- talk/computing/hardware/china-builds-worlds- fastest-supercomputer, 2010  http://spectrum.ieee.org/computing/hardware/china s-homegrown-supercomputers, 2012

Tarek El-Ghazawi, GWU 41