20190714 Hipeac ACACES Keynote
Total Page:16
File Type:pdf, Size:1020Kb
Images courtesy of The PRACE Scientific Steering Committee, “The Scientific Case for Computing in Europe 2018-2026” [1] ASCI Red (Intel Teraflops) 100.00 [2] Red(memory upgrade) [3] ASCI Blue Pacific (IBM RS/6000) [14] 10.00 [13] [4] ASCI White [11] [5] Cray XT3 [7] [10] [6] IBM Roadrunner 1.00 [12] [6] [9] [7] IBM Sequoia (BLUGENE/q) [5] [8] [8] Tianhe-1A 0.10 [9] Nebulae [10] PRIMEHPC FX10 [3] 0.01 [11] Eurora GFlops/Watts (log sclae) (log GFlops/Watts [4] [12] Fujitsu PRIMEHPC FX100 (SPARC64 Xifx) [1] 0.00 [2] [13] Sunway TaihuLight 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 [14] IBM Summit Mainframe Era (circa 1953 - circa 1972) First rise of HW acceleration: vector computers (1974-1993) Rise of massive homogeneous parallelism (1994-2007) The renaissance of acceleration units (2008 - …? ) Sense Analyze Transmit Short range, BW µController MEMS Microphone L2 Memory Zurich e.g.Accelerators CortexM ULP Imager Low rate (periodic) data IOs Benini,ETH SW update, commands EMG/ECG/EIT 11 ÷ ÷100025 MOPS MOPS 1 ÷ 10 mW Long range, low BW Luca of Courtesy • Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 10mW Need a supercomputer chip on the sensor node 2016 2017 2018 2019 2020 2021 2022 2023 Budget Summit Aurora 21 200 PFlop/s El Capitan US 1 EFlop/s Power 9 + >1 EFlop/s 4.2 B $ Intel+Cray Volta (2016-2023) Sierra Frontier ANL system? 125 PFlop/s >1.5 EFlop >1 EFlop/s Power 9 + AMD+Cray 2013 Volta Shuguang/Sugon China Tianhe-2 Sunway 1 EFlop/s 33 PFlop/s 125 PFlop/s Sunway processors 700 M $ Intel Sunway (2016-2020) processors processors Tianhe-3 1 EFlop/s ARM Processors 2011 Japan K ABCI Post-K computer 0.55 EFlop/s computer 1Bn $ 10 PFlop/s (DL) 1 EFlop/s Fujitsu 37 Pflop Fujitsu+ARM processors Intel+NVIDIA Europe >200 PFlop/s >1 EFlop/s 1,4Bn $ >200 PFlop/s >1 EFlop/s Chip Design Manuf. IBM POWER9 NVIDIA Volta GV100 Sunway SW26010 Intel Xeon E5 • Courtesy of Denis Dutoit Images courtesy of Reuters.com and European Processor Initiative EPI SGA1 (ARM +RISC V) SGA2 (ARM +RISC V) EPI TO EXA (ARM +RISC V) Other RIA investment activities Extreme escale act CoE CoC GEANT SME Pre-Exa Procurement Pre-Exa Deployment Exa Procurement Exa Deployment 2019 2020 2021 2022 2023 2024+ MareNostrum4 Total peak performance: 12,7 Pflops General Purpose Cluster: 11.15 Pflops (1.07.2017) CTE1-P9+Volta: 1.57 Pflops (1.03.2018) • 3 “pilot” Pre-Exascale systems • 5 Peta-scale systems MareNostrum 1 MareNostrumMareNostrum2 MareNostrum5 3 MareNostrum 4 2004 – 42,3 Tflops 2006 – 94,2 Tflopscirca 220 Million Euros budget2012 – 1,1 Pflops 2017 – 11,1 Pflops 1st Europe / 4th World 1st Europe / 5th WorldPre-Exascale system12th Europe / 36th World 2nd Europe / 13th World New technologies New technologiesIncluding new technologies New technologies EUROPEAN PROCESSOR INITIATIVE * Pre-ExaScale level with general-purpose CPU core in the first EPI GPP chip * Develop acceleration technologies for better DP GFLOPS/Watt performance * Inclusion of MPPA for real-time application acceleration * Adopt Arm general-purpose CPU core with SVE / vector acceleration in the first EPI chip * Supply sufficient Memory Bandwidth (Byte/FLOP) to support the GPP application * in SGA1, focus on programming models to include accelerations. Codesign, Architecture, System software and key S1 - Common Stream technologies for the Common Platform Design and implement the processor chip(s) S2 - GPP Processor and PoC system Foster acceleration technologies and S3 - Acceleration create building blocks S4 - Automotive Address automotive market needs and create a pilot eHPC system S5 - Administration Manage and support activities EUROPEAN PROCESSOR INITIATIVE Rhea Chronos CPU GPP GPP GPP ACCEL. rev1 rev2 rev3 ACCEL TEST only ACCEL. CHIP 2018 2019 2020 2021 2022 2023 2024 2025 SGA 1&2 SGA 3 PCIe gen5 HSL links links D2D links to adjacent chiplets ARM MPPA eFPGA EPAC HBM memories DDR memories EPAC VPU Bridge to GPP STX VRP Bridge to GPP DDR DDR HBM HBM HBM HBM DDR DDR Scalar Vector STX core lanes units SerDes Scalar Vector STX High speed core lanes units GPP NoC Bridge GPP NoC Bridge L2$ L2$ Accelerator NOC DMA L1$ SPM VRF VRFVector LSU VLSU VectorVector Vectorlanelane Vectorlanelane Scalar core lanes NTX SNTXTX Specialized LANE Units INTERCONNECT Vector Processor ACCELERATOR TILE 푇퐸푋퐸퐶 = 푁 ∙ 퐶푃퐼 ∙ 푇퐶퐾 Applications 2 푃 = 퐶푒푓푓 ∙ 푉 ∙ 1Τ푇퐶퐾 Circuits Runtime Architecture Move the future Exascale Future ? Exascale Future ? Flat paradigm Hierarchy Homogeneity Heterogeneity Fixed pattern Decoupling Legacy Pipelining Locality (BW) Long vectors Established Abstraction vendor Established Malleability vendor Established Low voltage vendor Chip-let technology Can we get there at the right time? EU Actors Follower Lots of money Difficult to reach 26 EPI Instruction extension Long vectors Circuit aware architecture Runtime aware architecture Sagrada Familia 1975 Special thanks to • Mateo Valero • Jesus Labarta Thank you for attending • Adrian Cristal Kestelman • Osman Unsal • Peter Hsu • Luca Benini [email protected] • Ying Chih Yang [email protected] • Philippe Notton • Denis Dutoit • Eugene Griffiths • Fabrizio Gagliardi 33.