Images courtesy of The PRACE Scientific Steering Committee, “The Scientific Case for Computing in Europe 2018-2026”

[1] ASCI Red (Intel Teraflops) 100.00 [2] Red(memory upgrade) [3] ASCI Blue Pacific (IBM RS/6000) [14] 10.00 [13] [4] ASCI White [11] [5] Cray XT3 [7] [10] [6] IBM Roadrunner 1.00 [12] [6] [9] [7] IBM (BLUGENE/q) [5] [8] [8] Tianhe-1A 0.10 [9] Nebulae [10] PRIMEHPC FX10 [3]

0.01 [11] Eurora GFlops/ (log sclae) (log GFlops/Watts [4] [12] Fujitsu PRIMEHPC FX100 (SPARC64 Xifx) [1] 0.00 [2] [13] Sunway TaihuLight 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 [14] IBM  Mainframe Era (circa 1953 - circa 1972)    First rise of HW acceleration: vector computers (1974-1993)    Rise of massive homogeneous parallelism (1994-2007)     The renaissance of acceleration units (2008 - …? )    Sense Analyze Transmit Short range, BW µController

MEMS Microphone L2 Memory Zurich e.g.Accelerators CortexM ULP Imager Low rate (periodic) data IOs Benini,ETH  SW update, commands  EMG/ECG/EIT 11 ÷ ÷100025 MOPS MOPS  1 ÷ 10 mW

Long range, low BW Luca of Courtesy •

 Idle: ~1µW  100 µW ÷ 2 mW Active: ~ 10mW  Need a supercomputer chip on the sensor node 2016 2017 2018 2019 2020 2021 2022 2023 Budget Summit Aurora 21 200 PFlop/s El Capitan US 1 EFlop/s Power 9 + >1 EFlop/s 4.2 B $ Intel+Cray Volta (2016-2023) Frontier ANL system? 125 PFlop/s >1.5 EFlop >1 EFlop/s Power 9 + AMD+Cray 2013 Volta Shuguang/ Tianhe-2 Sunway 1 EFlop/s 33 PFlop/s 125 PFlop/s Sunway processors 700 M $ Intel Sunway (2016-2020) processors processors Tianhe-3 1 EFlop/s ARM Processors 2011

Japan K ABCI Post- 0.55 EFlop/s computer 1Bn $ 10 PFlop/s (DL) 1 EFlop/s Fujitsu 37 Pflop Fujitsu+ARM processors Intel+NVIDIA

Europe >200 PFlop/s >1 EFlop/s 1,4Bn $ >200 PFlop/s >1 EFlop/s    

Chip Design Manuf.

IBM POWER9

NVIDIA Volta GV100

Sunway SW26010

Intel E5

• Courtesy of Denis Dutoit 

  

 

Images courtesy of Reuters.com and European Processor Initiative EPI

SGA1 (ARM +RISC V)

SGA2 (ARM +RISC V)

EPI TO EXA (ARM +RISC V)

Other RIA investment activities

Extreme escale act CoE CoC

GEANT SME

Pre-Exa Procurement

Pre-Exa Deployment

Exa Procurement

Exa Deployment

2019 2020 2021 2022 2023 2024+ MareNostrum4 Total peak performance: 12,7 Pflops General Purpose Cluster: 11.15 Pflops (1.07.2017) CTE1-P9+Volta: 1.57 Pflops (1.03.2018)

• 3 “pilot” Pre-Exascale systems • 5 Peta-scale systems

MareNostrum 1 MareNostrumMareNostrum2 MareNostrum5 3 MareNostrum 4 2004 – 42,3 Tflops 2006 – 94,2 Tflopscirca 220 Million Euros budget2012 – 1,1 Pflops 2017 – 11,1 Pflops 1st Europe / 4th World 1st Europe / 5th WorldPre-Exascale system12th Europe / 36th World 2nd Europe / 13th World New technologies New technologiesIncluding new technologies New technologies EUROPEAN PROCESSOR INITIATIVE

               * Pre-ExaScale level with general-purpose CPU core in the first EPI GPP chip * Develop acceleration technologies for better DP GFLOPS/ performance * Inclusion of MPPA for real-time application acceleration

 * Adopt Arm general-purpose CPU core with SVE / vector acceleration in the first EPI chip * Supply sufficient Memory Bandwidth (Byte/FLOP) to support the GPP application * in SGA1, focus on programming models to include accelerations. Codesign, Architecture, System software and key S1 - Common Stream technologies for the Common Platform Design and implement the processor chip(s) S2 - GPP Processor and PoC system Foster acceleration technologies and S3 - Acceleration create building blocks

S4 - Automotive Address automotive market needs and create a pilot eHPC system

S5 - Administration Manage and support activities EUROPEAN PROCESSOR INITIATIVE Rhea Chronos CPU GPP GPP GPP ACCEL. rev1 rev2 rev3 ACCEL TEST only ACCEL. CHIP

2018 2019 2020 2021 2022 2023 2024 2025 SGA 1&2 SGA 3 PCIe gen5 HSL links links

D2D links to adjacent chiplets ARM MPPA

eFPGA EPAC HBM memories 

DDR  memories  EPAC

VPU 

Bridge to GPP 

STX  VRP

Bridge to GPP  DDR DDR

HBM HBM 

 HBM HBM  DDR DDR

Scalar Vector STX core lanes units 

 SerDes

Scalar Vector STX High speed High core lanes units    GPP NoC Bridge GPP NoC Bridge  L2$ L2$

Accelerator NOC   DMA  L1$ SPM  VRF VRFVector LSU VLSU VectorVector Vectorlanelane Vectorlanelane Scalar core lanes NTX SNTXTX Specialized  LANE Units INTERCONNECT  Vector Processor ACCELERATOR TILE  

푇퐸푋퐸퐶 = 푁 ∙ 퐶푃퐼 ∙ 푇퐶퐾 Applications

2

푃 = 퐶푒푓푓 ∙ 푉 ∙ 1Τ푇퐶퐾 Circuits  Runtime

 Architecture     Move the future Exascale Future ? Exascale Future ? Flat paradigm Hierarchy Homogeneity Heterogeneity Fixed pattern Decoupling Legacy Pipelining Locality (BW) Long vectors Established Abstraction vendor Established Malleability vendor Established Low voltage vendor Chip-let technology Can we get there at the right time?

EU Actors Follower Lots of money Difficult to reach 26 

    

 

    

    

  EPI Instruction extension

Long vectors

Circuit aware architecture

Runtime aware architecture

  

Sagrada Familia 1975 Special thanks to • Mateo Valero • Jesus Labarta Thank you for attending • Adrian Cristal Kestelman • Osman Unsal • Peter Hsu • Luca Benini [email protected] • Ying Chih Yang [email protected] • Philippe Notton • Denis Dutoit • Eugene Griffiths • Fabrizio Gagliardi

33