Update on International HPC Activities (mostly Asia)

Input from: Erich Strohmaier, Patrick Naullieu (LBNL) Satoshi Matsuoka (TiTech) Haohuan Fu (Wuxi) And many conversations in Singapore

John Shalf Lawrence Berkeley National Laboratory

ASCAC, April 18, 2017 Performance of Countries

100,000 US /s] 10,000 EU

Tflop Japan 1,000 China

100

10

1 Total P erformanc e [ 0 2000 2002 2004 2006 2008 2010 2012 2014 2016 Share of Top500 Entries Per Country

Historical Share Current Share (averaged over liftetime of list) (November 2016 list)

Korea, Poland Italy Others Italy Poland Italy China South United United 1% 1% 11% 1% 1% 1% Canada 1% others KingdomKingdom Others United States 1% 12% 3% France 3%France 11% ChinaChina Germany China 4% France 4% Japan 3434%% Japan 5% United 4% Japan United 6% States 6% France Kingdom Germany 6% 52% Germany United Kingdom 6% United Germany 6% United StatesStates Poland 8% Japan 34%34% Italy 10% Producers of HPC Equipment

500 India 450 400 Taiwan 350 Australia 300 250 Russia 200 China 150

100 Europe 50 0 Japan

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 USA Vendors / Performance Share

2007 Now HPE, 66, 10% HPE SGI, 40, 6% SGI Others, Lenovo 136, 20% Lenovo, 64, 10% Cray Inc. NUDT, 39, 6% Sugon Fujitsu, 38, 6% Cray Inc., Dell, 16, 2% 143, 21% IBM Inspur, Bull, Atos 9, 1% Huawei, 9, 1% Bull, Atos, 24, 4% Huawei IBM, 63, 9% Sugon, 25, 4%

Sum of Pflop/s, % of whole list by vendor NSA-DOE Technical Meeting on High Performance Computing December 1, 2017

Top Level Conclusions 1. National security requires the best computing available, and loss of leadership in HPC will severely compromise our national security. 2. HPC leadership has important economic benefits because of HPC’s role as an enabling technology 3. Leadership positions, once lost, are expensive to regain

Meeting participants expressed significant concern that – absent aggressive action by the U.S. – the U.S. will lose leadership and not control its own future in HPC v It is critical to lead the exploration and development of innovative computing architectures that will unleash the creativity of the HPC community v Workforce development is a major concern in HPC and a priority for supporting NSCI Objectives #4 and #5 v NSCI leadership develop more efficient contracting regulations to improve the public-private partnership in HPC science and technology development. PERFORMANCE AND ALGORITHMS RESEARCH GROUP

China Update Aggressive Growth of China Chip Fabs v Current 28nm domestic capability in Shenzhen, Nanjing and other regions v Broke ground on 14nm fab for 2018 near Shanghai § Annual spending on fab equipment in China above $10B by 2018 § Feb 2017: China is expected to be the top spending region for fab equipment spending by 2019, overtaking South Korea and Taiwan. v Foxconn offered 3T Yen ($30B) bid for Toshiba fabs § Amazon & Google + SK Hynix & Western Digital consortium bidding § Apple bidding to own 20% stake in Fujitsu fab § TSMC withdrew its bid § Selection by June

8 Fab Construction in China Source Semiconductor Equipment and Materials International (SEMI)

9 Fab Construction in China Source Semiconductor Equipment and Materials International (SEMI)

10 Fab Construction in China Source Semiconductor Equipment and Materials International (SEMI)

11 Fab Construction in China Source Semiconductor Equipment and Materials International (SEMI)

12 China 2017 Prototype System Bake-off

v China plans to have three prototypes for candidate exascale systems delivered in 2017 [Xinhua: Jan 19, 2017] v Scale-up winner(s) to exascale in 2020 (my guesses below) v Other: Longsoon (unlikely), Silicon Cube (no), Thatic AMD (Tianjin/Sugon?)

Wuxi/Sunway NSC/Phytium NUDT/Tianhe2a? v Heterogeneous v Homogeneous v Attached Accelerator manycore/accel Manycore v ARMv8 PCIe attached v 4*8x8 CPEs (light) v 64-core ARMv8 self- accelerator (ISC16) + 4 MPE (heavy) hosted v

Core ... Core Core ... Core Core Core ... Acc. Memory ...... Network-on-Chip Network-on-Chip Network-on-Chip ...... Memory Memory Memory

Core Core Acc. ... Core ... Core Core ... Core Core Core Memory

13 152 J. Comput. Sci. & Technol., Jan. 2015, Vol.30, No.1

4 Implementation and Performance Evalua- The prototype includes 256 CPEs, four MPEs and four tion MCs, as shown in Fig.7. The FPGA prototype system uses a total of 352 To validate DFMC, we implemented a full chip Altera EP3C120, 21 Xilinx 5VLX330 and one Xilinx RTL design and built a prototype system with FPGA. 5VLXT220. The frequency of the prototype system is The performance of cooperative computing techniques 2.6 MHz. Table 5 lists the components and functions. in the prototype system was evaluated. Furthermore, Although there are many cross-board signals, we several typical applications were mapped to the DFMC balance all of the stages related to cross-board and en- architecture for performance analysis. sure the FPGA prototype system is equal to the RTL design at the cycle level. Then, the foremost reason 4.1 Full Chip RTL that the simulation is inaccurate is the main memory frequency. Compared with the target RTL design, the The RTL of DFMC is designed in-house; thus we ratio of CPE frequency to MC frequency in the proto- can easily optimize the , extend the type is quite different, which results in simulation de- functionality, and balance the performance and the viation. To ensure accuracy, the prototype system uses power-usage. Clock gate and fault tolerance technology the performance calibration techniques. FPGA proto- are also used in this design. For the future test chip, types have many performance adjusters and counters, we finished the physical design intended for fabrication and we have an FPGA adjustment benchmark that in- in 40 nm technology. cludes more than one hundred short programs espe- The parameters of DFMC are compared with those cially for memory systems. We define the deviation ra- of an Intel Xeon CPU and an NVIDIA GPU as shown in tio as the ratio of a program’s execution time on RTL Table 4. These processors are different in architectures, to its execution time on FPGA. Then, we can adjust but under the similar CMOS technology process. Be- the latency, bandwidth, and scheduling in the FPGA cause of the balance design of power and performance in prototype to find the minimum average deviation ratio CPEs, DFMC achieves the best peak performance and for the benchmark. The performance counters can in- the ratio of computation to power consumption. How- dicate which adjustment is more important. The test ever, the ratio of memory bandwidth to computation of shows that the performance accuracy of the prototype DFMC is the worst. In this paper, DFMC combines a system is up to 95% in the benchmark thanks to the series of cooperative computing techniques to solve this calibration. problem. 4.3 Software Layer 4.2 Prototype System In this paper, the programs running on DFMC use The applications and tests run slowly in a software the accelerated model. We designed a library-based environment, thereby we implemented a full chip pro- programming approach to ease the task of utilizing totype system with FPGA for acceleration. DFMC. The library supports programming interfaces The FPGA prototype system adopts a modular for thread management, data stream transfer, register structure, which consists of MPE cards, CPE cards, level communication and synchronization. Program- Sunwaya Node PCIe card, Architecture an MC card, an NoC card, and so on. mers can use these interfaces to explicitly control the (refresher course) Table 4. Parameters of DFMC/Xeon/GPGPU

SW26010: Sunway 260-CoreDFMC Processor (40 nm) Intel! Xeon! 5680 (32 nm) NVIDIA Fermi M2090 (40 nm) Architecture Memory 4 CPE clustersMemory (256 CPEs) 6 cores 512 CUDA cores 4 MPEs, 4 MCs iMC Core Group 0 Core Group 1 iMC Memory Level 8*8NoC CPE Mesh Mesh Ring topology – PPU PPU Computing Row On-chip memory 32 KB in each CPE 256=8 MB 12 MB cache 1 024 KB share memory/L1 cache Core Communication × 8*8 CPE 8*8 CPE Bus MPE MPE 768 KB L2 cache Mesh Frequency 1GHz Mesh 3.33 GHz 1.3 GHz LDM Level Registers Computing ability 1000 GFLOPS DP 80 GFLOPS DP 665.6 GFLOPS DP NoC Data Transfer Memory bandwidth 102.4 GB/s DDR3 32 GB/s DDR3 177.6 GB/s GDDR5 LDM Network Chip area 400 mm2@40 nm 240 mm2 @32 nm 520 mm2@40 nm 8*8 CPE 8*8 CPE MPE ∼ MPE Power Mesh 200 W Mesh Register Level 130 W 250 W Transfer Agent (TA) ∼ PPU PPU Control Column iMC iMC Network Communication Bus Core Group 2 Core Group 3

Computing Level That is 64k per CPEMemory LDM@28nmMemory (not 64k for the entire CPE mesh) Fang Zheng (Wuxi) 212 instructions Alpha-like ISA Jan 2015 240mm^2 chip area @ 28nm (cacti) J. Comp. & Sci. Tech 14 Phytium Mars Architecture Panel Architecture Xiaomi Xiaomi L2cache Eight Xiaomi Cores Xiaomi Xiaomi Compatible design with ARMv8 arch license DCU Both AArch32 and AArch64 modes EL0~EL3 supported Routing

ASIMD-128 supported DCU Adv. hybrid Branch Prediction Xiaomi Xiaomi 4 fetch/4 decode/4 dispatch Out-of-Order L2cache Xiaomi Xiaomi superscalar pipeline 6000μm Cache Hierarchy Separated L1 ICache and L1 Dcache Shared L2 cache, totally 4MB Directory-based cache coherency 10600μm maintenance Directory Control Unit (DCU) Routing Cell 7 Phytium Technology Co., Ltd

15 Phytium Mars Architecture

Cache & Memory Chip

L3 cache Mars Interface 16MB Data Array 2MB Data ECC L3 L3 DDR bandwidth Bank0 Bank1 2 x DDR3-800:25.6GB/s D D Mem Mem D D Proprietary interface between Mars & Ctrl0 Ctrl1 CMC R R Parallel interface L3 L3 Needs more pins, but lower latency than Bank2 Bank3 serdes 16 Separate write/cmd and read data channel Effective read channel bandwidth:12.8GB/s Effective write/cmd channel bandwidth:6.4GB/s

15 Phytium Technology Co., Ltd Comparing Phytium to Sunway

CategoryCategory UnitsUnits Phytium/MarsSunway Phytium/MarsSunway RatioRatio ISAISA -- CPE (DSP-like)ARMv8 ARMv8CPE (DSP-like) Cores/numanodeCores/numanode corescores 64 64 6464 1.0 Core FLOP RateCore FLOP Rate GFLOPsGFLOPs 11.72 8 11.728 0.71.5 L1$/CoreL1$/Core KBKB 64 32 6432 0.52.0 Clock RateClock Rate GHzGHz - 2GHz - 2GHz -- Power/numanodePower/numanode WW 93120 12093 1.30.8 Performance/numanodePerformance/numanode GFLOPsGFLOPs 750512 750512 0.71.5 Memory Bandwidth/numanodeMemory Bandwidth/numanode GB/sGB/s 34204 20434 6.00.2 Sockets for 125 PF systemSockets for 125 PF system -- 40,960234,375 234,37540,960 5.70.2 Cores for 125PF systemCores for 125PF system MillionsMillions 20 15 2015 0.81.3 Power for 125 PF systemPower for 125 PF system MWMW 15 28 1528 1.90.5 v Phytium advantages § 6x higher memory bandwidth per NUMA node § Conventional CPU programming model v Sunway advantages § 2x Energy efficiency of Phytium system § 5x higher performance density (5x fewer sockets for a system) 17 Sugon Silicon Cube: Meteorological Supercomputer

boost performance by 5%. Note that the noise of the system is underneath 45dB at the same time.

Meteorological researches tightly relate to Big Data issue. According to the news reports, China metrological data already greater than ~5PB with the annual increment 1PB. In most cases, all those data need to be kept almost forever. In the view of this situations, we introduced the ParaStor 200 parallel storage system, its excellent performance and high expandability suitably meets the demands of meteorological researches. Meanwhile, it is able to confidently meet with the v Processor Fig. 3. Multilevel designv Overall System future data flow caused by larger area and higher resolution in § Intel Xeon E5-2680 12core § Peak 1 PFLOPS the research. The aggregation performance of the ParaStor200

§ DDR4 2133MHz memory § #95 non Top500 atparallel 75% efficiency storage system increase linearly with the increase of Petascale and then exascale supercomputers require and will the quantity of oStor data controller nodes. Based on the v Interconnect § Total Memory Capacity 80TB require hundreds of thousands of cores to efficiently work actual measurement results, each oStor data controller node is together.§ FDR That InfiniBand is why the (56Gb) interconnection is§ one208 of square the top meters inserted of 1,000 with liquid 2 double-port gigabit Ethernet cards, provides 4 cooled servers challenges§ Torus in interconnect supercomputer topology technologies. As a switchless data transmission channels, a single node is able to provide a way to connect nodes in a supercomputer system,§ Total the 3Dpower torus is 641.38kWwriting bandwidth up to 150MB/s and a reading bandwidth up 18 network topology is a good answer to both problems of speed to 360MB/s. For the ParaStore 200 parallel storage system and scalability. It can assure very low latency and linear deployed at Shenzhen Cloud Computing Center, its total system scalability. Silicon Cube adopts 3D torus network as system capacity is up to 16PB and it is able to provide an well. This is the first 3D torus network developed and aggregation bandwidth up to 100 GB/s. deployed in China. Gridview HPC Suite is an integrated monitoring, management and job scheduling software platform for HPC and has been installed in Silicon Cube. Gridview is designed with several pluggable function modules. It could dynamically monitor the overall and detailed status of both computing center and cluster, provide comprehensive cluster management, real-time and historical alerting, and powerful job scheduling. The suite provides Computing Center Visualization, Cluster Monitoring, Performance Analysis, Asset Management, Cluster Management, and Alert Management. Gridview offers powerful and flexible scheduling policies, fault tolerance, easy-to-use application Web Portals and clear accounting system. All these features greatly improving management efficiency and utilization of high performance computers. According to communication pattern of parallel job, Gridview choose 1D, 2D or 3D-adjacent nodes and obey adjacent node allocation policy for meteorology application in general.

More recently, Gridview introduced a new partner, EasyOP, Fig. 4. Liquid cooling in Silicon Cube which is an online cloud service platform and has been serving ~10000 nodes, distributed in 18 Chinese cities. As shared by insidehpc.com, the liquid cooling is about 3,500 times better at storing and transferring heat than air. Direct contact liquid cooling (DCLC) uses the exceptional thermal conductivity of liquid to provide dense, concentrated cooling to targeted small surface areas. By using DCLC, the dependence on fans and expensive air conditioning and air handling systems is drastically reduced. Reducing internal heat is essential to avoid temperature-related system damage and downtime. Silicon Cube uses the cold plate liquid cooling, TC4600E-LP, to increase the cooling efficiency of the system. The power usage effectiveness (PUE) of the system is lower than 1.2. Compared to traditional air cooling, TC4600E-LP Fig. 5. Job scheduling in Sicon Cube through Gridview can successfully lower CPU temperature by 20℃ and then What’s in a Name? Sunway

Shen Wei “God” “Powerful”

Taihu Lake apostrophe Guang A famous lake near Shanghai “Taihu’s” “Light”

19 System Comparisons Sunway TaihuLight V.S. Other Systems

System TaihuLight Tianhe-2 Titan Sequoia Cori Peak Performance (PFlops) 125.4 54.9 27.1 20.1 27.9 Total Memory (TB) 1310 1024 710 1572 879 Linpack Performance (PFlops) 93.0(74%) 33.9(62%) 17.6(65%) 17.2(85.3) 14.0(50%) Rank of Top500 1 2 3 4 5 Performance/Power (Mflops/W) 6051.3 1901.5 2142.8 2176.6 3266.8 Rank of Green500 4 135 100 90 26 GTEPS 23755.7 2061.48 ### 23751 ### Rank of Graph500 2 8 ### 3 ### HPCG (Pflops) 0.3712 0.5801 0.3223 0.3304 0.3554 Rank of HPCG 4 2 7 6 5

19

20 Continued Progress Since GB Runs (WuxiStrong team- scalingare committed results to codesign)

100% Haohuan Fu y

c Wuxi n 80% 67% e i c i f

f 60% e

l e l l 40% 45% a

r 33% (GB’15) a 20% P

Initial Port of 0% 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M the CCSM3 Highly-Scalable AtmosphericTotal number of Simulation cores Framework 2016-2017 model improvement Yang, Chao through cube-sphereThe grid3-km or res run: 1.01 SYPD Institutewith 10.6M of Software, cores, CAS dt=240s, I/Ocloud penalty resolving <5% other grid computational mathematics sync-free explicit, implicit, or Wang, Lanning data-locality semi-implicit method Beijing Normal University Application climate modeling preserving Xue, Wei Algorithm algorithm Tsinghua University computer science Architecture Fu, Haohuan Sunway, GPU, MIC, FPGA Tsinghua University geo-computing C/C++, Fortran, MPI, CUDA, Java, … 21 The “Best” Computational Solution

52 SW26010: Sunway 260-Core Processor

Programming Model MPI+X Memory Memory

iMC Core Group 0 Core Group 1 iMC Memory Level 8*8 CPE Mesh PPU PPU Computing v Row MPI3 Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE § Based on OSU MVAPICH Mesh Mesh LDM Level Registers § One MPI process per MPE core (4 per socket) NoC Data Transfer v OpenACC 2.0 LDM Network 8*8 CPE 8*8 CPE MPE MPE Mesh Mesh Register Level § OpenACC2.0 cross-compiler basedTran sonfer A geLLNLnt (TA) ROSE translator

PPU PPU § Fortran, C/C++ Support Control Column iMC Core Group 2 Core Group 3 iMC § Nearly identical accelerator offloadNetwor kstyleCom asmun icforation B GPUus systems Computing Level § Copy in to “fast memory” is CPE local stores instead of GPU Memory Memory GDDR5 memory. § Extensions (swap,pack,tilemask) for hardware collective mem ops v Athreads: A low-level spatial threading § Low-level target of the ROSE OpenACC translator § Supports some hardware collective operations such as transposes and common domain decomposition operations (beyond Pthreads)

22 DirectlyComparison data transfer between of OpenACC MEM andFrom LDM Haohuan Fu Offload Models Wuxi Use data copy to handle data moving between Mem and LDM

!$ acc data copyin(A) copyout(B) !$acc parallel loop !$acc parallel loop do i=1,128 do i=1,128 m = func(i) m = func(i) !$acc data copyin(A(*, m)) copyout(B(*, i)) The difference do j=1,128 between thedo j=1,128 Memory Models B(j, i) = A(j, m) B(j, i) = A(j, m) enddo enddo enddoOpenACC for GPU !$acc end data OpenACC for SW26010 Offload!$acc end parallelModel loop(e.g. Titan) enddo !$acc end data !$acc end parallel loop OpenACC2.0OpenACC Sunway OpenACCSW26010 Data Moving Moving A, B between host memoryis executed by Moving A(*, m)、B(*, i) between host and device memory (e.g. globalhost thread memory and LDM in each i-loop memory on GPU) Host Memory Host Host Memory MPEExecuted by each CPE thread Executed by host thread S P M … Device CPEs Device Memory …

The memory that Accelerator The memory that Accelerator Data movement threads can be accessed threads can be accessed is initiated by each CPE thread 23 Software Porting Strategy

The Gap between Software and Hardware From Haohuan Fu Wuxi 100P

• millions lines of legacy code • poor scalability • written for multi-core, rather than many-core

100T

China’s models China’s supercomputers • pure CPU code • heterogeneous systems with many-core chips • scaling to hundreds or thousands of cores • millions of cores

49 24 Software Porting Strategy

Our Research Goals From Haohuan Fu Wuxi • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100P

• millions lines of legacy code • poor scalability • written for multi-core, rather than many-core

100T

China’s models China’s supercomputers • pure CPU code • heterogeneous systems with many-core chips • scaling to hundreds or thousands of cores • millions of cores

63 25 Is this and image of “failure” or of “success”? Can anyone guess what this is?

v Apple Ethos: Refine until it is near perfect.

v Google Ethos: Try early and try often!

26 Conclusions on China v Hardware Strategy § 3 Prototype Systems in 2017 (Sunway, Phytium, ?Tianhe2a?) § ARMv8 systems more conventional than Sunway (less energy efficient) v Software Strategy § MPI+x where x=OpenACC directives § Sunway OpenACC programming similar to GPU systems (not exotic) § Plans to increase automation to port from old to new § Continue advances in algorithm design increase gains over GB wins v Overall: Moving at a fast pace § Investing in a portfolio of risk (ranging from conventional to exotic) § There is little incentive to play it safe (no alternatives) § And are not held back by an installed base (open ended design space) § Not about out-selling US in HPC! Its about creating domestic supply chain to support domestic industry (cars, aerospace, basic science…)

27 PERFORMANCE AND ALGORITHMS RESEARCH GROUP

Japan Update

28 FLAGSHIP2020FLAGSHIP2020 ProjectProject

MissionsMissions •Building•Building thethe JapaneseJapanese nationalnational flagshipflagship supercomputer,supercomputer, and Post K, and •Developing wide range of HPC applications, runningin on Post K, •Developing wide range of HPC applications,FLAGSHIP2020 running Project on Post K, in orderorder to to solve solve social social and and science science issues issues in in Japan Japan BudgetBudget Missions •Building the Japanese national flagship supercomputer, and Post K, •110•110 BillionBillion JPYJPY (about(about 0.910.91 BillionBillion USDin•DevelopingUSDin casecase wide ofof range 120120 of JPY/$)HPC applications, runningin on Post K,

•including•including research,research, developmentdevelopment andandorder acquisition,acquisition, to solve social onandand and applicatiscience issueson in Japan

Budget …

development I/O Network I/O Network development …

•110 Billion… JPY (about 0.91 Billion USDin case of 120 JPY/$) … … Maitenance Maitenance

•including… research, developmentServers and acquisition,on and Serversapplicati

… Hardware and System Software … Hardware and System Software …

development I/O Network

… … …

Portal … PortalMaitenance

•Post K Computer … Servers

•Post K Computer Hardware and System Software Servers Servers …

… Portal

•RIKEN•RIKEN AICS AICS is is in in charge charge of of •Post K Computer Login … LoginServers

… …

… Servers Servers

•RIKEN AICS is in charge of … Login …

Post-K Strategic Delay…

developmentdevelopment … Servers

… …

development… Hierarchical Hierarchical … Hierarchical •Fujitsu•Fujitsu is is vendor vendor partnership partnership •Fujitsu is vendor partnershipStorage System StorageStorage System System

: Compute Node CY : Compute Node Old CYCY : Compute Node Manufacturing, Installation, Basic Design Design and Implementation Operation Timeline and Tuning Manufacturing, Manufacturing,Installation, Installation, BasicBasic Design Design Design andDesign Implementation and Implementation2015/01/30 Yutaka OperationIshikawa @ RIKEN AICS Operation 3 and Tuning and Tuning

2015/01/302015/01/30 Yutaka IshikawaYutaka @ RIKEN Ishikawa AICS @ RIKEN AICS 3 3 New Timeline v Flasgship 2020: 0.91B USD project (Riken + Fujitsu) v Originally planned for 2020 à moving to 2021 or 2022 § Energy efficiency benefits of new process technology offer better TCO v Fujitsu: scalable-core/node ARMv8 + SVE512 Vectors § Wider vectors: K=128, FX100 256, Post-K 512 § 6D Mesh Interconnect, Sector Cache, Fast Sync between cores § Nearly same microarchitecture as SPARC64-based K-computer § Gains advantage of larger market for ARM software ecosystem

29 Tsubame 3.0 @ TiTech/SGI/HP/NVIDIA Converged BigData/AI/HPC Supercomputer

12.5PF DP 47.2PF HP Omnipath@ 4x100Gb/s 30 Deep LearningML moving is now towards reaching AI is aviable Hotbed accuracy of Activity in US & Japan HPC Figure: Fujitsu ARTIFICIAL INTELLIGENCE Tsubame 3.0 A program that can sense, reasons, act and adapt ?

MACHINE LEARNING MV DGX-1 & Fujitsu Google TPU PROJECT OLYMPUS HGX-1 Algorithms whose performanceProject improve Olympus Industry Standard Hyperscale GPU Accelerator when exposed to more data over time MS Olympus HGX-1 Hyperscale GPU IMAGENET 2012 WINNER Accelerator Alex Krizhevsky, Iiya SutskeverContinuing, Geoffrey Hinton, University Challenges of Toronto

Configurable PCIe Cable to host + Expansion slots NVIDIA P100 GPU NVLink Hybrid Cube Mesh Fabric Our model is a large, deep convolutional neural network trained on raw RGB pixel 20 Gbyte/sec per link Duplex values. The neural network, which hasLarge 60 million parameters compute and 650,000 neurons, requirements for training Adapters for other GPUs consists of five convolutional layers, some of which are followed by max-pooling DEEP LEARNING layers, and three globally-connected layers with a final 1000-way softmax. It was FACEBOOK BIG BASIN 23 38 Multi-layered neural networks trained on two NVIDIA GPUs for about a week. To make training faster, we used8x non Tesla- P100 GPU Server – Hybrid Mesh Cube Topology learn from vast amounts of data saturating neurons and a very efficientPerformance GPU implementation of convolutional that nets. scales with data

Calculation of increasingly complex models

9 27 31 Courtesy of Nervana

5 Copyright 2017 FUJITSU Vertically Integrated Data Centers Spinning their Own Designs

FACEBOOK BIG BASIN 8x Tesla P100 GPU Server – Hybrid Mesh Cube Topology

27 32

4. Activate performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on. Its ​ inputs are the Accumulators, and its output is the Unified Buffer. It can also perform the pooling operations needed for convolutions using the dedicated hardware on the die, as it is connected to nonlinear function logic. 5. Write_Host_Memory writes data from the Unified Buffer into the CPU host memory. ​ The other instructions are alternate host memory read/write, set configuration, two versions of synchronization, interrupt host, debug-tag, nop, and halt. The CISC MatrixMultiply instruction is 12 bytes, of which 3 are Unified Buffer address; 2 are accumulator address; 4 are length (sometimes 2 dimensions for convolutions); and the rest are opcode and flags. The philosophy of the TPU microarchitecture is to keep the matrix unit busy. It uses a 4-stage pipeline for these CISC instructions, where each instruction executes in a separate stage. The plan was to hide the execution of the other instructions by overlapping their execution with the MatrixMultiply instruction. Toward that end, the Read_Weights instruction ​ ​ ​ ​ follows the decoupled-access/execute philosophy [Smi82], in that it can complete after sending its address but before the weight is fetched from Weight Memory. The matrix unit will stall if the input activation or weight data is not ready. We don’t have clean pipeline overlap diagrams, because our CISC instructions can occupy a station for thousands of clock cycles, unlike the traditional RISC pipeline with one clock cycle per stage. Interesting cases occur when the activations for one network layer must complete before the matrix multiplications of the next layer can begin; we see a “delay slot,” where the matrix unit waits for explicit synchronization before safely reading from the Unified Buffer. As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer [Kun80][Ram91][Ovt15b]. Figure 4 shows that data flows in from the left, and the weights are loaded from the top. A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators. From a correctness perspective, software is unaware of the systolic nature of the matrix unit, but for performance, it does worry about the latency of the unit. The TPU software stack had to be compatible with those developed for CPUs and GPUs so that applications could be ported quickly to the TPU. The portion of the application run on the TPU is typically written in TensorFlow and is compiled into an API that can run on GPUs or TPUs [Lar16]. Like GPUs, the TPU stack is split into a User Space Driver and a Kernel Driver. The Kernel Driver is lightweight and handles only memory management and interrupts. It is designed for long-term stability. The User Space driver changes frequently. It sets up and controls TPU execution, reformats data into TPU order, translates API calls into TPU instructions, and turns them into an application binary. The User Space driver compiles a model the first time it is evaluated, caching the program image and writing the weight image into the TPU’s weight memory; the Googlesecond and followin g Tensorevaluations run at full s peProcessinged. The TPU runs most models comple teUnitly from inputs t(TPU)o outputs, maximizing the ratio of TPU compute time to I/O time. Computation is often done one layer at a time, with overlapped execution allowing the matrix multiply unit to hide most non-critical-path operations.

• Deployed in datacenters since 2015 • 10-30x Faster than NVIDIA G80 or Intel Haswell for ML workloads (64k arithmetic ops per cycle) • Could be faster with better memory subsystem. • 8bit integer arithmetic (all that is needed for ML)

Figure 3. TPU Printed Circuit Board. It can be inserted in the slot Figure 4. Systolic data flow of the Matrix Multiply Unit. Software ​ ​ for an SATA disk in a server, but the card uses PCIe Gen3 x16. has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

4

33 158 Ising Computer

In practice, this use of random numbers means that Hitachi Review Vol. 65 (2016), No. 6 159 CMOS ISING COMPUTING the solution obtained is not necessarily the optimal While computing methods that use superconductors to one. However, when the computing technique is used replicate an Ising model have been proposed in the past, for parameter optimization, it is likely that it will not 10,000 Hitachi has proposed using a complementary metal matter if the results obtained are not always optimal. 158 Ising Computer oxide semiconductor (CMOS) circuit for this purpose. In situations1,000 where this computing technique might The benefits of using a CMOS circuit are simpler be deployed, it is possible to anticipate applications manufacturing, greater scalability, and ease of use. where providing100 a theoretical guarantee that it will Ising chips The updating of actual spin values is performed in produceIn solutions practice, thiswith use 99% of randomor better numbers accuracy,×1,800 means 90% that

CMOS ISING COMPUTING efficiency Energy accordance with the following rule: or morethe solutionof10 the time, obtained for example, is not necessarily will mean thatthe optimalthese (relative to conventional (relative New spin value = +1 (if a > b) solutionsapproximation algorithm) can be relied on to not cause any problems While computing methods that use superconductors to one. However,1 when the computing technique is used replicate an Ising− model1 (if a have < b )been proposed in the past,for thefor system.parameter8 optimization,64 512 it is likely4,096 that it32,768 will not Hitachi has proposed+/−1 (ifusing a = ba )complementary metal matter if the resultsNumber obtained of spins (problem are not size) always optimal. Here,PC: personal a is computer the number LAN: local of area cases network in which (adjacent spin value,oxide semiconductor interaction coefficient) (CMOS) circuit is (+1, for this +1) purpose. or PROTOTYPEFig. 6—EnergyIn situations Efficiency COMPUTER where of Solvingthis computing Randomly Generatedtechnique might (Fig.−1, 5—Ising−The1) andbenefits Node. b is ofthe using number a CMOS of cases circuit in which are simpler it AMaximum prototypebe deployed, Cut IsingProblem. chipit is waspossible manufactured to anticipate using applications a 65-nm The photograph shows an Ising node with two Ising chips. The is (+1,manufacturing, −1) or (−1, greater +1). These scalability, interactions and ease causeof use. CMOSThe graphwhere process shows providing the to relative test a thetheoretical energy proposed efficiency guarantee Ising of the computing calculation that it will Ising node is connected to a server or PC via a LAN cable and compared to an approximation algorithm executing on a the energyThe of updating the Ising of modelactual spinto fall, values following is performed the intechnique. produce An solutions Ising node with was 99% then or built better with accuracy, this Ising 90% can be used to solve combinatorial optimization problems. general-purpose CPU. The energy efficiency improves as the energy contours (landscape) like that shown in Fig. 3. chip and its ability to solve optimization problems was accordance with the following rule: size (numberor more of of spins) the time, of the forproblem example, increases, will with mean the thatnew these However, because the energy profile includes peaks demonstrated. This section describes the prototype and New spin value = +1 (if a > b) techniquesolutions being canapproximately be relied 1,800 on to times not morecause efficient any problems for a and valleys (as shown in the figure), this interaction the results of its use to solve optimization problems. 20,000-spin problem. processThe operatingIsing chip on implements its own−1 has (if athea

- 79 - PERFORMANCE AND ALGORITHMS RESEARCH GROUP

A Brief EU Update

35 Recent Developments in the EU (and soon to be former members thereof) v Jan 18 2017: “Cray commits to deliver 10,000+core ARM system” § “Isambard” @ UK Bristol £4.7M to be installed March-December 2017 § Includes GPUs, x86 CPUs, and FPGAs (in addition to ARM) § Simon McIntosh-Smith: “Scientists have a growing choice of potential computer architectures to choose from, including new 64-bit ARM CPUs, graphics processors, and many-core CPUs from Intel. Choosing the best architecture for an application can be a difficult task, so the new Isambard GW4 Tier 2 HPC service aims to provide access to a wide range of the most promising emerging architectures, all using the same software stack." v Change in EU Horizon 2020 strategy in Feb 2017 1. Refocus on domestic technologies 2. Preparatory call for proposals expected imminently 3. Open to non-traditional architectures (e.g. EU BRAIN project) 4. Current Focus on ARM (ISA license, but indigeonous microarchitecture) 5. Chiplet and SoC integration strategies are both being pursued

36 Conclusions v China to have 3 prototype systems by 2017 (exascale candidates) § Try early, try often, learning cycle § Broad range of architectures. § Not constrained by an installed base. (both an asset and a curse!) v China’s Sunway system has emboldened other countries to pursue an “all indigenous” processor approach § Enabled by embedded ecosystem (don’t have to own “all” of the design) § Started with China and Japan, but now EU has joined in to the strategy v Japan refocusing Flagship 2020 § New roadmap for ARMv8 based Post-K (2021-2022) § Innovations happening at smaller scale for ML acceleration (TiTech) v ML is taking off in Asia and US § Plus: Driving a lot of innovation and investment in HPC-relevant technologies (contributions vertically integrated Inc. Google/FaceBook) § Minus: Focus is on low precision arithmetic (8 bits floats?!?!) § Broader trend towards AI (a superset of ML and neural networks)

37 4/18/17 38 39 A Short Diversion on Node Inflation

• Depends on the foundry • Technology “node” might reflect other advances (lower leakage or FinFET transistors) • Not consistent across foundries.

Foundry IDM Min half Node Node pitch 7 nm 10 nm 22 nm 5 nm 7 nm 16 nm 3 nm 5 nm 12 nm

Bottom Line: No longer a very meaningful metric

40 A Short Diversion About ARM Licenses v 1980s-1990s: Custom Vector/MPP Market § NRE costs not shared by a broader market (hard to recoup dev costs) § Technology dev. eclipsed by (‘killer micro’) market v 1990s-present: Commodity Microprocessor Market § The Chip is the commodity: shared by larger desktop/server market v ARM play is to make IP the commodity (not the chip) § Share NRE costs with an even larger embedded market § Feasible as 64-bit addressing and DP started to appear in embedded § Also feasible when clock-rates stopped scaling (arrays of simple cores) § Embedded market also enables China, Japan, EU to develop a “domestic technology” v Two kinds of licenses § ISA License: vendor/country develops microarchitecture, but ISA compliance ensures ALL licensees can rely on common software § IP License: Can buy a “commodity” IP circuit design from ARMs design library (cost of developing technology is amortized by

41