The Tofu Interconnect D for Supercomputer Fugaku

Total Page:16

File Type:pdf, Size:1020Kb

The Tofu Interconnect D for Supercomputer Fugaku The Tofu Interconnect D for Supercomputer Fugaku 20th June 2019 Yuichiro Ajima Fujitsu Limited 20th June 2019, ExaComm 2019 0 Copyright 2019 FUJITSU LIMITED Overview of Fujitsu’s Computing Products Fujitsu continues to develop general purpose computing products and new domain specific computing devices General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers Combinatorial Problems Supercomputers PC Clusters Quantum Computing High Performance Computing 20th June 2019, ExaComm 2019 1 Copyright 2019 FUJITSU LIMITED Role of Supercomputer in Fujitsu Products Supercomputer is a technology driver for Fujitsu’s computing products in technologies including packaging and interconnect General Purpose Computing Products Domain Specific Computing Devices Servers w/ Fujitsu own processor Deep Learning Mainframes SPARC Servers x86 Servers Combinatorial Problems Supercomputers PC Clusters Quantum Computing High Performance Computing 20th June 2019, ExaComm 2019 2 Copyright 2019 FUJITSU LIMITED Development of Packaging Technology K computer Fugaku HPC2500 FX1 FX10 FX100 (This image is prototype) Single-Socket Node Water-Cooling 3D-Stacked Memory 2.5D Package 2003 2009 2012 2015 2021 Fujitsu has developed single-socket node and water-cooled supercomputer with 3D-stacked memory Fugaku will integrate memory stacks into a CPU package 20th June 2019, ExaComm 2019 3 Copyright 2019 FUJITSU LIMITED Development of Interconnect Technology K computer Fugaku HPC2500 FX1 FX10 FX100 (This image is prototype) DTU InfiniBand Tofu1 Tofu2 TofuD 2003 2009 2012 2015 2021 Tofu Interconnect (Tofu1) for K computer 6D mesh/torus network, virtual torus rank mapping, Tofu Barrier Tofu2 added new functions; atomic operations, cache injection Tofu Interconnect D (TofuD) for Fugaku Increased resources for high-density node configuration Fault-resilience – dynamic packet slicing 20th June 2019, ExaComm 2019 4 Copyright 2019 FUJITSU LIMITED Features of the Tofu interconnect family 6D Mesh/Torus Network Virtual 3D-Torus Rank-mapping Tofu Barrier Characteristics of Torus Network 20th June 2019, ExaComm 2019 5 Copyright 2019 FUJITSU LIMITED 6D Mesh/Torus Network Six coordinate axes: X, Y, Z, A, B, C X, Y, Z: the size varies according to the system configuration A, B, C: the size is fixed to 2×3×2 Tofu stands for “torus fusion”: (X, Y, Z)×(A, B, C) X×Y×Z×2×3×2 Z B C X Y A 20th June 2019, ExaComm 2019 6 Copyright 2019 FUJITSU LIMITED Virtual 3D-Torus Rank-mapping A rank-mapping option for topology-awareness A 3D torus rank can be mapped to a 6D submesh even if there is an offline node This fault tolerance contributes to the system availability 4 3 5 2 Z C 6 1 6D submesh 7 0 4 6 3 7 5 9 8 2 10 1 A 7 6 5 4 11 X 0 0 1 2 3 B Y 20th June 2019, ExaComm 2019 7 Copyright 2019 FUJITSU LIMITED Tofu Barrier Tofu Barrier offloads Barrier and Allreduce communications Barrier channel (BCH) is an interface Barrier gate (BG) is a communication engine Tofu barrier can execute an arbitrary communication algorithm Recursive-doubling algorithm uses log2(n) of BGs in each process Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Process 8 BCH + start/end BG Process 9 Intermediate BG Reduce-broadcast algorithm uses a maximum of 5 BGs in each process 20th June 2019, ExaComm 2019 8 Copyright 2019 FUJITSU LIMITED Characteristics of Torus Network System Network Total Injection Bisection Bandwidth (PB/s) Bandwidth (TB/s) Blue Gene/Q Torus (5D) 1.97 40X 49 K Computer Mesh/Torus (6D) 1.66 36X 46 Virtual Torus (3D) 34 Sunway TaihuLight Tapered Fat-Tree 0.51 7.3X 70 Piz Daint Dragonfly 0.07 2.0X 36 Summit Fat-Tree 0.12 1.0X 115 Oakforest-PACS Fat-Tree 0.10 1.0X 102 All systems have the same order of bisection bandwidth No significant performance difference in global data exchange Torus networks have higher total injection bandwidth Topology-aware communication such as nearest-neighbor data exchange results in higher performance 20th June 2019, ExaComm 2019 9 Copyright 2019 FUJITSU LIMITED The Design of TofuD High Density Node Configuration Link Configuration and Injection Bandwidth Packaging Dynamic Packet Slicing Increased Tofu Barrier Resources 20th June 2019, ExaComm 2019 10 Copyright 2019 FUJITSU LIMITED High Density Node Configuration The processor die size gets smaller from FX100 The off-chip channels are halved # memory stacks: 8 to 4 # high speed serial lanes for Tofu: 40 to 20 The area of Tofu interconnect shrinks to about 1/3 size Tofu2 TofuD HMC HMC HBM HBM HMC HMC HMC HMC HBM HBM HMC HMC Fugaku – A64FX (7nm) FX100 – SPARC64TM XIfx (20nm) 20th June 2019, ExaComm 2019 11 Copyright 2019 FUJITSU LIMITED High Density Node Configuration (cont.) More resources are integrated into the CPU # CPU Memory Groups (NUMA nodes): 2 to 4 The expected number of processes per node is also doubled # Tofu Network Interfaces: 4 to 6 Provide more resources and accelerate collective communications HMC HMC HMC HMC HBM2 HBM2 SPARC64 XIfx A64FX CMG CMG CMG PCIe PCIe c c c c c c c c c c c c c c c c c c c c c c c c c c c TNI0 TNI0 c c c c c c c c c c c c c c c c TNI1 10 ports TNI1 10 ports 10 ports 10 NOC TNI2 × × × TNI3 c c c c c c c c TNI2 c c c c c c c c c c c c c c c c c c c c c c c c TNI4 lanes lanes c c c c c c c c TNI5 lanes TNI3 2 c 4 lanes c c Network Router Tofu Tofu Network Network Router Tofu CMG Tofu2 CMG CMG TofuD HMC HMC HMC HMC HBM2 HBM2 20th June 2019, ExaComm 2019 12 Copyright 2019 FUJITSU LIMITED Link Configuration and Injection Bandwidth Tofu1 Tofu2 TofuD Data rate (Gbps) 6.25 25.78125 28.05 Number of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 Number of TNIs per node 4 4 6 Injection bandwidth per node (GB/s) 20 50 40.8 Data transfer rate increased from 25 Gbps to 28 Gbps Link bandwidth reduced from 12.5 GB/s to 6.8 GB/s TofuD simultaneously transmits in 6 directions Increased from 4 directions in the case of Tofu1 and Tofu2 Total injection bandwidth per node is 40.8 GB/s Approximately, twice that of Tofu1 or 80% that of Tofu2 20th June 2019, ExaComm 2019 13 Copyright 2019 FUJITSU LIMITED Packaging – CPU Memory Unit (CMU) Two CPUs connected with C-axis X×Y×Z×A×B×C = 1×1×1×1×1×2 Two or three active optical cable (AOC) cages on the board Each cable bundles two lanes of signals from each of the two CPUs CPU AOC (X) AOC AOC (Y) AOC AOC (Z) CPU 20th June 2019, ExaComm 2019 14 Copyright 2019 FUJITSU LIMITED Packaging – Rack Structure Rack 8 shelves Rack (prototype) 192 CMUs (384 CPUs) Shelf 24 CMUs (48 CPUs) X×Y×Z×A×B×C = 1×1×4×2×3×2 Top or bottom half of rack 4 shelves X×Y×Z×A×B×C = 2×2×4×2×3×2 Shelves 20th June 2019, ExaComm 2019 15 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Split Mode The physical layer of TofuD is independent for each lane In the ordinary multi-lane transmission, the physical layer has media- independent interface and hides the number of signal lanes A packet is sliced and each is injected into a different lane The routing header of the packet is copied to both slices for virtual cut- through packet transfer This is normal operation and is called split mode Slice 0 Slice 0 Slice 1 Slice 1 Packet Routing Header 20th June 2019, ExaComm 2019 16 Copyright 2019 FUJITSU LIMITED Dynamic Packet Slicing – Duplicate Mode When the error rate is high, the operation mode falls down to duplicate mode that duplicates packets If the error rate returns to low, the link can return to split mode Each lane is never disconnected independently The error rates of both lanes are continuously monitored and fed back Error rate feedback Slice 0 Packet Slice 1 Packet Packet Routing Header 20th June 2019, ExaComm 2019 17 Copyright 2019 FUJITSU LIMITED Increased Tofu Barrier Resources Tofu1/2 TofuD Number of BCHs 8 16 TNI Number of BGs 64 48 Number of TNI w/ Tofu Barrier 1 6 Node Number of BCHs 8 96 Number of BGs 64 288 The number of Tofu Barrier resources significantly increased All 6 TNIs of TofuD have Tofu barrier Only TNI #0 of Tofu1/2 has Tofu barrier This change intended to support intra-node synchronization 20th June 2019, ExaComm 2019 18 Copyright 2019 FUJITSU LIMITED Performance Evaluations Put Latencies Latency Breakdown Injection Rates Tofu Barrier 20th June 2019, ExaComm 2019 19 Copyright 2019 FUJITSU LIMITED Put Latencies 8B Put transfer between nodes on the same board The low-latency features were used Communication settings Latency Tofu1 Descriptor on main memory 1.15 µs Direct Descriptor 0.91 µs Tofu2 Cache injection OFF 0.87 µs 0.20 µs Cache injection ON 0.71 µs TofuD To/From far CMGs 0.54 µs 0.22 µs To/From near CMGs 0.49 µs Tofu2 reduced the Put latency by 0.20 μs from that of Tofu1 The cache injection feature contributed to this reduction TofuD reduced the Put latency by 0.22 μs from that of Tofu2 20th June 2019, ExaComm 2019 20 Copyright 2019 FUJITSU LIMITED Latency Breakdown 1000 Rx CPU 900 Rx Host bus Rx TNI 800 Cache injection Packet Transfer 700 Tx TNI Tx Host bus 600 Tx CPU Tx Optimization 500 400 Increased Overhead in Physical Layer Latency (nsec) Latency 300 Overhead Reduced 200 100 Rx Optimization 0 Tofu1 Tofu2 TofuD The overhead increase in Tofu2 has been reduced 20th June 2019, ExaComm 2019 21 Copyright 2019 FUJITSU LIMITED Injection Rates per Node Simultaneous Put transfers to the nearest-neighbor nodes Tofu1 and Tofu2 used 4 TNIs, and TofuD used 6 TNIs Injection rate Efficiency Tofu1 (K) 15.0 GB/s 77 % Tofu1 (FX10) 17.6 GB/s 88 % Tofu2 45.8 GB/s 92 % TofuD 38.1 GB/s 93 % The efficiencies of Tofu1 were lower than 90% Because of a bottleneck in the bus that connects CPU and ICC The efficiencies of Tofu2 and TofuD exceeded 90 % Integration into the processor chip removed the bottleneck 20th June 2019, ExaComm 2019 22 Copyright 2019 FUJITSU LIMITED Tofu Barrier – Intra-Node Synchronization The test program synchronized multiple BCHs in a node For 8 and 16 BCHs, some TNIs are shared by multiple BCHs Sharing TNI causes the serialization of BCHs/BGs processing Number of BCHs 1 4 8 16 48 Number of used TNIs 1 4 6 6 6 Number of communication stages 2 2 4 6 9 Max.
Recommended publications
  • File System and Power Management Enhanced for Supercomputer Fugaku
    File System and Power Management Enhanced for Supercomputer Fugaku Hideyuki Akimoto Takuya Okamoto Takahiro Kagami Ken Seki Kenichirou Sakai Hiroaki Imade Makoto Shinohara Shinji Sumimoto RIKEN and Fujitsu are jointly developing the supercomputer Fugaku as the successor to the K computer with a view to starting public use in FY2021. While inheriting the software assets of the K computer, the plan for Fugaku is to make improvements, upgrades, and functional enhancements in various areas such as computational performance, efficient use of resources, and ease of use. As part of these changes, functions in the file system have been greatly enhanced with a focus on usability in addition to improving performance and capacity beyond that of the K computer. Additionally, as reducing power consumption and using power efficiently are issues common to all ultra-large-scale computer systems, power management functions have been newly designed and developed as part of the up- grading of operations management software in Fugaku. This article describes the Fugaku file system featuring significantly enhanced functions from the K computer and introduces new power management functions. 1. Introduction application development environment are taken up in RIKEN and Fujitsu are developing the supercom- separate articles [1, 2], this article focuses on the file puter Fugaku as the successor to the K computer and system and operations management software. The are planning to begin public service in FY2021. file system provides a high-performance and reliable The Fugaku is composed of various kinds of storage environment for application programs and as- system software for supporting the execution of su- sociated data.
    [Show full text]
  • Interconnect Your Future Enabling the Best Datacenter Return on Investment
    Interconnect Your Future Enabling the Best Datacenter Return on Investment TOP500 Supercomputers, November 2016 Mellanox Accelerates The World’s Fastest Supercomputers . Accelerates the #1 Supercomputer . 39% of Overall TOP500 Systems (194 Systems) . InfiniBand Connects 65% of the TOP500 HPC Platforms . InfiniBand Connects 46% of the Total Petascale Systems . Connects All of 40G Ethernet Systems . Connects The First 100G Ethernet System on The List (Mellanox End-to-End) . Chosen for 65 End-User TOP500 HPC Projects in 2016, 3.6X Higher versus Omni-Path, 5X Higher versus Cray Aries InfiniBand is the Interconnect of Choice for HPC Infrastructures Enabling Machine Learning, High-Performance, Web 2.0, Cloud, Storage, Big Data Applications © 2016 Mellanox Technologies 2 Mellanox Connects the World’s Fastest Supercomputer National Supercomputing Center in Wuxi, China #1 on the TOP500 List . 93 Petaflop performance, 3X higher versus #2 on the TOP500 . 41K nodes, 10 million cores, 256 cores per CPU . Mellanox adapter and switch solutions * Source: “Report on the Sunway TaihuLight System”, Jack Dongarra (University of Tennessee) , June 20, 2016 (Tech Report UT-EECS-16-742) © 2016 Mellanox Technologies 3 Mellanox In the TOP500 . Connects the world fastest supercomputer, 93 Petaflops, 41 thousand nodes, and more than 10 million CPU cores . Fastest interconnect solution, 100Gb/s throughput, 200 million messages per second, 0.6usec end-to-end latency . Broadest adoption in HPC platforms , connects 65% of the HPC platforms, and 39% of the overall TOP500 systems . Preferred solution for Petascale systems, Connects 46% of the Petascale systems on the TOP500 list . Connects all the 40G Ethernet systems and the first 100G Ethernet system on the list (Mellanox end-to-end) .
    [Show full text]
  • The Sunway Taihulight Supercomputer: System and Applications
    SCIENCE CHINA Information Sciences . RESEARCH PAPER . July 2016, Vol. 59 072001:1–072001:16 doi: 10.1007/s11432-016-5588-7 The Sunway TaihuLight supercomputer: system and applications Haohuan FU1,3 , Junfeng LIAO1,2,3 , Jinzhe YANG2, Lanning WANG4 , Zhenya SONG6 , Xiaomeng HUANG1,3 , Chao YANG5, Wei XUE1,2,3 , Fangfang LIU5 , Fangli QIAO6 , Wei ZHAO6 , Xunqiang YIN6 , Chaofeng HOU7 , Chenglong ZHANG7, Wei GE7 , Jian ZHANG8, Yangang WANG8, Chunbo ZHOU8 & Guangwen YANG1,2,3* 1Ministry of Education Key Laboratory for Earth System Modeling, and Center for Earth System Science, Tsinghua University, Beijing 100084, China; 2Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 3National Supercomputing Center in Wuxi, Wuxi 214072, China; 4College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China; 5Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 6First Institute of Oceanography, State Oceanic Administration, Qingdao 266061, China; 7Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China; 8Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China Received May 27, 2016; accepted June 11, 2016; published online June 21, 2016 Abstract The Sunway TaihuLight supercomputer is the world’s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip.
    [Show full text]
  • It's a Multi-Core World
    It’s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group Moore’s Law is not to blame. Intel process technology capabilities High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018 Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm Integration Capacity (Billions of 2 4 8 16 32 64 128 256 Transistors) Transistor for Influenza Virus 90nm Process Source: CDC 50nm Source: Intel At end of day, we keep using all those new transistors. That Power and Clock Inflection Point in 2004… didn’t get better. Fun fact: At 100+ Watts and <1V, currents are beginning to exceed 100A at the point of load! Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL Not a new problem, just a new scale… CPU Power W) Cray-2 with cooling tower in foreground, circa 1985 And how to get more performance from more transistors with the same power. RULE OF THUMB A 15% Frequency Power Performance Reduction Reduction Reduction Reduction In Voltage 15% 45% 10% Yields SINGLE CORE DUAL CORE Area = 1 Area = 2 Voltage = 1 Voltage = 0.85 Freq = 1 Freq = 0.85 Power = 1 Power = 1 Perf = 1 Perf = ~1.8 Single Socket Parallelism Processor Year Vector Bits SP FLOPs / core / Cores FLOPs/cycle cycle Pentium III 1999 SSE 128 3 1 3 Pentium IV 2001 SSE2 128 4 1 4 Core 2006 SSE3 128 8 2 16 Nehalem 2008 SSE4 128 8 10 80 Sandybridge 2011 AVX 256 16 12 192 Haswell 2013 AVX2 256 32 18 576 KNC 2012 AVX512 512 32 64 2048 KNL 2016 AVX512 512 64 72 4608 Skylake 2017 AVX512 512 96 28 2688 Putting It All Together Prototypical Application: Serial Weather Model CPU MEMORY First Parallel Weather Modeling Algorithm: Richardson in 1917 Courtesy John Burkhardt, Virginia Tech Weather Model: Shared Memory (OpenMP) Core Fortran: !$omp parallel do Core do i = 1, n Core a(i) = b(i) + c(i) enddoCore C/C++: MEMORY #pragma omp parallel for Four meteorologists in the samefor(i=1; room sharingi<=n; i++) the map.
    [Show full text]
  • Supercomputer Fugaku
    Supercomputer Fugaku Toshiyuki Shimizu Feb. 18th, 2020 FUJITSU LIMITED Copyright 2020 FUJITSU LIMITED Outline ◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary 1 Copyright 2020 FUJITSU LIMITED Supercomputer “Fugaku”, formerly known as Post-K Focus Approach Application performance Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2 Leading-edge Si-technology, Fujitsu's proven low power & high Power efficiency performance logic design, and power-controlling knobs Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Usability Linux 2 Copyright 2020 FUJITSU LIMITED Fugaku project schedule 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Fugaku development & delivery Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning Select Architecture & Co-Design w/ apps groups apps sizing 3 Copyright 2020 FUJITSU LIMITED Fugaku co-design ◼ Co-design goals ◼ Obtain the best performance, 100x apps performance than K computer, within power budget, 30-40MW • Design applications, compilers, libraries, and hardware ◼ Approach ◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator • Computation time: brief & precise estimation • Communication time: bandwidth and latency for communication w/ some attributes for communication patterns • I/O time: ◼ Then, optimize apps/compilers etc. and resolve bottlenecks ◼ Estimation of performance and power ◼ Precise performance estimation for primary kernels • Make & run Fugaku objects on the Fugaku cycle base simulator ◼ Brief performance estimation for other sections • Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst.
    [Show full text]
  • Challenges in Programming Extreme Scale Systems William Gropp Wgropp.Cs.Illinois.Edu
    1 Challenges in Programming Extreme Scale Systems William Gropp wgropp.cs.illinois.edu Towards Exascale Architectures Figure 1: Core Group for Node (Low Capacity, High Bandwidth) 3D Stacked (High Capacity, Memory Low Bandwidth) DRAM Thin Cores / Accelerators Fat Core NVRAM Fat Core Integrated NIC Core for Off-Chip Coherence Domain Communication Figure 2.1: Abstract Machine Model of an exascale Node Architecture 2.1 Overarching Abstract Machine Model We begin with asingle model that highlights the anticipated key hardware architectural features that may support exascale computing. Figure 2.1 pictorially presents this as a single model, while the next subsections Figure 2: Basic Layout of a Node describe several emergingFrom technology “Abstract themes that characterize moreMachine specific hardware design choices by com- Sunway TaihuLightmercial vendors. In Section 2.2, we describe the most plausible set of realizations of the singleAdapteva model that are Epiphany-V DOE Sierra viable candidates forModels future supercomputing and architectures. Proxy • 1024 RISC June• 19, Heterogeneous2016 2.1.1 Processor 2 • Power 9 with 4 NVIDA It is likely that futureArchitectures exascale machines will feature heterogeneous for nodes composed of a collectionprocessors of more processors (MPE,than a single type of processing element. The so-called fat cores that are found in many contemporary desktop Volta GPU and server processorsExascale characterized by deep pipelines, Computing multiple levels of the memory hierarchy, instruction-level parallelism
    [Show full text]
  • Management Direction Update
    • Thank you for taking the time out of your busy schedules today to participate in this briefing on our fiscal year 2020 financial results. • To prevent the spread of COVID-19 infections, we are holding today’s event online. We apologize for any inconvenience this may cause. • My presentation will offer an overview of the progress of our management direction. To meet the management targets we announced last year for FY2022, I will explain the initiatives we have undertaken in FY2020 and the areas on which we will focus this fiscal year and into the future. Copyright 2021 FUJITSU LIMITED 0 • It was just around a year ago that we first defined our Purpose: “to make the world more sustainable by building trust in society through innovation.” • Today, all of our activities are aligned to achieve this purpose. Copyright 2021 FUJITSU LIMITED 1 Copyright 2021 FUJITSU LIMITED 2 • First, I would like to provide a summary of our financial results for FY2020. • The upper part of the table shows company-wide figures for Fujitsu. Consolidated revenue was 3,589.7 billion yen, and operating profit was 266.3 billion yen. • In the lower part of the table, which shows the results in our core business of Technology Solutions, consolidated revenue was 3,043.6 billion yen, operating profit was 188.4 billion yen, and our operating profit margin was 6.2%. • Due to the impact of COVID-19, revenue fell, but because of our shift toward the services business and greater efficiencies, on a company-wide basis our operating profit and profit for the year were the highest in Fujitsu’s history.
    [Show full text]
  • Optimizing High-Resolution Community Earth System
    https://doi.org/10.5194/gmd-2020-18 Preprint. Discussion started: 21 February 2020 c Author(s) 2020. CC BY 4.0 License. Optimizing High-Resolution Community Earth System Model on a Heterogeneous Many-Core Supercomputing Platform (CESM- HR_sw1.0) Shaoqing Zhang1,4,5, Haohuan Fu*2,3,1, Lixin Wu*4,5, Yuxuan Li6, Hong Wang1,4,5, Yunhui Zeng7, Xiaohui 5 Duan3,8, Wubing Wan3, Li Wang7, Yuan Zhuang7, Hongsong Meng3, Kai Xu3,8, Ping Xu3,6, Lin Gan3,6, Zhao Liu3,6, Sihai Wu3, Yuhu Chen9, Haining Yu3, Shupeng Shi3, Lanning Wang3,10, Shiming Xu2, Wei Xue3,6, Weiguo Liu3,8, Qiang Guo7, Jie Zhang7, Guanghui Zhu7, Yang Tu7, Jim Edwards1,11, Allison Baker1,11, Jianlin Yong5, Man Yuan5, Yangyang Yu5, Qiuying Zhang1,12, Zedong Liu9, Mingkui Li1,4,5, Dongning Jia9, Guangwen Yang1,3,6, Zhiqiang Wei9, Jingshan Pan7, Ping Chang1,12, Gokhan 10 Danabasoglu1,11, Stephen Yeager1,11, Nan Rosenbloom 1,11, and Ying Guo7 1 International Laboratory for High-Resolution Earth System Model and Prediction (iHESP), Qingdao, China 2 Ministry of Education Key Lab. for Earth System Modeling, and Department of Earth System Science, Tsinghua University, Beijing, China 15 3 National Supercomputing Center in Wuxi, Wuxi, China 4 Laboratory for Ocean Dynamics and Climate, Qingdao Pilot National Laboratory for Marine Science and Technology, Qingdao, China 5 Key Laboratory of Physical Oceanography, the College of Oceanic and Atmospheric Sciences & Institute for Advanced Ocean Study, Ocean University of China, Qingdao, China 20 6 Department of Computer Science & Technology, Tsinghua
    [Show full text]
  • Integrated Report 2019 the Fujitsu Way CONTENTS
    FUJITSU GROUP Integrated Report 2019 Report FUJITSU Integrated GROUP Fujitsu Group Integrated Report 2019 The Fujitsu Way CONTENTS The Fujitsu Way articulates the principles LETTER FROM THE MANAGEMENT 02 of the Fujitsu Group’s purpose in society, the values MESSAGE TO 02 MESSAGE TO SHAREHOLDERS AND INVESTORS SHAREHOLDERS AND INVESTORS it upholds, and the code of conduct that Group Takahito Tokita Representative Director and President employees follow in their daily business activities. CORPORATE VALUES By adhering to the values of the Fujitsu Way FUJITSU GROUP OVERVIEW What we strive for: 08 FUJITSU AT A GLANCE in their daily business activities, all Fujitsu Group Society and In all our actions, we protect the Environment environment and contribute to society. 10 FINANCIAL HIGHLIGHTS employees aim to enhance corporate value Profit and Growth We strive to meet the expectations of 11 ENVIRONMENT, SOCIETY, AND GOVERNANCE HIGHLIGHTS customers, employees, and shareholders. and contribute to global society. 12 MANAGEMENT CAPITAL: KEY DATA Shareholders We seek to continuously increase our 14 BOARD OF DIRECTORS / AUDIT & SUPERVISORY BOARD MEMBERS and Investors corporate value. Global Perspective We think and act from a global perspective. What we value: SPECIAL FEATURE: MANAGEMENT DIRECTION Employees We respect diversity and support individual 16 MANAGEMENT DIRECTION—TOWARD FUJITSU’S GROWTH growth. 17 Market Overview Customers We seek to be their valued and trusted Core Policy partner. 18 CORPORATE VISION Business Partners We build mutually beneficial relationships. 19 The Anticipated DX Business Technology We seek to create new value through 20 A New Company to Drive the DX Business Through our constant pursuit of innovation.
    [Show full text]
  • Eithne: a Framework for Benchmarking Micro-Core Accelerators
    Eithne: A framework for benchmarking micro-core accelerators Maurice Jamieson Nick Brown EPCC EPCC University of Edinburgh University of Edinburgh Edinburgh, UK Edinburgh, UK [email protected] [email protected] Soft-core MFLOPs/core 1 INTRODUCTION MicroBlaze (integer only) 0.120 The free lunch is over and the HPC community is acutely aware of MicroBlaze (floating point) 5.905 the challenges that the end of Moore’s Law and Dennard scaling Table 1: LINPACK performance of the Xilinx MicroBlaze on [4] impose on the implementation of exascale architectures due to the Zynq-7020 @ 100MHz the end of significant generational performance improvements of traditional processor designs, such as x86 [5]. Power consumption and energy efficiency is also a major concern when scaling thecore is the benefit of reduced chip resource usage when configuring count of traditional CPU designs. Therefore, other technologies without hardware floating point support, but there is a 50 times need to be investigated, with micro-cores and FPGAs, which are performance impact on LINPACK due to the software emulation somewhat related, being considered by the community. library required to perform floating point arithmetic. By under- Micro-core architectures look to address this issue by implement- standing the implications of different configuration decisions, the ing a large number of simple cores running in parallel on a single user can make the most appropriate choice, in this case trading off chip and have been used in successful HPC architectures, such how much floating point arithmetic is in their code vs the saving as the Sunway SW26010 of the Sunway TaihuLight (#3 June 2019 in chip resource.
    [Show full text]
  • How Amdahl's Law Restricts Supercomputer Applications
    How Amdahl’s law restricts supercomputer applications and building ever bigger supercomputers J´anos V´egha aUniversity of Miskolc, Hungary Department of Mechanical Engineering and Informatics 3515 Miskolc-University Town, Hungary Abstract This paper reinterprets Amdahl’s law in terms of execution time and applies this simple model to supercomputing. The systematic discussion results in a quantitative measure of computational efficiency of supercomputers and supercomputing applications, explains why supercomputers have different efficiencies when using different benchmarks, and why a new supercomputer intended to be the 1st on the TOP500 list utilizes only 12 % of its processors to achieve the 4th place only. Through separating non-parallelizable contri- bution to fractions according to their origin, Amdahl’s law enables to derive a timeline for supercomputers (quite similar to Moore’s law) and describes why Amdahl’s law limits the size of supercomputers. The paper validates that Amdahl’s 50-years old model (with slight extension) correctly describes the performance limitations of the present supercomputers. Using some simple and reasonable assumptions, absolute performance bound of supercomputers is concluded, furthermore that serious enhancements are still necessary to achieve the exaFLOPS dream value. Keywords: supercomputer, parallelization, performance, scaling, figure of arXiv:1708.01462v2 [cs.DC] 29 Dec 2017 merit, efficiency 1. Introduction Supercomputers do have a quarter of century history for now, see TOP500.org (2016). The number of processors raised exponentially from the initial just-a- few processors, see Dongarra (1992), to several millions, see Fu et al. (2016), Email address: [email protected] (J´anos V´egh) Preprint submitted to Computer Physics Communications August 24, 2018 and increased their computational performance (as well as electric power con- sumption) even more impressively.
    [Show full text]
  • World's Fastest Computer
    HISTORY OF FUJITSU LIMITED SUPERCOMPUTING “FUGAKU” – World’s Fastest Computer Birth of Fugaku Fujitsu has been developing supercomputer systems since 1977, equating to over 40 years of direct hands-on experience. Fujitsu determined to further expand its HPC thought leadership and technical prowess by entering into a dedicated partnership with Japan’s leading research institute, RIKEN, in 2006. This technology development partnership led to the creation of the “K Computer,” successfully released in 2011. The “K Computer” raised the bar for processing capabilities (over 10 PetaFLOPS) to successfully take on even larger and more complex challenges – consistent with Fujitsu’s corporate charter of tackling societal challenges, including global climate change and sustainability. Fujitsu and Riken continued their collaboration by setting aggressive new targets for HPC solutions to further extend the ability to solve complex challenges, including applications to deliver improved energy efficiency, better weather and environmental disaster prediction, and next generation manufacturing. Fugaku, another name for Mt. Fuji, was announced May 23, 2019. Achieving The Peak of Performance Fujitsu Ltd. And RIKEN partnered to design the next generation supercomputer with the goal of achieving Exascale performance to enable HPC and AI applications to address the challenges facing mankind in various scientific and social fields. Goals of the project Ruling the Rankings included building and delivering the highest peak performance, but also included wider adoption in applications, and broader • : Top 500 list First place applicability across industries, cloud and academia. The Fugaku • First place: HPCG name symbolizes the achievement of these objectives. • First place: Graph 500 Fugaku’s designers also recognized the need for massively scalable systems to be power efficient, thus the team selected • First place: HPL-AI the ARM Instruction Set Architecture (ISA) along with the Scalable (real world applications) Vector Extensions (SVE) as the base processing design.
    [Show full text]