Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption

Total Page:16

File Type:pdf, Size:1020Kb

Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption Ryohei Okazaki Takekazu Tabata Sota Sakashita Kenichi Kitamura Noriko Takagi Hideki Sakata Takeshi Ishibashi Takeo Nakamura Yuichiro Ajima The A64FX was developed as a processor for the supercomputer Fugaku. For the semiconduc- tors, the 7- nm CMOS process technology of TSMC [1] has been used. For higher density, the Tofu interconnect D and the PCI Express controllers have been implemented in the CPU chip, and the high-bandwidth 3D stacked memory is integrated in the package. The A64FX employs the Arm architecture to improve the software development environment while inheriting Fujitsu’s proven high-performance microarchitecture. In addition, we worked on the specifica- tion of Scalable Vector Extension (SVE) as a lead partner of Arm Limited, the result of which has been adopted. This article outlines the A64FX and describes its high-performance microarchi- tecture, architecture to achieve high-density packaging, and low power consumption design. 1. Introduction architecture to the device level with the aim of providing The A64FX (Figure 1) was developed as a proces- a system that uses the A64FX, a general-purpose CPU, sor for the supercomputer Fugaku (hereafter, Fugaku). with the capability of achieving performance per power Fugaku features 158,976 processors [2]. These proces- equivalent to that of a system equipped with a GPU. sors require high performance, high-density packaging, Figure 2 shows a block diagram of the A64FX and low power consumption, all at high levels. The CPU [3]. There are four core memory groups (CMGs), software development environment is also important. each of which is composed of 13 cores (12 used as This article outlines the A64FX and describes its computing cores and one as an assistant core), level 2 high-performance microarchitecture, architecture to cache and memory controller, and a ring bus network achieve high-density packaging, and low power con- on a chip (NoC) is used to connect them with the Tofu sumption design. Interconnect D (hereafter, TofuD) [4] interface and PCI Express (PCIe) interface. 2. A64FX overview and adoption of Arm In developing the A64FX, the Arm architecture architecture was adopted with the aim of having the supercomputer For the A64FX, we decided to work on devel- Fugaku accepted by a wider range of software develop- opment with the aim of achieving the capability of ers than the K computer, which saw a full-scale launch higher-speed application execution based on the mi- in 2012, and putting in place an environment allowing croarchitectures of processors developed by Fujitsu up use of the latest software. The Arm architecture is an in- to now, including the processor of the K computer. struction set developed by Arm Limited (hereafter, Arm) In order to accelerate application performance, we widely used in software development for smartphones analyzed various applications and optimized the entire and embedded devices. Recently, Arm processors for processor, revising the configuration of various blocks, servers have appeared by extension to a 64-bit archi- optimizing resources, adding new circuits, selecting tecture, which is the standard in the server sector, and and optimizing memory components, and optimizing new addition of the hypervisor extension feature for OS operations. In addition, we implemented measures servers. These developments have raised expectations to save power across a wide range of levels from the for expansion of Arm into the server sector. Fujitsu Technical Review 1 Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption Tu nteface e nteface Core nteface Level ache Rin us Figure 1 Die photo of A64FX CPU. Tu e processing (DSP) for embedding and other applica- CMG ntlle ntlle tions. The SIMD length is 128 bits, the same as the K Core computer. This is shorter than 256 bits and 512 bits, the current trends for HPC CPUs, and was unsuitable HBM2 Core HBM2 for improving the computing performance per core. In addition, based on Fujitsu’s past experience in HPC de- Rin us velopment, instructions useful for the HPC applications were also insufficient. HBM2 HBM2 Therefore, by collaborating with Arm, Fujitsu contributed as a lead partner to the specification of Scalable Vector Extension (SVE) capable of high-speed ih anwith e execution of HPC applications including scientific com- Figure 2 puting and AI, the result of which has been adopted for Block diagram of A64FX CPU. the A64FX. 3. High-performance microarchitecture In offering the Arm architecture as a proces- With the A64FX, microarchitecture exclusively for sor for high-performance computing (HPC), single Fugaku was developed with the aim of speeding up the instruction, multiple data (SIMD) extension, a unique execution of user applications. This section outlines the feature of the Arm architecture, became an issue. This microarchitecture and describes the elemental technol- SIMD extension, known as the Advanced SIMD, is used ogy utilized to make it a reality. for accelerating media processing and digital signal 2 Fujitsu Technical Review Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption 3.1 Microarchitecture overview although the operating frequency varies depending on Figure 3 shows the pipeline of the A64FX [3]. A the system in which it is installed. core is composed of the instruction control unit, execu- The level 1 cache unit processes load/store in- tion unit, and level 1 cache unit. The instruction control structions. Each core has a 64-KiB instruction cache unit performs instruction fetch, instruction decode, and a 64-KiB data cache. The data cache is configured instruction out-of-order processing control, and instruc- to be capable of two simultaneous load accesses and tion completion control. executes two 64-byte SIMD loads or one 64-byte SIMD The execution unit is equipped with two fixed- store. point functional units (EXA/EXB), two functional units The level 2 cache unit has 8 MiB of unified cache for address computation and simple fixed-point arith- per CMG and is shared by 13 cores including the assis- metic (known as EAGA/EAGB for address calculation tant core. and EXC/EXD for fixed-point arithmetic), two floating- point units for executing SVE instructions (FLA/FLB), 3.2 Proven microarchitecture and resource and one predicate unit for executing predicate arith- optimization metic (PRX). Both floating-point units have a 512-bit For the A64FX, we have implemented various SIMD configuration and can perform a floating-point types of hardware resource optimization based on the multiply-accumulate operation every cycle. Therefore, high-performance, high-reliability microarchitecture each computing core is capable of 32 double-precision adopted in mainframe, and UNIX servers, and the K floating-point operations per cycle and use of all com- computer, which Fujitsu previously developed. puting cores in the chip allows 1,536 double-precision In particular, for the numbers of reorder buffer floating-point operations to be performed per cycle. (ROB), reservation station, and other queues, which With single-precision floating-point operations and are important as performance indicators, we have ad- half-precision floating-point operations, the number of opted control to accelerate the release by judging the operations that can be performed is twice and four times release timing of the queue at the time of instruction the number of double-precision floating-point opera- execution. This ensures the instruction execution per- tions, respectively. It operates at 1.8 GHz/2 GHz/2.2 GHz, formance without unnecessarily increasing the number Fetch ece ssue Reiste ea Executin e access Comletin Level nstuctin Executin unit Level cache unit nstuctin cache unit cntl unit cntl unit ces it cntl unit EAGA Resevatin EXC Fetch t Level statin EAGB Progra nstuctin aess Fie int instuctin EXD cunte ece eneatin reiste Store t Level cache EXA ata cache ntl Resevatin EXB reiste statin Write bufe Branch executin preictin Resevatin Preicate R statin reiste executin Resevatin FLA statin Flatin branch int eiste F Tu cntlle Level cache e cntlle cntlle Tu HBM2 PCI-GEN3 Figure 3 Diagram of A64FX pipeline. Fujitsu Technical Review 3 Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption of queue entries, which reduces the chip area by curb- 3.5 Level 1 data cache to support various ing the increase of logic circuits. access patterns In order to maximize the efficiency of 512-bit 3.3 Branch prediction circuits SIMD, it is important to maintain the access throughput We have employed several branch prediction of the L1 data cache in load instructions that transfer circuits so that optimum branch prediction can be per- data to registers. When load instructions with addresses formed with various applications. not on 512-bit boundaries are sequentially executed in For example, we have employed a circuit that uses the order of address, access across cache lines occurs a piecewise linear algorithm to perform branch predic- once in several instructions. To avoid performance deg- tion. This allows for high-accuracy branch prediction radation in this case, the L1 data cache of the A64FX capability even with a program that has a compli- is configured to allow each of the two lead ports to cated instruction structure. As a result, it is possible
Recommended publications
  • File System and Power Management Enhanced for Supercomputer Fugaku
    File System and Power Management Enhanced for Supercomputer Fugaku Hideyuki Akimoto Takuya Okamoto Takahiro Kagami Ken Seki Kenichirou Sakai Hiroaki Imade Makoto Shinohara Shinji Sumimoto RIKEN and Fujitsu are jointly developing the supercomputer Fugaku as the successor to the K computer with a view to starting public use in FY2021. While inheriting the software assets of the K computer, the plan for Fugaku is to make improvements, upgrades, and functional enhancements in various areas such as computational performance, efficient use of resources, and ease of use. As part of these changes, functions in the file system have been greatly enhanced with a focus on usability in addition to improving performance and capacity beyond that of the K computer. Additionally, as reducing power consumption and using power efficiently are issues common to all ultra-large-scale computer systems, power management functions have been newly designed and developed as part of the up- grading of operations management software in Fugaku. This article describes the Fugaku file system featuring significantly enhanced functions from the K computer and introduces new power management functions. 1. Introduction application development environment are taken up in RIKEN and Fujitsu are developing the supercom- separate articles [1, 2], this article focuses on the file puter Fugaku as the successor to the K computer and system and operations management software. The are planning to begin public service in FY2021. file system provides a high-performance and reliable The Fugaku is composed of various kinds of storage environment for application programs and as- system software for supporting the execution of su- sociated data.
    [Show full text]
  • NVIDIA Tesla Personal Supercomputer, Please Visit
    NVIDIA TESLA PERSONAL SUPERCOMPUTER TESLA DATASHEET Get your own supercomputer. Experience cluster level computing performance—up to 250 times faster than standard PCs and workstations—right at your desk. The NVIDIA® Tesla™ Personal Supercomputer AccessiBLE to Everyone TESLA C1060 COMPUTING ™ PROCESSORS ARE THE CORE is based on the revolutionary NVIDIA CUDA Available from OEMs and resellers parallel computing architecture and powered OF THE TESLA PERSONAL worldwide, the Tesla Personal Supercomputer SUPERCOMPUTER by up to 960 parallel processing cores. operates quietly and plugs into a standard power strip so you can take advantage YOUR OWN SUPERCOMPUTER of cluster level performance anytime Get nearly 4 teraflops of compute capability you want, right from your desk. and the ability to perform computations 250 times faster than a multi-CPU core PC or workstation. NVIDIA CUDA UnlocKS THE POWER OF GPU parallel COMPUTING The CUDA™ parallel computing architecture enables developers to utilize C programming with NVIDIA GPUs to run the most complex computationally-intensive applications. CUDA is easy to learn and has become widely adopted by thousands of application developers worldwide to accelerate the most performance demanding applications. TESLA PERSONAL SUPERCOMPUTER | DATASHEET | MAR09 | FINAL FEATURES AND BENEFITS Your own Supercomputer Dedicated computing resource for every computational researcher and technical professional. Cluster Performance The performance of a cluster in a desktop system. Four Tesla computing on your DesKtop processors deliver nearly 4 teraflops of performance. DESIGNED for OFFICE USE Plugs into a standard office power socket and quiet enough for use at your desk. Massively Parallel Many Core 240 parallel processor cores per GPU that can execute thousands of GPU Architecture concurrent threads.
    [Show full text]
  • The Sunway Taihulight Supercomputer: System and Applications
    SCIENCE CHINA Information Sciences . RESEARCH PAPER . July 2016, Vol. 59 072001:1–072001:16 doi: 10.1007/s11432-016-5588-7 The Sunway TaihuLight supercomputer: system and applications Haohuan FU1,3 , Junfeng LIAO1,2,3 , Jinzhe YANG2, Lanning WANG4 , Zhenya SONG6 , Xiaomeng HUANG1,3 , Chao YANG5, Wei XUE1,2,3 , Fangfang LIU5 , Fangli QIAO6 , Wei ZHAO6 , Xunqiang YIN6 , Chaofeng HOU7 , Chenglong ZHANG7, Wei GE7 , Jian ZHANG8, Yangang WANG8, Chunbo ZHOU8 & Guangwen YANG1,2,3* 1Ministry of Education Key Laboratory for Earth System Modeling, and Center for Earth System Science, Tsinghua University, Beijing 100084, China; 2Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 3National Supercomputing Center in Wuxi, Wuxi 214072, China; 4College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China; 5Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 6First Institute of Oceanography, State Oceanic Administration, Qingdao 266061, China; 7Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China; 8Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China Received May 27, 2016; accepted June 11, 2016; published online June 21, 2016 Abstract The Sunway TaihuLight supercomputer is the world’s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip.
    [Show full text]
  • Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks
    Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy G. Armstrong, Zhao Zhang Daniel S. Katz, Michael Wilde, Ian T. Foster Department of Computer Science Computation Institute University of Chicago University of Chicago & Argonne National Laboratory [email protected], [email protected] [email protected], [email protected], [email protected] Abstract—In order for many-task applications to be attrac- as a worker and allocate one task per node. If tasks are tive candidates for running on high-end supercomputers, they single-threaded, each core or virtual thread can be treated must be able to benefit from the additional compute, I/O, as a worker. and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this The second feature of many-task applications is an empha- means that the application should use the HPC resource in sis on high performance. The many tasks that make up the such a way that it can reduce time to solution beyond what application effectively collaborate to produce some result, is possible otherwise. Furthermore, it is necessary to make and in many cases it is important to get the results quickly. efficient use of the computational resources, achieving high This feature motivates the development of techniques to levels of utilization. Satisfying these twin goals is not trivial, because while the efficiently run many-task applications on HPC hardware. It parallelism in many task computations can vary over time, allows people to design and develop performance-critical on many large machines the allocation policy requires that applications in a many-task style and enables the scaling worker CPUs be provisioned and also relinquished in large up of existing many-task applications to run on much larger blocks rather than individually.
    [Show full text]
  • Supercomputer Fugaku
    Supercomputer Fugaku Toshiyuki Shimizu Feb. 18th, 2020 FUJITSU LIMITED Copyright 2020 FUJITSU LIMITED Outline ◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary 1 Copyright 2020 FUJITSU LIMITED Supercomputer “Fugaku”, formerly known as Post-K Focus Approach Application performance Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2 Leading-edge Si-technology, Fujitsu's proven low power & high Power efficiency performance logic design, and power-controlling knobs Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Usability Linux 2 Copyright 2020 FUJITSU LIMITED Fugaku project schedule 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Fugaku development & delivery Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning Select Architecture & Co-Design w/ apps groups apps sizing 3 Copyright 2020 FUJITSU LIMITED Fugaku co-design ◼ Co-design goals ◼ Obtain the best performance, 100x apps performance than K computer, within power budget, 30-40MW • Design applications, compilers, libraries, and hardware ◼ Approach ◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator • Computation time: brief & precise estimation • Communication time: bandwidth and latency for communication w/ some attributes for communication patterns • I/O time: ◼ Then, optimize apps/compilers etc. and resolve bottlenecks ◼ Estimation of performance and power ◼ Precise performance estimation for primary kernels • Make & run Fugaku objects on the Fugaku cycle base simulator ◼ Brief performance estimation for other sections • Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst.
    [Show full text]
  • E-Commerce Marketplace
    E-COMMERCE MARKETPLACE NIMBIS, AWESIM “Nimbis is providing, essentially, the e-commerce infrastructure that allows suppliers and OEMs DEVELOP COLLABORATIVE together, to connect together in a collaborative HPC ENVIRONMENT form … AweSim represents a big, giant step forward in that direction.” Nimbis, founded in 2008, acts as a clearinghouse for buyers and sellers of technical computing — Bob Graybill, Nimbis president and CEO services and provides pre-negotiated access to high performance computing services, software, and expertise from the leading compute time vendors, independent software vendors, and domain experts. Partnering with the leading computing service companies, Nimbis provides users with a choice growing menu of pre-qualified, pre-negotiated services from HPC cycle providers, independent software vendors, domain experts, and regional solution providers, delivered on a “pay-as-you- go” basis. Nimbis makes it easier and more affordable for desktop users to exploit technical computing for faster results and superior products and solutions. VIRTUAL DESIGNS. REAL BENEFITS. Nimbis Services Inc., a founding associate of the AweSim industrial engagement initiative led by the Ohio Supercomputer Center, has been delving into access complexities and producing, through innovative e-commerce solutions, an easy approach to modeling and simulation resources for small and medium-sized businesses. INFORMATION TECHNOLOGY INFORMATION TECHNOLOGY 2016 THE CHALLENGE Nimbis and the AweSim program, along with its predecessor program Blue Collar Computing, have identified several obstacles that impede widespread adoption of modeling and simulation powered by high performance computing: expensive hardware, complex software and extensive training. In response, the public/private partnership is developing and promoting use of customized applications (apps) linked to OSC’s powerful supercomputer systems.
    [Show full text]
  • Wearable Technology for Enhanced Security
    Communications on Applied Electronics (CAE) – ISSN : 2394-4714 Foundation of Computer Science FCS, New York, USA Volume 5 – No.10, September 2016 – www.caeaccess.org Wearable Technology for Enhanced Security Agbaje M. Olugbenga, PhD Babcock University Department of Computer Science Ogun State, Nigeria ABSTRACT Sproutling. Watches like the Apple Watch, and jewelry such Wearable's comprise of sensors and have computational as Cuff and Ringly. ability. Gadgets such as wristwatches, pens, and glasses with Taking a look at the history of computer generations up to the installed cameras are now available at cheap prices for user to present, we could divide it into three main types: mainframe purchase to monitor or securing themselves. Nigerian faced computing, personal computing, and ubiquitous or pervasive with several kidnapping in schools, homes and abduction for computing[4]. The various divisions is based on the number ransomed collection and other unlawful acts necessitate these of computers per users. The mainframe computing describes reviews. The success of the wearable technology in medical one large computer connected to many users and the second, uses prompted the research into application into security uses. personal computing, as one computer per person while the The method of research is the use of case studies and literature term ubiquitous computing however, was used in 1991 by search. This paper takes a look at the possible applications of Paul Weiser. Weiser depicted a world full of embedded the wearable technology to combat the cases of abduction and sensing technologies to streamline and improve life [5]. kidnapping in Nigeria. Nigeria faced with several kidnapping in schools and homes General Terms are in dire need for solution.
    [Show full text]
  • Defining and Measuring Supercomputer Reliability
    Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS) Jon Stearley <[email protected]> Sandia National Laboratories?? Albuquerque, New Mexico Abstract. The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercom- puter RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 [1] specification which is widely used in the semiconductor manufacturing industry. 1 Impetus The needs for standardized terminology and metrics for supercomputer RAS begins with the procurement process, as the below quotation excellently summarizes: “prevailing procurement practices are ... a lengthy and expensive undertak- ing both for the government and for participating vendors. Thus any technically valid methodologies that can standardize or streamline this process will result in greater value to the federally-funded centers, and greater opportunity to fo- cus on the real problems involved in deploying and utilizing one of these large systems.” [2] Appendix A provides several examples of “numerous general hardware and software specifications” [2] from the procurements of several modern systems. While there are clearly common issues being communicated, the language used is far from consistent. Sites struggle to describe their reliability needs, and vendors strive to translate these descriptions into capabilities they can deliver to multiple customers. Another example is provided by this excerpt: “The system must be reliable... It is important to define what we mean by reliable. We do not mean high availability... Reliability in this context means that a large parallel job running for many hours has a high probability of suc- cessfully completing.
    [Show full text]
  • Supercomputer
    Four types of computers INTRODUCTION TO COMPUTING Since the advent of the first computer different types and sizes of computers are offering different services. Computers can be as big as occupying a large building and as small as a laptop or a microcontroller in mobile & embedded systems.The four basic types of computers are. Super computer Mainframe Computer Minicomputer Microcomputer Supercomputer supercomputer - Types of Computer The most powerful computers in terms of performance and data processing are the supercomputers. These are specialized and task specific computers used by large organizations. These computers are used for research and exploration purposes, like NASA uses supercomputers for launching space shuttles, controlling them and for space exploration purpose. The supercomputers are very expensive and very large in size. It can be accommodated in large air-conditioned rooms; some super computers can span an entire building. In 1964, Seymour cray designed the first supercomptuer CDC 6600. Uses of Supercomputer In Pakistan and other countries Supercomputers are used by Educational Institutes like NUST (Pakistan) for research purposes. Pakistan Atomic Energy commission & Heavy Industry Taxila uses supercomputers for Research purposes. Space Exploration Supercomputers are used to study the origin of the universe, the dark-matters. For these studies scientist use IBM’s powerful supercomputer “Roadrunner” at National Laboratory Los Alamos. Earthquake studies Supercomputers are used to study the Earthquakes phenomenon. Besides that supercomputers are used for natural resources exploration, like natural gas, petroleum, coal, etc. Weather Forecasting Supercomputers are used for weather forecasting, and to study the nature and extent of Hurricanes, Rainfalls, windstorms, etc. Nuclear weapons testing Supercomputers are used to run weapon simulation that can test the Range, accuracy & impact of Nuclear weapons.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Programming on K Computer
    Programming on K computer Koh Hotta The Next Generation Technical Computing Fujitsu Limited Copyright 2010 FUJITSU LIMITED System Overview of “K computer” Target Performance : 10PF over 80,000 processors Over 640K cores Over 1 Peta Bytes Memory Cutting-edge technologies CPU : SPARC64 VIIIfx 8 cores, 128GFlops Extension of SPARC V9 Interconnect, “Tofu” : 6-D mesh/torus Parallel programming environment. 1 Copyright 2010 FUJITSU LIMITED I have a dream that one day you just compile your programs and enjoy high performance on your high-end supercomputer. So, we must provide easy hybrid parallel programming method including compiler and run-time system support. 2 Copyright 2010 FUJITSU LIMITED User I/F for Programming for K computer K computer Client System FrontFront End End BackBack End End (80000 (80000 Nodes) Nodes) IDE Interface CCommommandand JobJob C Controlontrol IDE IntInteerfrfaceace debugger DDebugebuggerger App IntInteerfrfaceace App Interactive Debugger GUI Data Data Data CoConvnveerrtteerr Data SamplSampleerr debugging partition official op VisualizedVisualized SSaammpplingling Stage out partition Profiler DaDattaa DaDattaa (InfiniBand) 3 Copyright 2010 FUJITSU LIMITED Parallel Programming 4 Copyright 2010 FUJITSU LIMITED Hybrid Parallelism on over-640K cores Too large # of processes to manipulate To reduce number of processes, hybrid thread-process programming is required But Hybrid parallel programming is annoying for programmers Even for multi-threading, procedure level or outer loop parallelism was desired Little opportunity for such coarse grain parallelism System support for “fine grain” parallelism is required 5 Copyright 2010 FUJITSU LIMITED Targeting inner-most loop parallelization Automatic vectorization technology has become mature, and vector-tuning is easy for programmers. Inner-most loop parallelism, which is fine-grain, should be an important portion for peta-scale parallelization.
    [Show full text]
  • Management Direction Update
    • Thank you for taking the time out of your busy schedules today to participate in this briefing on our fiscal year 2020 financial results. • To prevent the spread of COVID-19 infections, we are holding today’s event online. We apologize for any inconvenience this may cause. • My presentation will offer an overview of the progress of our management direction. To meet the management targets we announced last year for FY2022, I will explain the initiatives we have undertaken in FY2020 and the areas on which we will focus this fiscal year and into the future. Copyright 2021 FUJITSU LIMITED 0 • It was just around a year ago that we first defined our Purpose: “to make the world more sustainable by building trust in society through innovation.” • Today, all of our activities are aligned to achieve this purpose. Copyright 2021 FUJITSU LIMITED 1 Copyright 2021 FUJITSU LIMITED 2 • First, I would like to provide a summary of our financial results for FY2020. • The upper part of the table shows company-wide figures for Fujitsu. Consolidated revenue was 3,589.7 billion yen, and operating profit was 266.3 billion yen. • In the lower part of the table, which shows the results in our core business of Technology Solutions, consolidated revenue was 3,043.6 billion yen, operating profit was 188.4 billion yen, and our operating profit margin was 6.2%. • Due to the impact of COVID-19, revenue fell, but because of our shift toward the services business and greater efficiencies, on a company-wide basis our operating profit and profit for the year were the highest in Fujitsu’s history.
    [Show full text]