SX-Aurora TSUBASA

Total Page:16

File Type:pdf, Size:1020Kb

SX-Aurora TSUBASA NEC SX-Aurora TSUBASA 1 © NEC Corporation 2021 NEC HPC Business NEC Business 4 © NEC Corporation 2021 NEC Technology Labs 5 © NEC Corporation 2021 Solution Areas in EMEA Public Carrier Enterprise Smart Energy System platform sector Transport solutions Business Public Safety Energy Desktop displays (incl. mobile backhaul) mobility Storage Systems - Face & fingerprint Large format displays (ESS) Small cells as a Cloud recognition Service (ScaaS) Generation Energy Projectors Unified communications - Urban surveillance Management Emergency service & collaboration (cameras and Digital cinema solutions solutions Systems (GEMS) High availability analytics) Supercomputers Distributed & SDN/NFV solutions - Cyber security Storage Energy Servers TOMS (OSS/BSS) Management IT solutions Broadcast Systems Mainframes Cloud services - Retail (DEMS/SEMS) platform - Manufacturing Smart cities, Storage transport Demand Value added services Response Systems (DREMS) SDN/NFV solutions 6 © NEC Corporation 2021 NEC Our Global Scope Marketing & Service affiliates Manufacturing affiliates Liaison Offices Branch Offices Laboratories Business activities in over 160 countries and territories Our affiliates, offices, and laboratories: 58 countries and territories 7 © NEC Corporation 2021 NEC HPC EMEA 8 © NEC Corporation 2021 NEC HPC portfolio HPC SX–Aurora LX Series Storage Portfolio TSUBASA Vector CPU X86 CPU & GPU Scalable appliances NEC Application tuning Service & Support Expertise System architecting R&D System integration Sustained Performance 9 © NEC Corporation 2021 NEC Offers Direct Full Customer Service We deliver comprehensive solutions for High Performance Computing (HPC) ▌ Turnkey ▌ Caring for the customer during the whole solution lifecycle: From individual consulting to managed services Customers have access to NEC Aurora Benchmark Center Individual presales Application-, Burn-in tests of Software & OS consulting customer-, site- systems Installation specific sizing of HPC solution Benchmarking of different systems Application Installation Continual improvement Integration into Maintenance, support Onsite hardware Customer training customer’s & managed services environment installation 10 © NEC Corporation 2021 NEC key strengths ▌NEC has expertise in software Software and compiler development Benchmarking and code optimization Cluster management solution development ▌NEC develops and builds optimized HPC hardware solutions SX vector architecture for maximizing sustained performance Storage appliances for parallel files systems ▌NEC services Users and system administrators trainings Onsite or remote administration Users application support (optimization) Workshops and hackathons Joint research activities 11 © NEC Corporation 2021 NEC – Main HPC Customers Meteorology, Climatology, Environment Research Centers Universities, HPC Centers Industry 12 © NEC Corporation 2021 NEC Top 500 entries as of November 2020 Rank Site System Cores Rmax Rpeak (Tflop/s) (Tflop/s) 33 National Institute for Fusion Plasma Simulator - SX-Aurora TSUBASA 34,560 7,892.7 10,510.70 Science (NIFS) A412-8, Vector Engine Type10AE 8C Japan 1.58GHz, Infiniband HDR 200 NEC 140 Deutscher Wetterdienst SX-Aurora TSUBASA A412-8, Vector 14,848 2,595.6 4,360.00 Germany Engine Type10AE 8C 1.58GHz, InfiniBand HDR 100 NEC 152 Universitaet Aachen CLAIX (2018) – Intel-HNS2600BPB, Xeon 61,200 2,483.60 4,112.60 RWTH – IT Center Platinum 8160 24C 2.1GHz, Intel Omni- Germany Path NEC 251 Universitaet Mainz Mogon II – NEC Cluster, Xeon Gold 6130 49,432 1,967.80 2,800.90 Germany 16C 2.1 GHz, MEGWARE MiriQuid Xeon E5-2360v4, Intel Omni-Path NEC/MEGWARE 298 Institute for Molecular Science Molecular Simulator – NEC LX Cluster, 38,552 1,785.60 3,072.80 Japan Xeon Gold 6148/6154, Intel Omni-Path NEC 304 German Aerospace Center CARA – NEC LX Cluster, AMD Epyc 7601 145,920 1,746.00 2,568.20 Germany 32C 2.2GHz, InfiniBand HDR NEC 437 Center for Computational Cygnus – NEC LX Cluster, Xeon Gold 6126 27,520 1,582.00 2,399.70 Sciences 12C 2.6GHz, NVIDIA Tesla V100, University of Tsukuba InfiniBand HDR100 Japan NEC 13 © NEC Corporation 2021 NEC SX-Aurora TSUBASA Developing Super Computing technologies for 35 years 1900 1899 1950 1980 1983 1990 SX-2 SX-3 SX-4 2000 Earth Simulator SX-5 SX-6 SX-7 2010 SX-8 SX-9 SX-ACE 2020 2020 15 © NEC Corporation 2021 Development Philosophy 1 Highest memory bandwidth per processor Highest memory bandwidth per core 2 Fortran/C/C++ programming, OpenMP Automatic vectorization/parallelization 3 Power limitation for HPC systems becomes an issue Higher sustained performance with lower power 16 © NEC Corporation 2021 Architecture of SX-Aurora TSUBASA SX-Aurora TSUBASA Controlling Processing -Vector Processor -1.5TB/s Memory Access Vector X86 Standard -Low Power 200W CPU Engine X86/Linux Server Linux VE OS VE tools OS Library APP Vector VE CPU library Tools Vector compiler 17 © NEC Corporation 2021 Architecture Aurora Architecture x86 node + Vector Engine (VE) VE capability is provided on x86/Linux environment Aurora Architecture Product VE + x86 node x86 server VE OS SW Environment Linux driver x86 Linux OS OS Fortran/C/C++ standard programming Automatic vectorization by proven vector compiler x86 node Vector Engine Interconnect InfiniBand for MPI SX-Aurora TSUBASA 18 © NEC Corporation 2021 Execution modes - Native Vector Engine executes the entire application Only system calls (e.g. IOs) are handled by x86 Direct MPI communications (x86 is not involved) Application Application System calls System calls MPI Linux OS Linux OS x86 Vector Vector x86 PLX PLX processor Engine Engine processor Infiniband Infiniband 19 © NEC Corporation 2021 Execution modes - Offload Application runs on x86 Application runs on Vector Engine Offload vectorized functions to VE Offload “inefficient” scalar function from “Accelerator mode” such as GPU VE to x86 Application Function Application Function Function System calls Linux OS Linux OS x86 Vector x86 Vector processor Engine processor Engine 20 © NEC Corporation 2021 Large System Configuration InfiniBand network InfiniBand HCA VE PCIe SW File system x86 node (VH) 21 © NEC Corporation 2021 MPI communications Direct communications between VEs (no x86 involved and RDMA) Hybrid mode with processes on x86 & Aurora processors Infiniband Network mem. mem. mem. VH VH X86 IB x86 x86 cluster PCIe PCIe IB SW IB SW VE VE VE VE mem. mem. mem. mem. 22 © NEC Corporation 2021 Vector Engine (VE) Card Implementation Standard PCIe implementation Connector: PCIe Gen.3 x16 Double height (same form factor as Nvidia) <300W (DGEMM ~210W/VE, STREAM ~200W/VE, HPCG ~215W/VE) 24 © NEC Corporation 2021 VE20 Processor VE20 Specifications Processor Version Type A Type B core core core core core Cores/processor 10 8 core core core core core 307GF (DP) Core performance 614GF (SP) 0.4TB/s Processor 3.07TF (DP) 2.45TF (DP) 3TB/s performance 6.14TF (SP) 4.91TF (SP) Cache capacity 16MB Software controllable cache 16MB Cache bandwidth 3TB/s Cache Function Software Controllable Memory capacity 48GB 1.53TB/s Memory bandwidth 1.53TB/s ~300W (TDP) Power ~200W (Application) HBM2 memory x 6 25 © NEC Corporation 2021 Core Architecture Single core VFMA0 VFMA0 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV ALU1DIV ALU1DIV 1.53TB/s / processor 400GB/s / core ALU1DIV DIV DIV SPU Scalar Processing Unit 26 © NEC Corporation 2021 Vector Execution 32e 256e 32e 8e A vector register D 256e x 64 B FMA x3 64e (128kB) C Vector Length = 256e (32e x 8 cycle) 307.2G = (32 x 2 (op/cycle) ) x 3 x 1.6GHz 27 © NEC Corporation 2021 VE Roadmap VE50 VE40 2++TB/s memory bandwidth VE30 2+TB/s memory bandwidth VE20 10C/3.07TF 1.5TB/s memory bandwidth VE10E 8C/2.45TF Memory Memory bandwidth / VE 1.35TB/s memory bandwidth VE10 8C/2.45TF 1.22TB/s memory bandwidth 2019 2020 2021 2022 2023 2024 28 © NEC Corporation 2021 VE10/10E/20 SKU Memory Memory capacity bandwidth 2+TB/s 30 1.53TB/s 20B 20A * * 48GB 1.35TB/s 10BE 10AE *10AE is 1.584GHz *10BE is 1.408GHz 1.22TB/s 10B 10A 24GB 0.75TB/s 10C 2.15TF 2.45TF 3.07TF Frequency 1.4GHz 1.6GHz Cores 8core 10core 29 © NEC Corporation 2021 Server products Products A500 Supercomputer Model For large scale configuration A400 DLC with 40C water A300 Rack Mount Model Flexible configuration Air Cooled 2VE 4VE 8VE A100 Tower Model For developer/programmer Tower implementation 1VE 31 © NEC Corporation 2021 B401-8 High Density DLC Model (Components) Direct Liquid Cooling VE DLC To provide higher density Hot water cooling (~45C) 8VE/2U B401-8 VE x8 AMD Rome processor x1 IB HCA x2 Performance: 24.5TF /VH Memory bandwidth: 12.2TB/s /VH HPL: 2.4kW HPCG: 1.7kW Up to 18VH, 144VE per rack 32 © NEC Corporation 2021 B401-8 High Density DLC Front Rear 18VH, 144VE 18VH, 33 © NEC Corporation 2021
Recommended publications
  • User's Manual
    Management Number:OUCMC-super-技術-007-06 National University Corporation Osaka University Cybermedia Center User's manual NEC First Government & Public Solutions Division Summary 1. System Overview ................................................................................................. 1 1.1. Overall System View ....................................................................................... 1 1.2. Thress types of Computiong Environments ........................................................ 2 1.2.1. General-purpose CPU environment .............................................................. 2 1.2.2. Vector computing environment ................................................................... 3 1.2.3. GPGPU computing environment .................................................................. 5 1.3. Three types of Storage Areas ........................................................................... 6 1.3.1. Features.................................................................................................. 6 1.3.2. Various Data Access Methods ..................................................................... 6 1.4. Three types of Front-end nodes ....................................................................... 7 1.4.1. Features.................................................................................................. 7 1.4.2. HPC Front-end ......................................................................................... 7 1.4.1. HPDA Front-end ......................................................................................
    [Show full text]
  • DMI-HIRLAM on the NEC SX-6
    DMI-HIRLAM on the NEC SX-6 Maryanne Kmit Meteorological Research Division Danish Meteorological Institute Lyngbyvej 100 DK-2100 Copenhagen Ø Denmark 11th Workshop on the Use of High Performance Computing in Meteorology 25-29 October 2004 Outline • Danish Meteorological Institute (DMI) • Applications run on NEC SX-6 cluster • The NEC SX-6 cluster and access to it • DMI-HIRLAM - geographic areas, versions, and improvements • Strategy for utilization and operation of the system DMI - the Danish Meteorological Institute DMI’s mission: • Making observations • Communicating them to the general public • Developing scientific meteorology DMI’s responsibilities: • Serving the meteorological needs of the kingdom of Denmark • Denmark, the Faroes and Greenland, including territorial waters and airspace • Predicting and monitoring weather, climate and environmental conditions, on land and at sea Applications running on the NEC SX-6 cluster Operational usage: • Long DMI-HIRLAM forecasts 4 times a day • Wave model forecasts for the North Atlantic, the Danish waters, and for the Mediterranean Sea 4 times a day • Trajectory particle model and ozone forecasts for air quality Research usage: • Global climate simulations • Regional climate simulations • Research and development of operational and climate codes Cluster interconnect GE−int Completely closed Connection to GE−ext Switch internal network institute network Switch GE GE nec1 nec2 nec3 nec4 nec5 nec6 nec7 nec8 GE GE neci1 neci2 FC 4*1Gbit/s FC links per node azusa1 asama1 FC McData FC 4*1Gbit/s FC links per azusa fiber channel switch 8*1Gbit/s FC links per asama azusa2 asama2 FC 2*1Gbit/s FC links per lun Disk#0 Disk#1 Disk#2 Disk#3 Cluster specifications • NEC SX-6 (nec[12345678]) : 64M8 (8 vector nodes with 8 CPU each) – Desc.
    [Show full text]
  • Performance Modeling the Earth Simulator and ASCI Q
    PAL CCS-3 Performance Modeling the Earth Simulator and ASCI Q Darren J. Kerbyson Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003 Los Alamos PAL CCS-3 “26.58Tflops on AFES … 64.9% of peak (640nodes)” “14.9Tflops on Impact-3D …. 45% of peak (512nodes)” “10.5Tflops on PFES … 44% of peak (376nodes)” 40Tflops 20Tflops Los Alamos PAL Talk Overview CCS-3 G Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics G Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics G Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) G System Performance Comparison (Earth Simulator vs ASCI Q) Los Alamos PAL Earth Simulator: Overview CCS-3 ... ... RCU RCU . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7 ... Node0 Node1 Node639 640x640 ... crossbar G 640 Nodes (Vector Processors) G interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) G NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system Los Alamos PAL Earth Simulator Node CCS-3 Crossbar LAN network Disks AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 2 3 4 5 6 0 1 31 M M M M M M M .
    [Show full text]
  • Project Aurora 2017 Vector Inheritance
    Vector Coding a simple user’s perspective Rudolf Fischer NEC Deutschland GmbH Düsseldorf, Germany SIMD vs. vector Input Pipeline Result Scalar SIMD people call it “vector”! Vector SX 2 © NEC Corporation 2017 Data Parallelism ▌‘Vector Loop’, data parallel do i = 1, n Real, dimension(n): a,b,c a(i) = b(i) + c(i) … end do a = b + c ▌‘Scalar Loop’, not data parallel, ex. linear recursion do i = 2, n a(i) = a(i-1) + b(i) end do ▌Reduction? do i = 1, n, VL do i_ = i, min(n,i+VL-1) do i = 1, n s_(i_) = s_(i_) + v(i_)* w(i_) s = s + v(i)* w(i) end do end do end do s = reduction(s_) (hardware!) 3 © NEC Corporation 2017 Vector Coding Paradigm ▌‘Scalar’ thinking / coding: There is a (grid-point,particle,equation,element), what am I going to do with it? ▌‘Vector’ thinking / coding: There is a certain action or operation, to which (grid-points,particles,equations,elements) am I going to apply it simultaneously? 4 © NEC Corporation 2017 Identifying the data-parallel structure ▌ Very simple case: do j = 2, m-1 do i = 2, n-1 rho_new(i,j) = rho_old(i,j) + dt * something_with_fluxes(i+/-1,j+/-1) end do end do ▌ Does it tell us something? Partial differential equations Local theories Vectorization is very “natural” 5 © NEC Corporation 2017 Identifying the data-parallel subset ▌ Simple case: V------> do j = 1, m |+-----> do i = 2, n || a(i,j) = a(i-1,j) + b(i,j) |+----- end do V------ end do ▌ The compiler will vectorise along j ▌ non-unit-stride access, suboptimal ▌ Results for a certain case: Totally scalar (directives): 25.516ms Avoid outer vector loop:
    [Show full text]
  • 2020 Global High-Performance Computing Product Leadership Award
    2020 GLOBAL HIGH-PERFORMANCE COMPUTING PRODUCT LEADERSHIP AWARD Strategic Imperatives Frost & Sullivan identifies three key strategic imperatives that impact the automation industry: internal challenges, disruptive technologies, and innovative business models. Every company that is competing in the automation space is obligated to address these imperatives proactively; failing to do so will almost certainly lead to stagnation or decline. Successful companies overcome the challenges posed by these imperatives and leverage them to drive innovation and growth. Frost & Sullivan’s recognition of NEC is a reflection of how well it is performing against the backdrop of these imperatives. Best Practices Criteria for World-Class Performance Frost & Sullivan applies a rigorous analytical process to evaluate multiple nominees for each award category before determining the final award recipient. The process involves a detailed evaluation of best practices criteria across two dimensions for each nominated company. NEC excels in many of the criteria in the high-performance computing (HPC) space. About NEC Established in 1899, NEC is a global IT, network, and infrastructure solution provider with a comprehensive product portfolio across computing, data storage, embedded systems, integrated IT infrastructure, network products, software, and unified communications. Headquartered in Tokyo, Japan, NEC has been at the forefront of accelerating the industrial revolution of the 20th and 21st © Frost & Sullivan 2021 The Growth Pipeline Company™ centuries by leveraging its technical knowhow and product expertise across thirteen different industries1 in industrial and energy markets. Deeply committed to the vision of orchestrating a better world. NEC envisions a future that embodies the values of safety, security, fairness, and efficiency, thus creating long-lasting social value.
    [Show full text]
  • Hardware Technology of the Earth Simulator 1
    Special Issue on High Performance Computing 27 Architecture and Hardware for HPC 1 Hardware Technology of the Earth Simulator 1 By Jun INASAKA,* Rikikazu IKEDA,* Kazuhiko UMEZAWA,* 5 Ko YOSHIKAWA,* Shitaka YAMADA† and Shigemune KITAWAKI‡ 5 1-1 This paper describes the hardware technologies of the supercomputer system “The Earth Simula- 1-2 ABSTRACT tor,” which has the highest performance in the world and was developed by ESRDC (the Earth 1-3 & 2-1 Simulator Research and Development Center)/NEC. The Earth Simulator has adopted NEC’s leading edge 2-2 10 technologies such as most advanced device technology and process technology to develop a high-speed and high-10 2-3 & 3-1 integrated LSI resulting in a one-chip vector processor. By combining this LSI technology with various advanced hardware technologies including high-density packaging, high-efficiency cooling technology against high heat dissipation and high-density cabling technology for the internal node shared memory system, ESRDC/ NEC has been successful in implementing the supercomputer system with its Linpack benchmark performance 15 of 35.86TFLOPS, which is the world’s highest performance (http://www.top500.org/ : the 20th TOP500 list of the15 world’s fastest supercomputers). KEYWORDS The Earth Simulator, Supercomputer, CMOS, LSI, Memory, Packaging, Build-up Printed Wiring Board (PWB), Connector, Cooling, Cable, Power supply 20 20 1. INTRODUCTION (Main Memory Unit) package are all interconnected with the fine coaxial cables to minimize the distances The Earth Simulator has adopted NEC’s most ad- between them and maximize the system packaging vanced CMOS technologies to integrate vector and density corresponding to the high-performance sys- 25 25 parallel processing functions for realizing a super- tem.
    [Show full text]
  • Shared-Memory Vector Systems Compared
    Shared-Memory Vector Systems Compared Robert Bell, CSIRO and Guy Robinson, Arctic Region Supercomputing Center ABSTRACT: The NEC SX-5 and the Cray SV1 are the only shared-memory vector computers currently being marketed. This compares with at least five models a few years ago (J90, T90, SX-4, Fujitsu and Hitachi), with IBM, Digital, Convex, CDC and others having fallen by the wayside in the early 1990s. In this presentation, some comparisons will be made between the architecture of the survivors, and some performance comparisons will be given on benchmark and applications codes, and in areas not usually presented in comparisons, e.g. file systems, network performance, gzip speeds, compilation speeds, scalability and tools and libraries. KEYWORDS: SX-5, SV1, shared-memory, vector systems averaged 95% on the larger node, and 92% on the smaller 1. Introduction node. HPCCC The SXes are supported by data storage systems – The Bureau of Meteorology and CSIRO in Australia SAM-FS for the Bureau, and DMF on a Cray J90se for established the High Performance Computing and CSIRO. Communications Centre (HPCCC) in 1997, to provide ARSC larger computational facilities than either party could The mission of the Arctic Region Supercomputing acquire separately. The main initial system was an NEC Centre is to support high performance computational SX-4, which has now been replaced by two NEC SX-5s. research in science and engineering with an emphasis on The current system is a dual-node SX-5/24M with 224 high latitudes and the Arctic. ARSC provides high Gbyte of memory, and this will become an SX-5/32M performance computational, visualisation and data storage with 224 Gbyte in July 2001.
    [Show full text]
  • Recent Supercomputing Development in Japan
    Supercomputing in Japan Yoshio Oyanagi Dean, Faculty of Information Science Kogakuin University 2006/4/24 1 Generations • Primordial Ages (1970’s) – Cray-1, 75APU, IAP • 1st Generation (1H of 1980’s) – Cyber205, XMP, S810, VP200, SX-2 • 2nd Generation (2H of 1980’s) – YMP, ETA-10, S820, VP2600, SX-3, nCUBE, CM-1 • 3rd Generation (1H of 1990’s) – C90, T3D, Cray-3, S3800, VPP500, SX-4, SP-1/2, CM-5, KSR2 (HPC ventures went out) • 4th Generation (2H of 1990’s) – T90, T3E, SV1, SP-3, Starfire, VPP300/700/5000, SX-5, SR2201/8000, ASCI(Red, Blue) • 5th Generation (1H of 2000’s) – ASCI,TeraGrid,BlueGene/L,X1, Origin,Power4/5, ES, SX- 6/7/8, PP HPC2500, SR11000, …. 2006/4/24 2 Primordial Ages (1970’s) 1974 DAP, BSP and HEP started 1975 ILLIAC IV becomes operational 1976 Cray-1 delivered to LANL 80MHz, 160MF 1976 FPS AP-120B delivered 1977 FACOM230-75 APU 22MF 1978 HITAC M-180 IAP 1978 PAX project started (Hoshino and Kawai) 1979 HEP operational as a single processor 1979 HITAC M-200H IAP 48MF 1982 NEC ACOS-1000 IAP 28MF 1982 HITAC M280H IAP 67MF 2006/4/24 3 Characteristics of Japanese SC’s 1. Manufactured by main-frame vendors with semiconductor facilities (not ventures) 2. Vector processors are attached to mainframes 3. HITAC IAP a) memory-to-memory b) summation, inner product and 1st order recurrence can be vectorized c) vectorization of loops with IF’s (M280) 4. No high performance parallel machines 2006/4/24 4 1st Generation (1H of 1980’s) 1981 FPS-164 (64 bits) 1981 CDC Cyber 205 400MF 1982 Cray XMP-2 Steve Chen 630MF 1982 Cosmic Cube in Caltech, Alliant FX/8 delivered, HEP installed 1983 HITAC S-810/20 630MF 1983 FACOM VP-200 570MF 1983 Encore, Sequent and TMC founded, ETA span off from CDC 2006/4/24 5 1st Generation (1H of 1980’s) (continued) 1984 Multiflow founded 1984 Cray XMP-4 1260MF 1984 PAX-64J completed (Tsukuba) 1985 NEC SX-2 1300MF 1985 FPS-264 1985 Convex C1 1985 Cray-2 1952MF 1985 Intel iPSC/1, T414, NCUBE/1, Stellar, Ardent… 1985 FACOM VP-400 1140MF 1986 CM-1 shipped, FPS T-series (max 1TF!!) 2006/4/24 6 Characteristics of Japanese SC in the 1st G.
    [Show full text]
  • Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores
    Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores Template Vector Library — General Approach and Opportunities Annett Ungethüm, Johannes Pietrzyk, Erich Focht Patrick Damme, Alexander Krause, High Performance Computing Europe Dirk Habich, Wolfgang Lehner NEC Deutschland GmbH Database Systems Group Stuttgart, Germany Technische Universität Dresden [email protected] Dresden, Germany fi[email protected] ABSTRACT be mapped to a variety of parallel processing architectures Vectorization based on the Single Instruction Multiple Data like many-core CPUs or GPUs. Another recent hardware- (SIMD) parallel paradigm is a core technique to improve oblivious approach is the data processing interface DPI for query processing performance especially in state-of-the-art modern networks [2]. in-memory column-stores. In mainstream CPUs, vectoriza- In addition to parallel processing over multiple cores, tion is offered by a large number of powerful SIMD exten- vector processing according to the Single Instruction sions growing not only in vector size but also in terms of Multiple Data (SIMD) parallel paradigm is also a heavily complexity of the provided instruction sets. However, pro- used hardware-driven technique to improve the query performance in particular for in-memory column-store gramming with vector extensions is a non-trivial task and 1 currently accomplished in a hardware-conscious way. This database systems [10, 14, 21, 26, 27]. On mainstream implementation process is not only error-prone but also con- CPUs, this vectorization is done using SIMD extensions nected with quite some effort for embracing new vector ex- such as Intel's SSE (Streaming SIMD extensions) or AVX tensions or porting to other vector extensions.
    [Show full text]
  • Performance Evaluation of Supercomputers Using HPCC and IMB Benchmarks
    Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks Subhash Saini1, Robert Ciotti1,Brian T. N. Gunney2, Thomas E. Spelce2, Alice Koniges2, Don Dossa2, Panagiotis Adamidis3, Rolf Rabenseifner3, Sunil R. Tiyyagura3, Matthias Mueller4, and Rod Fatoohi5 1Advanced Supercomputing Division NASA Ames Research Center Moffett Field, California 94035, USA {ssaini, ciotti}@nas.nasa.gov 4ZIH, TU Dresden. Zellescher Weg 12, D-01069 Dresden, Germany 2Lawrence Livermore National Laboratory [email protected] Livermore, California 94550, USA {gunney, koniges, dossa1}@llnl.gov 5San Jose State University 3 One Washington Square High-Performance Computing-Center (HLRS) San Jose, California 95192 Allmandring 30, D-70550 Stuttgart, Germany [email protected] University of Stuttgart {adamidis, rabenseifner, sunil}@hlrs.de Abstract paradigm has become the de facto standard in programming high-end parallel computers. As a result, the performance of a majority of applications depends on the The HPC Challenge (HPCC) benchmark suite and the performance of the MPI functions as implemented on Intel MPI Benchmark (IMB) are used to compare and these systems. Simple bandwidth and latency tests are two evaluate the combined performance of processor, memory traditional metrics for assessing the performance of the subsystem and interconnect fabric of five leading interconnect fabric of the system. However, simple supercomputers - SGI Altix BX2, Cray X1, Cray Opteron measures of these two are not adequate to predict the Cluster, Dell Xeon cluster, and NEC SX-8. These five performance for real world applications. For instance, systems use five different networks (SGI NUMALINK4, traditional methods highlight the performance of network Cray network, Myrinet, InfiniBand, and NEC IXS).
    [Show full text]
  • Optimizing Sparse Matrix-Vector Multiplication in NEC SX-Aurora Vector Engine
    Optimizing Sparse Matrix-Vector Multiplication in NEC SX-Aurora Vector Engine Constantino Gómez1, Marc Casas1, Filippo Mantovani1, and Erich Focht2 1Barcelona Supercomputing Center, {first.last}@bsc.es 2NEC HPC Europe (Germany), {first.last}@EMEA.NEC.COM August 14, 2020 Abstract Sparse Matrix-Vector multiplication (SpMV) is an essential piece of code used in many High Performance Computing (HPC) applications. As previous literature shows, achieving ecient vectorization and performance in modern multi-core systems is nothing straightforward. It is important then to revisit the current state- of-the-art matrix formats and optimizations to be able to deliver high performance in long vector architectures. In this tech-report, we describe how to develop an ef- ficient implementation that achieves high throughput in the NEC Vector Engine: a 256 element-long vector architecture. Combining several pre-processing and ker- nel optimizations we obtain an average 12% improvement over a base SELL-C-σ implementation on a heterogeneous set of 24 matrices. 1 Introduction The Sparse Matrix-Vector (SpMV) product is a ubiquitous kernel in the context of High- Performance Computing (HPC). For example, discretization schemes like the finite dif- ferences or finite element methods to solve Partial Dierential Equations (PDE) pro- duce linear systems with a highly sparse matrix. Such linear systems are typically solved via iterative methods, which require an extensive use of SpMV across the whole execution. In addition, emerging workloads from the data-analytics area also require the manipulation of highly irregular and sparse matrices via SpMV. Therefore, the e- cient execution of this fundamental linear algebra kernel is of paramount importance.
    [Show full text]
  • Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?
    Supercomputers – Prestige Objects or Crucial Tools for Science and Industry? Hans W. Meuer a 1, Horst Gietl b 2 a University of Mannheim & Prometeus GmbH, 68131 Mannheim, Germany; b Prometeus GmbH, 81245 Munich, Germany; This paper is the revised and extended version of the Lorraine King Memorial Lecture Hans Werner Meuer was invited by Lord Laird of Artigarvan to give at the House of Lords, London, on April 18, 2012. Keywords: TOP500, High Performance Computing, HPC, Supercomputing, HPC Technology, Supercomputer Market, Supercomputer Architecture, Supercomputer Applications, Supercomputer Technology, Supercomputer Performance, Supercomputer Future. 1 e-mail: [email protected] 2 e-mail: [email protected] 1 Content 1 Introduction ..................................................................................................................................... 3 2 The TOP500 Supercomputer Project ............................................................................................... 3 2.1 The LINPACK Benchmark ......................................................................................................... 4 2.2 TOP500 Authors ...................................................................................................................... 4 2.3 The 39th TOP500 List since 1993 .............................................................................................. 5 2.4 The 39th TOP10 List since 1993 ...............................................................................................
    [Show full text]