Project Aurora 2017 Vector Inheritance

Total Page:16

File Type:pdf, Size:1020Kb

Project Aurora 2017 Vector Inheritance Vector Coding a simple user’s perspective Rudolf Fischer NEC Deutschland GmbH Düsseldorf, Germany SIMD vs. vector Input Pipeline Result Scalar SIMD people call it “vector”! Vector SX 2 © NEC Corporation 2017 Data Parallelism ▌‘Vector Loop’, data parallel do i = 1, n Real, dimension(n): a,b,c a(i) = b(i) + c(i) … end do a = b + c ▌‘Scalar Loop’, not data parallel, ex. linear recursion do i = 2, n a(i) = a(i-1) + b(i) end do ▌Reduction? do i = 1, n, VL do i_ = i, min(n,i+VL-1) do i = 1, n s_(i_) = s_(i_) + v(i_)* w(i_) s = s + v(i)* w(i) end do end do end do s = reduction(s_) (hardware!) 3 © NEC Corporation 2017 Vector Coding Paradigm ▌‘Scalar’ thinking / coding: There is a (grid-point,particle,equation,element), what am I going to do with it? ▌‘Vector’ thinking / coding: There is a certain action or operation, to which (grid-points,particles,equations,elements) am I going to apply it simultaneously? 4 © NEC Corporation 2017 Identifying the data-parallel structure ▌ Very simple case: do j = 2, m-1 do i = 2, n-1 rho_new(i,j) = rho_old(i,j) + dt * something_with_fluxes(i+/-1,j+/-1) end do end do ▌ Does it tell us something? Partial differential equations Local theories Vectorization is very “natural” 5 © NEC Corporation 2017 Identifying the data-parallel subset ▌ Simple case: V------> do j = 1, m |+-----> do i = 2, n || a(i,j) = a(i-1,j) + b(i,j) |+----- end do V------ end do ▌ The compiler will vectorise along j ▌ non-unit-stride access, suboptimal ▌ Results for a certain case: Totally scalar (directives): 25.516ms Avoid outer vector loop: 21.032ms Default compilation: 0.515ms 6 © NEC Corporation 2017 Identifying the data-parallel subset ▌ example: 2d ILU or Gauß-Seidel +V------> do idiagj = 2, = n2, 2*n-1 |+|V-----> do i = 2,max( n 2, idiag+1-n ), min( n, idiag-1 ) || x(j i,j= idiag) = rhs- (ii,j+ )1 - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) |+|| ----- end x( i,jdo ) = rhs(i,j) - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) +|V----------- S end end do do V------ end do ▌ solution: hyperplane-ordering ▌ Results: Default 36.2ms hyperplane 2.0ms 7 © NEC Corporation 2017 Data Layout: Sparse Matrix * Vector (SMV) ▌ Normal case: compressed row storage (CRS) ▌ res = A * x do irow = 1, nrow do jcol = 1, jstart(irow+1)-jstart(irow) ind = jstart(irow)+jcol-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do ▌ better: jagged diagonal storage (JDS) (Yousef Saad, 90s!) do jcol = 1, ncol do irow = 1, istart(jcol+1)-istart(jcol) ind = istart(jcol)+irow-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do JDS allows for quite some additional optimizations! 8 © NEC Corporation 2017 Conclusion ▌ Vectorization is normal Mother nature tends to provide the necessary structures All contemporary architectures need it for performance anyway ▌ There is a difference between SSE, AVX … you name it, and a real vector-architecture This can lead to different approaches in some cases In many cases just identical ▌ The techniques are known since ages anyway And by no means intellectually challenging ▌ NEC expects that codes will be adapted gradually, just because the architecture promises a lot of efficiency And coding for Aurora is anyway easier than CUDA 9 © NEC Corporation 2017 CPU Package Memory Bandwidth measured Past vectors: Theoretical Peak: • Single core: ~330GB/s • Cray T90-32: 360GB/s • Best result: ~980GB/s • NEC SX-4/32: 512GB/s World’s first implementation of 6 HBM memories 10 © NEC Corporation 2017 HPCG HPL and STREAM are extreme benchmarks HPCG is somehow “in between”, better representing real app’s Performance HPCG / node HPCG / price Characteristics Aurora SKL Performance ratio Performance ratio Performance ratio Performance 3x 3x Memory bandwidth Performance bound bound SKL = dual-socket Intel Skylake 6148 11 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Tianhe-2 (MilkyWay-2) – TH-IVB-FEP Cluster, Intel 2 NSCC / Guangzhou Xeon 12C 2.2GHz, TH 3120000 33,863 2 0,5800 1,10% 17808,00 32,57 Express 2, Intel Xeon Phi 31S1P 57-core Trinity – Cray XC40, Intel 3 DOE/NNSA/LANL/SNL Xeon Phi 7250 68C 1.4GHz, 979072 14,137 7 0,5460 1,80% 3843,58 142,06 Aries interconnect Piz Daint – Cray XC50, Swiss National Intel Xeon E5-2690v3 12C 4 Supercomputing Centre 361760 19,590 3 0,4860 1,90% 2271,99 213,91 2.6GHz, Cray Aries, NVIDIA (CSCS) Tesla P100 16GB Sunway TaihuLight – National Supercomputing 5 Sunway MPP, SW26010 10649600 93,015 1 0,4810 0,40% 15371,00 31,29 Center in Wuxi 260C 1.45GHz, Sunway Oakforest-PACS – Joint Center for Advanced PRIMERGY CX600 M1, Intel 6 High Performance Xeon Phi Processor 7250 557056 13,555 9 0,3850 1,50% 2718,70 141,61 Computing 68C 1.4GHz, Intel Omni-Path Architecture Cori – XC40, Intel Xeon Phi 7 DOE/SC/LBNL/NERSC 7250 68C 1.4GHz, Cray 632400 13,832 8 0,3550 1,30% 3939,00 90,12 Aries Sequoia – IBM BlueGene/Q, 8 DOE/NNSA/LLNL PowerPC A2 1.6 GHz 16- 1572864 17,173 6 0,3300 1,60% 7890,00 41,83 core, 5D Torus Titan – Cray XK7, Opteron 6274 16C 2.200GHz, Cray 9 DOE/SC/Oak Ridge Nat Lab 560640 17,590 5 0,3220 1,20% 8209,00 39,23 Gemini interconnect, NVIDIA K20x TSUBAME3.0 – SGI ICE XA (HPE SGI 8600), IP139- SXM2, Intel Xeon E5-2680 GSIC Center, Tokyo 10 v4 15120C 2.9GHz, Intel 136080 8,125 13 0,1890 1,60% 792,08 238,61 Institute of Technology Omni-Path Architecture, NVIDIA TESLA P100 SXM2 with NVLink Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Mira – IBM BlueGene/Q, DOE/SC/Argonne National 12 PowerPC A2 1.6 GHz 16- 786432 8,587 11 0,1670 1,70% 3945,00 42,33 Laboratory core, 5D Torus Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR Hazel Hen – Cray XC40, 14 HLRS/University of Stuttgart Intel Xeon E5-2680-V3, Cray 185088 5,640 19 0,1380 1,90% 3615,00 38,17 Aries MareNostrum4 – ThinkSystem SD530, Intel 15 BSC-CNS 241108 6,227 16 0,1220 1,10% 1632,00 74,75 Xeon Platinium 8160 24C 2.1GHz, INTEL OmniPath 12 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) NEC SX – NEC SX-ACE Cyberscience Center / 61 4C+IXS, NEC SX-ACE 2048 0,123 0,0150 11,40% 140,80 106,53 Tohoku University 4C+IXS, NEC IXS Earth Simulator – NEC SX- 36 CEIST / JAMSTEC 8192 0,487 0,0580 11,00% 563,20 102,98 ACE, NEC SX-ACE, Osaka U ACE – SX-ACE, Cybermedia Center, Osaka 62 NEC SX-ACE 2048C 1.0GHz, 2048 0,123 0,0140 10,80% 140,80 99,43 University IXS Cyberscience Center, Cyberscience Center, Tohoku University – SX- 51 4096 0,246 0,0280 10,70% 281,60 99,43 Tohoku University ACE, NEC SX-ACE;4096;1.0, IXS, None NEC SX-ACE – NEC SX- Christian-Albrechts- 83 ACE, NEC SX-ACE 1024C 1024 0,062 0,0068 10,50% 70,40 96,59 Universitaet zu Kiel 1.0GHz, NEC IXS Center for Global SX-ACE – NEC SX-ACE, NEC Environmental Research, 74 SX-ACE 1536C 1.0GHz, NEC 1536 0,092 0,0100 10,20% 105,60 94,70 National Institute for IXS Environmental Studies K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Information Technology Oakleaf-FX – PRIMEHPC 37 Center, The University of FX10, SPARC64 Ixfx 32C 76800 1,043 157 0,0570 5,00% 1176,80 48,44 Tokyo 1.848GHz, Tofu interconnect iDataPlex – iDataPlex Max-Planck-Gesellschaft DX360M4, Intel Xeon E5- 35 65320 1,283 100 0,0610 4,20% 1260,00 48,41 MPI/IPP 2680v2 10C 2.800GHz, Infiniband FDR SORA-MA – PRIMEHPC Japan Aerospace FX100, SPARC64 Xifx 32C 17 103680 3,157 38 0,1100 3,20% 1652,40 66,57 eXploration Agency 1.975GHz, Tofu interconnect 2 ARCHER – Cray XC30, Intel EPSRC/University of 24 Xeon E5 v2 12C 2.700GHz, 118080 1,643 79 0,0810 3,20% 3306,24 24,50 Edinburgh Aries interconnect Edison – Cray XC30, Intel 25 DOE/SC/LBNL/NERSC Xeon E5-2695v2 12C 133824 1,655 78 0,0790 3,10% 3747,07 21,08 2.4GHz, Aries interconnect Curie thin nodes – Bullx 39 CEA/TGCC-GENCI B510, Intel Xeon E5-2680 8C 77184 1,359 93 0,0510 3,10% 2132,00 23,92 2.700GHz, Infiniband QDR Plasma Simulator – Fujitsu National Institute for Fusion 26 PRIMEHPC FX100, SPARC64 82944 2,376 59 0,0730 2,80% 1244,16 58,67 Science Xifx, Tofu Interconnect 2 ITC Nagoya – PRIMEHPC Information Technology 21 FX100, SPARC64 XIfx, Tofu 92160 2,910 43 0,0870 2,70% 1382,40 62,93 Center, Nagoya University interconnect 2 Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR 13 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) Research Computation Facility for GOSAT-2 (RCF2) – SGI Rackable National Institute for 55 C1104-GP1, Intel Xeon E5- 16320 0,770 320 0,0230 2,10% 78,64 292,47 Environmental Studies 2650 v4 2880C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 PCIe Reedbush-L – SGI Information Technology Rackable C1102-GP8, Intel 52 Center, The University of Xeon E5-2695 v4 2304C 16640 0,806 292 0,0230 1,60% 79,24 290,26 Tokyo 2.1GHz, EDR Infiniband, NVIDIA Tesla P100 NVLink
Recommended publications
  • Performance Modeling the Earth Simulator and ASCI Q
    PAL CCS-3 Performance Modeling the Earth Simulator and ASCI Q Darren J. Kerbyson Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory April 2003 Los Alamos PAL CCS-3 “26.58Tflops on AFES … 64.9% of peak (640nodes)” “14.9Tflops on Impact-3D …. 45% of peak (512nodes)” “10.5Tflops on PFES … 44% of peak (376nodes)” 40Tflops 20Tflops Los Alamos PAL Talk Overview CCS-3 G Overview of the Earth Simulator – A (quick) view of the architecture of the Earth Simulator (and Q) – A look at its performance characteristics G Application Centric Performance Models – Method of comparing performance is to use trusted models of applications that we are interested in, e.g. SAGE and Sweep3D. – Analytical / Parameterized in system & application characteristics G Models can be used to provide: – Predicted performance prior to availability (hardware or software) – Insight into performance – Performance Comparison (which is better?) G System Performance Comparison (Earth Simulator vs ASCI Q) Los Alamos PAL Earth Simulator: Overview CCS-3 ... ... RCU RCU . RCU AP AP AP Memory Memory Memory 0-7 0-7 0-7 ... Node0 Node1 Node639 640x640 ... crossbar G 640 Nodes (Vector Processors) G interconnected by a single stage cross-bar – Copper interconnect (~3,000Km wire) G NEC markets a product – SX-6 – Not the same as an Earth Simulator Node - similar but different memory sub-system Los Alamos PAL Earth Simulator Node CCS-3 Crossbar LAN network Disks AP AP AP AP AP AP AP AP 0 1 2 3 4 5 6 7 RCU IOP Node contains: 8 vector processors (AP) 16GByte memory Remote Control Unit (RCU) I/O Processor (IOP) 2 3 4 5 6 0 1 31 M M M M M M M .
    [Show full text]
  • 2020 Global High-Performance Computing Product Leadership Award
    2020 GLOBAL HIGH-PERFORMANCE COMPUTING PRODUCT LEADERSHIP AWARD Strategic Imperatives Frost & Sullivan identifies three key strategic imperatives that impact the automation industry: internal challenges, disruptive technologies, and innovative business models. Every company that is competing in the automation space is obligated to address these imperatives proactively; failing to do so will almost certainly lead to stagnation or decline. Successful companies overcome the challenges posed by these imperatives and leverage them to drive innovation and growth. Frost & Sullivan’s recognition of NEC is a reflection of how well it is performing against the backdrop of these imperatives. Best Practices Criteria for World-Class Performance Frost & Sullivan applies a rigorous analytical process to evaluate multiple nominees for each award category before determining the final award recipient. The process involves a detailed evaluation of best practices criteria across two dimensions for each nominated company. NEC excels in many of the criteria in the high-performance computing (HPC) space. About NEC Established in 1899, NEC is a global IT, network, and infrastructure solution provider with a comprehensive product portfolio across computing, data storage, embedded systems, integrated IT infrastructure, network products, software, and unified communications. Headquartered in Tokyo, Japan, NEC has been at the forefront of accelerating the industrial revolution of the 20th and 21st © Frost & Sullivan 2021 The Growth Pipeline Company™ centuries by leveraging its technical knowhow and product expertise across thirteen different industries1 in industrial and energy markets. Deeply committed to the vision of orchestrating a better world. NEC envisions a future that embodies the values of safety, security, fairness, and efficiency, thus creating long-lasting social value.
    [Show full text]
  • Hardware Technology of the Earth Simulator 1
    Special Issue on High Performance Computing 27 Architecture and Hardware for HPC 1 Hardware Technology of the Earth Simulator 1 By Jun INASAKA,* Rikikazu IKEDA,* Kazuhiko UMEZAWA,* 5 Ko YOSHIKAWA,* Shitaka YAMADA† and Shigemune KITAWAKI‡ 5 1-1 This paper describes the hardware technologies of the supercomputer system “The Earth Simula- 1-2 ABSTRACT tor,” which has the highest performance in the world and was developed by ESRDC (the Earth 1-3 & 2-1 Simulator Research and Development Center)/NEC. The Earth Simulator has adopted NEC’s leading edge 2-2 10 technologies such as most advanced device technology and process technology to develop a high-speed and high-10 2-3 & 3-1 integrated LSI resulting in a one-chip vector processor. By combining this LSI technology with various advanced hardware technologies including high-density packaging, high-efficiency cooling technology against high heat dissipation and high-density cabling technology for the internal node shared memory system, ESRDC/ NEC has been successful in implementing the supercomputer system with its Linpack benchmark performance 15 of 35.86TFLOPS, which is the world’s highest performance (http://www.top500.org/ : the 20th TOP500 list of the15 world’s fastest supercomputers). KEYWORDS The Earth Simulator, Supercomputer, CMOS, LSI, Memory, Packaging, Build-up Printed Wiring Board (PWB), Connector, Cooling, Cable, Power supply 20 20 1. INTRODUCTION (Main Memory Unit) package are all interconnected with the fine coaxial cables to minimize the distances The Earth Simulator has adopted NEC’s most ad- between them and maximize the system packaging vanced CMOS technologies to integrate vector and density corresponding to the high-performance sys- 25 25 parallel processing functions for realizing a super- tem.
    [Show full text]
  • Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?
    Supercomputers – Prestige Objects or Crucial Tools for Science and Industry? Hans W. Meuer a 1, Horst Gietl b 2 a University of Mannheim & Prometeus GmbH, 68131 Mannheim, Germany; b Prometeus GmbH, 81245 Munich, Germany; This paper is the revised and extended version of the Lorraine King Memorial Lecture Hans Werner Meuer was invited by Lord Laird of Artigarvan to give at the House of Lords, London, on April 18, 2012. Keywords: TOP500, High Performance Computing, HPC, Supercomputing, HPC Technology, Supercomputer Market, Supercomputer Architecture, Supercomputer Applications, Supercomputer Technology, Supercomputer Performance, Supercomputer Future. 1 e-mail: [email protected] 2 e-mail: [email protected] 1 Content 1 Introduction ..................................................................................................................................... 3 2 The TOP500 Supercomputer Project ............................................................................................... 3 2.1 The LINPACK Benchmark ......................................................................................................... 4 2.2 TOP500 Authors ...................................................................................................................... 4 2.3 The 39th TOP500 List since 1993 .............................................................................................. 5 2.4 The 39th TOP10 List since 1993 ...............................................................................................
    [Show full text]
  • Beyond Earth Simulator
    Beyond Earth Simulator International Computing for the Atmospheric Sciences Symposium (iCAS2017) 海洋研究開発機構 September 11, 2017 地球情報基盤センターMakoto Tsukakoshi Information 情報システム部Systems Department Center for Earth Information Science and塚越 Technology眞 JAMSTEC Japan Agency for Marine-Earth Science and Technology 2 The main seven research and development issues during the third mid-term plan During the third mid-term plan, we set and address the seven research and development issues with all our strength due to promote strategic and focused research and development based on the national and social needs. Exploring untapped Ocean drilling – submarine resources Getting to know the Earth from beneath the seabed Detecting signals of global Information Science - environmental Predicting the Earth's change future by simulations Understanding seismogenic zones, and contributing to disaster mitigation Marine Bioscience - Exploring the unknown Construction of research base extreme biosphere to solve to spawn the ocean frontier the mystery of life 3 Typhoon-marine interaction using nonstatic atmospheric wave ocean coupling model Typhoon Vera in 1959 by Kazuhisa Tsuboki (2015) Typhoon-Ocean Interaction Study Using the Coupled Atmosphere-Ocean Non-hydrostatic Model: With Careful Consideration of Upper Outflow Layer Clouds of Typhoon 4 • Earth Simulator : - Developed by NASDA, JAERI and JAMSTEC (563.4* Oku Yen) - Operation started on March 2002 and immediately recognized as #1 supercomputer - “It will keep #1 for 2 years in peak and 5 years in effective performance” by H. Miyoshi *inclusing facilities and building 5 Earth Simulator has changed its role when “K computer” is completed Council for Science and Technology Policy (CSTP) Report on Nov 25, 2005: After the completion of Leading Edge High Performance Supercomputer (i.e.
    [Show full text]
  • Evaluation of Cache-Based Superscalar and Cacheless Vector Architectures for Scientific Computations
    Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations Leonid Oliker, Jonathan Carter, John Shalf, David Skinner CRDLVERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Stephane Ethier Princeton Plasma Physics Laboratory, Princeton Universiq, Princeton, NJ 08453 Rupak Biswas, Jahed Djomehri*, and Rob Van der Wijngaart* NAS Division, NASA Ames Research Centec Mofett Field, CA 94035 Abstract The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively. 1 Introduction The rapidly increasing peak performance and generality of superscalar cache-based microprocessors long led re- searchers to believe that vector architectures hold little promise for future large-scale computing systems.
    [Show full text]
  • Japan's 10 Peta FLOPS Supercomputer Development
    Japan’s 10 Peta FLOPS Supercomputer Development Project and its energy saving designs Ryutaro Himeno 1) Group Director of R&D 2) Deputy Program Director 3) Director of Advance group, NGS R&D Center for Computational Science Center for Computer & Communication Contents 1. Self introduction 2. Japan’s next generation supercomputer R&D project: 1. Before the project started 2. Introduction and current status 3. Energy issues 3. Introduction of focused application area 2 Who am I? 1985 1987 1988 1995 1993 1990 Courtesy of Nissan 1992 3 4 Hardware development at RIKEN Special purpose computers MD-GRAPE2 Theoretical Peak In 1989, GRAPE Performance developed by U. Tokyo 100Peta MD-GRAPE3 MDM:MD-GRAPE2 develop by RIKEN 10Peta Japan US GRAPE-DR MDM was integrated in (U.Tokyo/RIKEN) BlueGene/Q installed 1PetaFLOPS RSCC, 2004 MD-GRAPE3 1Peta Planed HPCS MD-GRAPE3 developed in (RIKEN) NLCF BlueGene/L 2004 367TeraFLOPS 100 ASC nd Purple 2 fastest record on MD- Tera Earth Simulator ASC Tsubame 41TeraFLOPS Red (TITech) GRAPE3: Gordon Bell Storm ASCI Q PACS-CS Prize, 2006. 10Tera RSCC ASCI (Tsukuba) Blue Pacific ASCI White (Riken) MD-GRAPE-3 will be ASCI Red integrated in RSCC soon. ASCI Blue Mountain 1Tera *ASCI: Accelerated Strategic CP-PACS Computing Initiative '96 '98 '00 '02 '04 '06 '08 '10 [year] 5 Current RSCC System installed in 2004 at Riken Advance Center for Computing & Communication HDD User User Tape Storage User 20 TB User 200 TB POWER4 Server: 7 Itanium2 Server: 2 Front end Tape Drive: 4 (16) Server Disk Cache: 2TB Myrinet InfiniBand
    [Show full text]
  • 1. Introduction
    Recent Trends in the Marketplace of High Performance Computing Erich Strohmaier1, Jack J. Dongarra2, Hans W. Meuer3, and Horst D. Simon4 High Performance Computing, HPC Market, Supercomputer Market, HPC technology, Supercomputer market, Supercomputer technology In this paper we analyze major recent trends and changes in the High Performance Computing (HPC) market place. The introduction of vector computers started the area of ‘Supercomputing’. The initial success of vector computers in the seventies was driven by raw performance. Massive Parallel Systems (MPP) became successful in the early nineties due to their better price/performance ratios, which was enabled by the attack of the ‘killer-micros’. The success of microprocessor based SMP concepts even for the very high-end systems, was the basis for the emerging cluster concepts in the early 2000s. Within the first half of this decade clusters of PC’s and workstations have become the prevalent architecture for many HPC application areas on all ranges of performance. However, the Earth Simulator vector system demonstrated that many scientific applications could benefit greatly from other computer architectures. At the same time there is renewed broad interest in the scientific HPC community for new hardware architectures and new programming paradigms. The IBM BlueGene/L system is one early example of a shifting design focus for large-scale system. The DARPA HPCS program has the declared goal of building a Petaflops computer system by the end of the decade using novel computer architectures.
    [Show full text]
  • Notes on the Earth Simulator
    Notes on the Earth Simulator Jack Dongarra Computer Science Department University of Tennessee FASTEST COMPUTER TODAY The Japanese Earth Simulator Research and Development Center Japan Atomic Energy Research Institute Atmospheric and Solid earth science oceanographic science High resolution Global dynamic model global models predictions of to describe the entire solid global warming etc earth as a system. High resolution Earth Earth Regional model regional models Simulator predictions of to describe crust/mantle El Niño events and activity in the Japanese Asian monsoon etc., Archipelago region, High resolution local models Simulation of earthquake predictions of weather disasters generation process such as typhoons, localized Seismic wave tomography torrential downpour, oil spill, downburst etc. 1 Earth Simulator • Based on the NEC SX architecture, 640 nodes, each node with 8 vector processors (8 Gflop/s peak per processor), 2 ns cycle time, 16GB shared memory. – Total of 5104 total processors, 40 TFlop/s peak, and 10 TB memory. • It has a single stage crossbar (1800 miles of cable) 83,000 copper cables, 16 GB/s cross section bandwidth. • 700 TB disk space • 1.6 PB mass store • Area of computer = 4 tennis courts, 3 floors Outline of the Earth Simulator Computer • Architecture : A MIMD-type, distributed memory, parallel system consisting of computing nodes in which vector-type multi- processors are tightly connected by sharing main memory. • Total number of processor nodes: 640 • Number of PE’s for each node: 8 • Total number of PE’s: 5120 • Peak performance of each PE: 8 GFLOPS • Peak performance of each node: 64 GFLOPS • Main memory : 10 TB (total).
    [Show full text]
  • The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment
    The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment Hans Werner Meuer [email protected] Prometeus GmbH & Universität Mannheim Microsoft Industriekunden – Veranstaltung Frankfurt /16. Oktober 2008 / Windows HPC Server 2008 Launch page 1 31th List: The TOP10 Rmax Power Manufacturer Computer Installation Site Country #Cores [TF/s] [MW] Roadrunner 1 IBM 1026 DOE/NNSA/LANL USA 2.35 122,400 BladeCenter QS22/LS21 BlueGene/L 2 IBM 478.2 DOE/NNSA/LLNL USA 2.33 212,992 eServer Blue Gene Solution Intrepid 3 IBM 450.3 DOE/ANL USA 1.26 163,840 Blue Gene/P Solution Ranger 4 Sun 326 TACC USA 2.00 62,976 SunBlade x6420 Jaguar 5 Cray 205 DOE/ORNL USA 1.58 30,976 Cray XT4 QuadCore JUGENE Forschungszentrum 6 IBM 180 Germany 0.50 65,536 Blue Gene/P Solution Juelich (FZJ) Encanto New Mexico Computing 7 SGI 133.2 USA 0.86 14,336 SGI Altix ICE 8200 Applications Center EKA Computational Research 8 HP 132.8 India 1.60 14,384 Cluster Platform 3000 BL460c Laboratories, TATA SONS 9 IBM Blue Gene/P Solution 112.5 IDRIS France 0.32 40,960 Total Exploration 10 SGI SGI Altix ICE 8200EX 106.1 France 0.44 10,240 Production Microsoft Industriekunden – Veranstaltung page 2 Outline Mannheim Supercomputer Statistics & Top500 Project Start in 1993 Competition between Manufacturers, Countries and Sites My Supercomputer Favorite in the Top500 Lists The 31st List as of June 2008 Performance Development and Projection Bell‘s Law Supercomputing, quo vadis? Top500, quo vadis? Microsoft Industriekunden – Veranstaltung
    [Show full text]
  • Porting the 3D Gyrokinetic Particle-In-Cell Code GTC to the NEC SX-6 Vector Architecture: Perspectives and Challenges
    Computer Physics Communications 164 (2004) 456–458 www.elsevier.com/locate/cpc Porting the 3D gyrokinetic particle-in-cell code GTC to the NEC SX-6 vector architecture: perspectives and challenges S. Ethier a,∗,Z.Linb a Princeton Plasma Physics Laboratory, Princeton, NJ 08543, USA b Department of Physics and Astronomy, University of California, Irvine, CA 92697, USA Available online 27 October 2004 Abstract Several years of optimization on the cache-based super-scalar architecture has made it more difficult to port the current version of the 3D particle-in-cell code GTC to the NEC SX-6 vector architecture. This paper explains the initial work that has been done to port this code to the SX-6 computer and to optimize the most time consuming parts. After a few modifications, single-processor results show a performance increase of 5.2 compared to the IBM SP Power3 processor, and 2.7 compared to the Power4. 2004 Elsevier B.V. All rights reserved. PACS: 02.70.Ns; 07.05.Tp Keywords: Vector processor; Particle-in-cell; Code optimization 1. Introduction (< 10%). When properly vectorized, the same codes can, however, reach over 30 or even 40% of peak per- The impressive performance achieved in 2002 by formance on a vector processor. Not all codes can the Japanese Earth Simulator computer (26.58 Tflops, achieve such performance. The purpose of this study 64.9% of peak) [1] has revived the interest in vec- is to evaluate the work/reward ratio involved in vec- tor processors. Having been the flagship of high per- torizing our particle-in-cell code on the latest paral- formance computing for more than two decades, vec- lel vector machines, such as the NEC SX-6, which is tor computers were gradually replaced, at least in the building block of the very large Earth Simulator the US, by much cheaper multi-processor super-scalar system in Japan [2].
    [Show full text]
  • SX-Aurora TSUBASA Introduction Vector Supercomputer Technology on a Pcie Card What Is Vector Processor? (1/2)
    SX-Aurora TSUBASA Introduction Vector Supercomputer Technology on a PCIe Card What is Vector Processor? (1/2) Vector processor can operate large data at once and suited for fast processing of large scale data General Processor Vector Processor Suited for processing data in small Suited for processing data in large units such as business operation and units at once such as simulation,AI, web servers and Bigdata data data 256 Scalar Vector calculation calculation 256 output output 2 © NEC Corporation 2019 What is Vector Processor? (2/2) ① Many small cores vs small number of large cores ② Balance of computation performance and data access performance ③ Software development environment GPU-like Processors Vector Processors ① Many small cores ① Small number of large cores ② Larger size of computation circuits ② Balanced size of computation circuits and ③ Special language (such as CUDA) data access circuits ③ Standard language (C/C++/Fortran) Cores Cores Data access Data access Memory Memory 3 © NEC Corporation 2018 Vector Processor – History & Future ▌Vector Processor has traditionally been used to process big data, much earlier than the term big data was coined. ▌The very first vector processor based machine, Cray-1, was built by Seymour Cray in 1976. NEC made its first vector-supercomputer, the SX-2, in 1981. SX-2 was the first ever CPU to exceed 1 Gflops of peaK performance. Soon, Fujitsu, Hitachi followed NEC’s footsteps in the high-end HPC Technology segment. ▌However, in 1990s, the computer industry changed drastically with the advent of affordable x86 processors. The eventual dominance of x86 played a Key-role in democratization of HPC across academia & industry.
    [Show full text]