New Hardware Support in OSIRIS 4.0

Total Page:16

File Type:pdf, Size:1020Kb

New Hardware Support in OSIRIS 4.0 New hardware support in OSIRIS 4.0 1,2 R. A. Fonseca 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Ricardo Fonseca | OSIRIS Workshop 2017 Outline • The road to Exascale Systems – HPC system evolution – Current trends – OSIRIS support for top10 systems • The new Intel Knights Landing architecture – Architecture details – OSIRIS on the KNL • OSIRIS on modern CPUs – Overview of modern CPUs – Compiling and running for these • OSIRIS on GPUs 2 Ricardo Fonseca | OSIRIS Workshop 2017 The road to Exascale systems UNIVAC 1 - 1951 Internal view The Road to Exascale Computing High Performance Computing Power Evolution Exaflop/s 1012 Sunway Taihulight NRCPC, China Sunway TaihuLight #1 - TOP500 Jun/16 Tianhe-2 K Computer Titan Petaflop/s 9 Cray XT5-HE Tianhe-1A 10 IBM Blue Gene/L IBM RoadRunner IBM Blue Gene/L SGI Project Columbia NEC Earth Simulator IBM ASCI White 6 Teraflop/s Intel ASCI Red/9152 Intel ASCI Red/9632 10 - Fujitsu Numerical Hitachi/Tsukuba CP-PACS/2048 Wind Tunnel Intel Paragon XP/S 140 NEC SX-3/44R Cray-2/8 ETA10-G/8 Gigaflop/s M-13 103 Cray X-MP/4 Cray-1 CDC Cyber 205 CDC STAR-100 Burroughs ILLIAC IV Performance [ MFLOPs ] CDC 7600 CDC 6600 Megaflop/s 100 Sunway Taihulight Total system IBM 7030 Stretch UNIVAC LARC IBM 7090 • 40 960 compute nodes • 10 649 600 cores Node Configuration • 1.31 PB RAM IBM 709 IBM 704 kiloflop/s • 1× SW26010 manycore processor Performance 10-3 UNIVAC 1 • Rpeak 125.4 Pflop/s EDSAC 1 • 4×(64+1) cores @ 1.45 GHz • 4× 8 GB DDR3 • Rmax 93.0 Pflop/s 1940 1950 1960 1970 1980 1990 2000 2010 2020 Year (data from multiple sources) 4 Ricardo Fonseca | OSIRIS Workshop 2017 The Road to Power Efcient Computing Energy Requirement Evolution kiloJoule 6 10 Sunway Taihulight UNIVAC 1 NRCPC, China #1 - TOP500 Jun/16 IBM 704 Joule 103 IBM 7090 IBM 7030 Stretch CDC 6600 ] J m [ CDC 7600 miliJoule n o i 0 t 10 a Cray-1 r e p o Cray-2/8 r e Fujitsu Numerical p microJoule Wind Tunnel y g -3 r 10 e Intel ASCI Red/9152 Intel ASCI Red/9632 n IBM ASCI White E NEC Earth Simulator Sunway Taihulight IBM Blue Gene/L IBM Blue Gene/L IBM RoadRunner • Manycore architecture nanoJoule Cray XT5-HE 10-6 Tianhe-1A K Computer Titan Tianhe-2 • Peak performance 93 PFlop/s Sunway TaihuLight • Total power 15.3 MW picoJoule • 6.07 Gflop/W 10-9 • 165 pJ / flop 1940 1950 1960 1970 1980 1990 2000 2010 2020 Year (data from multiple sources) 5 Ricardo Fonseca | OSIRIS Workshop 2017 Petaflop systems firmly established Multicore systems • Maintain complex cores and replicate The drive towards Exaflop • 2 systems in the top 10 are based on multicore CPUs • Steady progress for over 60 years • 1× Fujitsu SPARK #8 (K-Computer) – 138 systems above 1 PFlop/s • 1× Intel Xeon E5 #10 (Trinity) • Supported by many computing paradigm Manicore evolutions • Use many lower power cores • Trend indicates Exaflop systems by next decade • 5 systems in the top 10 • Electric power is one of the limiting factors • #1 (Sunway Taihulight) – Target < 20 MW • 2× IBM BlueGene/Q in the top – Top system achieves ~ 6 Gflop/W • #5 (Sequoia) and #9 (Mira) • ~ 0.2 GW for 1 Exaflop • Seem to be the last of their kind • 2× Intel Knights Landing systems • Factor of 10× improvement still required • #6 (Cori) and #7 (Oakforest-PACS) – Best energy efciency • 14 Gflop/W Accelerator/co-processor technology • NVIDIA Tesla P100 • 3 systems in top 10 • #3 (Piz Daint) and #4 (Titan) use NVIDIA GPUs • #2 (Tianhe-2) uses Intel Knights Corner 6 Ricardo Fonseca | OSIRIS Workshop 2017 Which top 10 systems are currently supported by OSIRIS? Rank System Cores Rmax [TFlop/s] Rpeak [TFlop/s] Power [kW] OSIRIS support 1 Sunway TaihuLight,China 10649600 93014.6 125435.9 15371 No 2 Tianhe-2 (MilkyWay-2), China 3120000 33862.7 54902.4 17808 Full (KNC) 3 Piz Daint, Switzerland 361760 19590.0 25326.3 2272 Full (CUDA) 4 Titan, United States 560640 17590.0 27112.5 8209 Full (CUDA) 5 Sequoia, United States 1572864 17173.2 20132.7 7890 Full (QPX) 6 Cori, United States 622336 14014.7 27880.7 3939 Full (KNL) 7 Oakforest-PACS, Japan 556104 13554.6 24913.5 2719 Full (KNL) 8 K computer, Japan 705024 10510.0 11280.4 12660 Standard (Fortran) 9 Mira, United States 786432 8586.6 10066.3 3945 Full (QPX) 10 Trinity, United States 301056 8100.9 11078.9 4233 Full (AVX2) 7 Ricardo Fonseca | OSIRIS Workshop 2017 Intel Knights Landing Architecture Intel Xeon Phi 5110p die Intel Knights Landing Architecture NERSC Cori Intel Knights Landing (KNL) • Processing • 68 x86_64 cores @ 1.4 GHz • 4 threads / core • Scalar unit + 2× 512 bit vector unit • 32KB L1-I + 32KB L1-D cache • Organized in 36 tiles • Connected by 2D Mesh • 1 MB L2 cache • 2 cores/tile • Fully Intel Xeon ISA-compatible NERSC Cori • Binary compatible with other x86 code Cray XC40 • No L3 cache (MCDRAM can be configured as an L3 cache) #5 - TOP500 Nov/16 • Memory • Cray XC40 • 16 GB 8-channel 3D MCDRAM • 9152 Compute Nodes • 400+ GB/s • Total system • 3.9 MW • (Up to) 384 GB 6-channel DDR4 • 622 336 cores • Interconnect • 115.2 GB/s • 0.15 PB (MCDRAM) + 0.88 PB (RAM) • Aries Interconnect • 8×2 memory controllers • Performance • Node configuration • Up to 350 GB/s • RMAX = 14.0 PFlop/s • 1× Intel® Xeon® Phi 7250 @ • System Interface • RPEAK = 27.9 PFlop/s 1.4GHz (68 cores) • Standalone host processor • 16 GB (MCDRAM) + 96 GB (RAM) • Max power consumption 215W • 9 Ricardo Fonseca | OSIRIS Workshop 2017 KNL Memory Architecture Access Modes Clustering Modes • All2All • Cache mode • No afnity between tile, directory 16 GB and memory DDR • SW-Transparent MOCPDIORAM MOPCIDORAM PCIe MOCPDIORAM MOCPIDORAM MCDRAM • Most general, usually lower perf. • Covers whole DDR range EDC EDC IIO EDC EDC than other modes Tile Tile Tile Tile • Quadrant Tile Tile Tile Tile Tile Tile • Chip divided into 4 virtual quadrants Tile Tile Tile Tile Tile Tile • Afnity between directory and iMC Tile Tile Tile Tile iMC 16 GB memory, Lower latency and better DDR DDR MCDRAM bandwidth than All2All, SW- Tile Tile Tile Tile Tile Tile • Flat mode transparent • MCDRAM as regular memory Tile Tile Tile Tile Tile Tile • SW-Managed • Sub-NUMA clustering Tile Tile Tile Tile Tile Tile DDR • Same address space • Looks like 4-socket Xeon EDC EDC Misc EDC EDC Physical address • Afnity between tile, directory and memory, Lowest latency, requires MOCPDIORAM MOCPDIORAM MOCPDIORAM MOCPIDORAM NUMA optimize to get benefit KNL Mesh Interconnect • Also a hybrid Mode • Memory access / cluster modes must be configured at boot time • Use MCDRAM as part memory part cache • 25% or 50% as cache 10 Ricardo Fonseca | OSIRIS Workshop 2017 Logical view of a KNL node Intel KNL systems Logical View • The KNL node is simply a 68 core (4 threads per core) shared memory computer OpenMP region • Each core has dual 512 but VPUs OpenMP region N nodes core core core OpenMP region core core core OpenMP region • This can be viewed as small computer cluster with N SMP core corecore corecore core OpenMP regionM cores nodes with M cores per node core corecore corecore core OpenMP regionM cores core core core • Use a distributed memory algorithm (MPI) for core corecore corecore core M cores core core core parallelizing across nodes core corecore corecore Sharedcore memory M cores core core core • Use a shared memory algorithm (OpenMP) for core core Sharedcore memory M cores core core core parallelizing inside each node core core Sharedcore memory M cores core core core • The exact same code used in "standard" HPC systems can Shared memory core core core run on the KNL nodes Shared memory Shared memory • The vector units play a key role in system performance Intel KNL node • Vectorization for this architecture must be considered 11 Ricardo Fonseca | OSIRIS Workshop 2017 Parallelization for a KNL node Distributed Memory (MPI) • HPC systems present a hierarchy of distributed / shared memory parallelism • Top level is a network of computing nodes • Nodes exchange information through network messages Sim. Volume Parallel Domain • Each computing node is a set of CPUs / cores • Memory can be shared inside the node Shared Memory (OpenMP) • Parallelization of PIC algorithm Particle Bufer • Spatial decomposition across nodes: each node handles a specific region of simulation space • Split particles over cores inside node • Same as for standard CPU systems Thread 1 Thread 2 Thread 3 SMP node 12 Ricardo Fonseca | OSIRIS Workshop 2017 Vectorization of the PIC algorithm for KNL Particle Push Current Deposition Load nV virtual part. into Vector Load nV particles into Vector • The code spends ~80-90% time advancing Unit Unit particles and depositing current • Focus vectorization efort here Vectorized by particle Interpolate Fields Interpolate Fields InterpolateInterpolate Fields Fields Interpolate Fields • Vectors contain same Interpolate Fields InterpolateCalculate Currents Fields quantity for all particles • Previous strategy developed for vectorization works well Push Particles • Process nv (16) particles at a time for most of Push Particles PushPush Particles Particles Push Particles Transpose vectors the algorithm Push Particles TransposePush Particlescurrent to get 1 • Done through vector vector per current line operations • For current deposition vectorize by current line to avoid serialization Split Path / Create virtual particles • Critical because of large vector width Vectorized by current line Push Particles Push Particles • Use a vector transpose operation LoopPush over Particles vectors and • Vectors contain 1 line of accumulate current current for 1 particle Store Results 13 Ricardo Fonseca
Recommended publications
  • インテル® Parallel Studio XE 2020 リリースノート
    インテル® Parallel Studio XE 2020 2019 年 12 月 5 日 内容 1 概要 ..................................................................................................................................................... 2 2 製品の内容 ........................................................................................................................................ 3 2.1 インテルが提供するデバッグ・ソリューションの追加情報 ..................................................... 5 2.2 Microsoft* Visual Studio* Shell の提供終了 ........................................................................ 5 2.3 インテル® Software Manager .................................................................................................... 5 2.4 サポートするバージョンとサポートしないバージョン ............................................................ 5 3 新機能 ................................................................................................................................................ 6 3.1 インテル® Xeon Phi™ 製品ファミリーのアップデート ......................................................... 12 4 動作環境 .......................................................................................................................................... 13 4.1 プロセッサーの要件 ...................................................................................................................... 13 4.2 ディスク空き容量の要件 .............................................................................................................. 13 4.3 オペレーティング・システムの要件 ............................................................................................. 13
    [Show full text]
  • (52) Cont~Ol Data
    C) (52) CONT~OL DATA literature and Distribution Services ~~.) 308 North Dale Street I st. Paul. Minnesota 55103 rJ 1 August 29, 1983 "r--"-....." (I ~ __ ,I Dear Customer: Attached is the third (3) catalog supplement since the 1938 catalog was published . .. .·Af ~ ~>J if-?/t~--62--- G. F. Moore, Manager Literature & Distribution Services ,~-" l)""... ...... I _._---------_._----_._----_._-------- - _......... __ ._.- - LOS CATALOG SUPPLEPtENT -- AUGUST 1988 Pub No. Rev [Page] TITLE' [ extracted from catalog entry] Bind Price + = New Publication r = Revision - = Obsolete r 15190060 [4-07] FULL SCREEN EDITOR (FSEDIT) RM (NOS 1 & 2) .......•...•.•...•••........... 12.00 r 15190118 K [4-07] NETWORK JOB ENTRY FACILITY (NJEF) IH8 (NOS 2) ........................... 5.00 r 15190129 F [4-07] NETWORK JOB ENTRY FACILITY (NJEF) RM (NOS 2) .........•.......•........... + 15190150 C [4-07] NETWORK TRANSFER FACILITY (NTF) USAGE (NOS/VE) .......................... 15.00 r 15190762 [4-07] TIELINE/NP V2 IHB (L642) (NOS 2) ........................................ 12.00 r 20489200 o [4-29] WREN II HALF-HEIGHT 5-114" DISK DRIVE ................................... + 20493400 [4-20] CDCNET DEVICE INTERFACE UNITS ........................................... + 20493600 [4-20] CDCNET ETHERNET EQUIPMENT ............................................... r 20523200 B [4-14] COMPUTER MAINTENANCE SERVICES - DEC ..................................... r 20535300 A [4-29] WREN II 5-1/4" RLL CERTIFIED ............................................ r 20537300 A [4-18] SOFTWARE
    [Show full text]
  • Introduction to High Performance Computing
    Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline • What is HPC? Why computer cluster? • Basic structure of a computer cluster • Computer performance and the top 500 list • HPC for scientific research and parallel computing • National-wide HPC resources: XSEDE • BU SCC and RCS tutorials What is HPC? • High Performance Computing (HPC) refers to the practice of aggregating computing power in order to solve large problems in science, engineering, or business. • Purpose of HPC: accelerates computer programs, and thus accelerates work process. • Computer cluster: A set of connected computers that work together. They can be viewed as a single system. • Similar terminologies: supercomputing, parallel computing. • Parallel computing: many computations are carried out simultaneously, typically computed on a computer cluster. • Related terminologies: grid computing, cloud computing. Computing power of a single CPU chip • Moore‘s law is the observation that the computing power of CPU doubles approximately every two years. • Nowadays the multi-core technique is the key to keep up with Moore's law. Why computer cluster? • Drawbacks of increasing CPU clock frequency: --- Electric power consumption is proportional to the cubic of CPU clock frequency (ν3). --- Generates more heat. • A drawback of increasing the number of cores within one CPU chip: --- Difficult for heat dissipation. • Computer cluster: connect many computers with high- speed networks. • Currently computer cluster is the best solution to scale up computer power. • Consequently software/programs need to be designed in the manner of parallel computing. Basic structure of a computer cluster • Cluster – a collection of many computers/nodes. • Rack – a closet to hold a bunch of nodes.
    [Show full text]
  • Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor
    Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti Lawrence Berkeley National Laboratory {dwdoerf, jrdeslippe, swwilliams, loliker, bgcook, tkurth, mlobet, tmalas, jlvay, hvincenti}@lbl.gov Abstract The Roofline Performance Model is a visually intuitive method used to bound the sustained peak floating-point performance of any given arithmetic kernel on any given processor architecture. In the Roofline, performance is nominally measured in floating-point operations per second as a function of arithmetic intensity (operations per byte of data). In this study we determine the Roofline for the Intel Knights Landing (KNL) processor, determining the sustained peak memory bandwidth and floating-point performance for all levels of the memory hierarchy, in all the different KNL cluster modes. We then determine arithmetic intensity and performance for a suite of application kernels being targeted for the KNL based supercomputer Cori, and make comparisons to current Intel Xeon processors. Cori is the National Energy Research Scientific Computing Center’s (NERSC) next generation supercomputer. Scheduled for deployment mid-2016, it will be one of the earliest and largest KNL deployments in the world. 1 Introduction Moving an application to a new architecture is a challenge, not only in porting of the code, but in tuning and extracting maximum performance. This is especially true with the introduction of the latest manycore and GPU-accelerated architec- tures, as they expose much finer levels of parallelism that can be a challenge for applications to exploit in their current form.
    [Show full text]
  • Introduction to Intel Xeon Phi (“Knights Landing”) on Cori
    Introduction to Intel Xeon Phi (“Knights Landing”) on Cori" Brandon Cook! Brian Friesen" 2017 June 9 - 1 - Knights Landing is here!" • KNL nodes installed in Cori in 2016 • “Pre-produc=on” for ~ 1 year – No charge for using KNL nodes – Limited access (un7l now!) – Intermi=ent down7me – Frequent so@ware changes • KNL nodes on Cori will soon enter produc=on – Charging Begins 2017 July 1 - 2 - Knights Landing overview" Knights Landing: Next Intel® Xeon Phi™ Processor Intel® Many-Core Processor targeted for HPC and Supercomputing First self-boot Intel® Xeon Phi™ processor that is binary compatible with main line IA. Boots standard OS. Significant improvement in scalar and vector performance Integration of Memory on package: innovative memory architecture for high bandwidth and high capacity Integration of Fabric on package Three products KNL Self-Boot KNL Self-Boot w/ Fabric KNL Card (Baseline) (Fabric Integrated) (PCIe-Card) Potential future options subject to change without notice. All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification. - 3 - Knights Landing overview TILE CHA 2 VPU 2 VPU Knights Landing Overview 1MB L2 Core Core 2 x16 X4 MCDRAM MCDRAM 1 x4 DMI MCDRAM MCDRAM Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 EDC EDC PCIe D EDC EDC M Gen 3 3 I 3 Memory: MCDRAM: 16 GB on-package; High BW D Tile D D D DDR4: 6 channels @ 2400 up to 384GB R R 4 36 Tiles 4 IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset C DDR MC connected by DDR MC C Node: 1-Socket only H H A 2D Mesh A Fabric: Omni-Path on-package (not shown) N Interconnect N N N E E L L Vector Peak Perf: 3+TF DP and 6+TF SP Flops S S Scalar Perf: ~3x over Knights Corner EDC EDC misc EDC EDC Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+ Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
    [Show full text]
  • Intel® Parallel Studio XE 2020 Update 2 Release Notes
    Intel® Parallel StudIo Xe 2020 uPdate 2 15 July 2020 Contents 1 Introduction ................................................................................................................................................... 2 2 Product Contents ......................................................................................................................................... 3 2.1 Additional Information for Intel-provided Debug Solutions ..................................................... 4 2.2 Microsoft Visual Studio Shell Deprecation ....................................................................................... 4 2.3 Intel® Software Manager ........................................................................................................................... 5 2.4 Supported and Unsupported Versions .............................................................................................. 5 3 What’s New ..................................................................................................................................................... 5 3.1 Intel® Xeon Phi™ Product Family Updates ...................................................................................... 12 4 System Requirements ............................................................................................................................. 13 4.1 Processor Requirements........................................................................................................................ 13 4.2 Disk Space Requirements .....................................................................................................................
    [Show full text]
  • Effectively Using NERSC
    Effectively Using NERSC Richard Gerber Senior Science Advisor HPC Department Head (Acting) NERSC: the Mission HPC Facility for DOE Office of Science Research Largest funder of physical science research in U.S. Bio Energy, Environment Computing Materials, Chemistry, Geophysics Particle Physics, Astrophysics Nuclear Physics Fusion Energy, Plasma Physics 6,000 users, 48 states, 40 countries, universities & national labs Current Production Systems Edison 5,560 Ivy Bridge Nodes / 24 cores/node 133 K cores, 64 GB memory/node Cray XC30 / Aries Dragonfly interconnect 6 PB Lustre Cray Sonexion scratch FS Cori Phase 1 1,630 Haswell Nodes / 32 cores/node 52 K cores, 128 GB memory/node Cray XC40 / Aries Dragonfly interconnect 24 PB Lustre Cray Sonexion scratch FS 1.5 PB Burst Buffer 3 Cori Phase 2 – Being installed now! Cray XC40 system with 9,300 Intel Data Intensive Science Support Knights Landing compute nodes 10 Haswell processor cabinets (Phase 1) 68 cores / 96 GB DRAM / 16 GB HBM NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec Support the entire Office of Science 30 PB of disk, >700 GB/sec I/O bandwidth research community Integrate with Cori Phase 1 on Aries Begin to transition workload to energy network for data / simulation / analysis on one system efficient architectures 4 NERSC Allocation of Computing Time NERSC hours in 300 millions DOE Mission Science 80% 300 Distributed by DOE SC program managers ALCC 10% Competitive awards run by DOE ASCR 2,400 Directors Discretionary 10% Strategic awards from NERSC 5 NERSC has ~100% utilization Important to get support PI Allocation (Hrs) Program and allocation from DOE program manager (L.
    [Show full text]
  • Future of Supercomputing Yoshio Oyanagi ∗
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Journal of Computational and Applied Mathematics 149 (2002) 147–153 www.elsevier.com/locate/cam Future of supercomputing Yoshio Oyanagi ∗ Department of Computer Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-0033 Japan Received 26 October 2001; received in revised form 11 February 2002 Abstract Supercomputing or High Performance Computing plays ever more important roles in industrial ÿelds as well as in basic research. Based upon the history of supercomputers in the last few decades, a personal view of the supercomputing in the ÿrst decade of the 21st century is presented. c 2002 Elsevier Science B.V. All rights reserved. Keywords: Supercomputers; HPC; Simulation; Computational science; Peta5ops computer 1. From supercomputing to high performance computing The ÿrst “supercomputer” Cray-1 was built by Seymour Cray in 1976 and people thought only a fewof such supercomputers wouldmeet the demand from science and engineering. At present, however, a notebook PC in your bag supersedes the computing power of Cray-1. Supercomputing has played important roles in industrial ÿelds such as automobile, aeronautics, building and civil engineering, electric and electronic engineering, material and pharmaceutics, as well as in basic research such as elementary particles, chemistry, condensed matter, genome, protein and complex systems. Supercomputers are regarded as the computers which have an order of magnitude higher performance. They are special expensive facilities which only government laboratories or universities or big companies can a<ord. Currently, although supercomputing is playing ever more important roles in various ÿelds, the word “supercomputer” is not very popular.
    [Show full text]
  • Recent Supercomputing Development in Japan
    Supercomputing in Japan Yoshio Oyanagi Dean, Faculty of Information Science Kogakuin University 2006/4/24 1 Generations • Primordial Ages (1970’s) – Cray-1, 75APU, IAP • 1st Generation (1H of 1980’s) – Cyber205, XMP, S810, VP200, SX-2 • 2nd Generation (2H of 1980’s) – YMP, ETA-10, S820, VP2600, SX-3, nCUBE, CM-1 • 3rd Generation (1H of 1990’s) – C90, T3D, Cray-3, S3800, VPP500, SX-4, SP-1/2, CM-5, KSR2 (HPC ventures went out) • 4th Generation (2H of 1990’s) – T90, T3E, SV1, SP-3, Starfire, VPP300/700/5000, SX-5, SR2201/8000, ASCI(Red, Blue) • 5th Generation (1H of 2000’s) – ASCI,TeraGrid,BlueGene/L,X1, Origin,Power4/5, ES, SX- 6/7/8, PP HPC2500, SR11000, …. 2006/4/24 2 Primordial Ages (1970’s) 1974 DAP, BSP and HEP started 1975 ILLIAC IV becomes operational 1976 Cray-1 delivered to LANL 80MHz, 160MF 1976 FPS AP-120B delivered 1977 FACOM230-75 APU 22MF 1978 HITAC M-180 IAP 1978 PAX project started (Hoshino and Kawai) 1979 HEP operational as a single processor 1979 HITAC M-200H IAP 48MF 1982 NEC ACOS-1000 IAP 28MF 1982 HITAC M280H IAP 67MF 2006/4/24 3 Characteristics of Japanese SC’s 1. Manufactured by main-frame vendors with semiconductor facilities (not ventures) 2. Vector processors are attached to mainframes 3. HITAC IAP a) memory-to-memory b) summation, inner product and 1st order recurrence can be vectorized c) vectorization of loops with IF’s (M280) 4. No high performance parallel machines 2006/4/24 4 1st Generation (1H of 1980’s) 1981 FPS-164 (64 bits) 1981 CDC Cyber 205 400MF 1982 Cray XMP-2 Steve Chen 630MF 1982 Cosmic Cube in Caltech, Alliant FX/8 delivered, HEP installed 1983 HITAC S-810/20 630MF 1983 FACOM VP-200 570MF 1983 Encore, Sequent and TMC founded, ETA span off from CDC 2006/4/24 5 1st Generation (1H of 1980’s) (continued) 1984 Multiflow founded 1984 Cray XMP-4 1260MF 1984 PAX-64J completed (Tsukuba) 1985 NEC SX-2 1300MF 1985 FPS-264 1985 Convex C1 1985 Cray-2 1952MF 1985 Intel iPSC/1, T414, NCUBE/1, Stellar, Ardent… 1985 FACOM VP-400 1140MF 1986 CM-1 shipped, FPS T-series (max 1TF!!) 2006/4/24 6 Characteristics of Japanese SC in the 1st G.
    [Show full text]
  • Intel® Xeon Phi™ Processor: Your Path to Deeper Insight
    PRODUCT BRIEF Intel® Xeon Phi™ Processor: Your Path to Deeper Insight Intel Inside®. Breakthroughs Outside. Unlock deeper insights to solve your biggest challenges faster with the Intel® Xeon Phi™ processor, a foundational element of Intel® Scalable System Framework (Intel® SSF) What if you could gain deeper insight to pursue new discoveries, drive business innovation or shape the future using advanced analytics? Breakthroughs in computational capability, data accessibility and algorithmic innovation are driving amazing new possibilities. One key to unlocking these deeper insights is the new Intel® Xeon Phi™ processor – Intel’s first processor to deliver the performance of an accelerator with the benefits of a standard host CPU. Designed to help solve your biggest challenges faster and with greater efficiency, the Intel® Xeon Phi™ processor enables machines to rapidly learn without being explicitly programmed, in addition to helping drive new breakthroughs using high performance modeling and simulation, visualization and data analytics. And beyond computation, as a foundational element of Intel® Scalable System Framework (Intel® SSF), the Intel® Xeon Phi™ processor is part of a solution that brings together key technologies for easy-to-deploy and high performance clusters. Solve Your Biggest Challenges Faster1 Designed from the ground up to eliminate bottlenecks, the Intel® Xeon Phi™ processor is Intel’s first bootable host processor specifically designed for highly parallel workloads, and the first to integrate both memory and fabric technologies. Featuring up to 72 powerful and efficient cores with ultra-wide vector capabilities (Intel® Advanced Vector Extensions or AVX-512), the Intel® Xeon Phi™ processor raises the bar for highly parallel computing.
    [Show full text]
  • HPC Software Cluster Solution
    Architetture innovative per il calcolo e l'analisi dati Marco Briscolini Workshop di CCR La Biodola, Maggio 16-20. 2016 2015 Lenovo All rights reserved. Agenda . Cooling the infrastructure . Managing the infrastructure . Software stack 2015 Lenovo 2 Target Segments - Key Requirements Data Center Cloud Computing Infrastructure Key Requirements: Key Requirements: • Low-bin processors (low • Mid-high bin EP cost) processors High Performance • Smaller memory (low • Lots of memory cost) Computing (>256GB/node) for • 1Gb Ethernet virtualization Key Requirements: • 2 Hot Swap • 1Gb / 10Gb Ethernet drives(reliability) • 1-2 SS drives for boot • High bin EP processors for maximum performance Virtual Data • High performing memory Desktop Analytics • Infiniband Key Requirements: • 4 HDD capacity Key Requirements: • Mid-high bin EP processors • GPU support • Lots of memory (> • Lots of memory (>256GB per 256GB per node) for node) virtualization • 1Gb / 10Gb Ethernet • GPU support • 1-2 SS drives for boot 3 A LOOK AT THE X86 MARKET BY USE . HPC is ~6.6 B$ growing 8% annually thru 2017 4 Some recent “HPC” EMEA Lenovo 2015 Client wins 5 This year wins and installations in Europe 1152 nx360M5 DWC 1512 nx360M5 BRW EDR 3600 Xeon KNL 252 nx360M5 nodes 392 nx360M5 nodes GPFS, GSS 10 PB IB FDR14, Fat Tree IB FDR14 3D Torus OPA GPFS, 150 TB, 3 GB/s GPFS 36 nx360M5 nodes 312 nx360M5 DWC 6 nx360M5 with GPU 1 x 3850X6 GPFS, IB FDR14, 2 IB FDR, GPFS GSS24 6 2 X 3 PFlops SuperMUC systems at LRZ Phase 1 and Phase 2 Phase 1 Ranked 20 and 21 in Top500 June 2015 • Fastest Computer in Europe on Top 500, June 2012 – 9324 Nodes with 2 Intel Sandy Bridge EP CPUs – HPL = 2.9 PetaFLOP/s – Infiniband FDR10 Interconnect – Large File Space for multiple purpose • 10 PetaByte File Space based on IBM GPFS with 200GigaByte/s I/O bw Phase 2 • Innovative Technology for Energy Effective .
    [Show full text]
  • VHSIC and ETA10
    VHSIC and The First CMOS and Only Cryogenically Cooled Super Computer David Bondurant Former Honeywell Solid State Electronics Division VHSIC System Applications Manager ETA10 Supercomputer at MET Office - UK’s National Weather Forecasting Service Sources • “Very High Speed Integrated Circuits (VHSIC) Final Program Report 1980-1990, VHSIC Program Office”, Office of the Under Secretary of Defense for Acquisition, Deputy Director, Defense Research and Engineering for Research and Advanced Technology, September 30, 1990 • Carlson, Sullivan, Bach, and Resnick, “The ETA10 Liquid-Nitrogen-Cooled Supercomputer System”, IEEE Transactions on Electron Devices, Vol. 36, No. 8, August 1989. • Cummings and Chase, “High Density Packaging for Supercomputers” • Tony Vacca, “First Hand: The First CMOS And The Only Cryogenically Cooled Supercomputer”, ethw.org • Robert Peglar, “The ETA Era or How to (Mis-)Manage a Company According to Control Data Corp.”, April 17, 1990 Background • I graduated from Missouri S&T in May 1971 • My first job was at Control Data Corporation, the leading Supercomputer Company • I worked in Memory Development • CDC 7600 was in production, 8600 (Cray 1) was in development by Seymour Cray in Chippewa Falls, WI • Star 100 Vector Processor was in Development in Arden Hills • I was assigned to the development of the first DRAM Memory Module for a CDC Supercomputer • I designed the DRAM Module Tester Using ECL Logic Control Data Corporation Arden Hills • First DRAM Module was 4Kx32 using 128 1K DRAM Development Center - June 1971 Chips
    [Show full text]