Exploring Performance of Xeon Phi Co-Processor

Total Page:16

File Type:pdf, Size:1020Kb

Exploring Performance of Xeon Phi Co-Processor Exploring performance of Xeon Phi co-processor Mateusz Iwo Dubaniowski August 21, 2015 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2015 Abstract The project aims to explore the performance of Intel Xeon Phi processor. We use various parallelisation and vectorisation methods to port a LU decomposition library to the co- processor. The popularity of accelerators and co-processors is growing due to their good energy efficiency characteristics, and the large potential of further performance improvements. These two factors make co-processors suitable to drive the innovation in high performance computing forwards, towards the next goal of achieving the Exascale- level computing. Due to increasing demand Intel has delivered a co-processor designed to fit the requirements of the HPC community, the Intel MIC architecture, of which the most prominent example is Intel Xeon Phi. The co-processor utilises the many-core principle. It provides a large number of slower cores supplemented with vector processing units, thus forcing high level of parallelisation upon the users. LU factorisation is an operation on matrices used in many fields to solve linear algebra, inverse matrices, and calculate matrix determinants. In this project we port a LU factorisation algorithm using Gaussian elimination method to perform the decomposition to Intel Xeon Phi co-processor. We use various parallelisation techniques including Intel LEO, OpenMP 4.0 pragmas, Intel’s Cilk array notation, and ivdep pragma. Furthermore, we examine the effect of data transfer to the co-processor on the overall execution time. The results obtained show that the best level of performance on Xeon Phi is achieved with the use of Intel Cilk array notation to vectorise, and OpenMP4.0 to parallelise the code. Intel Cilk array notation, on average across sparse and dense benchmark matrices, results in the speed-up of 27 times over the single-threaded performance of the host processor. The peak speed-up achieved with this method, across attempted benchmarks, results in performance 49 times better than that of a single thread of the host processor. Contents Chapter 1 Introduction ........................................................................................................ 1 1.1 Obstacles and diversions from the original plan ..................................................... 4 1.2 Structure of the dissertation ...................................................................................... 4 Chapter 2 Co-processors and accelerators in HPC ............................................................ 6 2.1 Importance of energy efficiency in HPC ................................................................. 6 2.2 Co-processors and the move to Exascale ................................................................. 8 2.3 Intel Xeon Phi and other accelerators ...................................................................... 9 2.4 Related work ...........................................................................................................10 Chapter 3 Intel MIC architecture......................................................................................13 3.1 Architecture of Intel MICs .....................................................................................13 3.2 Xeon Phi in EPCC and Hartree ..............................................................................16 3.3 Xeon Phi programming tools .................................................................................17 3.4 Intel Xeon – host node............................................................................................18 3.5 Knights Landing – the future of Xeon Phi .............................................................18 Chapter 4 LU factorization – current implementation .....................................................20 4.1 What is LU factorization? ......................................................................................20 4.2 Applications of LU factorization ...........................................................................21 4.3 Initial algorithm ......................................................................................................21 4.4 Matrix data structure...............................................................................................22 i Chapter 5 Optimisation and parallelisation methods .......................................................24 5.1 Intel “ivdep” pragma and compiler auto-vectorisation .........................................24 5.2 OpenMP 4.0 ............................................................................................................25 5.3 Intel Cilk array notation .........................................................................................26 5.4 Offload models .......................................................................................................26 Chapter 6 Implementation of the solution........................................................................29 6.1 Initial profiling ........................................................................................................29 6.2 Parallelising the code ..............................................................................................30 6.2.1 Hotspots analysis .............................................................................................31 6.3 Offloading the code ................................................................................................32 6.4 Hinting vectorisation with ivdep ............................................................................33 6.4.1 Hotspots for further vectorisation ...................................................................34 6.5 Ensuring vectorisation with Intel Cilk and OpenMP simd ...................................34 6.5.1 Intel Cilk array notation ..................................................................................35 6.5.2 OpenMP simd pragma ....................................................................................35 Chapter 7 Benchmarking the solution ..............................................................................37 7.1 Matrix format ..........................................................................................................37 7.2 University of Florida sparse matrix collection ......................................................38 7.3 Dense benchmarks ..................................................................................................39 7.4 Summary of benchmarks’ characteristics ..............................................................39 Chapter 8 Analysis of performance of Xeon Phi .............................................................41 8.1 Collection of results ................................................................................................41 8.2 Validation of the results .........................................................................................42 8.3 Overview of results.................................................................................................43 ii 8.4 Speed-up with different optimisation options........................................................45 8.5 Native speed-up on Intel Xeon and on Intel Xeon Phi ..........................................47 8.6 Offloading overhead ...............................................................................................49 8.7 Speed-up on the host with different optimisation options .....................................51 8.8 Running NICSLU on the host ................................................................................53 Chapter 9 Summary and conclusions ...............................................................................54 9.1 Future work .............................................................................................................56 Bibliography .....................................................................................................................57 iii List of Tables Table 2-1: Overview of available co-processors and accelerators by vendor ................10 Table 3-1: Overview of versions of Intel Xeon Phi available .........................................16 Table 5-1: Outline of scheduling options available in OpenMP 4.0 [31] .......................25 Table 5-2: Intel Cilk array notation example ...................................................................26 Table 6-1: gprof profile of running the LU factorization algorithm with ranmat4500 input on 4 host threads ...............................................................................................................30 Table 6-2: Intel Cilk array notation use in lup_od_omp function ...................................35 Table 6-3: OpenMP simd pragma usage in lup_od_omp ................................................35 Table 7-1: Characteristics of benchmark matrices ..........................................................40 Table 8-1: Execution times (in seconds) of running benchmarks offloaded to Xeon Phi with different parallelisation methods ..............................................................................44 Table 8-2: Speed-up values summary against single-threaded host execution time ......46 Table 8-3: Code snippets explaining performance difference between simd pragma and Intel Cilk array notation ....................................................................................................47
Recommended publications
  • 2020 ALCF Science Report
    ARGONNE LEADERSHIP 2020 COMPUTING FACILITY Science On the cover: A snapshot of a visualization of the SARS-CoV-2 viral envelope comprising 305 million atoms. A multi-institutional research team used multiple supercomputing resources, including the ALCF’s Theta system, to optimize codes in preparation for large-scale simulations of the SARS-CoV-2 spike protein that were recognized with the ACM Gordon Bell Special Prize for HPC-Based COVID-19 Research. Image: Rommie Amaro, Lorenzo Casalino, Abigail Dommer, and Zied Gaieb, University of California San Diego 2020 SCIENCE CONTENTS 03 Message from ALCF Leadership 04 Argonne Leadership Computing Facility 10 Advancing Science with HPC 06 About ALCF 12 ALCF Resources Contribute to Fight Against COVID-19 07 ALCF Team 16 Edge Services Propel Data-Driven Science 08 ALCF Computing Resources 18 Preparing for Science in the Exascale Era 26 Science 28 Accessing ALCF GPCNeT: Designing a Benchmark 43 Materials Science 51 Physics Resources for Science Suite for Inducing and Measuring Constructing and Navigating Hadronic Light-by-Light Scattering Contention in HPC Networks Polymorphic Landscapes of and Vacuum Polarization Sudheer Chunduri Molecular Crystals Contributions to the Muon 30 2020 Science Highlights Parallel Relational Algebra for Alexandre Tkatchenko Anomalous Magnetic Moment Thomas Blum 31 Biological Sciences Logical Inferencing at Scale Data-Driven Materials Sidharth Kumar Scalable Reinforcement-Learning- Discovery for Optoelectronic The Last Journey Based Neural Architecture Search Applications
    [Show full text]
  • Introduction to Intel Xeon Phi Programming Models
    Introduction to Intel Xeon Phi programming models F.Affinito F. Salvadore SCAI - CINECA Part I Introduction to the Intel Xeon Phi architecture Trends: transistors Trends: clock rates Trends: cores and threads Trends: summarizing... The number of transistors increases The power consumption must not increase The density cannot increase on a single chip Solution : Increase the number of cores GP-GPU and Intel Xeon Phi.. Coupled to the CPU To accelerate highly parallel kernels, facing with the Amdahl Law What is Intel Xeon Phi? 7100 / 5100 / 3100 Series available 5110P: Intel Xeon Phi clock: 1053 MHz 60 cores in-order ~ 1 TFlops/s DP peak performance (2 Tflops SP) 4 hardware threads per core 8 GB DDR5 memory 512-bit SIMD vectors (32 registers) Fully-coherent L1 and L2 caches PCIe bus (rev. 2.0) Max Memory bandwidth (theoretical) 320 GB/s Max TDP: 225 W MIC vs GPU naïve comparison The comparison is naïve System K20s 5110P # cores 2496 60 (*4) Memory size 5 GB 8 GB Peak performance 3.52 TFlops 2 TFlops (SP) Peak performance 1.17 TFlops 1 TFlops (DP) Clock rate 0.706 GHz 1.053 GHz Memory bandwidth 208 GB/s (ECC off) 320 GB/s Terminology MIC = Many Integrated Cores is the name of the architecture Xeon Phi = Commercial name of the Intel product based on the MIC architecture Knight's corner, Knight's landing, Knight's ferry are development names of MIC architectures We will often refer to the CPU as HOST and Xeon Phi as DEVICE Is it an accelerator? YES: It can be used to “accelerate” hot-spots of the code that are highly parallel and computationally extensive In this sense, it works alongside the CPU It can be used as an accelerator using the “offload” programming model An important bottleneck is represented by the communication between host and device (through PCIe) Under this respect, it is very similar to a GPU Is it an accelerator? / 2 NOT ONLY: the Intel Xeon Phi can behave as a many-core X86 node.
    [Show full text]
  • Accelerators for HP Proliant Servers Enable Scalable and Efficient High-Performance Computing
    Family data sheet Accelerators for HP ProLiant servers Enable scalable and efficient high-performance computing November 2014 Family data sheet | Accelerators for HP ProLiant servers HP high-performance computing has made it possible to accelerate innovation at any scale. But traditional CPU technology is no longer capable of sufficiently scaling performance to address the skyrocketing demand for compute resources. HP high-performance computing solutions are built on HP ProLiant servers using industry-leading accelerators to dramatically increase performance with lower power requirements. Innovation is the foundation for success What is hybrid computing? Accelerators are revolutionizing high performance computing A hybrid computing model is one where High-performance computing (HPC) is being used to address many of modern society’s biggest accelerators (known as GPUs or coprocessors) challenges, such as designing new vaccines and genetically engineering drugs to fight diseases, work together with CPUs to perform computing finding and extracting precious oil and gas resources, improving financial instruments, and tasks. designing more fuel efficient engines. As parallel processors, accelerators can split computations into hundreds or thousands of This rapid pace of innovation has created an insatiable demand for compute power. At the same pieces and calculate them simultaneously. time, multiple strict requirements are placed on system performance, power consumption, size, response, reliability, portability, and design time. Modern HPC systems are rapidly evolving, Offloading the most compute-intensive portions of already reaching petaflop and targeting exaflop performance. applications to accelerators dramatically increases both application performance and computational All of these challenges lead to a common set of requirements: a need for more computing efficiency.
    [Show full text]
  • Quick-Reference Guide to Optimization with Intel® Compilers
    Quick Reference Guide to Optimization with Intel® C++ and Fortran Compilers v19.1 For IA-32 processors, Intel® 64 processors, Intel® Xeon Phi™ processors and compatible non-Intel processors. Contents Application Performance .............................................................................................................................. 2 General Optimization Options and Reports ** ............................................................................................. 3 Parallel Performance ** ................................................................................................................................ 4 Recommended Processor-Specific Optimization Options ** ....................................................................... 5 Optimizing for the Intel® Xeon Phi™ x200 product family ............................................................................ 6 Interprocedural Optimization (IPO) and Profile-Guided Optimization (PGO) Options ................................ 7 Fine-Tuning (All Processors) ** ..................................................................................................................... 8 Floating-Point Arithmetic Options .............................................................................................................. 10 Processor Code Name With Instruction Set Extension Name Synonym .................................................... 11 Frequently Used Processor Names in Compiler Options ...........................................................................
    [Show full text]
  • Broadwell Skylake Next Gen* NEW Intel NEW Intel NEW Intel Microarchitecture Microarchitecture Microarchitecture
    15 лет доступности IOTG is extending the product availability for IOTG roadmap products from a minimum of 7 years to a minimum of 15 years when both processor and chipset are on 22nm and newer process technologies. - Xeon Scalable (w/ chipsets) - E3-12xx/15xx v5 and later (w/ chipsets) - 6th gen Core and later (w/ chipsets) - Bay Trail (E3800) and later products (Braswell, N3xxx) - Atom C2xxx (Rangeley) and later - Не включает в себя Xeon-D (7 лет) и E5-26xx v4 (7 лет) 2 IOTG Product Availability Life-Cycle 15 year product availability will start with the following products: Product Discontinuance • Intel® Xeon® Processor Scalable Family codenamed Skylake-SP and later with associated chipsets Notification (PDN)† • Intel® Xeon® E3-12xx/15xx v5 series (Skylake) and later with associated chipsets • 6th Gen Intel® Core™ processor family (Skylake) and later (includes Intel® Pentium® and Celeron® processors) with PDNs will typically be issued no later associated chipsets than 13.5 years after component • Intel Pentium processor N3700 (Braswell) and later and Intel Celeron processors N3xxx (Braswell) and J1900/N2xxx family introduction date. PDNs are (Bay Trail) and later published at https://qdms.intel.com/ • Intel® Atom® processor C2xxx (Rangeley) and E3800 family (Bay Trail) and late Last 7 year product availability Time Last Last Order Ship Last 15 year product availability Time Last Last Order Ship L-1 L L+1 L+2 L+3 L+4 L+5 L+6 L+7 L+8 L+9 L+10 L+11 L+12 L+13 L+14 L+15 Years Introduction of component family † Intel may support this extended manufacturing using reasonably Last Time Order/Ship Periods Component family introduction dates are feasible means deemed by Intel to be appropriate.
    [Show full text]
  • Biology at the Exascale
    Biology at the Exascale Advances in computational hardware and algorithms that have transformed areas of physics and engineering have recently brought similar benefits to biology and biomedical research. Contributors: Laura Wolf and Dr. Gail W. Pieper, Argonne National Laboratory Biological sciences are undergoing a revolution. High‐performance computing has accelerated the transition from hypothesis‐driven to design‐driven research at all scales, and computational simulation of biological systems is now driving the direction of biological experimentation and the generation of insights. As recently as ten years ago, success in predicting how proteins assume their intricate three‐dimensional forms was considered highly unlikely if there was no related protein of known structure. For those proteins whose sequence resembles a protein of known structure, the three‐dimensional structure of the known protein can be used as a “template” to deduce the unknown protein structure. At the time, about 60 percent of protein sequences arising from the genome sequencing projects had no homologs of known structure. In 2001, Rosetta, a computational technique developed by Dr. David Baker and colleagues at the Howard Hughes Medical Institute, successfully predicted the three‐dimensional structure of a folded protein from its linear sequence of amino acids. (Baker now develops tools to enable researchers to test new protein scaffolds, examine additional structural hypothesis regarding determinants of binding, and ultimately design proteins that tightly bind endogenous cellular proteins.) Two years later, a thirteen‐year project to sequence the human genome was declared a success, making available to scientists worldwide the billions of letters of DNA to conduct postgenomic research, including annotating the human genome.
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.0
    The Intel x86 Microarchitectures Map Version 2.0 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]
  • ECP Software Technology Capability Assessment Report
    ECP-RPT-ST-0001-2018 ECP Software Technology Capability Assessment Report Michael A. Heroux, Director ECP ST Jonathan Carter, Deputy Director ECP ST Rajeev Thakur, Programming Models & Runtimes Lead Jeffrey Vetter, Development Tools Lead Lois Curfman McInnes, Mathematical Libraries Lead James Ahrens, Data & Visualization Lead J. Robert Neely, Software Ecosystem & Delivery Lead July 1, 2018 DOCUMENT AVAILABILITY Reports produced after January 1, 1996, are generally available free via US Department of Energy (DOE) SciTech Connect. Website http://www.osti.gov/scitech/ Reports produced before January 1, 1996, may be purchased by members of the public from the following source: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone 703-605-6000 (1-800-553-6847) TDD 703-487-4639 Fax 703-605-6900 E-mail [email protected] Website http://www.ntis.gov/help/ordermethods.aspx Reports are available to DOE employees, DOE contractors, Energy Technology Data Exchange representatives, and International Nuclear Information System representatives from the following source: Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 37831 Telephone 865-576-8401 Fax 865-576-5728 E-mail [email protected] Website http://www.osti.gov/contact.html This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights.
    [Show full text]
  • FY18Q2 Marketwide Shopping Event
    $130 BRIGHTON AREA SCHOOL DISTRICT LAPTOP PRESIDENTS DAY EARLY ACCESS STARTS 2/12 DOORBUSTERS ALL WEEK AT 11AM ET 11.6" A DOORBUSTER 11AM ET, 2/16 A: Inspiron 11 3000 Market value* $199.99 | Save $70 $12999 Intel® Celeron® Processor, Windows 10 Home, 2GB memory* & 32GB storage, 11.6" HD display DOORBUSTER 11AM ET, 2/13 23.8" B B: Dell 24 Monitor - S2418NX Market value* $259.99 | Save $100 $15999 23.8" widescreen In nityEdge monitor, C Full HD 1920 x 1080, Dell HDR DOORBUSTER 11AM ET, 2/14 C: Inspiron Small Desktop Market value* $319.99 | Save $100 $21999 Intel® Pentium® processor, Windows 10 Home, 4GB memory* & 1TB* hard drive Limited quantities available at these prices SAVE BIG ON INCREDIBLE COLOR, SOUND AND STREAMING WITH DELL CINEMA PCs. www.Dell.com/BrightonASD US3936579 *See reverse for important details. Netflix streaming membership required. Screen images are simulated. EARLY ACCESS TO DOORBUSTERS PRESIDENTS DAY SALE STARTS EARLY FOR MEMBERS SAVE UP TO 40% A 15.6" B 27" C 15.6" DOORBUSTER 11AM ET, 2/12 DOORBUSTER 11AM ET, 2/15 D A: Inspiron 15 3000 B: New Inspiron Gaming Desktop Market value* $299.99 | Save $70 Market value* $799.99 | Save $100 $22999 $69999 Intel® Celeron® Processor, Windows 10 Home, 8th Gen Intel® Core™ i5 processor, Windows 10 Home, 4GB memory*, 500GB* hard drive, 15.6" HD 8GB memory*, 1TB* hard drive, 3GB* NVIDIA® display GeForce® GTX 1060 graphics card* DOORBUSTER 11AM ET, 2/14 DOORBUSTER 11AM ET, 2/13 C: Dell UltraSharp 27 D: Inspiron 15 5000 2-in-1 4K Monitor - U2718Q Market value* $499.99 | Save $100 Market value* $739.99 | Save $260 $ 99 $ 99 399 479 7th Gen Intel® Core™ i3 processor, 27" 4K In nityEdge monitor, 3840 x 2160 Windows 10 Home, 4GB memory* & 1TB* resolution, Dell HDR hard drive, 15.6" Full HD Touch display *Limited quantities available at these prices.
    [Show full text]
  • Future Computing Platforms for Science in a Power Constrained Era
    Future Computing Platforms for Science in a Power Constrained Era David Abdurachmanov1, Peter Elmer2, Giulio Eulisse1, Robert Knight3 1 Fermilab, Batavia, IL 60510, USA 2 Department of Physics, Princeton University, Princeton, NJ 08540, USA 3 Research Computing, Office of Information Technology, Princeton University, Princeton, NJ, 08540, USA E-mail: [email protected] Abstract. Power consumption will be a key constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics (HEP). This makes performance-per-watt a crucial metric for selecting cost-efficient computing solutions. For this paper, we have done a wide survey of current and emerging architectures becoming available on the market including x86-64 variants, ARMv7 32-bit, ARMv8 64-bit, Many-Core and GPU solutions, as well as newer System-on-Chip (SoC) solutions. We compare performance and energy efficiency using an evolving set of standardized HEP-related benchmarks and power measurement techniques we have been developing. We evaluate the potential for use of such computing solutions in the context of DHTC systems, such as the Worldwide LHC Computing Grid (WLCG). 1. Introduction and Motivation The data produced by the four experiments at the Large Hadron Collider (LHC) [1] or similar High Energy Physics (HEP) experiments requires a significant amount of human and computing resources which cannot be provided by research institute or even country. For this reasons the various parties involved created the Worldwide LHC Computing Grid (WLCG) in order to tackle the data processing challenges posed by such a large amount of data. The WLGC consists of a arXiv:1510.03676v1 [cs.DC] 28 Jul 2015 highly federated union of computing centers sparse in 40 countries and it represents an admirable example of international organization.
    [Show full text]
  • Introduction to Intel Xeon Phi (“Knights Landing”) on Cori
    Introduction to Intel Xeon Phi (“Knights Landing”) on Cori" Brandon Cook! Brian Friesen" 2017 June 9 - 1 - Knights Landing is here!" • KNL nodes installed in Cori in 2016 • “Pre-produc=on” for ~ 1 year – No charge for using KNL nodes – Limited access (un7l now!) – Intermi=ent down7me – Frequent so@ware changes • KNL nodes on Cori will soon enter produc=on – Charging Begins 2017 July 1 - 2 - Knights Landing overview" Knights Landing: Next Intel® Xeon Phi™ Processor Intel® Many-Core Processor targeted for HPC and Supercomputing First self-boot Intel® Xeon Phi™ processor that is binary compatible with main line IA. Boots standard OS. Significant improvement in scalar and vector performance Integration of Memory on package: innovative memory architecture for high bandwidth and high capacity Integration of Fabric on package Three products KNL Self-Boot KNL Self-Boot w/ Fabric KNL Card (Baseline) (Fabric Integrated) (PCIe-Card) Potential future options subject to change without notice. All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification. - 3 - Knights Landing overview TILE CHA 2 VPU 2 VPU Knights Landing Overview 1MB L2 Core Core 2 x16 X4 MCDRAM MCDRAM 1 x4 DMI MCDRAM MCDRAM Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 EDC EDC PCIe D EDC EDC M Gen 3 3 I 3 Memory: MCDRAM: 16 GB on-package; High BW D Tile D D D DDR4: 6 channels @ 2400 up to 384GB R R 4 36 Tiles 4 IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset C DDR MC connected by DDR MC C Node: 1-Socket only H H A 2D Mesh A Fabric: Omni-Path on-package (not shown) N Interconnect N N N E E L L Vector Peak Perf: 3+TF DP and 6+TF SP Flops S S Scalar Perf: ~3x over Knights Corner EDC EDC misc EDC EDC Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+ Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
    [Show full text]
  • Experimental and Analytical Study of Xeon Phi Reliability
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Lume 5.8 Experimental and Analytical Study of Xeon Phi Reliability Daniel Oliveira Laércio Pilla Nathan DeBardeleben Institute of Informatics, UFRGS Department of Informatics and Los Alamos National Laboratory Porto Alegre, RS, Brazil Statistics, UFSC Los Alamos, NM, US Florianópolis, SC, Brazil Sean Blanchard Heather Quinn Israel Koren Los Alamos National Laboratory Los Alamos National Laboratory University of Massachusetts, UMass Los Alamos, NM, US Los Alamos, NM, US Amherst, MA, US Philippe Navaux Paolo Rech Institute of Informatics, UFRGS Institute of Informatics, UFRGS Porto Alegre, RS, Brazil Porto Alegre, RS, Brazil ABSTRACT 1 INTRODUCTION We present an in-depth analysis of transient faults effects on HPC Accelerators are extensively used to expedite calculations in large applications in Intel Xeon Phi processors based on radiation experi- HPC centers. Tianhe-2, Cori, Trinity, and Oakforest-PACS use Intel ments and high-level fault injection. Besides measuring the realistic Xeon Phi and many other top supercomputers use other forms of error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) accelerators [17]. The main reasons to use accelerators are their by correlating the distribution of corrupted elements in the out- high computational capacity, low cost, reduced per-task energy put to the application’s characteristics. We evaluate the benefits consumption, and flexible development platforms. Unfortunately, of imprecise computing for reducing the programs’ error rate. For accelerators are also extremely likely to experience transient errors example, for HotSpot a 0.5% tolerance in the output value reduces as they are built with cutting-edge technology, have very high the error rate by 85%.
    [Show full text]