With Extreme Scale Computing the Rules Have Changed

Total Page:16

File Type:pdf, Size:1020Kb

With Extreme Scale Computing the Rules Have Changed With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems. • Three technology architecture possibilities or “swim lanes” are thriving. • Commodity (e.g. Intel) • Commodity + accelerator (e.g. GPUs) (104 systems) • Special purpose lightweight cores (e.g. IBM BG, ARM, Knights Landing) • Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). • Exascale (1018 Flop/s) projects exist in many countries and regions. • Intel processors largest share, 89% followed by AMD, 4%. 9 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt National Super Tianhe-2 NUDT, 1 Computer Center in Xeon 12C 2.2GHz + IntelXeon China 3,120,000 33.9 62 17.8 1905 Guangzhou Phi (57c) + Custom Titan, Cray XK7, AMD (16C) + DOE / OS 2 Nvidia Kepler GPU (14c) + USA 560,640 17.6 65 8.3 2120 Oak Ridge Nat Lab Custom DOE / NNSA Sequoia, BlueGene/Q (16c) 3 USA 1,572,864 17.2 85 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 4 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + Custom DOE / OS Mira, BlueGene/Q (16c) 5 USA 786,432 8.16 85 3.95 2066 Argonne Nat Lab + Custom Piz Daint, Cray XC30, Xeon 8C + 6 Swiss CSCS Swiss 115,984 6.27 81 2.3 2726 Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon Saudi 7 KAUST 196,608 5.54 77 4.5 1146 16C + Custom Arabia Texas Advanced Stampede, Dell Intel (8c) + Intel 8 USA 204,900 5.17 61 4.5 1489 Computing Center Xeon Phi (61c) + IB Forschungszentrum JuQUEEN, BlueGene/Q, 9 Germany 458,752 5.01 85 2.30 2178 Juelich (FZJ) Power BQC 16C 1.6GHz+Custom DOE / NNSA Vulcan, BlueGene/Q, 10 USA 393,216 4.29 85 1.97 2177 L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom 500 (422) Software Comp HP Cluster USA 18,896 .309 48 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt National Super Tianhe-2 NUDT, 1 Computer Center in Xeon 12C + IntelXeon Phi (57c) China 3,120,000 33.9 62 17.8 1905 Guangzhou + Custom Titan, Cray XK7, AMD (16C) + DOE / OS 2 Nvidia Kepler GPU (14c) + USA 560,640 17.6 65 8.3 2120 Oak Ridge Nat Lab Custom DOE / NNSA Sequoia, BlueGene/Q (16c) 3 USA 1,572,864 17.2 85 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 4 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + Custom DOE / OS Mira, BlueGene/Q (16c) 5 USA 786,432 8.16 85 3.95 2066 Argonne Nat Lab + Custom DOE / NNSA / Trinity, Cray XC40,Xeon 16C + 6 USA 301,056 8.10 80 Los Alamos & Sandia Custom Piz Daint, Cray XC30, Xeon 8C + 7 Swiss CSCS Swiss 115,984 6.27 81 2.3 2726 Nvidia Kepler (14c) + Custom Hazel Hen, Cray XC40, Xeon 12C 8 HLRS Stuttgart Germany 185,088 5.64 76 + Custom Shaheen II, Cray XC30, Xeon Saudi 9 KAUST 196,608 5.54 77 4.5 1146 16C + Custom Arabia Texas Advanced Stampede, Dell Intel (8c) + Intel 10 USA 204,900 5.17 61 4.5 1489 Computing Center Xeon Phi (61c) + IB 500 (368) Karlsruher MEGAWARE Intel Germany 10,800 .206 95 ♦ US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 - $525M hardware ♦ Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems ♦ Argonne Lab to receive Intel based system ! After this Exaflops ♦ US Dept of Commerce is preventing some China groups from receiving Intel technology ! Citing concerns about nuclear research being done with the systems; February 2015. ! On the blockade list: ! National SC Center Guangzhou, site of Tianhe-2 ! National SC Center Tianjin, site of Tianhe-1A ! National University for Defense Technology, developer ! National SC Center Changsha, location of NUDT 12 07 13 07 14 ♦ Sunway Blue Light - Sunway BlueLight MPP, ShenWei processor SW1600 975.00 MHz, Infiniband QDR ! Site:National Supercomputing Center in Jinan ! Cores:137,200 Linpack Performance (Rmax)795.9 TFlop/s Theoretical Peak (Rpeak)1,070.16 TFlop/s ! Processor:ShenWei processor SW1600 16C 975MHz (Alpha arch) ! Interconnect:Infiniband QDR ♦ Rumored to have a 100 Pflop/s system 07 15 Peak Performance - Per Core Floating point operations per cycle per core ! Most of the recent computers have FMA (Fused multiple add): (i.e. x ←x + y*z in one cycle) ! Intel Xeon earlier models and AMD Opteron have SSE2 ! 2 flops/cycle DP & 4 flops/cycle SP ! Intel Xeon Nehalem (’09) & Westmere (’10) have SSE4 ! 4 flops/cycle DP & 8 flops/cycle SP ! Intel Xeon Sandy Bridge(’11) & Ivy Bridge (’12) have AVX ! 8 flops/cycle DP & 16 flops/cycle SP ! Intel Xeon Haswell (’13) & (Broadwell (’14)) AVX2 ! 16 flops/cycle DP & 32 flops/cycle SP ! We Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP are ! Intel Xeon Skylake (server) (’15) AVX 512 Here ! soon 32 flops/cycle DP & 64 flops/cycle SP CPU Access Latencies in Clock Cycles In 167 cycles can do 2672 DP Flops Cycles Cycles ♦ Processors over provisioned for floating point arithmetic ♦ Data movement extremely expensive ♦ Operation count is not a good indicator of the time to solve a problem. ♦ Algorithms that do more ops may actually take less time. 11/17/15 18 Singular Value Decomposi<on LAPACK Version 1991 Level 1, 2, & 3 BLAS First Stage 8/3 n3 Ops square, with vectors 50 40 30 lapack QR lapack QR (1 core) linpack QR 20 eispack (1 core) QR refers to the QR algorithm speedup over eispack 10 for computing the eigenvalues 0 Intel Sandy Bridge 2.6 GHz 0k 4k 8k 12k 16k 20k (8 Flops per core per cycle) columns (matrix size N N) ⇥ BoGleneck in the Bidiagonalizaon The Standard Bidiagonal Reducon: xGEBRD Two Steps: Factor Panel & Update Tailing Matrix 0 0 0 0 10 10 10 10 20 20 20 20 30 30 30 30 40 40 40 40 50 50 50 50 60 60 60 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3035 nz = 275 nz = 2500 nz = 225 factor panel k then update ! factor panel k+1 Requires 2 GEMVs H " Characteristics Q*A*P • Total cost 8n3/3, (reduction to bi-diagonal) • Too many Level 2 BLAS operations • 4/3 n3 from GEMV and 4/3 n3 from GEMM • Performance limited to 2* performance of GEMV • !Memory bound algorithm. 16 cores Intel Sandy Bridge, 2.6 GHz, 20 MB shared L3 cache. The theoretical peak per core double precision is 20.4 Gflop/s per core. Compiled with icc and using MKL 2015.3.187 Recent Work on 2-Stage Algorithm 0 0 Second stage 0 10 First stage 10 Bulge chasing 10 20 To band 20 To bi-diagonal 20 30 30 30 40 40 40 50 50 50 60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3600 nz = 605 nz = 119 " Characteristics • Stage 1: • Fully Level 3 BLAS • Dataflow Asynchronous execution • Stage 2: • Level “BLAS-1.5” • Asynchronous execution • Cache friendly kernel (reduced communication) Recent work on developing new 2-stage algorithm n n − b n 0 0 b 0 3 +( )Second3 +( stage ) 10 3+( ) ( ) 3 10 flops 10  2nb nt s 3nb nt s 10 3 nb nt s nt s 5nb First stage ⇡ s=1 − Bulge chasing− − ⇥ − 20 20n n 20 To band − b n To bi-diagonal 30 30 b 30 + 3 +( ) 3 +( ) 10 3+( ) ( ) 3 40 40  2nb nt s 1 3nb nt 40 s 1 3 nb nt s nt s 1 5nb s=1 − − − − − ⇥ − − 50 50 50 60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 10 3 nz =10n 605 b 2 2nb 3 nz = 119 nz = 3600 n + n + n ⇡ 3 3 3 n n − b nb 10 3 3 3 3 10n (3gemm)first stage 3 flops  2nb +(nt s)3nb +(nt⇡ s) 3 nb+(nt s) (nt s)5nb ⇡ s=1 − − − ⇥ − n n − b nb (7) 3 3 10 3 3 +  2nb +(nt s 1)3nb +(nt s 1) 3 nb+(nt s) (nt s 1)5nb s=1 − − − − − ⇥ − − 10 n3 + 10nb n2 + 2nsecondb n3 stage: ⇡ 3 3 3 10 n3(gemm) flops = 6 n n2(gemv) (8) ⇡ 3 first stage ⇥ b ⇥ second stage More Flops, original did 8/3 n 3 (7) maximium performance bound: second stage: reference number of operations Pmax = minimum time tmin 2 flops = 6 nb n (gemv)second stage (8) ⇥ ⇥ 8 3 3 n = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) maximium performance bound: 3 8n PgemmPgemv = 3 2 (9) reference number of operations 10n Pgemv+18nbn Pgemm Pmax = minimum time tmin 8 8 8 3 = 40 Pgemm Pmax 12 Pgemm 3 n ) = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) if Pgemm is about 20x Pgemv and 120 nb 240.
Recommended publications
  • UNIT 8B a Full Adder
    UNIT 8B Computer Organization: Levels of Abstraction 15110 Principles of Computing, 1 Carnegie Mellon University - CORTINA A Full Adder C ABCin Cout S in 0 0 0 A 0 0 1 0 1 0 B 0 1 1 1 0 0 1 0 1 C S out 1 1 0 1 1 1 15110 Principles of Computing, 2 Carnegie Mellon University - CORTINA 1 A Full Adder C ABCin Cout S in 0 0 0 0 0 A 0 0 1 0 1 0 1 0 0 1 B 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 C S out 1 1 0 1 0 1 1 1 1 1 ⊕ ⊕ S = A B Cin ⊕ ∧ ∨ ∧ Cout = ((A B) C) (A B) 15110 Principles of Computing, 3 Carnegie Mellon University - CORTINA Full Adder (FA) AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 4 Carnegie Mellon University - CORTINA 2 Another Full Adder (FA) http://students.cs.tamu.edu/wanglei/csce350/handout/lab6.html AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 5 Carnegie Mellon University - CORTINA 8-bit Full Adder A7 B7 A2 B2 A1 B1 A0 B0 1-bit 1-bit 1-bit 1-bit ... Cout Full Full Full Full Cin Adder Adder Adder Adder S7 S2 S1 S0 AB 8 ⁄ ⁄ 8 C 8-bit C out FA in ⁄ 8 S 15110 Principles of Computing, 6 Carnegie Mellon University - CORTINA 3 Multiplexer (MUX) • A multiplexer chooses between a set of inputs. D1 D 2 MUX F D3 D ABF 4 0 0 D1 AB 0 1 D2 1 0 D3 1 1 D4 http://www.cise.ufl.edu/~mssz/CompOrg/CDAintro.html 15110 Principles of Computing, 7 Carnegie Mellon University - CORTINA Arithmetic Logic Unit (ALU) OP 1OP 0 Carry In & OP OP 0 OP 1 F 0 0 A ∧ B 0 1 A ∨ B 1 0 A 1 1 A + B http://cs-alb-pc3.massey.ac.nz/notes/59304/l4.html 15110 Principles of Computing, 8 Carnegie Mellon University - CORTINA 4 Flip Flop • A flip flop is a sequential circuit that is able to maintain (save) a state.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]
  • Misleading Performance Reporting in the Supercomputing Field David H
    Misleading Performance Reporting in the Supercomputing Field David H. Bailey RNR Technical Report RNR-92-005 December 1, 1992 Ref: Scientific Programming, vol. 1., no. 2 (Winter 1992), pg. 141–151 Abstract In a previous humorous note, I outlined twelve ways in which performance figures for scientific supercomputers can be distorted. In this paper, the problem of potentially mis- leading performance reporting is discussed in detail. Included are some examples that have appeared in recent published scientific papers. This paper also includes some pro- posed guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion in the field of supercomputing. The author is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field, CA 94035. 1 1. Introduction Many readers may have read my previous article “Twelve Ways to Fool the Masses When Giving Performance Reports on Parallel Computers” [5]. The attention that this article received frankly has been surprising [11]. Evidently it has struck a responsive chord among many professionals in the field who share my concerns. The following is a very brief summary of the “Twelve Ways”: 1. Quote only 32-bit performance results, not 64-bit results, and compare your 32-bit results with others’ 64-bit results. 2. Present inner kernel performance figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs, and compare your assembly-coded results with others’ Fortran or C implementations. 4. Scale up the problem size with the number of processors, but don’t clearly disclose this fact.
    [Show full text]
  • Intel Xeon & Dgpu Update
    Intel® AI HPC Workshop LRZ April 09, 2021 Morning – Machine Learning 9:00 – 9:30 Introduction and Hardware Acceleration for AI OneAPI 9:30 – 10:00 Toolkits Overview, Intel® AI Analytics toolkit and oneContainer 10:00 -10:30 Break 10:30 - 12:30 Deep dive in Machine Learning tools Quizzes! LRZ Morning Session Survey 2 Afternoon – Deep Learning 13:30 – 14:45 Deep dive in Deep Learning tools 14:45 – 14:50 5 min Break 14:50 – 15:20 OpenVINO 15:20 - 15:45 25 min Break 15:45 – 17:00 Distributed training and Federated Learning Quizzes! LRZ Afternoon Session Survey 3 INTRODUCING 3rd Gen Intel® Xeon® Scalable processors Performance made flexible Up to 40 cores per processor 20% IPC improvement 28 core, ISO Freq, ISO compiler 1.46x average performance increase Geomean of Integer, Floating Point, Stream Triad, LINPACK 8380 vs. 8280 1.74x AI inference increase 8380 vs. 8280 BERT Intel 10nm Process 2.65x average performance increase vs. 5-year-old system 8380 vs. E5-2699v4 Performance varies by use, configuration and other factors. Configurations see appendix [1,3,5,55] 5 AI Performance Gains 3rd Gen Intel® Xeon® Scalable Processors with Intel Deep Learning Boost Machine Learning Deep Learning SciKit Learn & XGBoost 8380 vs 8280 Real Time Inference Batch 8380 vs 8280 Inference 8380 vs 8280 XGBoost XGBoost up to up to Fit Predict Image up up Recognition to to 1.59x 1.66x 1.44x 1.30x Mobilenet-v1 Kmeans Kmeans up to Fit Inference Image up to up up Classification to to 1.52x 1.56x 1.36x 1.44x ResNet-50-v1.5 Linear Regression Linear Regression Fit Inference Language up to up to up up Processing to 1.44x to 1.60x 1.45x BERT-large 1.74x Performance varies by use, configuration and other factors.
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]
  • Summarizing CPU and GPU Design Trends with Product Data
    Summarizing CPU and GPU Design Trends with Product Data Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli Northeastern University Email: fyifansun, agostini, shidong, [email protected] Abstract—Moore’s Law and Dennard Scaling have guided the products. Equipped with this data, we answer the following semiconductor industry for the past few decades. Recently, both questions: laws have faced validity challenges as transistor sizes approach • Are Moore’s Law and Dennard Scaling still valid? If so, the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In what are the factors that keep the laws valid? this work, we collect data of more than 4000 publicly-available • Do GPUs still have computing power advantages over CPU and GPU products. We find that transistor scaling remains CPUs? Is the computing capability gap between CPUs critical in keeping the laws valid. However, architectural solutions and GPUs getting larger? have become increasingly important and will play a larger role • What factors drive performance improvements in GPUs? in the future. We observe that GPUs consistently deliver higher performance than CPUs. GPU performance continues to rise II. METHODOLOGY because of increases in GPU frequency, improvements in the thermal design power (TDP), and growth in die size. But we We have collected data for all CPU and GPU products (to also see the ratio of GPU to CPU performance moving closer to our best knowledge) that have been released by Intel, AMD parity, thanks to new SIMD extensions on CPUs and increased (including the former ATI GPUs)1, and NVIDIA since January CPU core counts.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware
    Theoretical Peak FLOPS per instruction set on less conventional hardware Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—This is a companion paper to “Theoreti- popular at the time [4][5][6]. Only the FPUs are cal Peak FLOPS per instruction set on modern Intel shared in Bulldozer. We can take a look back at CPUs” [1]. In it, we survey some alternative hardware for the equation1, replicated from the main paper, and which the peak FLOPS can be of interest. As in the main see how this affects the peak FLOPS. paper, we take into account and explain the peculiarities of the surveyed hardware. Revision 1.16, 2016/10/04 08:40:17 flop 9 Index Terms—FLOPS > operation> > > I. INTRODUCTION > operations => Many different kind of hardware are in use to × micro − architecture instruction> perform computations. No only conventional Cen- > > tral Processing Unit (CPU), but also Graphics Pro- > instructions> cessing Unit (GPU) and other accelerators. In the × > cycle ; main paper [1], we described how to compute the flops = (1) peak FLOPS for conventional Intel CPUs. In this node extension, we take a look at the peculiarities of those cycles 9 × > alternative computation devices. second> > > II. OTHER CPUS cores => × machine architecture A. AMD Family 15h socket > > The AMD Family 15h (the name “15h” comes > sockets> from the hexadecimal value returned by the CPUID × ;> instruction) was introduced in 2011 and is composed node flop of the so-called “construction cores”, code-named For the micro-architecture parts ( =operation, operations instructions Bulldozer, Piledriver, Steamroller and Excavator.
    [Show full text]
  • Parallel Programming – Multicore Systems
    FYS3240 PC-based instrumentation and microcontrollers Parallel Programming – Multicore systems Spring 2011 – Lecture #9 Bekkeng, 4.4.2011 Introduction • Until recently, innovations in processor technology have resulted in computers with CPUs that operate at higher clock rates. • However, as clock rates approach their theoretical physical limits, companies are developing new processors with multiple processing cores. • With these multicore processors the best performance and highest throughput is achieved by using parallel programming techniques This requires knowledge and suitable programming tools to take advantage of the multicore processors Multiprocessors & Multicore Processors Multicore Processors contain any Multiprocessor systems contain multiple number of multiple CPUs on a CPUs that are not on the same chip single chip The multiprocessor system has a divided The multicore processors share the cache with long-interconnects cache with short interconnects Hyper-threading • Hyper-threading is a technology that was introduced by Intel, with the primary purpose of improving support for multi- threaded code. • Under certain workloads hyper-threading technology provides a more efficient use of CPU resources by executing threads in parallel on a single processor. • Hyper-threading works by duplicating certain sections of the processor. • A hyper-threading equipped processor (core) pretends to be two "logical" processors to the host operating system, allowing the operating system to schedule two threads or processes simultaneously. • E.g.
    [Show full text]
  • PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus
    PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs Julia S. Mullen, Michael M. Wolf and Anna Klein MIT Lincoln Laboratory Lexington, MA 02420 Email: fjsm, michael.wolf,anna.klein [email protected] Abstract—Recent projections suggest that applications and (PAKCK) study was funded by DARPA to gather performance architectures will need to attain 75 GFLOPS/W in order to data necessary to address both requirements. support future DoD missions. Meeting this goal requires deeper As part of the PAKCK study, several key computational understanding of kernel and application performance as a function of power and architecture. As part of the PAKCK kernels from DoD applications were identified and charac- study, a set of DoD application areas, including signal and image terized in terms of power usage and performance for sev- processing and big data/graph computation, were surveyed to eral architectures. A key contribution of this paper is the identify performance critical kernels relevant to DoD missions. methodology we developed for characterizing power usage From that survey, we present the characterization of dense and performance for kernels on the Intel Sandy Bridge and matrix-vector product, two dimensional FFTs, and sparse matrix- dense vector multiplication on the NVIDIA Fermi and Intel NVIDIA Fermi architectures. We note that the method is Sandy Bridge architectures. extensible to additional kernels and mini-applications. An We describe the methodology that was developed for charac- additional contribution is the power usage and performance terizing power usage and performance on these architectures per Watt results for three performance critical kernels.
    [Show full text]
  • Comparing Performance and Energy Efficiency of Fpgas and Gpus For
    Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing Brahim Betkaoui, David B. Thomas, Wayne Luk Department of Computing, Imperial College London London, United Kingdom {bb105,dt10,wl}@imperial.ac.uk Abstract—This paper provides the first comparison of per- figurable Computing (HPRC). We define a HPRC system as formance and energy efficiency of high productivity computing a high performance computing system that relies on recon- systems based on FPGA (Field-Programmable Gate Array) and figurable hardware to boost the performance of commodity GPU (Graphics Processing Unit) technologies. The search for higher performance compute solutions has recently led to great general-purpose processors, while providing a programming interest in heterogeneous systems containing FPGA and GPU model similar to that of traditional software. In a HPRC accelerators. While these accelerators can provide significant system, problems are described using a more familiar high- performance improvements, they can also require much more level language, which allows software developers to quickly design effort than a pure software solution, reducing programmer improve the performance of their applications using reconfig- productivity. The CUDA system has provided a high productivity approach for programming GPUs. This paper evaluates the urable hardware. Our main contributions are: High-Productivity Reconfigurable Computer (HPRC) approach • Performance evaluation of benchmarks with different to FPGA programming, where a commodity CPU instruction memory characteristics on two high-productivity plat- set architecture is augmented with instructions which execute forms: an FPGA-based Hybrid-Core system, and a GPU- on a specialised FPGA co-processor, allowing the CPU and FPGA to co-operate closely while providing a programming based system.
    [Show full text]
  • Measurement, Control, and Performance Analysis for Intel Xeon Phi
    Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi Azzam Haidar∗, Heike Jagode∗, Asim YarKhan∗, Phil Vaccaro∗, Stanimire Tomov∗, Jack Dongarra∗yz [email protected], ∗Innovative Computing Laboratory (ICL), University of Tennessee, Knoxville, USA yOak Ridge National Laboratory, USA zUniversity of Manchester, UK Abstract— The emergence of power efficiency as a primary our experiments are set up and executed on Intel’s Xeon Phi constraint in processor and system designs poses new challenges Knights Landing (KNL) architecture. The Knights Landing concerning power and energy awareness for numerical libraries processor demonstrates many of the interesting architectural and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa- characteristics of the newer many-core architectures, with a scale systems. Understanding and improving the energy efficiency deeper hierarchy of explicit memory modules and multiple of numerical simulation becomes very crucial. modes of configuration. Our analysis is performed using Dense We present a detailed study and investigation toward control- Linear Algebra (DLA) kernels, specifically BLAS (Basic Linear ling power usage and exploring how different power caps affect Algebra Subroutine) kernels. the performance of numerical algorithms with different computa- tional intensities, and determine the impact and correlation with In order to measure the relationship between energy con- performance of scientific applications. sumption and attained performance for algorithms with different Our analyses is performed using a set of representatives complexities, we run experiments with representative BLAS kernels, as well as many highly used scientific benchmarks. We kernels that have different computational intensities.
    [Show full text]
  • Intel® Itanium™ Processor Microarchitecture Overview
    Itanium™ Processor Microarchitecture Overview Intel® Itanium™ Processor Microarchitecture Overview Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ® Microprocessor Forum October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Unveiling the Intel® Itanium™ Processor Design l Leading-edge implementation of IA-64 architecture for world-class performance l New capabilities for systems that fuel the Internet Economy l Strong progress on initial silicon ® Microprocessor Forum 2 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview Itanium™ Processor Goals l World-class performance on high-end applications – High performance for commercial servers – Supercomputer-level floating point for technical workstations l Large memory management with 64-bit addressing l Robust support for mission critical environments – Enhanced error correction, detection & containment l Full IA-32 instruction set compatibility in hardware l Deliver across a broad range of industry requirements – Flexible for a variety of OEM designs and operating systems Deliver world-class performance and features for servers & workstations and emerging internet applications ® Microprocessor Forum 3 October 5-6, 1999 Itanium™ Processor Microarchitecture Overview EPIC Design Philosophy ì Maximize performance via EPICEPIC hardware & software synergy ì Advanced features enhance instruction level parallelism ìPredication, Speculation, ... ì Massive hardware resources for parallel execution VLIW OOO / SuperScalar ì High performance
    [Show full text]