Fundamentals – Parallel Architectures, Models, and Languages

HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages Michael Bader TUM – SCCS Winter 2017/2018 Part I Parallel Architectures (sorry, not everywhere the latest ones . ) Michael Bader j HPC – Algorithms and Applications j Fundamentals j Winter 2017/2018 2 Manycore CPU – Intel Xeon Phi Coprocessor • coprocessor = works as an extension card on the PCI bus • ≈ 60 cores, 4 hardware threads per core • simpler architecture for each core, but • wider vector computing unit (8 double-precision ﬂoats) • next generation (Knights Landing) available as standalone CPU (since 2017) Michael Bader j HPC – Algorithms and Applications j Fundamentals j Winter 2017/2018 3 Manycore CPU – Intel “Knights Landing” 2nd half ’15 Unveiling Details of Knights Landing 1st commercial systems (Next Generation Intel® Xeon Phi™ Products) 3+ TFLOPS1 In One Package Platform Memory: DDR4 Bandwidth and Parallel Performance & Density Capacity Comparable to Intel® Xeon® Processors Compute: Energy-efficient IA cores2 … . Microarchitecture enhanced for HPC3 . 3X Single Thread Performance vs Knights Corner4 5 . Intel Xeon Processor Binary Compatible . Intel® Silvermont Arch. On-Package Memory: Enhanced for HPC . up to 16GB at launch . 1/3X the Space6 Integrated Fabric . 5X Bandwidth vs DDR47 . 5X Power Efficiency6 Processor Package Jointly Developed with Micron Technology All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 1Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle. FLOPS = cores x clock frequency x floating- point operations per second per cycle. 2Modified version of Intel® Silvermont microarchitecture currently found in Intel® AtomTM processors. 3Modifications include AVX512 and 4 threads/core support. 4Projected peak theoretical single-thread performance relative to 1st Generation Intel® Xeon Phi™ Coprocessor 7120P (formerly codenamed Knights Corner). 5 Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX) . 6Projected results based on internal Intel analysis of Knights Landing memory vs Knights Corner (GDDR5). 7Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory only with all channels populated. Conceptual—Not Actual Package Layout (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader j HPC – Algorithms and Applications j Fundamentals j Winter 2017/2018 4 Hardware Execution CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp. While programmers can generally ignore warp execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses. An Overview of the Fermi Architecture The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread GPGPUglobal scheduler – NVIDIA distributes thread Fermi blocks to SM thread schedulers. Fermi’s 16 SM are positioned(source: NVIDIAaround a – common Fermi Whitepaper) L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Michael Bader j HPC – Algorithms(execution units), and and Applications light blue portionsj Fundamentals(register file and L1 jcache).Winter 2017/2018 5 7 Instruction Cache Third Generation Streaming Warp Scheduler Warp Scheduler Multiprocessor Dispatch Unit Dispatch Unit The third generation SM introduces several Register File (32,768 x 32-bit) architectural innovationsGPGPU that – NVIDIAmake it not Fermionly the (2) most powerful SM yet built, but also the most LD/ST programmable and efficient. Core CoreInstruction Cache Core Core Third Generation Streaming Warp Scheduler Warp Scheduler LD/ST Multiprocessor Dispatch Unit Dispatch Unit SFU 512 High Performance CUDA cores LD/ST The third generation SM introduces several Core RegisterCore File (32,768 x 32-bit) Core Core architectural innovations that make it not only the LD/ST most powerful SM yet built, but also the most LD/ST LD/ST Each SM features 32 CUDAprogrammable CUDAand efficient. Core Core Core Core Core LD/ST Core Core Core SFU Core 512 High PerformanceDispatch CUDA Port cores LD/ST processors—a fourfold Core Core Core Core LD/ST Operand Collector LD/ST SFU Each SM features 32 CUDA CUDA Core LD/ST increase over prior SM Core Core Core Core LD/ST processors—a fourfold Dispatch Port LD/ST Operand Collector Core Core Core SFU Core increase over prior SM LD/ST designs. Each CUDA Core Core Core Core LD/ST LD/ST designs. EachFP CUDA Unit INTFP UnitUnit INT Unit processor has a fully LD/ST Core Core Core Core LD/ST processor has a fully pipelined integer arithmetic Result Queue LD/ST Core Core Core SFU Core logic unit (ALU) and floating LD/ST Result Queue Core Core Core Core LD/ST pipelined integer arithmetic LD/ST point unit (FPU). Prior GPUs used IEEE 754-1985 SFU LD/ST floating point arithmetic. The Fermi architecture Core Core Core Core LD/ST logic unit (ALU) and floating LD/ST implements the new IEEE 754-2008 floating-point Core Core Core SFU Core LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST point unit (FPU). Prior GPUsinstruction used for both IEEE single and754-1985 double precision LD/ST arithmetic. FMA improves over a multiply-add Interconnect Network LD/ST floating point arithmetic. The(MAD) instructionFermi by architecture doing the multiplication and Core Core Core Core 64 KB Shared Memory / L1 Cache addition with a single final rounding step, with no LD/ST implements the new IEEE 754-2008loss of precision in thefloating-point addition. FMA is more Uniform Cache SFU (source: NVIDIA – Fermiaccurate Whitepaper) than performing the operations Fermi Streaming Multiprocessor (SM) LD/ST standard, providing the fusedseparately. multiply-add GT200 implemented double (FMA) precision FMA. Core Core Core Core LD/ST instruction for bothMichael single Bader jandIn HPCGT200, double –theAlgorithms integer ALU precision was andlimited Applications to 24-bit precisionj forFundamentals multiply operations; asj aWinter result, 2017/2018 6 multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly arithmetic. FMA improvesdesigned over integer a multiply-add ALU supports full 32-bit precision for all instructions, consistent withInterconnect standard Network programming language requirements. The integer ALU is also optimized to efficiently support (MAD) instruction by doing64-bit the and multiplication extended precision operations. and Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and64 population KB Shared Memory / L1 Cache addition with a single final count.rounding step, with no loss of precision in the addition.16 Load/Store FMA Units is more Uniform Cache Each SM has 16 load/store units, allowing source and destination Fermiaddresses Streamingto be calculated Multiprocessor (SM) accurate than performing thefor sixteen operations threads per clock. Supporting units load and store the data at each address to separately. GT200 implementedcache or DRAM. double precision FMA. In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, 8 multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count. 16 Load/Store Units Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock. Supporting units load and store the data at each address to cache or DRAM. 8 GPGPU – NVIDIA Fermi (3) Memory Subsystem Innovations General Purpose GraphicsNVIDIA Parallel Processing DataCache TM with Configurable Unit: L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many problems, it is not appropriate for all problems. Some algorithms map naturally to Shared • 512 CUDA coresmemory, (organized others require a cache, in 16 while others require a combination of both. The optimal streaming multiprocessors)memory hierarchy should offer the benefits of both Shared memory and cache, and allow the • improved doubleprogrammer precision a choice over its partitioning. The Fermi memory hierarchy

Fundamentals – Parallel Architectures, Models, and Languages

Paralellizing the Data Cube

Cache-Aware Roofline Model: Upgrading the Loft

CS 575: the Roofline Model

Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation

Beyond the Roofline: Cache-Aware Power and Energy-Efficiency

Minimizing Writes in Parallel External Memory Search

How to Write Fast Numerical Code Fall 2016 Lecture: Roofline Model

A Roofline Model of Energy

Roofline-Based Data Migration Methodology for Hybrid Memories 849

Roofline Model Toolkit: a Practical Tool for Architectural and Program

FPGA-Roofline: an Insightful Model for FPGA-Based Hardware

Gables: a Roofline Model for Mobile Socs