HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages

Technische Universitat¨ M ünchen HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages Michael Bader Winter 2012/2013 Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2012/2013 1 Technische Universitat¨ M ünchen Part I Parallel Architectures Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2012/2013 2 Technische Universitat¨ M ünchen Multicore CPUs – Intel’s Nehalem Architecture • example: quad-core CPU with shared and private caches • simultaneous multithreading: 8 threads on 4 cores • memory architecture: Quick Path Interconnect (replaced Front Side Bus) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2012/2013 3 Technische Universitat¨ M ünchen Multicore CPUs – Intel’s Nehalem Architecture (2) • NUMA (non-uniform memory access) architecture: CPUs have “private” memory, but uniform access to remote memory • max. 25 GB/s bandwidth (source: Intel – Nehalem Whitepaper) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2012/2013 4 Technische Universitat¨ M ünchen Manycore CPU – Intel Knights Ferry Intel® MIC Architecture: An Intel Co-Processor Architecture VECTOR VECTOR VECTOR VECTOR IA CORE IA CORE … IA CORE IA CORE INTERPROCESSOR NETWORK COHERENT COHERENT COHERENT COHERENT CACHE CACHE … CACHE CACHE COHERENT COHERENT COHERENT COHERENT CACHE CACHE … CACHE CACHE INTERPROCESSOR NETWORK FIXED FUNCTION LOGIC VECTOR VECTOR VECTOR VECTOR IA CORE IA CORE … IA CORE IA CORE INTERFACES I/O and MEMORY Many cores and many, many more threads Standard IA programming and memory model (source: Intel/K. Skaugen – SC’10 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2012/2013 5 Technische Universitat¨ M ünchen Manycore CPU – Intel Knights Ferry (2) Knights Ferry • Software development platform • Growing availability through 2010 • 32 cores, 1.2 GHz • 128 threads at 4 threads / core • 8MB shared coherent cache • 1-2GB GDDR5 • Bundled with Intel HPC tools Software development platform for Intel® MIC architecture (source: Intel/K. Skaugen – SC’10 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2012/2013 6 Hardware Execution CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp. While programmers can generally ignore warp execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses. An Overview of the Fermi Architecture The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The Technische Universit512 CUDAat¨ Mcores ünchen are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread GPGPUglobal scheduler – NVIDIA distributes thread Fermi blocks to SM thread schedulers. Fermi’s 16 SM are positioned(source: NVIDIAaround a – common Fermi Whitepaper) L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Michael Bader: HPC – Algorithms(execution and Applications units), and light blue portions (register file and L1 cache). Fundamentals – Parallel Architectures, Models, and Languages,7 Winter 2012/2013 7 Instruction Cache Third Generation Streaming Warp Scheduler Warp Scheduler MultiprocessorTechnische Universit at¨ M ünchen Dispatch Unit Dispatch Unit The third generation SM introduces several GPGPU – NVIDIA Fermi (2) Register File (32,768 x 32-bit) architectural innovations that make it not only the most powerful SM yet built, but also the most Instruction Cache Third Generation Streaming Warp Scheduler Warp Scheduler LD/ST programmable and efficient.Multiprocessor CoreDispatch UnitCore CoreDispatch Unit Core LD/ST SFU The third generation SM introduces several Register File (32,768 x 32-bit) 512 High Performancearchitectural CUDA innovations cores that make it not only the LD/ST Core Core Core Core most powerful SM yet built, but also the most LD/ST LD/ST programmable and efficient. Core Core Core Core LD/ST Each SM features 32 CUDA SFU LD/ST 512 High PerformanceCUDA CUDA Core cores LD/ST CoreCoreCore CoreCore Core Core Core LD/ST processors—a fourfold Dispatch Port LD/ST Each SM features 32 CUDA CUDA Core LD/ST Operand Collector Core Core Core Core SFU processors—a fourfold Dispatch Port LD/ST LD/ST increase over prior SM Operand Collector SFU increase over prior SM LD/ST CoreCoreCore CoreCore Core Core Core designs. Each CUDA LD/ST LD/ST designs. Each CUDA FP Unit INTFP Unit Unit INT Unit processor has a fully LD/ST Core Core Core Core LD/ST processor has a fully pipelined integer arithmetic Result Queue LD/ST Core Core Core CoreSFU logic unit (ALU) and floating LD/ST Result Queue Core Core Core Core LD/ST pipelined integer arithmetic LD/ST point unit (FPU). Prior GPUs used IEEE 754-1985 SFU LD/ST floating point arithmetic. The Fermi architecture Core Core Core Core LD/ST logic unit (ALU) and floating LD/ST implements the new IEEE 754-2008 floating-point Core Core Core CoreSFU LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST point unit (FPU). Prior GPUs used IEEE 754-1985 LD/ST instruction for both single and double precision LD/ST arithmetic. FMA improves over a multiply-add Interconnect Network floating point arithmetic. The Fermi architecture Core Core Core Core (MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache LD/ST implements the new IEEEaddition 754-2008 with a single final floating-point rounding step, with no SFU loss of precision in the addition. FMA is more Uniform Cache LD/ST standard, providing(source: the NVIDIA accuratefused – than Fermi multiply-add performing Whitepaper) the operations (FMA) CoreFermi StreamingCore MultiprocessorCore (SM) Core instruction for both singleseparately. and GT200 double implemented precision double precision FMA. LD/ST Michael Bader: HPCIn GT200, – Algorithms the integer andALU was Applications limited to 24-bit precision for multiply operations; as a result, arithmetic. FMA improves over a multiply-add Interconnect Network Fundamentals – Parallelmulti-instruction Architectures, emulation sequences Models, andwere Languages,required for integer Winter arithmetic. 2012/2013 In Fermi, the newly 8 (MAD) instruction by doingdesigned the integer multiplication ALU supports full 32-bit and precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to64 efficiently KB Shared support Memory / L1 Cache addition with a single final64-bit rounding and extended precision step, operations. with no Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population Uniform Cache loss of precision in the addition.count. FMA is more accurate than performing16 Load/Store the operations Units Fermi Streaming Multiprocessor (SM) separately. GT200 implementedEach SM has 16 doubleload/store units, precision allowing source FMA. and destination addresses to be calculated for sixteen threads per clock. Supporting units load and store the data at each address to cache or DRAM. In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly designed integer ALU supports full 32-bit precision8 for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count. 16 Load/Store Units Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock. Supporting units load and store the data at each address to cache or DRAM. 8 Technische Universitat¨ M ünchen GPGPU – NVIDIA Fermi (3) Memory Subsystem Innovations General Purpose GraphicsNVIDIA Parallel DataCache ProcessingTM with ConfigurableUnit: L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many • 512 CUDA coresproblems, it is not appropriate for all problems. Some algorithms map naturally to Shared • memory, others require a cache, while others improved doublerequire precisiona combination of both. The optimal memory hierarchy should offer the benefits of performance both Shared memory and cache, and allow the programmer a choice over its partitioning. The • shared vs. globalFermi

HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages

Paralellizing the Data Cube

Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation

Minimizing Writes in Parallel External Memory Search

The Parallel Persistent Memory Model

Implementing Operational Intelligence Using In-Memory Computing

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

High Performance Computing Ð Past, Present and Future

Experiments with a Parallel External Memory System*

Fundamentals – Parallel Architectures, Models, and Languages

The Efficiency of Mapreduce in Parallel External Memory Arxiv

The Power of In-Memory Computing: from Supercomputing to Stream Processing

On the Complexity of List Ranking in the Parallel External Memory Model