<<

Outline

Lecture 1: Parallel Architecture Concepts n Parallel computer, multiprocessor, multicomputer Parallel Computer n SIMD vs. MIMD execution Architecture Concepts n vs. architecture n Interconnection networks TDDE35 Lecture 1 n Parallel architecture design concepts l Instruction-level parallelism Christoph Kessler l Hardware multithreading

PELAB / IDA l Multi-core and many-core Linköping university Sweden l Accelerators and heterogeneous systems l Clusters n Implications for programming and design 2019 2

Traditional Use of Parallel : High Performance Computing Large-Scale HPC Applications Application Areas (Selection)

n High Performance Computing (HPC) n Computational Fluid Dynamics n Weather Forecasting and Climate Simulation l E.g. climate simulations, particle physics, proteine docking, … n Aerodynamics / Air Flow Simulations and Optimization l Much computational work n Structural Engineering (in FLOPs, floatingpoint operations) n Fuel-Efficient Aircraft Design l Often, large data sets n Molecular Modelling n Material Science n n Single-CPU and even today’s n Battery Simulation and Optimization multicore processors cannot provide such n Galaxy Simulations massive computation power n Earthquake Engineering, Oil Reservoir Simulation n Flood Prediction n Aggregate LOTS of computers à Clusters n (DNA Pattern Matching, Proteine Docking) n Fluid / Structural Interaction l Need scalable parallel n Blood Flow Simulation l Need exploit multiple levels of parallelism n fRMI Image Analysis n ... l Cost of communication, memory access

3 NSC Tetralith 4 w ww.e-science.se

Another Classical Use of : Example: Weather Forecast (very simplified…) Parallel Embedded Computing • 3D Space discretization (cells) • Air pressure • Time discretization (steps) • Temperature n High-performance embedded computing • Start from current observations • Humidity (sent from weather stations etc.) • Sun radiation l E.g. on-board realtime image/video processing, gaming, … • Wind direction • Simulation step by evaluating l Much computational work weather model equations • Wind velocity (often fixed point operations) • … l Often, in energy-constrained mobile devices E.g., cell size 1km3 and 10min time discretization à 10-day simulation: 1015 FLOPs n Sequential programs on single-core computers SMHI: 4 forecasts per day, cannot provide sufficient computation power 50 variants (simulations) per forecast at a reasonable power budget

n Use many small cores at low frequency l Need scalable parallel algorithms l Cost of communication, memory access l Energy cost (Power x Time) https://www.smhi.se/kunskapsbanken/meteorologi/sa-gor-smhi-en-vaderprognos-1.4662 5 6 NSC Tetralith

1 More Recent Use of Parallel Computing: HPC vs Big-Data Computing Big-Data Analytics Applications n Big Data Analytics n Both need parallel computing n Same kind of hardware – Clusters of (multicore) servers l Data access intensive (disk I/O, memory accesses) n Same OS family (Linux) 4Typically, very large data sets (GB … TB … PB … EB …) n Different programming models, languages, and tools l Also some computational work for combining/aggregating data l E.g. data center applications, business analytics, click HPC application Big-Data application analysis, scientific data analysis, , … HPC prog. languages: Big-Data prog. languages: l Soft real-time requirements on interactive querys , /C++ (Python) Java, Scala, Python, … Par. programming models: Par. programming models: n Single-CPU and multicore processors cannot MPI, OpenMP, … MapReduce, Spark, … provide such massive computation power and I/O +capacity Scientific computing Big-/access: libraries: BLAS, … HDFS, … n Aggregate LOTS of computers à Clusters OS: Linux OS: Linux l Need scalable parallel algorithms HW: Cluster HW: Cluster l Need to exploit multiple levels of parallelism l à Let us start with the common basis: Parallel 7 NSC Tetralith 8

Parallel Computer Parallel Computer Architecture Concepts

Classification of parallel computer architectures: n by control structure n by memory organization l in particular, Distributed memory vs. Shared memory n by interconnection

9 10

Classification by Control Structure Classification by Memory Organization

op

vop e.g. (traditional) HPC cluster e.g. multiprocessor (SMP) or computer with a standard multicore CPU

Most common today in HPC and Data centers: Hybrid Memory System • Cluster (distributed memory) op op op op 1 2 3 4 of hundreds, thousands of shared-memory servers each containing one or several multi-core CPUs

11 12 NSCNSC Triolith Tetralith

2 Hybrid (Distributed + Shared) Memory Interconnection Networks (1)

n Network = physical interconnection medium (wires, switches) + (a) connecting cluster nodes with each other (DMS) (b) connecting processors with memory modules (SMS)

Classification M M n Direct / static interconnection networks l connecting nodes directly to each other P R l Hardware routers (communication ) can be used to offload processors from most communication work n Switched / dynamic interconnection networks

13 SC Tetralith 14

Interconnection Networks (2): P Interconnection Networks (3): P Simple Topologies P Fat-Tree Network P fully connected P P n Tree network extended for higher bandwidth (more switches, more links) closer to the root l avoids bandwidth bottleneck

n Example: Infiniband network (www.mellanox.com)

15 16

Instruction Level Parallelism (1): More about Interconnection Networks Pipelined Execution Units n Hypercube, Crossbar, Butterfly, Hybrid networks… à TDDC78 n Switching and algorithms n Discussion of interconnection network properties l Cost (#switches, #lines) l (asymptotically, cost grows not much faster than #nodes) l Node degree l Longest path (à latency) l Accumulated bandwidth l Fault tolerance (worst-case impact of node or switch failure) l … 17 18

3 SIMD computing Instruction-Level Parallelism (2): e.g., vector VLIW and Superscalar with Pipelined Vector Units (1970s, 1980s), Fujitsu, …

n Multiple functional units in parallel n 2 main paradigms: l VLIW (very large instruction word) architecture ^ 4Parallelism is explicit, progr./-managed (hard) l Superscalar architecture à 4Sequential instruction stream 4Hardware-managed dispatch 4power + area overhead n ILP in applications is limited l typ. < 3...4 instructions can be issued simultaneously l Due to control and data dependences in applications

19 n Solution: Multithread the 20application and the

Hardware Multithreading SIMD Instructions ”vector register”

n “Single Instruction stream, Multiple Data streams” l single of control flow op SIMD unit l restricted form of 4apply the same primitive operation (a single instruction) in parallel to multiple data elements stored contiguously l SIMD units use long “vector registers” 4each holding multiple data elements n Common today E.g., l data MMX, SSE, SSE2, SSE3,… dependence l Altivec, VMX, SPU, … n Performance for operations on shorter data types n Area- and energy-efficient n Code to be rewritten (SIMDized) by programmer or compiler n Does not help (much) for memory bandwidth

P P 21 P P 22

The Memory Wall Moore’s Law (since 1965) n Performance gap CPU – Memory Exponential increase in density n n Increasing sizes shows diminishing returns l Costs power and chip area 4 GPUs spend the area instead on many simple cores with little memory l Relies on good data locality in the application n What if there is no / little data locality? l Irregular applications, e.g. sorting, searching, optimization... n Solution: Spread out / overlap memory access delay l Programmer/Compiler: Prefetching, on-chip pipelining, SW-managed on-chip buffers

l Generally: Hardware multithreading, again! Data from , Lance Hammond, Herb Sutter, 23 24Burton Smith, Chris Batten, and Krste Asanoviç

4 The Power Issue Moore’s Law vs. Clock Frequency n Power = Static (leakage) power + Dynamic (switching) power n Dynamic power ~ Voltage2 * Clock frequency where Clock frequency approx. ~ voltage • # / mm2 still 3 growing exponentially à Dynamic power ~ Frequency according to Moore’s Law n Total power ~ #processors ~3GHz • Clock speed flattening out

2003

More transistors + Limited frequency ⇒ More cores

25 26

Solution for CPU Design: Multicore + Multithreading Main features of a multicore system n There are multiple computational cores on the same chip. n Single-thread performance does not improve any more since ca. 2003 l Homogeneous multicore (same core type) l ILP wall l Heterogeneous multicore (different core types) l Memory wall n The cores might have (small) private on-chip memory l Power wall (end of “Dennard Scaling”) modules and/or access to on-chip memory shared by several cores. n but thanks to Moore’s Law we could still put more cores on a n The cores have access to a common off-chip main memory chip n There is a way by which these cores communicate with each l And hardware-multithread the cores to hide memory other and/or with the environment. latency l All major chip manufacturers produce multicore CPUs today

27 28

Standard CPU Multicore Designs Some early dual-core CPUs (2004/2005) n Standard desktop/ CPUs have a few cores with shared off-chip main memory core core core core SMT L1$ L1$ L1$ L1$ l On-chip cache (typ., 2 levels) P0 P1 P0 P1 P0 P1 D1$ 4L1-cache mostly core-private L2$ L2$ L1$ D1$ L1$ D1$ L1$ L1$ D1$ L1$ D1$ L1$ D1$ Interconnect / Memory interface L2$ L2$ L2$ L2$ 4L2-cache often shared by Memory Ctrl Memory Ctrl Memory Ctrl groups of cores main memory (DRAM) Main memory Main memory Main memory l Memory access interface shared by all or groups of cores IBM Power5 AMD Opteron Xeon n Caching à multiple copies of the same data item (2004) Dualcore (2005) Dualcore(2005) n Writing to one copy (only) causes inconsistency $ = ”cache” L1$ = ”level-1 instruction cache” D1$ = ”level-1 data cache” n Shared mechanism to enforce automatic L2$ = ”level-2 cache” (uniform) updating or invalidation of all copies around à More about shared-memory architecture, caches, data locality, consistency issues and coherence protocols in TDDC78/TDDD56 29 30

5 SUN/Oracle SPARC T Niagara (8 cores) SUN / Oracle SPARC-T5 (2012)

Sun UltraSPARC ”Niagara” Niagara T1 (2005): 8 cores, 32 HW threads Niagara T2 (2008): 8 cores, 64 HW threads Niagara T3 (2010): 16 cores, 128 HW threads T5 (2012): 16 cores, 128 HW threads

P0 P1 P2 P3 P4 P5 P6 P7 L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ Niagara T1 (2005): 8 cores, 32 HW threads L2$ 28nm , 16 cores x 8 HW threads, L3 cache on-chip, Memory Ctrl Memory Ctrl Memory Ctrl Memory Ctrl On-die accelerators for common encryption algorithms Main memory Main memory Main memory Main memory 31 32

Scaling Up: Network-On-Chip Example: Cell/B.E. (IBM//Toshiba 2006) n Cache-coherent shared memory (hardware-controlled) – n An on-chip network (four parallel unidirectional rings) does not scale well to many cores interconnect the master core, the slave cores and the main l power- and area-hungry memory interface l signal latency across whole chip l not well predictable access times n LS = local on-chip memory, PPE = master, SPE = slave n Idea: NCC-NUMA – non-cache-coherent, non- l Physically distributed on-chip [cache] memory, l on-chip network, connecting PEs or coherent ”tiles” of PEs l global shared , l but responsible for maintaining coherence n Examples: l STI Cell/B.E., l Tilera TILE64,

l Intel SCC, Kalray MPPA 33 34

Towards Many-Core CPUs... Towards Many-Core Architectures

n Tilera TILE64 (2007): 64 cores, 8x8 2D-mesh on-chip network n Intel, second-generation many-core research processor: Mem-controller 48-core (in-order ) SCC ”Single-chip cloud computer”, 2010 l No longer fully cache coherent over the entire chip P C l MPI-like passing I/O I/O over 2D mesh network on chip R

1 tile: VLIW-processor + cache + router

(Image simplified)

35 Source: Intel 36

6 Kalray MPPA-256 Intel Xeon Phi n 16 tiles n First generation (late 2012): with 16 VLIW compute cores each Up to 61 cores, 244 HW threads, 1.2 Tflops peak performance plus 1 control core per tile l Simpler x86 (Pentium) cores (x 4 HW threads), n network on chip with 512 bit wide SIMD vector registers (AVX-512) n Virtually unlimited array extension l Can also be used as a , instead of a GPU by clustering several chips n Latest version (2016): x200 ”Knight’s Landing” n 28 nm CMOS technology (up to 72 cores / 288 HW threads), no longer as coprocessor n Low power dissipation, typ. 5 W

Image source: Kalray

37 38

”General-purpose” GPUs Fermi (2010): 512 cores 1 ”shared-memory I-cache • Main GPU providers for laptop/desktop 1 Fermi C2050 GPU multiprocessor” (SM) Scheduler Nvidia, AMD(ATI), Intel SM Dispatch (Images removed) • Example: 32 Streaming NVIDIA’s 10-series GPU (Tesla, 2008) L2 processors has 240 cores (cores) Source: NVidia • Each core has a • Floating point / integer unit C1060: • Logic unit 933 GFlops • Move, compare unit Load/Store units • Branch unit 1 Streaming Special function units Processor • Cores managed by thread manager (SP) 64K configurable L1cache/ shared memory • Thread manager can spawn FPU IntU and manage 30,000+ threads

• Zero overhead thread switching39 40

GPU Architecture Paradigm The future will be heterogeneous! n Optimized for high throughput Need 2 kinds of cores – often on same chip: l In theory, ~10x to ~100x higher throughput than CPU is n For non-parallelizable code: possible Parallelism only from running several serial applications simultaneously on different cores n Massive hardware-multithreading hides memory access latency (e.g. on desktop: word processor, email, virus scanner, … n Massive parallelism … not much more) n GPUs are good at data-parallel computations à Few (ca. 4-8) ”fat” cores – designed for low latency (power-hungry, area-costly, l multiple threads executing the same instruction on different large caches, out-of-order issue / speculation) data, preferably located adjacently in memory for high single-thread performance n For well-parallelizable code: à hundreds of simple cores – designed for high throughput at low power consumption (power + area efficient)

41 (GPU-/SCC-like) 42

7 Heterogeneous / Hybrid Multi-/Manycore Heterogeneous / Hybrid Multi-/Manycore Systems

Key concept: Master-slave parallelism, offloading n Cell/B.E. n General-purpose CPU (master) processor controls execution of slave processors by submitting tasks to them and transfering operand data to the slaves’ local memory à Master offloads computation to the slaves n GPU-based system: n Slaves often optimized for heavy throughput computing l Master could do something else while waiting for the result, Offload heavy or switch to a power-saving mode computation n Master and slave cores might reside CPU on the same chip (e.g., Cell/B.E.) Device Data memory or on different chips (e.g., most GPU-based systems today) transfer GPU Main n Slaves might have access to off-chip main memory (e.g., Cell) memory or not (e.g., today’s GPUs) 43 44

Multi-GPU Systems Units n Connect one or few general-purpose (CPU) multicore n FPGA – Field Programmable Gate Array processors with shared off-chip memory to several GPUs n Increasingly popular in high-performance computing l Cost and (quite) energy effective if offloaded computation fits GPU architecture well

L2

"Altera StratixIVGX FPGA" by Altera Corp. Licensed under CC BY 3.0 via Wikimedia Commons

L2

Main Memory (DRAM) 45 46

Example: Beowulf-class PC Clusters Example: Tetralith (NSC, 2018/2019)

with off-the-shelf CPUs (Xeon, Opteron, …)

• Each Tetralith compute node has 2 Intel Xeon Gold 6130 CPUs (2.1 GHz) each with 16 cores (32 hardware threads) • 1832 "thin" nodes with 96 GiB of primary memory (RAM) • and 60 "fat" nodes with 384 GiB. à 1892 nodes, 60544 cores in total All nodes are interconnected with a 100 Gbps

47 Intel Omni-Path48 network (Fat-Tree topology)

8 The Challenge Can’t the compiler fix it for us? n Today, basically all computers are parallel computers! n Automatic parallelization? l Single-thread performance stagnating l Dozens of cores and hundreds of HW threads available per server l at compile time: l May even be heterogeneous (core types, accelerators) 4Requires static analysis – not effective for pointer-based l Data locality matters languages

l Large clusters for HPC and Data centers, require message passing 4needs programmer hints / rewriting ... n Utilizing more than one CPU core requires thread-level parallelism n One of the biggest software challenges: Exploiting parallelism 4ok for few benign special cases: l Need LOTS of (mostly, independent) tasks to keep cores/HW threads – (Fortran) loop SIMDization, busy and overlap waiting times (cache misses, I/O accesses) l All application areas, not only traditional HPC – extraction of instruction-level parallelism, … 4 General-purpose, , graphics, games, embedded, DSP, … l at run time (e.g. speculative multithreading) l Affects HW/SW system architecture, programming languages, algorithms, data structures … 4High overheads, not scalable l Parallel programming is more error-prone (, races, further sources of inefficiencies) 4 And thus more expensive and time-consuming n More about parallelizing in TDDD56 + TDDC78 49 50

Bread-and-Butter Programming is Not And worse yet, Sufficient for High-Performance Computing n A lot of variations/choices in hardware n Resource-Aware Programming can give orders of magnitude in l Many will have performance implications n Exploit multiple levels of parallelism and optimizations l No standard parallel programming model

4portability issue n Understanding the hardware will it easier to make programs get high performance Native Progr. Language l Performance-aware programming gets more important Multi-Core // also for single-threaded code Loop tiling for l Adaptation leads to portability issue again data locality n How to write future-proof parallel programs? SIMD //

Table source: Turing award lecture by J. Hennessy and D. Patterson, 2018. See also: J. Hennessy, D. Patterson: A New Golden Age for Computer Architecture. Communications 51 of the ACM 62(2):48-60, Feb. 2019. 52

The Challenge What we had learned so far … n Bad news 1: n Sequential von-Neumann model Many programmers (also less skilled ones) programming, algorithms, data structures, complexity need to use parallel programming in the future l Sequential / few-threaded languages: C/C++, Java, ... not designed for exploiting massive parallelism n Bad news 2: time There will be no single uniform parallel programming model T(n) = O ( n log n ) as we were used to in the old sequential times à Several competing general-purpose and domain-specific languages and their concepts will co-exist

problem size 53 54

9 … and what we need now n Parallel programming! l Parallel algorithms and data structures l Analysis / cost model: parallel time, work, cost; scalability; l Performance-awareness: data locality, load balancing, communication

time T(n,p) = O ( (n log n)/p + log p ) Questions? number of processing units used

problem size 55

10