<<

18‐447 Lecture 21: Parallel Architecture Overview

James C. Hoe Department of ECE Carnegie Mellon University

18‐447‐S21‐L21‐S1, James C. Hoe, CMU/ECE/CALCM, ©2021 Housekeeping

•Your goal today –see the diverse landscape of parallel architectures/organizations –set the context for focused topics to come •Notices –Lab 4: status check 4/26, due 5/7 –HW5: Friday, 5/7 –Midterm 2 Regrade: Monday, 5/3 –Midterm 3: Tuesday, 5/11, 5:30~6:25pm • Readings –P&H Ch 6

18‐447‐S21‐L21‐S2, James C. Hoe, CMU/ECE/CALCM, ©2021 Parallelism Defined

• T1 (work measured in time): –time to do work with 1 PE x = a + b; y = b * 2 • T (critical path): z =(x‐y) * (x+y) –time to do work with infinite PEs a b – T bounded by dataflow dependence •Average parallelism: + *2 x y Pavg = T1 / T •For a system with p PEs - +

Tp  max{ T1/p, T } * •When Pavg>>p

Tp  T1/p, aka “linear

18‐447‐S21‐L21‐S3, James C. Hoe, CMU/ECE/CALCM, ©2021 A Non‐Parallel “Architecture”

•Memory holds both program and data – instructions and data in a linear memory array – instructions can be modified as data • Sequential instruction processing 1. program (PC) identifies current instruction 2. fetch instruction from memory 3. update some state (e.g. PC and memory) as a function of current state (according to instruction)

4. repeat

0 1 2 3 4 5 . . . … Dominant paradigm since its invention

18‐447‐S21‐L21‐S4, James C. Hoe, CMU/ECE/CALCM, ©2021 Inherently Parallel Architecture • Consider a von Neumann program –What is the significance of the program order? –What is the significance of the storage locations? v := a+ b ; w:= b*2 ; a b x := v ‐ w ; y := v +w ; + *2 z := x *y ; • Dataflow program instruction ordering implied by data dependence ‐ + – instruction specifies who receives the result * – instruction executes when operands received –no program counter, no* intermediate state z

18‐447‐S21‐L21‐S5, James C. Hoe, CMU/ECE/CALCM, ©2021 [dataflow figure and example from Arvind] More Conventionally Parallel

P1 P2 P3 Pn

Memory

Do you naturally think parallel or sequential? 18‐447‐S21‐L21‐S6, James C. Hoe, CMU/ECE/CALCM, ©2021 Simple First Look:

• Abundant in matrix operations and scientific/ numerical applications •Example: DAXPY/LINPACK (inner loop of Gaussian elimination and matrix‐mult)

for(i=0; i

for(i=0; i

• Instantiate k copies of the hardware unit foo to k iterations of the loop in parallel

a0 a1 a2 a3 a4 a5 a6 a7

b0 b1 b2 b3 b4 b5 b6 b7

foo foo foo foo foo foo foo foo

c0 c1 c2 c3 c4 c5 c6 c7 18‐447‐S21‐L21‐S8, James C. Hoe, CMU/ECE/CALCM, ©2021 Parallelism vs Concurrency

for(i=0; i

………… a4 a3 a2 a1 a0 ………… c4 c3 c2 c1 c0 ………… b4 b3 b2 b1 b0 available many consumed 1 element concurrency? yes! cycles later per cycle Can combine concurrency and

18‐447‐S21‐L21‐S9, James C. Hoe, CMU/ECE/CALCM, ©2021 pipelining at the same time A Spotty Tour of the MP Universe

18‐447‐S21‐L21‐S10, James C. Hoe, CMU/ECE/CALCM, ©2021 Classic Thinking: Flynn’s Taxonomy Single Instruction Multiple Instruction Stream Stream ata D

SISD: MISD: your vanilla uniprocessor DB query?? Stream ingle S

SIMD: MIMD: ata

D fully independent

many PEs following common instruction programs/control‐flows

Stream stream/control‐flow on working in parallel ultiple

M different data (collaborating SISDs?)

18‐447‐S21‐L21‐S11, James C. Hoe, CMU/ECE/CALCM, ©2021 SIMD vs. MIMD (an abstract and general depiction)

cntrl

cntrl cntrl cntrl alu alu alu alu alu alu

data data data data data data

together or separate? together or separate? 18‐447‐S21‐L21‐S12, James C. Hoe, CMU/ECE/CALCM, ©2021 Variety in the details •Scale, technology, application . . . . • Concurrency – granularity of concurrency (how finely is work divided)whole programs down to bits – regularityall “nodes” look the same and look out to the same environment – static vs. dynamice.g., load‐balancing • Communication – message‐passing vs. – granularity of communicationwords to pages – interconnect and interface design/performance

18‐447‐S21‐L21‐S13, James C. Hoe, CMU/ECE/CALCM, ©2021 SIMD: Vector Machines •Vector data type and regfile • Deeply pipelined fxn units • Matching high‐perf load‐store units and multi‐banked memory • E.g., Cray 1, circa 1976 –64 x 64‐word vector RF –12 pipelines, 12.5ns –ECL 4‐input NAND and SRAM (no caches!!) –2x25‐ton cooling system – 250 MIPS peak for ~10M 1970$ [Figure from H&P CAaQA, COPYRIGHT 2007 Elsevier. ALL RIGHTS RESERVED.] 18‐447‐S21‐L21‐S14, James C. Hoe, CMU/ECE/CALCM, ©2021 SIMD: Big‐Irons Front •Sea of PEs on a regular grid End – synchronized common cntrl –direct access to local mem –nearest‐neighbor exchanges R R R – special support for PE PE PE broadcast, reduction, etc. M M M • E.g., Thinking Machines CM‐2 R R R PE PE PE – 1000s of bit‐sliced PEs lock‐ M M M step controlled by a R R R common sequencer PE PE PE – “hypercube” topology M M M – special external I/O nodes

18‐447‐S21‐L21‐S15, James C. Hoe, CMU/ECE/CALCM, ©2021 SIMD: Modern Renditions • SSE (Streaming SIMD Extension), 1999 –16 x 128‐bit “vector” registers, 4 floats or 2 doubles –SIMD instructions: ld/st, arithmetic, shuffle, bitwise – SSE4 with true full‐width operations Core i7 does upto 4 sp‐mult & 4 sp‐add per cyc per core, (24GFLOPS @3GHz) •AVX 2 doubles the above (over 1TFLOPS/chip) •“GP”GPUs . . . (next slide) Simple hardware, big perf numbers but only if massively data‐parallel app!! 18‐447‐S21‐L21‐S16, James C. Hoe, CMU/ECE/CALCM, ©2021 8+ TFLOPs Nvidia GP104 GPU

•20 Streaming Multiproc – 128 SIMD lane per SM –1 mul, 1 add per lane –1.73 GHz (boosted) •Performance – 8874 GFLOPs – 320GB/sec – 180 Watt How many FLOPs per Watt? How many FLOPs per DRAM byte accessed?

18‐447‐S21‐L21‐S17, James C. Hoe, CMU/ECE/CALCM, ©2021 [NVIDIA GeForce GTX 1080 Whitepaper] Aside: IPC, ILP, and TLP •Each cycle, select a “ready” from pool –only one instruction per thread in flight at once –on a long latency stall, remove the thread from scheduling •Simpler and faster implementation since –no data dependence, hence no stall or forwarding –no penalty in making pipeline deeper

Instthread1 A B C D E E G H Instthread2 A B C D E F G H Instthread3 A B C D E F G H Instthread4 A B C D E F G H Instthread5 A B C D E F G H Instthread6 A B C D E F G H Instthread7 A B C D E F G H Instthread8 A B C D E F G H Instthread1 A B C D E F G H

18‐447‐S21‐L21‐S18, James C. Hoe, CMU/ECE/CALCM, ©2021 e.g., Barrel [HEP, Smith] What 1 TFLOP meant in 1996 •ASCI Red, 1996World’s 1st TFLOP computer!! – $50M, 1600ft2 system – ~10K 200MHz PentiumPro’s –~1 TByte DRAM (total) – 500kW to power + 500kW on cooling •Advanced Simulation and Computing Initiative –how to know if nuclear stockpile still good if you can’t blow one up to find out? – require ever more expensive simulation as stockpile aged –Red 1.3TFLOPS 1996; Blue Mountain/Pacific 4TFLOPS 1998; White 12TFLOPS 2000; Purple 100TFLOPS 2005; . . . IBM Summit 200PetaFLOPS 18‐447‐S21‐L21‐S19, James C. Hoe, CMU/ECE/CALCM, ©2021 SIMD vs. MIMD (an abstract and general depiction)

cntrl

cntrl cntrl cntrl alu alu alu alu alu alu

data data data data data data

together or separate? together or separate? 18‐447‐S21‐L21‐S20, James C. Hoe, CMU/ECE/CALCM, ©2021 MIMD:

NIU NIU tx fifo tx fifo P M M P0 rx fifo rx fifo i

NIU NIU tx fifo tx fifo Interconnect P M M P1 rx fifo rx fifo j

NIU NIU tx fifo tx fifo P M M P2 rx fifo rx fifo k

•Private address space and memory per processor • Parallel threads on different processors communicate by explicit sending and receiving of messages

18‐447‐S21‐L21‐S21, James C. Hoe, CMU/ECE/CALCM, ©2021 MIMD Message Passing Systems (by network interface placement) •Beowulf Clusters (I/O ) – Linux PCs connected by Ethernet •High‐Performance Clusters (I/O bus) –stock workstations/servers but exotic interconnects, e.g., CPU ALU RF Myrinet, HIPPI, Infiniband, etc. $ • Supers (memory bus) X –stock CPUs on custom platform Memory Bus (GB/sec)X – e.g., Cray XT5 (“fastest” I/O I/O Bus in 2011 224K AMD Opteron Main Bridge Memory •Inside the CPU DiskI/O DiskDisk –single‐instruction send/receive X – e.g., iAPX 432 (1981), Transputers (80s) 18‐447‐S21‐L21‐S22, James C. Hoe, CMU/ECE/CALCM, ©2021 MIMD Shared Memory: Symmetric Multiprocessors (SMPs) • Symmetric means – identical procs connected to common memory – all procs have equal access to system (mem & I/O) –OS can schedule any process on any processor • (UMA) CPU CPU – processor/memory ALU RF ALU RF connected by bus or crossbar cache

– all processors have equal Memory Bus (GB/sec) memory access performance I/O I/O Bus Bridge to all memory locations Main Memory

DiskI/O –caches need to stay coherent DiskDisk

18‐447‐S21‐L21‐S23, James C. Hoe, CMU/ECE/CALCM, ©2021 MIMD Shared Memory: Big Irons Distributed Shared Memory •UMA hard to scale due to concentration of BW • Large scale SMPs have with non‐uniform memory (NUMA) – “local” memory pages (faster to access) –“remote” memory pages (slower to access) –cache‐coherence still possible but complicated • E.g., SGI Origin 2000 – upto 512 CPUs and 512GB dir dir DRAM ($40M) mem PE mem PE –48 128‐CPU system was NIU NIU collectively the 2nd fastest computer (3TFLOPS) in 1999 network 18‐447‐S21‐L21‐S24, James C. Hoe, CMU/ECE/CALCM, ©2021 MIMD Shared Memory: it is everywhere now! • General‐purpose “multicore” processors implement SMP (not UMA) on a single chip • Moore’s Law scaling in number of core’s

[Figure from P&H CO&D, COPYRIGHT Intel Xeon e5345 2009 Elsevier. ALL RIGHTS RESERVED.] 18‐447‐S21‐L21‐S25, James C. Hoe, CMU/ECE/CALCM, ©2021 Today’s New Normal

Intel® Xeon® E5‐2600

[https://software.intel.com/en‐us/articles/intel‐xeon‐processor‐

18‐447‐S21‐L21‐S26, James C. Hoe, CMU/ECE/CALCM, ©2021 scalable‐family‐technical‐overview] Remember how we got here . . . .

little little little little core core core core little little little little little little core core core core little core core little little little little core Big Core little little core core core core core core little little little little core core core core

1970~2005 2005~??

18‐447‐S21‐L21‐S27, James C. Hoe, CMU/ECE/CALCM, ©2021 Today’s Exotic

Microsoft Catapult [MICRO 2016, Caulfield, et al.]

Google TPU [Hotchips, 2017, Jeff Dean]

18‐447‐S21‐L21‐S28, James C. Hoe, CMU/ECE/CALCM, ©2021 March Toward Exascale (1018) HPC www.top500.org

18‐447‐S21‐L21‐S29, James C. Hoe, CMU/ECE/CALCM, ©2021