High-Performance Parallel Computing

High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre – 1 – Course Overview • Emphasis on algorithm development and programming issues for high performance • No assumed background in computer architecture; assume knowledge of C • Grading: • 60% Programming Assignments (4 x 15%) • 40% Final Exam (July 4) • Accounts will be provided on IIT-M system – 2 – Course Topics • Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures • Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) – 3 – Class Meeting Schedule • Lecture from 9am-12pm each day, with mid-class break • Optional Lab session from 12-1pm each day • 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 15% OpenMP June 27 15% CUDA June 30 15% Vectorization July 2 15% – 4 – The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) • Applications experienced automatic performance – 5 – improvement Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket NozzleSurface Nuclear Reactor 100 2 Pentium® 4 m c Pentium® III / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies”6 – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Courtesy Avi – 6 – Mendelson, Intel. Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif – 7 – Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." – 8 – The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 9 – 9 – Can You Predict Performance? double W,X,Y,Z; • E1 takes about 0.4s to execute on this laptop for(j=0;j<100000000;j++){ • Performance in GFLOPs: W = 0.999999*X; billions (Giga) of FLoating X = 0.999999*W;} point Operations per // Example loop E1 Second = • About how long does E2 for(j=0;j<100000000;j++){ take? W = 0.999999*W + 0.000001; 1. [0-0.4s] X = 0.999999*X + 0.000001; 2. [0.4s-0.6s] Y = 0.999999*Y + 0.000001; Z = 0.999999*Z + 0.000001; 3. [0.6s-0.8s] } 4. More than 0.8s // Example loop E2 E2 takes 0.35s => 2.27 GFLOPs – 10 – ILP Affects Performance • ILP (Instruction Level double W,X,Y,Z; Parallelism): Many operations in a sequential code could be for(j=0;j<100000000;j++){ executed concurrently if they do W = 0.999999*X; not have dependences X = 0.999999*W;} // Example loop E1 • Pipelined stages in functional units can be exploited by ILP for(j=0;j<100000000;j++){ • Multiple functional units in a W = 0.999999*W + 0.000001; CPU be exploited by ILP X = 0.999999*X + 0.000001; • ILP is automatically exploited by Y = 0.999999*Y + 0.000001; the system when possible Z = 0.999999*Z + 0.000001; } • E2’s statements are independent and provide ILP, // Example loop E2 but E1’s statements are not, and do not provide ILP – 11 – Example: Memory Access Cost #define N 32 #define T 1024*1024 • About how long will code // #define N 4096 run for the 4Kx4K matrix? // #define T 64 1. [0-0.3s] double A[N][N]; 2. [0.3s-0.6s] for(it=0; it<T; it++) 3. [0.6s-1.2s] for (j=0; i<N; i++) 4. More than 1.2s for (i=0; i<N; i++) A[i][j] += 1; 32x3232x32 4Kx4K4Kx4K FLOPsFLOPs 230230 230230 Data Movement overheads from main TimeTime 0.33s0.33s 15.5s memory to cache can GFLOPsGFLOPs 3.243.24 0.07 greatly exceed computational costs – 12 – Cache Memories Adapted from slides by Prof. Randy Bryant (CMU CS15-213) and Dr. Jeffrey Jones (OSU CSE 5441) Topics n Generic cache memory organization n Impact of caches on performance – 13 – Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. n Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure: CPU chip register file L1 ALU cache cache bus system bus memory bus I/O main L2 cache bus interface bridge memory – 14 – Inserting an L1 Cache Between the CPU and Main Memory The tiny, very fast CPU register file The transfer unit between has room for four 4-byte words. the CPU register file and the cache is a 4-byte block. line 0 The small fast L1 cache has room line 1 for two 4-word blocks. The transfer unit between the cache and main memory is a 4-word block (16 bytes). block 10 a b c d ... The big slow main memory block 21 p q r s has room for many 4-word ... blocks. block 30 w x y z ... – 15 – Direct-Mapped Cache Byte 0000 Byte0 000000 Byte1 000001 byte memory -64 Byte2 000010 viewed as 8 blocks of 8 bytes each Byte3 000111 Byte4 000100 Byte5 000101 Memory Block Line 0 Byte6 000110 Memory0 Block valid tag Data Byte7 001011 Memory1 Block valid tag Data Line 1 Byte8 001000 Memory2 Block Memory Block Byte9 001001 43 valid tag Data Line 2 Byte10 001010 Memory Block 5 valid tag Data Line 3 Byte11 = 001111 = Memory Block Byte12 001100 Memory6 Block Byte13 001101 7 Byte14 001110 Round-robin mapping of Byte15 010011 memory blocks to cache lines Byte16 010000 Binary Byte Address Byte17 010001 ……18 ……10 Byte 1111 Byte63 Memory11 64 Each memory block maps to a particular line – 16 – in the cache Direct-Mappedcache lines Cache and pre-fetching RAM byte addressable AC AD appropriate bytes for data type register transferred to register 1 4 4 5 cache AA E AA AB AC AD AE AF B0 B1 cache S line C AB AC U S AD O AE AF 8 B0 system with 64bit data paths transfers 8 bytes B1 J. S. Jones – 17 – Accessing Direct-Mapped Caches Set selection n Use the set (= line) index bits to determine the set of interest. set 0: valid tag cache block selected set (line) set 1: valid tag cache block • • • t bits s bits b bits valid tag cache block 0 0 0 0 1 set S-1: m-1 tag set index block offset0 – 18 – Accessing Direct-Mapped Caches Line matching and word selection n Line matching: Find a valid line in the selected set with a matching tag n Word selection: Then extract the word =1? (1) The valid bit must be set 0 1 2 3 4 5 6 7 selected set (i): 1 0110 w0 w1 w2 w3 (2) The tag bits in the cache line must match the = ? (3) If (1) and (2), then tag bits in the address cache hit, and block offset selects t bits s bits b bits starting byte. 0110 i 100 m-1 tag set index block offset0 – 19 – Direct-Mapped Cache Simulation M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set t=1 s=2 b=1 x xx x Address trace (reads): 0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002] 0 [00002] (miss) 13 [11012] (miss) v tag data v tag data 11 0 M[0-1]m[1] m[0] 1 0 M[0-1]m[1] m[0] (1) (3) 1 1 M[12-13]m[13] m[12] 8 [10002] (miss) 0 [00002] (miss) v tag data v tag data 11 1 M[8-9]m[9] m[8] 11 0 M[0-1]m[1] m[0] (4) (5) 1 1 M[12-13] 11 1 M[12-13]m[13] m[12] – 20 – General Structure of Cache Memory 1 valid bit t tag bits B = 2b bytes Cache is an array per line per line per cache block of sets. valid tag 0 1 • • • B–1 Each set contains E lines • • • one or more lines. set 0: per set valid tag 0 1 • • • B–1 Each line holds a block of data. valid tag 0 1 • • • B–1 set 1: • • • S = 2s sets valid tag 0 1 • • • B–1 • • • valid tag 0 1 • • • B–1 set S-1: • • • valid tag 0 1 • • • B–1 Cache size: C = B x E x S data bytes – 21 – Addressing Caches Address A: t bits s bits b bits m-1 0 v tag 0 1 • • • B–1 • • • set 0: <tag> <set index> <block offset> v tag 0 1 • • • B–1 v tag 0 1 • • • B–1 set 1: • • • v tag 0 1 • • • B–1 The word at address A is in the cache if • • • the tag bits in one of the <valid> lines in v tag 0 1 • • • B–1 set <set index> match <tag>.

Load more