High-Performance Parallel Computing

High-Performance Parallel Computing

High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre – 1 – Course Overview • Emphasis on algorithm development and programming issues for high performance • No assumed background in computer architecture; assume knowledge of C • Grading: • 60% Programming Assignments (4 x 15%) • 40% Final Exam (July 4) • Accounts will be provided on IIT-M system – 2 – Course Topics • Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures • Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) – 3 – Class Meeting Schedule • Lecture from 9am-12pm each day, with mid-class break • Optional Lab session from 12-1pm each day • 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 15% OpenMP June 27 15% CUDA June 30 15% Vectorization July 2 15% – 4 – The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) • Applications experienced automatic performance – 5 – improvement Hitting the Power Wall 1000 Power doubles every 4 years Sun's 5-year projection: 200W total, 125 W/cm2 ! Rocket NozzleSurface Nuclear Reactor 100 2 Pentium® 4 m c Pentium® III / Hot plate s t Pentium® II t 10 a Pentium® Pro W i386 Pentium® i486 P=VI: 75W @ 1.5V = 50 A! 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies”6 – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Courtesy Avi – 6 – Mendelson, Intel. Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif – 7 – Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." – 8 – The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 9 – 9 – Can You Predict Performance? double W,X,Y,Z; • E1 takes about 0.4s to execute on this laptop for(j=0;j<100000000;j++){ • Performance in GFLOPs: W = 0.999999*X; billions (Giga) of FLoating X = 0.999999*W;} point Operations per // Example loop E1 Second = • About how long does E2 for(j=0;j<100000000;j++){ take? W = 0.999999*W + 0.000001; 1. [0-0.4s] X = 0.999999*X + 0.000001; 2. [0.4s-0.6s] Y = 0.999999*Y + 0.000001; Z = 0.999999*Z + 0.000001; 3. [0.6s-0.8s] } 4. More than 0.8s // Example loop E2 E2 takes 0.35s => 2.27 GFLOPs – 10 – ILP Affects Performance • ILP (Instruction Level double W,X,Y,Z; Parallelism): Many operations in a sequential code could be for(j=0;j<100000000;j++){ executed concurrently if they do W = 0.999999*X; not have dependences X = 0.999999*W;} // Example loop E1 • Pipelined stages in functional units can be exploited by ILP for(j=0;j<100000000;j++){ • Multiple functional units in a W = 0.999999*W + 0.000001; CPU be exploited by ILP X = 0.999999*X + 0.000001; • ILP is automatically exploited by Y = 0.999999*Y + 0.000001; the system when possible Z = 0.999999*Z + 0.000001; } • E2’s statements are independent and provide ILP, // Example loop E2 but E1’s statements are not, and do not provide ILP – 11 – Example: Memory Access Cost #define N 32 #define T 1024*1024 • About how long will code // #define N 4096 run for the 4Kx4K matrix? // #define T 64 1. [0-0.3s] double A[N][N]; 2. [0.3s-0.6s] for(it=0; it<T; it++) 3. [0.6s-1.2s] for (j=0; i<N; i++) 4. More than 1.2s for (i=0; i<N; i++) A[i][j] += 1; 32x3232x32 4Kx4K4Kx4K FLOPsFLOPs 230230 230230 Data Movement overheads from main TimeTime 0.33s0.33s 15.5s memory to cache can GFLOPsGFLOPs 3.243.24 0.07 greatly exceed computational costs – 12 – Cache Memories Adapted from slides by Prof. Randy Bryant (CMU CS15-213) and Dr. Jeffrey Jones (OSU CSE 5441) Topics n Generic cache memory organization n Impact of caches on performance – 13 – Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. n Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure: CPU chip register file L1 ALU cache cache bus system bus memory bus I/O main L2 cache bus interface bridge memory – 14 – Inserting an L1 Cache Between the CPU and Main Memory The tiny, very fast CPU register file The transfer unit between has room for four 4-byte words. the CPU register file and the cache is a 4-byte block. line 0 The small fast L1 cache has room line 1 for two 4-word blocks. The transfer unit between the cache and main memory is a 4-word block (16 bytes). block 10 a b c d ... The big slow main memory block 21 p q r s has room for many 4-word ... blocks. block 30 w x y z ... – 15 – Direct-Mapped Cache Byte 0000 Byte0 000000 Byte1 000001 byte memory -64 Byte2 000010 viewed as 8 blocks of 8 bytes each Byte3 000111 Byte4 000100 Byte5 000101 Memory Block Line 0 Byte6 000110 Memory0 Block valid tag Data Byte7 001011 Memory1 Block valid tag Data Line 1 Byte8 001000 Memory2 Block Memory Block Byte9 001001 43 valid tag Data Line 2 Byte10 001010 Memory Block 5 valid tag Data Line 3 Byte11 = 001111 = Memory Block Byte12 001100 Memory6 Block Byte13 001101 7 Byte14 001110 Round-robin mapping of Byte15 010011 memory blocks to cache lines Byte16 010000 Binary Byte Address Byte17 010001 ……18 ……10 Byte 1111 Byte63 Memory11 64 Each memory block maps to a particular line – 16 – in the cache Direct-Mappedcache lines Cache and pre-fetching RAM byte addressable AC AD appropriate bytes for data type register transferred to register 1 4 4 5 cache AA E AA AB AC AD AE AF B0 B1 cache S line C AB AC U S AD O AE AF 8 B0 system with 64bit data paths transfers 8 bytes B1 J. S. Jones – 17 – Accessing Direct-Mapped Caches Set selection n Use the set (= line) index bits to determine the set of interest. set 0: valid tag cache block selected set (line) set 1: valid tag cache block • • • t bits s bits b bits valid tag cache block 0 0 0 0 1 set S-1: m-1 tag set index block offset0 – 18 – Accessing Direct-Mapped Caches Line matching and word selection n Line matching: Find a valid line in the selected set with a matching tag n Word selection: Then extract the word =1? (1) The valid bit must be set 0 1 2 3 4 5 6 7 selected set (i): 1 0110 w0 w1 w2 w3 (2) The tag bits in the cache line must match the = ? (3) If (1) and (2), then tag bits in the address cache hit, and block offset selects t bits s bits b bits starting byte. 0110 i 100 m-1 tag set index block offset0 – 19 – Direct-Mapped Cache Simulation M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set t=1 s=2 b=1 x xx x Address trace (reads): 0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002] 0 [00002] (miss) 13 [11012] (miss) v tag data v tag data 11 0 M[0-1]m[1] m[0] 1 0 M[0-1]m[1] m[0] (1) (3) 1 1 M[12-13]m[13] m[12] 8 [10002] (miss) 0 [00002] (miss) v tag data v tag data 11 1 M[8-9]m[9] m[8] 11 0 M[0-1]m[1] m[0] (4) (5) 1 1 M[12-13] 11 1 M[12-13]m[13] m[12] – 20 – General Structure of Cache Memory 1 valid bit t tag bits B = 2b bytes Cache is an array per line per line per cache block of sets. valid tag 0 1 • • • B–1 Each set contains E lines • • • one or more lines. set 0: per set valid tag 0 1 • • • B–1 Each line holds a block of data. valid tag 0 1 • • • B–1 set 1: • • • S = 2s sets valid tag 0 1 • • • B–1 • • • valid tag 0 1 • • • B–1 set S-1: • • • valid tag 0 1 • • • B–1 Cache size: C = B x E x S data bytes – 21 – Addressing Caches Address A: t bits s bits b bits m-1 0 v tag 0 1 • • • B–1 • • • set 0: <tag> <set index> <block offset> v tag 0 1 • • • B–1 v tag 0 1 • • • B–1 set 1: • • • v tag 0 1 • • • B–1 The word at address A is in the cache if • • • the tag bits in one of the <valid> lines in v tag 0 1 • • • B–1 set <set index> match <tag>.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    41 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us