Advanced Openmp Tutorial

1 SerDes MI/O SerDes SPARC SPARC Pwr SPARC SPARC Core Core PCIe Gen3 Core Core

SPARC SPARCSPARC T5 ProcessorSPARC SPARC Core Core Core Core

L3 L3 L3 L3

SerDes SerDes MCU MCU Cross Bar MCU MCU SPARC SPARC SPARC SPARC L3 L3 L3 L3 Core Core Core Core 16 cores @ 3.6 GHz SPARC SPARC SPARC SPARC Core Core 8 threads/core Core Core 8-way 1Coherence-hop glueless

SerDes

2 T5 8-Way Glueless

PCIe PCIe

DIMMs DIMMs T5 T5 PCIe PCIe

DIMMs T5 T5 DIMMs

DIMMs

DIMMs T5 T5 - PCIe T PCIe T5 T5 DDR3-1066 1+TB/sec DIMMs DIMMs

Coherency Bisection PCIe PCIe Bandwidth 840 GB/sec

3 SPARC M6-32 System

M6 M6 M6 M6

384 coresM6 M6 @ 3.6 GHz, 3,072M6 M6threads

32M6 TBM6 shared memoryM6 M6

M6 M6 M6 M6 TheyM6 ComeM6 In SmallerM6 Sizes TooM6  M6 M6 M6 M6

4 SPARC M7 – On The Way To You

32 S4 Cores @ “Faster” 8 threads/core 8-way glueless

5 The New S4 Core • Dynamically Threaded, 8 Threads Extreme • Increased Frequency at Same Pipeline Performance Depths • Dual-Issue, OOO Execution Core • 2 ALU's, 1 LSU, 1 FGU, 1 BRU, 1 SPU • 40 Entry Pick Queue • 16KB L1 I$ and a 16KB write-through L1 D$ per core • 64 Entry Fully Assoc. I-TLB, 128 Entry Fully Assoc. D-TLB • Cryptographic Performance Improvements • 4X Addressing Increase (54bit VA, 50bit RA/PA, ) • Advanced Power Management • Application Acceleration Support • Application Data Integrity • Virtual Address Masking

6 • User-Level Synchronization Instructions The New SPARC M7 Processor

• 32 next generation S4 cores • 256 hardware threads per processor • 64 MB L3 cache • 8 DDR4 Memory Controllers • Over 150 GB/s real memory bandwidth • Coherence subsystem supports 1-32 socket scaling • 8 DAX (“Data Analytics Accelerators”) for query acceleration and message handling • Each DAX has 4 pipelines • A total of 32 pipelines per processor

7 SPARC M7 Processor

4 4 SMP 4 4 cores cores & IO cores cores Gatew ay

8 MB 8 MB SMP 8 MB 8 MB L3 L3 & IO L3 L3 Gatew ay

MC MC U & ON-CHIP NETWORK(OCN) U & DA DA X X

SMP 8 MB 8 MB & IO 8 MB 8 MB Gatew L3 L3 ay L3 L3

SMP 4 4 & IO 4 4 cores cores Gatew cores cores ay

8 Application Data Integrity (ADI)

SPARC M7 data block address color 0101...11101 0x... ADI Detects Illegal Memory 1011...00111 0x... Access At Hardware Speed 1101...10001 0x... 0111...00101 0x... 0100...10101 0x... Memory blocks are tagged with 0111...11100 0x... a color 0001...11111 0x... Application can only access 1101...10001 0x... memory blocks with the color(s) it 0000...10101 0x... 1101...01100 0x... owns 0111...10101 0x... Invalid access is flagged 0100...10111 0x... immediately

9 How ADI Works

data block address color 0101...11101 0x... A Application A 1011...00111 0x... A 1101...10001 0x... load @address 0x... A 0111...00101 0x... 0100...10101 0x... A ...... 0111...11100 0x... A 0001...11111 0x... A store @address 0x... A 1101...10001 0x... 0000...10101 0x... A 1101...01100 0x... load @address 0x... 0111...10101 0x... A 0100...10111 0x... A(dbx) run signal SEGV (ADI version 13 mismatch for VA 0x4a900) in main at 0x10988 (dbx) where …stack trace…

10 The SPARC Sonoma Processor NEW

DDR4 M7 DDR4 Interfaces Interfaces Fully Integrated Design Sonoma

InfiniBand PCIe Gen3 Interfaces

11 11 The Sonoma Processor NEW • 8 SPARC S4 Cores •

PCIEExtremeDDR4 DDR4 Fourth Generation CMT Core (S4) DAX Performance MCU • Dynamically Threaded, 1 to 8 threads per core • Optimized Cache Organization COHERENCY • Shared Level 2 Data and Instruction Caches CORE CLUSTER • 16MB Shared & Partitioned Level 3 Cache • Integrated DDR4 DRAM • 4 DDR4 Direct attached channels • Up to 1TB Physical memory per processor

CORE CLUSTER INFINIBAND • Integrated InfiniBand HCA

ON CHIP NETWORK CHIP ON • Advanced Software in Silicon features • Real-time Application Data Integrity • Concurrent Memory Migration and VA Masking DDR4 DDR4

DAX • MCU DB Query Offload Engines • Integrated PCIe Gen3 • Scale-out IB interconnect • Technology: 20nm, 13 Metal Layers

12 12 Sonoma Connectivity

CL x8 PCIe Gen3 x8 PCIe Gen3 x8

PCIe Gen3 x8 PCIe Gen3 x8

IB FDR x4 IB FDR x4 IB FDR x4 IB FDR x4

•2 InfiniBand x4 links @ FDR rate • 28 GB/s Bidirectional Bandwidth • 4 Coherency x8 links • 128 GB/s Bidirectional Bandwidth • Auto Frame Retry , Auto Link Retrain, Single Lane Failover •2 PCIe x8 Gen3 links • 32 GB/s Bidirectional Bandwidth

13 13 “OpenMP Does Not Scale”

Ruud van der Pas Distinguished Engineer Architecture and Performance SPARC Microelectronics, Oracle Santa Clara, CA, USA

14 Agenda  The Myth  Deep Trouble  Get Real  The Wrapping

15 The Myth

16 “A myth, in popular use, is something that is widely believed but false.” ......

17 (source: www.reference.com) The Myth “OpenMP Does Not Scale”

18 Hmmm .... What Does That Really Mean ?

19 Some Questions I Could Ask

“Do you mean you wrote a parallel program, using OpenMP and it doesn't perform?” “I see. Did you make sure the program was fairly well optimized in sequential mode?”

20 Some Questions I Could Ask “Oh. You didn't. By the way, why do you expect the program to scale?” “Oh. You just think it should and you used all the cores. Have you estimated the speed up using Amdahl's Law?” “No, this law is not a new EU financial bail out plan. It is something else.”

21 Some Questions I Could Ask

“I understand. You can't know everything. Have you at least used a tool to identify the most time consuming parts in your program?” “Oh. You didn't. You just parallelized all loops in the program. Did you try to avoid parallelizing innermost loops in a loop nest?” “Oh. You didn't. Did you minimize the number of parallel regions then?” “Oh. You didn't. It just worked fine the way it was.

22 More Questions I Could Ask

“Did you at least use the nowait clause to minimize the use of barriers?” “Oh. You've never heard of a barrier. Might be worth to read up on.” “Do all threads roughly perform the same amount of work?” “You don't know, but think it is okay. I hope you're right.”

23 I Don’t Give Up That Easily

“Did you make optimal use of private data, or did you share most of it?” “Oh. You didn't. Sharing is just easier. I see.

24 I Don’t Give Up That Easily “You seem to be using a cc-NUMA system. Did you take that into account?” “You've never heard of that either. How unfortunate. Could there perhaps be any false sharing affecting performance?” “Oh. Never heard of that either. May come handy to learn a little more about both.”

25 The Grass Is Always Greener ...

“So, what did you do next to address the performance ?” “Switched to MPI. I see. Does that perform any better then?”

“Oh. You don't know. You're still debugging the code.”

26 Going Into Pedantic Mode

“While you're waiting for your MPI debug run to finish (are you sure it doesn't hang by the way ?), please allow me to talk a little more about OpenMP and Performance.”

27 Deep Trouble

28 OpenMP And Performance/1

 The transparency and ease of use of OpenMP are a mixed blessing Makes things pretty easy

May mask performance bottlenecks  In the ideal world, an OpenMP application “just performs well”  Unfortunately, this is not always the case

29 OpenMP And Performance/2

 Two of the more obscure things that can negatively impact performance are cc-NUMA effects and False Sharing  Neither of these are restricted to OpenMP They come with shared memory programming on modern cache based systems

But they might show up because you used OpenMP

In any case they are important enough to cover here

30 False Sharing

31 False Sharing Storing data into a shared cache line invalidates the other copies of that line

CPUs Caches Memory The system is not able to distinguish between changes within one individual line

32 False Sharing Red Flags

 Be alert, if all of these three conditions are met: – Shared data is modified by multiple processors – Multiple threads operate on the same cache line(s) – Update occurs simultaneously and very frequently

 Use local data where possible

 Shared read-only data does not lead to false sharing

33 False Sharing

Storing data into a shared cache line invalidates the other copies of that line

CPUs Caches Memory The system is not able to distinguish between changes within one individual line

34 False Sharing – Red Flags

 Use local data where possible

 Shared read-only data does not lead to false sharing

35 Considerations for cc-NUMA

A Generic cc-NUMA Architecture Memory

Processor Processor Memory

Local Cache Coherent Remote Access Interconnect Access (fast) (slower)

Main Issue: How To Distribute The Data ?

37 About Data Distribution

 Important aspect on cc-NUMA systems If not optimal, longer memory access times and memory controller hotspots  As of OpenMP 4.0: support for cc-NUMA Placement under control of the Operating System

User control through the OMP_PLACES and OMP_PROC_BIND environment variables  Windows, Linux and Solaris all use the “First Touch” placement policy by default May be possible to override default (check the docs) 38

Example First Touch Placement/1 Memory a[0] : Processor Processor

a[9999] Memory

Cache Coherent Interconnect

for (i=0; i<10000; i++) a[i] = 0;

First Touch All array elements are in the memory of the processor executing this thread

Example First Touch Placement/2 Memory a[0] a[5000] : Processor Processor :

a[4999] a[9999] Memory

Cache Coherent Interconnect

#pragma omp parallel for num_threads(2) for (i=0; i<10000; i++) a[i] = 0;

First Touch Both memories now have “their own half” of the array

40 Get Real

41 “I Value My Personal Space”

42 My Favorite Simple Algorithm

void mxv(int m,int n,double *a,double *b[], double *c) { for (int i=0; i

43 The OpenMP Source

#pragma omp parallel for default(none) \ shared(m,n,a,b,c) for (int i=0; i

i = *

44 Performance On Intel Nehalem

1x1 thread 2x1 threads 4x1 threads 8x1 threads 8x2 threads 35,000 Wait a minute!

30,000

s / This algorithm is highly

o 25,000

l f parallel ....

n 20,000

e c Max speed up n 15,000

r is ~1.6x only 10,000

P 5,000

0 0.25 1 4 16 64 256 1024 4096 16384 65536 262144 Memory Footprint (KByte, log-2 scale)

System: Intel X5570 with 2 sockets, Notation: Number of cores x 8 cores, 16 threads at 2.93 GHz number of threads within core

45 ?

46 Let's Get Technical

47 A Two Socket Nehalem System hw thread 0 caches core 0 hw thread 1 hw thread 0 socket 0 caches core 1 hw thread 1 hw thread 0 caches core 2

memory hw thread 1

shared cache shared hw thread 0 caches core 3 hw thread 1

hw thread 0 caches core 0 hw thread 1 hw thread 0 socket 1 caches core 1

hw thread 1 QPI Interconnect QPI hw thread 0 caches core 2

memory hw thread 1

shared cache shared hw thread 0 caches core 3 hw thread 1

48 Data Initialization Revisited

#pragma omp parallel default(none) \ shared(m,n,a,b,c) private(i,j) { #pragma omp for j for (j=0; j

49 Data Placement Matters!

1x1 thread 2x1 threads 4x1 threads 8x1 threads 8x2 threads 35,000

The only change is the

s 30,000

o way the data is

l 25,000

M distributed

i 20,000

c Max speedup n 15,000

m is ~3.3x now

o 10,000

e P 5,000

0 0.25 1 4 16 64 256 1024 4096 16384 65536 262144

Memory Footprint (KByte, log-2 scale)

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz Notation: Number of cores x number of threads within core 50 A Two Socket SPARC T4-2 System hw thread 0

caches core 0

..... socket socket 0

hw thread 7

...... hw thread 0

memory caches core 7

shared cache shared ..... hw thread 7

hw thread 0

caches core 0

..... socket socket

hw thread 7

......

System System Interconnect hw thread 0 1

memory caches core 7

shared cache shared ..... hw thread 7

51 Performance On SPARC T4-2

1x1 Thread 8x1 Threads 16x1 Threads 16x2 Threads 35,000 Scaling on larger matrices is

s 30,000

/ affected by cc-NUMA effects

l 25,000 (similar as on Nehalem)

i 20,000 Max speed

c Note that there are no idle n 15,000 up is ~5.8x a cycles to fill here

o 10,000

e P 5,000

0 0.25 1 4 16 64 256 1024 4096 16384 65536 262144

Memory Footprint (KByte, log-2 scale)

System: SPARC T4 with 2 sockets, Notation: Number of cores x 16 cores, 128 threads at 2.85 GHz number of threads within core

52 Data Placement Matters!

1x1 Thread 8x1 Threads 16x1 Threads 16x2 Threads 35,000

The only change is the

s 30,000

/ p way the data is Max speed

l 25,000

M distributed up is ~11.3x

i 20,000

n 15,000

o 10,000

e P 5,000

0 0.25 1 4 16 64 256 1024 4096 16384 65536 262144

Memory Footprint (KByte, log-2 scale)

System: SPARC T4 with 2 sockets, Notation: Number of cores x 16 cores, 128 threads at 2.85 GHz number of threads within core

53 Summary Matrix Times Vector

Nehalem OMP 16 threads Nehalem OMP-FT 16 threads T4-2 OMP 16 threads T4-2 OMP-FT 16 threads 35,000

/ 30,000

l f 25,000

i 20,000

n 15,000

m 1.9x r 10,000

P 5,000 2.1x

0 0.25 1 4 16 64 256 1024 4096 16384 65536 262144

Memory Footprint (KByte, log-2 scale)

54 “Room Sharing”

55 A 3D Matrix Update do k = 2, n do j = 2, n !$omp parallel do default(shared) private(i) & !$omp schedule(static) do i = 1, m x(i,j,k) = x(i,j,k-1) + x(i,j-1,k)*scale end do !$omp end parallel do end do end do

❑ The loops are correctly nested for K serial performance k ❑ Due to a data dependency on J and K, only the inner loop can be k-1 parallelized

❑ This will cause the barrier to be J executed (N-1) 2 times I j-1 j Data Dependency Graph 56 The Performance

600

500 Scaling is very poor

p (as to be expected)

l 400

n 300

n 200 a Inner loop over I has

r been parallelized o 100

P 0 0 8 16 24 32 40 48 56 64 Number of threads

Dimensions : M=640 N=800 System: SPARC T4 with 2 sockets, Footprint : ~3 GByte 16 cores, 128 threads at 2.85 GHz

57 Oracle Performance Analyzer

Results shown are for 1 and 2 threads

Barrier cost is high (to be expected) Compute part scales but why is there so much idle time reasonably well ??

58 Source Code View

Note the much higher “OMP Wait” time for 2 threads

59 The Analyzer Timeline Operating System State

Two Threads

Application View

60 The Analyzer Timeline

Compute Part Two Threads Thread Idle Barrier Eight Threads

61 Zoom In – 1, 2, 8 Threads

Compute Part Thread Idle Barrier

All threads but the master thread are idle at times Why ?

62 Some Threads Are More Active ... This is the Timeline View for other threads in the 16 thread run

63 This Is False Sharing At Work! !$omp parallel do default(shared) private(i) & !$omp schedule(static) do i = 1, m x(i,j,k) = x(i,j,k-1) + x(i,j-1,k)*scale end do !$omp end parallel do

1 Thread 2 Threads 4 Threads

False sharing increases as the number of threads

increases

NoSharing

Two threads share 1 1 line share threads Two Four threads share 3 lines 3 share threads Four

64 The Smoking Gun ...

L1 Data Cache Line Invalida ons The total number of level 1

s 6.0E+08

n data cache line invalidations o 5.5E+08 225x increase

a over single thread increases linearly

i 5.0E+08

v 4.5E+08

f 4.0E+08

r 3.5E+08

e b 3.0E+08

m u 2.5E+08

d 2.0E+08

g 1.5E+08

g 1.0E+08

A 5.0E+07 0.0E+00 1 2 4 8 16 32 Number of threads

Note: Single thread number is about 2.5 million

65 An Idea K

❑ No data dependency on ‘i'

❑ Therefore we can split the 3D matrix in larger blocks and process these in parallel J I

do k = 2, n do j = 2, n do i = 1, m x(i,j,k) = x(i,j,k-1) + x(i,j-1,k)*scale end do end do end do

66 The Implementation Of The Idea

❑ Distribute the iterations over the K number of threads ie is ❑ Do this by controlling the start (“is”) and end (“ie”) value of the I inner loop J ❑ Each thread calculates these values for its portion of the work

do k = 2, n do j = 2, n do i = is, ie x(i,j,k) = x(i,j,k-1) + x(i,j-1,k)*scale end do end do end do

67 The Driver Code subroutine kernel(is,ie,m,n,x,scale) ...... use omp_lib do k = 2, n do j = 2, n ...... do i = is, ie nrem = mod(m,nthreads) x(i,j,k)=x(i,j,k-1)+x(i,j-1,k)*scale end do nchunk = (m-nrem)/nthreads end do end do !$omp parallel default (none)& !$omp private (P,is,ie) & !$omp shared (nrem,nchunk,m,n,x,scale) P = omp_get_thread_num() if ( P < nrem ) then is = 1 + P*(nchunk + 1) ie = is + nchunk else is = 1 + P*nchunk+ nrem ie = is + nchunk - 1 end if call kernel(is,ie,m,n,x,scale) !$omp end parallel

68 Performance Comparison

Baseline 3D Blocks

2,500

p Sawtooth is caused

l 2,000 f by load imbalance

i 1,500 Dramatic

e c improvement

n a 1,000 (4x in peak)

f 500

P 0 0 8 16 24 32 40 48 56 64 Number of threads

Dimensions : M=640 N=800 System: SPARC T4 with 2 sockets, Footprint : ~3 GByte 16 cores, 128 threads at 2.85 GHz

69 Timeline For 3D Blocks Version No more idle time and only one barrier section now

Compute Part Thread Idle Barrier

70 Timeline For 16 Threads No more idle time and only one barrier section now

Compute Part Thread Idle Barrier

71 Zoom In – 8 Threads

Modest Load Imbalance

72 Cache Line Invalidations

L1 Data Cache Line Invalida ons The “3D Blocks” version shows a 6.0E+08 huge improvement

n 5.5E+08

a 5.0E+08

i l 4.5E+08 The maximum number for this

I 4.0E+08 version is less than what was

O 3.5E+08

r measured on only 2 threads for e 3.0E+08

m 2.5E+08 the original version

N 2.0E+08

e t 1.5E+08

e 1.0E+08

g 5.0E+07

A 0.0E+00 1 2 4 8 16 32 Original version Number of threads 3D Blocks

Note: Single thread numbers are both about the same (as one would expect)

73 Another Idea: Use OpenMP

use omp_lib implicit none integer :: is, ie, m, n real(kind=8):: x(m,n,n), scale integer :: i, j, k !$omp parallel default(none) & !$omp private(i,j,k) shared(m,n,scale,x) do k = 2, n do j = 2, n !$omp do schedule(static) do i = 1, m x(i,j,k) = x(i,j,k-1) + x(i,j-1,k)*scale end do !$omp end do nowait end do end do !$omp end parallel

74 Example Using 2 Threads

Thread 0 Executes: Thread 1 Executes: k=2 k=2 j=2 parallel region j=2 do i = 1,m/2 do i = m/2+1,m x(i,2,2) = ... work sharing x(i,2,2) = ... end do end do k=2 k=2 j=3 parallel region j=3 do i = 1,m/2 do i = m/2+1,m x(i,3,2) = ... x(i,3,2) = ... end do work sharing end do ... etc ...... etc ... This splits the operation in a way that is similar to our manual implementation

75 Performance Results

 Used M=640 N=800 Recall this problem size did not scale at all when we explicitly parallelized the inner loop over 'I'  We have have tested 4 versions of this program Inner Loop Over 'I' - Our first OpenMP version

AutoPar – Automatically parallelized version of 'kernel'

3D Blocks - Manually parallelized using 3D blocks

OMP DO - OpenMP parallel region and work-sharing DO

76 Performance Comparison

Baseline 3D Blocks OMP DO AutoPar

2,500

l 2,000

i 1,500

a 1,000 The automatically parallelizing

m compiler does very well !

f 500

P 0 0 8 16 24 32 40 48 56 64 Number of threads

Dimensions : M=640 N=800 System: SPARC T4 with 2 sockets, Footprint : ~3 GByte 16 cores, 128 threads at 2.85 GHz

77 Enable Software Prefetch

Baseline 3D Blocks

p 2,500

M 2,000

c 1,500

r 1,000

P 500

0 0 8 16 24 32 40 48 56 64 Number of threads

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

78 Performance Comparison

Baseline 3D Blocks 3D Blocks No SW Pref

s 2,500

f 2,000

1,500

a 1,000

r 500

P 0 0 8 16 24 32 40 48 56 64 Number of threads What the .... is going on here ?! System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

79 How Many Remote Misses ?

L1 Data Cache Miss - Remote L3 Hit The number of remote Baseline 3D Blocks references increases much 2.5E+07 more for the ”3D Blocks” version than for the

s 2.0E+07

s original version

M 1.5E+07

e 1.0E+07

g Numbers are less than

A 20,000 5.0E+06 (should indeed be very low)

0.0E+00 1 2 4 8 16 32 Number of threads

80 ?

81 Let's Get Very Technical

82 Storage Of Array “X” In Memory

(1,1,1) (1,2,1) (1,n,n)

(2,1,1) (2,2,1) (2,n,n) storage order

(m-1,1,1) (m-1,2,1) (m-1,n,n)

(m,1,1) (m,2,1) (m,n,n)

83 Page Mapping Follows Storage

(1,1,1) (1,2,1) (1,n,n)

(2,1,1) (2,2,1) (2,n,n) storage order

Page #1 Page #2 (m-1,1,1) (m-1,2,1) (m-1,n,n) Page #3

(m,1,1) (m,2,1) (m,n,n) Page #4 Page #5

84 Page Location Is Not Localized

(1,1,1) (1,2,1) (1,n,n)

(2,1,1) (2,2,1) Thread 0 (2,n,n)

Page #1 Page #2 (m-1,1,1) (m-1,2,1) (m-1,n,n) Thread 1 Page #3

(m,1,1) (m,2,1) (m,n,n) Page #4 Page #5

85 This Won’t Fix The Problem

!$omp parallel default (none)& !$omp private (P,is,ie) & This approach !$omp shared (nrem,nchunk,m,n,x,scale) only works if “m” P = omp_get_thread_num() is really large, is = .... ie = .... but that is not the do k = 2, n case here do j = 2, n do i = is, ie x(i,j,k) = ... end do end do end do !$omp end parallel

86 A Solution – Remap “X” to 4D

(1,1,1,TID) (1,2,1,TID) (1,n,n,TID)

(2,1,1,TID) (2,2,1,TID) (2,n,n,TID)

(VLEN,1,1,TID) (VLEN,2,1,TID) (VLEN,n,n,TID)

“X” is re-dimensioned as (VLEN,n,n,0:Nthreads-1) where VLEN is the maximum of {ie-is+1} over all threads

87 The Nasty Details/1

allocate ( x2(1:max_size,1:n,1:n,0:nthreads-1),STAT=memstat )

!$omp parallel default (none) & !$omp private (P) & !$omp shared (nrem,nchunk,m,n,x2,scale) & !$omp shared (max_size,is,ie) First Touch P = omp_get_thread_num() Each thread has most do k = 2, n or all of its data in do j = 2, n local memory do i = 1, ie(P)-is(P)+1 x2(i,j,k,P) = 1.0 end do end do end do !$omp end parallel

88 The Nasty Details/2

!$omp parallel default (none) & !$omp private (P,is,ie) & !$omp shared (nrem,nchunk,m,n,x2,scale,max_size)

P = omp_get_thread_num() The basic algorithm is = ..... ; ie = ...... is almost the same as before do k = 2, n do j = 2, n do i = 1, ie-is+1 x2(i,j,k,P) = x2(i,j,k-1,P)+x2(i,j-1,k,P)*scale end do end do end do !$omp end parallel

89 Comparison Remote Misses

L1 Data Cache Miss - Remote L3 Hit 3D Blocks Locality Opt

1.0E+07

e 1.0E+06

r 1.0E+05

e 1.0E+04

m 1.0E+03

r 1.0E+02

m 1.0E+01

N 1.0E+00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Thread ID

90 Now We’re Talking

Baseline 3D Blocks 3D Blocks No SW Pref Locality Opt

6,000

o 5,000

M 4,000 28x

e faster

c 3,000

m 2,000

f r 1,000

P 0 0 8 16 24 32 40 48 56 64 Number of threads

In peak performance the new version is 9.5x faster than the original code

91 “BO2BO” (Bad OpenMP to Better OpenMP)

92 A Request What follows next is work in progress And somewhat sensitive Therefore the following request Please: No notes No pictures No attention

93 Don’t Try This At Home (Yet)

94 Graph Analysis/1

 A graph is defined through the relationship between objects  A graph G is defined by a set of Vertices V(G)

Edges E(G)

95 Graph Analysis/2

 For example, for the graph below: V(G) = { A,B,C,D,E }

E(G) = { (A,C), (A,E), (B,C), (C,E), (C,D), (D,E) }  A graph is searched for vertices of interest Example: “Are Joe and John directly connected ?“

E A vertex

D B C edge

96 A Graph Analyis Example

97 Application Characteristics

 Graph size controlled by SCALE parameter Number of vertices = 2^SCALE

Number of edges = 16 * #vertices  Large Memory - Random Access  Many threads  Intense communication  Long to very long run times

98 Memory Requirements

Memory Requirements (GB) Number of ver ces (Billion)

2500 2291 5.00

4.50 )

) Small n

B o

2000 4.00 l

(

( t 3.50

1500 3.00 c

e 1150 r

i 2.50

f q 1000 2.00

y 580 1.50

o Mini 500 1.00 m

u m 282

Toy N 140 0.50 M 35 70 0 0.00 26 27 28 29 30 31 32 SCALE Value

Note: SCALE = 32 (small) already requires 2.3 TB of memory

99 Timeline For A Typical Run

Construction of Search and the graph verification

*) The Oracle Studio Performance Analyzer was used to produce this view 100 Initial Performance T5-2 (35 GB)

SPARC T5-2 Performance (SCALE 26) 0.050

)

(

0.040

r 0.030

g Game over beyond

0.020 d 16 threads

r 0.010

B 0.000 0 4 8 12 16 20 24 28 32 36 40 44 48 Number of threads

101 That doesn’t scale very well

Let’s use a bigger machine !

102 Performance on T5-2 and T5-8

SPARC T5-2 and T5-8 Performance

T5-2 SCALE 26 T5-8 SCALE 26

)

P 0.050

(

n 0.040

p 0.030

E 0.020

r 0.010

l i 0.000

B 0 4 8 12 16 20 24 28 32 36 40 44 48 Number of threads

103 Oops! That can’t be true

Let’s run a larger graph !

104 Let’s Try A Bigger Graph !

SPARC T5-2 and T5-8 Performance

T5-2 SCALE 29 T5-8 SCALE 29

)

S 0.050

(

d 0.040

r 0.030

E 0.020

v 0.010

l 0.000

i B 0 4 8 12 16 20 24 28 32 36 40 44 48 Number of threads

105 Oops again !

Time for some science (or black art)

106 Initial Observations

107 Clocks Per Instruction (CPI)

 This is a metric that gives information how long instructions take  In the ideal case the CPI = 0.5 The S3 core can execute up to 2 instructions/cycle  A significantly higher value warrants investigation: Pipeline stalls (e.g. RAW)

Address translation/TLB issues

Remote memory access

.....

108 T5-2 Single Thread CPI

CPI Graph 500 Baseline (SPARC T5-2, SCALE 26, 1 thread) 55

)

C 45

(

n 40

c 35

s 30

r 25

p Generation Search and Verify (64 times) s 20

y 15

C 10 5 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 Elapsed Time (seconds)

109 T5-2 Bandwidth (16 Threads)

SPARC T5-2 Measured Bandwidth (BASE, SCALE 28, 16 threads)

Total Read (Socket 0) Read (Socket 1) Write (Socket 0) Write (Socket 1) 80 70 Less than half of the memory

)

s 60 / bandwidth is used

( 50

h t 40

d 30

B 20 10 0 0 750 1500 2250 3000 3750 4500 5250 6000 6750 7500 8250 9000 Elapsed Time (seconds)

110 1 1 T5-2 Total CPU Time Distribution

Total CPU Time Percentage Distribu on (Baseline, SCALE 26) 100% 3% 32% 4% 3% 2% 2% Atomic opera ons 4% 5% 6% 15% 22% OMP-atomic_wait 11% 12% 25% 80% 29% OMP-cri cal_sec on_wait 17% OMP-implicit_barrier 26% 20% Other 60% 31% 34% 27% make_bfs_tree 27% 20% verify_bfs_tree 40% 21% 13% 22% 16% 27% 20% 10% 37% 31% 8% 22% 6% 15% 6% 4% 7% 0% 6% 1 2 4 8 16 20 Number of threads

111 Secret Sauce BO BO

112 T5-2 Bandwidth OPT1, 224 threads

SPARC T5-2 Measured Bandwidth (OPT1, SCALE 28, 224 threads)

Total Read (Socket 0) Read (Socket 1) 160

) 140

B 120 Both sockets deliver

(

h 100

t peak bandwidth now

d i 80

n 60

a B 40 20 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Elasped Time (seconds)

113 T5-2 Performance OPT1 (280 GB)

SPARC T5-2 Performance (SCALE 29)

)

P T5-2 Baseline T5-2 OPT1

( 0.70

o 0.60

r 0.50

e 0.40

d Peak performance

E 0.30 d is 13.7x higher

e 0.20

T 0.10

l 0.00

B 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 Number of threads

114 “You Call That A Knife ?”

SPARC T5-2 and T5-8 Performance (SCALE 29) Peak performance is

17.6x higher than the

)

S T5-2 Baseline T5-2 OPT1 T5-8 OPT1

P baseline

T 0.80

(

d 0.70

e 0.60

e 0.50

g 0.40

d Search time reduced from

E 0.30 d 3.5 hours to 12 minutes

s r 0.20

a r 0.10

i 0.00

l i 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024

B Number of threads

115 Think Big And Then Bigger Second Round Of Optimizations

116 More BO Secret Sauce MO

117 T5-8 Bandwidth (280 GB)

SPARC T5-8 Measured Bandwidth (OPT2, SCALE 29, 896 threads)

Total BW (GB/s) Read (GBs/s) Write (GB/s) 600

) 550

/ 500 Maximum Read Bandwidth

G 450

( 455 GB/s

h 400

d 350

w 300

n 250

B 200 150 100 50 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 Elapsed Time (seconds)

118 “That’s A Knife !” (580 GB)

T5-8 Performance Comparison (SCALE 30, 580 GB)

)

E T5-8 Base T5-8 OPT1 T5-8 OPT2

G Peak performance is

(

1.80

d 50x higher than the

o 1.60 baseline

1.40

p 1.20

e 2x the performance

g 1.00 d using 60% of the threads

d 0.80

r 0.60

r 0.40

n 0.20

l i 0.00

B 0 128 256 384 512 640 768 896 1024 Number of threads

119 Bigger Is Definitely Better (580 GB)

SPARC T5-8 Graph Search Performance (SCALE 30 - Memory Requirements: 580 GB)

Search Time (minutes) Speed Up 72x faster ! 800 80

Search time reduced

)

s 700 70

t from 12 hours to 10

u 600 60

i minutes

( 500 50

m 400 40

d 300 30

p 200 20

E 100 10 0 0 1 2 4 8 16 32 64 128 256 384 512 Number of threads

120 Performance Gains

SPARC T5-8 Speed Up Over OPT0 Bigger is OPT1 OPT2 better 55

s 50

e 45

0 40

T P 35

e 30

r 25

v 20

p 15

U Somewhat 10

e diminishing return e 5

S 0 26 27 28 29 30 31 SCALE Value

121 OPT2 Benefit Over OPT1

SPARC T5-8 Speed Up Of OPT2 Over OPT1 3.00

T 2.50

v 2.00

P 1.50

f Memory level tuning gives up

p to 3x better performance

U 1.00

e (bigger is better here)

p 0.50

0.00 26 27 28 29 30 31 32 SCALE Value

122 SPARC T5-8 Scalability (2.3 TB)

SPARC T5-8 (SCALE 32, OPT2)

1.60

c 1.40

r 1.20

1.50 GTEPS

e 1.00

)

E 0.80

( s Even starting at 32 threads r 0.60

v the speed up is still 11x

r 0.40

i 0.20

i B 0.00 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 Number of threads

123 The Wrapping

124 Wrapping Things Up

“While we're still waiting for your MPI debug run to finish, I want to ask you whether you found my information useful.” “Yes, it is overwhelming. I know.” “And OpenMP is somewhat obscure in certain areas. I know that as well.”

125 Wrapping Things Up

“I understand. You're not a Computer Scientist and just need to get your scientific research done.” “I agree this is not a good situation, but it is all about Darwin, you know. I'm sorry, it is a tough world out there.”

126 It Never Ends

“Oh, your MPI job just finished! Great.” “Your program does not write a file called 'core' and it wasn't there when you started the program?” “You wonder where such a file comes from? Let's talk, but I need to get a big and strong coffee first.” “WAIT! What did you just say?”

127 It Really Never Ends

“Somebody told you WHAT ??” “You think GPUs and OpenCL will solve all your problems?”

“Let's make that an XL Triple Espresso. I’ll buy”

128 Thank You And ..... Stay Tuned ! [email protected]

129 = OpenMP barrier time

130