Motivation PEM Model PEM Algorithms GPGPU Conclusions

Locality-conscious parallel algorithms

Nodari Sitchinava Karlsruhe Institute of Technology University of Hawaii

October 17, 2013

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs

     

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Mid 2000s – Power Dissipation Is a Problem

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Mid 2000s – Power Dissipation Is a Problem

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

1970-90s: PRAM, BSP, hypercubes... Important results on the limits of parallelism Are they practical?

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location

Runtime Θ(n log n)

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location

Runtime per element 1e-05

9e-06

8e-06

7e-06

seconds 6e-06

Runtime 5e-06 Θ(n log n) 4e-06 3e-06 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location

Runtime per element 1e-05

9e-06

8e-06

7e-06

seconds 6e-06

Runtime 5e-06 Θ(n log n) 4e-06 3e-06 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Θ notation

What does Θ notation represent? Number of comparisons Number of arithmetic operations Number of memory accesses

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Θ notation

What does Θ notation represent? Number of comparisons Number of arithmetic operations Number of memory accesses

All of the above in RAM/PRAM models

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location

Θ(n log n) memory accesses?

Runtime per element DRAM accesses per element 1e-05 80

9e-06 70

60 8e-06 50 7e-06 40

seconds 6e-06 30 5e-06 20

4e-06 10

3e-06 0 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n n

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Modern memory design

CPU Disk Caches! L1 Cache L2 Cache

RAM

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Caches in Multi-cores

CPU 1

CPU 2

Disk

CPU 3

CPU 4

L1 L2 RAM

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Dealing with Caches

External Memory Model

CPU

M Cache

B

External memory

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

RAM PRAM

CPU PE 1 PE 2 ... PE P

Random Access Memory Random access memory

External Memory

CPU

M Cache

B

External memory

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

RAM PRAM

CPU PE 1 PE 2 ... PE P

Random Access Memory Random access memory

External Memory Parallel External Memory

CPU PE 1 PE 2 PE P ... M M M M Cache Cache Cache Cache

B B B B

External memory

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel External Memory (PEM) Model [SPAA ’08]

PEM: A simple model of locality and parallelism

PE 1 PE 2 PE P ... M M M Cache Cache Cache Explicit cache replacement B B B Up to P block transfers = 1 I/O Shared memory Block-level CREW access

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel External Memory (PEM) Model [SPAA ’08]

PEM: A simple model of locality and parallelism

PE 1 PE 2 PE P ... M M M Cache Cache Cache Explicit cache replacement B B B Up to P block transfers = 1 I/O Shared memory Block-level CREW access

Complexity: Parallel I/O complexity – number of parallel block transfers

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Solutions in PEM

Algorithms Sorting [SPAA’08] Orthogonal Computational Geometry [ESA’10,

Independent IPDPS’11] Set

List Problems on Graphs Ranking Tree [IPDPS’10] Euler Tour

Connected Expression Tree MST Components Evaluation

Bi−connected Batched LCA PEM Data Structures [SPAA’12] Components Ear Decomposition Parallel Buffer/Range Tree

Exp. Validation [ESA’13] Orthogonal Computational Geometry Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

PEM Algorithms

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Sorting [SPAA ’08]

d = min{M/B, n/P} p

log (N/B) Sorting M/B d-way mergesort d-way distribution sort

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Sorting [SPAA ’08]

d = min{M/B, n/P} p

log (N/B) Sorting M/B d-way mergesort d-way distribution sort

n n sort Θ  PB logM/B B  = P (n) I/Os

Θ(n log n) comparisons

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Geometric Algorithms [ESA ’10, IPDPS ’11]

Solved Problems Line segment intersections Rectangle intersections Range Query Orthogonal point location

n n sort Θ  PB logM/B B  = P (n) I/Os

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

d vertical slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

d vertical slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

d vertical slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

d vertical slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

d vertical slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

O(N/P)

d vertical d vertical slabs slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping

d = min{M/B, n/P} p

O(N/P)

d vertical d vertical slabs slabs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location

Complexity n n O  PB logM/B B  I/Os O (n log n) comparisons

Runtime per element DRAM accesses per element 1e-05 80

9e-06 70

60 8e-06 50 7e-06 40

seconds 6e-06 30 5e-06 20

4e-06 10

3e-06 0 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n n

logM/B (n/B) ≈ 2

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Graph Problems [IPDPS ’10]

Independent Set

List Ranking

Tree Euler Tour

Connected Expression Tree MST Components Evaluation

Bi−connected Batched LCA Components

Ear Decomposition

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Pointers And Caches Don’t Get Along

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Pointers And Caches Don’t Get Along

n n n min Θ + log n , Θ logM B I/Os for list ranking. n P   PB / B o

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions If you bear with me for a moment...

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

GPGPU Computing

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPU computing

GPGPU – graphics cards for computation Hundreds of cores Thousands of threads

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs

Global Memory (n ≤ 3GB)

SM w SM w

sh. mem sh. mem Shared (Local) Memory (M ≈ 48 kB) Global memory

CPU memory Registers (≈ 32 kB)

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs

Global Memory (n ≤ 3GB) Coalesced accesses (w = 32)

SM w SM w

sh. mem sh. mem Shared (Local) Memory (M ≈ 48 kB) Global memory

CPU memory Registers (≈ 32 kB)

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs

Global Memory (n ≤ 3GB) Coalesced accesses (w = 32)

Hyperthreading SM w SM w

sh. mem sh. mem Shared (Local) Memory (M ≈ 48 kB) Global memory

CPU memory Registers (≈ 32 kB)

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs

Global Memory (n ≤ 3GB) Coalesced accesses (w = 32)

Hyperthreading SM w SM w

w w Shared (Local) Memory (M ≈ 48 kB) Global memory Memory banks

CPU memory Registers (≈ 32 kB)

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs

Global Memory (n ≤ 3GB) Coalesced accesses (w = 32)

Hyperthreading SM w SM w

w w Shared (Local) Memory (M ≈ 48 kB) Global memory w Memory banks

CPU memory Registers (≈ 32 kB)

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs

Global Memory (n ≤ 3GB) Coalesced accesses (w = 32)

Hyperthreading SM w SM w

w w Shared (Local) Memory (M ≈ 48 kB) Global memory w Memory banks

CPU memory Registers (≈ 32 kB) Fastest if addressable at compile time

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPU Programming Challenges

Type of memory

Coalescing complicates SM w SM w algorithm design sh. mem sh. mem Shared memory bank conflicts Global memory Registers private to threads

CPU memory Hyperthreading Resource sharing tradeoffs

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPU: PEM with multiprocessors [under submission]

Model SM w SM w Hyperthreading: k vSMs per SM sh. mem sh. mem Always load data into shared Global memory B = 4w memory of size M′ = M/k shared memory CPU memory using SIMD algorithms

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPU algorithm design

Algorithmic framework w w Pick minimum k to saturate SM SM I/O bandwidth sh. mem sh. mem

Minimize I/Os as in PEM Global memory Independently design (bank B = 4w conflict free) SIMD ′ algorithm on M items and CPU memory w processors

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: List Ranking

Problem List ranking with coalesced accesses?

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: List Ranking

Problem List ranking with coalesced accesses?

From the PEM model:

n n n min Θ + log n , Θ logM′ w I/Os. n P   Pw / w o

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Effect of bank conflicts on runtime

100 200 kernel 3 runtime kernel 3 bank conflicts improved kernel 3 runtime 95 175

90 150 ] 6 85 125

80 100

runtime [ms] 75 75 bank conflicts [10

70 50

65 25

60 0 0 2 4 6 8 10 12 14 16 colors

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Modeling bank conflicts

Matrix view of shared memory Column-major layout in ′ Memory Bank 0 A[0] A[8] A[16] A[24] A[32] A[40] A[48] A[56] w × (M /w) matrix Memory Bank 1 A[1] A[9] A[17] A[25] A[33] A[41] A[49] A[57] One per row Memory Bank 2 A[2] A[10] A[18] A[26] A[34] A[42] A[50] A[58] Memory Bank 3 A[3] A[11] A[19] A[27] A[35] A[43] A[51] A[59] Convert column- to Memory Bank 4 A[4] A[12] A[20] A[28] A[36] A[44] A[52] A[60] Memory Bank 5 A[5] A[13] A[21] A[29] A[37] A[45] A[53] A[61]

row-major to process Memory Bank 6 A[6] A[14] A[22] A[30] A[38] A[46] A[54] A[62] columns Memory Bank 7 A[7] A[15] A[23] A[31] A[39] A[47] A[55] A[63] Conversion for square matrices is a matrix transposition

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Bank Conflict Free Sorting

ShearSort [Sen et al. 1986] ′ Memory Bank 0 A[0] A[8] A[16] A[24] A[32] A[40] A[48] A[56] Repeat log(M /w) times: Memory Bank 1 A[1] A[9] A[17] A[25] A[33] A[41] A[49] A[57] Sort columns in alternate Memory Bank 2 A[2] A[10] A[18] A[26] A[34] A[42] A[50] A[58] Memory Bank 3 A[3] A[11] A[19] A[27] A[35] A[43] A[51] A[59]

order Memory Bank 4 A[4] A[12] A[20] A[28] A[36] A[44] A[52] A[60]

Memory Bank 5 A[5] A[13] A[21] A[29] A[37] A[45] A[53] A[61]

Sort rows in increasing order Memory Bank 6 A[6] A[14] A[22] A[30] A[38] A[46] A[54] A[62]

Memory Bank 7 A[7] A[15] A[23] A[31] A[39] A[47] A[55] A[63] Result: Sorted matrix in column-major order

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Bank Conflict Free Sorting

1 thrust mergesort thrust mergesort (base case 1024) 0.9 shearsort

0.8

0.7

0.6 elements/s] 9 0.5

0.4

0.3 throughput [10

0.2

0.1

0 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 input size

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Bank Conflict Free Sorting

4 thrust mergesort thrust mergesort (base case 1024) mergesort/shearsort (base case 1024) 3.5 mergesort/shearsort (base case 8192)

3

2.5

2 runtime [s] 1.5

1

0.5

0 0 5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 input size

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

Conclusions

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Theory can be useful

With a good model Faster implementations

Runtime per element 1e-05

9e-06

Solutions can become obvious 8e-06

7e-06

seconds 6e-06 Don’t need to reinvent the wheel 5e-06 4e-06

3e-06 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 Lower bounds save the frustration n

Simplicity vs Accuracy (P)RAM: uniform memory (P)EM: NUMA in the presence of caches

Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions

Thank you!

Nodari Sitchinava – Locality-conscious parallel algorithms